Continuous Integration/build engineering/devops thread

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread

«‹›8 »

12 rats tied together: Sep 7, 2006

I've been using AWX as a replacement for Jenkins and so far have been pretty happy with it.

New Yorp New Yorp posted:

The tasks are just wrappers around ARM templates and invocation of same. I'm pushing for infrastructure-as-code -- ARM templates and deployment scripts live as first-class citizens alongside the application source code, and all we're doing at deployment time is running that stuff. I don't want to tightly couple them to a particular deployment orchestration platform, create a maintenance nightmare, or create a system that makes rapid iteration difficult.

Pretty much exactly this except with CloudFormation and sometimes we'll trigger a post deploy config playbook.

e: in our case instead of powershell ansible will ssh to the node and do normal ansible things so I get to take advantage of the several hundred thousands of lines of code other people have already written instead of writing any powershell or python of my own

# ¿ Sep 20, 2018 08:28

Adbot: ADBOT LOVES YOU

# ¿ May 4, 2024 02:01

12 rats tied together: Sep 7, 2006

cheese-cube posted:

This seems kinda jacked. Ansible is a configuration management solution, not CI/CD.

Also for those using Jenkins you should be using this feature, it's brilliant: https://jenkins.io/doc/book/pipeline/shared-libraries/#defining-declarative-pipelines

An old version of the ansible website tried to skip the comparison to chef etc by proudly displaying "Ansible is task orchestration".

One of the tasks you can orchestrate is configuring a server. Another is rolling updates. A quick trip through the documentation these days reveals tons of features explicitly for tasks relevant to CI/CD.

The plus side is you can also use it for config management. Pretty much the only thing I wouldn't do with ansible in 2018 is compile code. Everything else goes straight into AWX.

# ¿ Sep 21, 2018 22:57

12 rats tied together: Sep 7, 2006

Legacy .net application takes multiple hours to build and sometimes days to successfully deploy.

Stuff that was implemented during my employment here varies based on the stack but it's usually about 5-15 minutes after a merge to master we're running the new code in prod.

# ¿ Sep 27, 2018 22:27

12 rats tied together: Sep 7, 2006

Warbird posted:

Point made. Let me ask you all this then, how do go about gaining enough experience in non-work related tech or applications to be able to use it down the line? Certs? I've only ever built directly off of what I did in the last job(s), so I'm super certain how one branches out in a meaningful way.

A lot of it is just a mindset thing, IMO, and demonstrating a strong grasp of fundamentals.

If you want to do "devops" stuff I think you just need to be able to speak about, roughly:

- 12 factor
- pets vs cattle
- 0 touch provisioning
- task orchestration (build / deploy)
- environment management or infrastructure as code

It's unreasonable to expect a new hire to be an expert at a particular tool (unless the job posting is like Sr. Engineer - OpenStack or whatever).

You can totally answer a question about deploying a new kubernetes application with the puppet+VMware version, or whatever. As an interviewer I'd be mostly looking for you to demonstrate knowledge about the above items anyway, so if you give me the answer for your own tech stack that's really just an even better opportunity for me to get at those fundamentals instead of talking about replica set pod bullshit which ultimately does not matter.

Comedy option though I usually learn a new tech stack during technical exercises at job interviews.

# ¿ Sep 28, 2018 22:29

12 rats tied together: Sep 7, 2006

Docjowles posted:

Are you vaguely aware of how to write and operate modern applications, where modern is like TYOOL 2012? It is that. https://12factor.net/. Plus the usual "make your app stateless!!!" "ok but the state has to live someplace, what do we do with that?" "???????????"

I wasn't familiar with this part of 12 factor but actually it's right here:

quote:

Twelve-factor processes are stateless and share-nothing. Any data that needs to persist must be stored in a stateful backing service, typically a database.

This is pretty reasonable, I'm not sure why anyone would object to this.

Cancelbot posted:

So we're mounting a two-pronged attack with AWS CodeDeploy & Docker

AWS CodeDeploy is really good. Docker is not a requirement in this scenario, since you're running in EC2 you already have just about everything Docker gives you from an orchestration perspective, it's just a matter of arranging the nuts and bolts to your liking. Sometimes Docker makes sense, most often though I've found that if pushing towards containers comes from the development half of an organization (I'm generalizing, I know), it ends up an untenable mess of bullshit in anywhere from 1 to 6 months.

I strongly recommend that, if you can, approach Docker / containers as an organization with a shared contract of "just put your containers here: _____". That underscore can be nexus, AWS ECR, dockerhub, whatever, but starting with instructions for a developer to push their containers and trigger a deploy is, in my experience, the best overall approach for operational sanity. From that point you can build your orchestration assuming that someone checks in code -> a container is built and pushed, our automation takes over from here.

Your organization is probably different, but in all of the orgs I've worked in once developers (or really, anybody beholden to a PM / business planning) get involved in orchestration (even via something like a helm chart), operational sanity gets thrown out the window immediately in favor of shipping features before an arbitrary deadline that was decided by someone who barely knows what a container is. Container troubleshooting is the worst kind of troubleshooting so definitely fight stuff that like tooth and nail, IMO.

You also mentioned that most of your services are 100MB ram and <1% CPU? ECS/Fargate are both excellent services, but I'd highly recommend engaging your TAM or some other kind of AWS support before deciding on anything and hopefully they can work with you guys to spin up a proof of concept with something applicable to what you guys are hoping to accomplish with EC2/Docker.

It's really hard to beat either of those services though assuming you are only running on EC2. in particular, I would start off this migration journey by thinking about IAM. Getting a container scheduled onto an instance is the easy part -- an intern can bang that out in half a day. The hard part is usually managing secrets, credentials, and AWS API access.

12 rats tied together fucked around with this message at 21:12 on Oct 1, 2018

# ¿ Oct 1, 2018 21:09

12 rats tied together: Sep 7, 2006

Cancelbot posted:

end goal is each team having an AWS account

Awesome, this is a pretty solid idea all around. Orgs I've worked in where they tackled multi-account either early on, or from the start, are generally much healthier from an operational standpoint 2-3 years down the line than orgs who had "the aws account" until some external factor forced them into multiple accounts abruptly.

Cancelbot posted:

but some are rebuilding things in lambda + S3 as they don't even want to give a poo poo about an instance going down.

This is another great idea. My current org has been going through a "oh this lambda stuff is pretty good huh" phase for the past 12-15 months and one thing that caught most of the developers off-guard is that an s3 event notification can only have one target per unique combo of prefix and suffix. Once people realized they could spin up functions in response to objects being placed (our platform is basically a huge s3 state machine), tons of people wanted to run functions off of the same prefix, so we naively implemented the first one, failed on the followups, and then everything had to be delayed while we rewrote them all to subscribe to SNS topics.

Moving an application's infrastructure from one region to another would take us probably 15 minutes to 2 hours in my current role, depending on the application. CloudFormation has a lot of built-in helpers here now that we used to use ansible for, in particular StackSets. As long as you build all of your cfn templates assuming that region (and possibly vpc id) is a type of primary key that you'll use to lookup subnet ids, amis, and security group ids, you're like 90% done with just being able to swap eu-west-1 to eu-west-2 in github and then pushing a button.

Biggest time sinks in my experience are, of course, application configuration, and if you need to move anything heavy (especially redshift clusters) that can take a couple hours. It's not too bad though -- really helps if you have your basic network configs in cloudformation though.

# ¿ Oct 2, 2018 19:13

12 rats tied together: Sep 7, 2006

Helianthus Annuus posted:

You can achieve something similar in AWS EC2 by baking AMIs when you build your software, and then treating each EC2 instance running some version of your AMI as your atomic unit of infrastructure. But in this case, you're using AWS specific tooling to achieve CI/CD instead of something more provider-agnostic, like k8s.

I agree mostly except I think it's a mindset thing more than a tech thing. You don't need to bake amis to have immutable infrastructure, the only thing you need to do is not mutate your infrastructure.

Last job we jokingly called it 'deterministic infrastructure' in that we assumed two servers, when fed the same inputs, resulted in the same outputs at least at a service level. This is ~usually true.

You don't need k8s or ec2, you can provision a server with PXE and run a post provisioning task. You just need cloud-init, something that can talk ipmi (ansible, for example), and your PXE distribution tool of choice and you're pretty much done.

Boot a network image, server comes up and requests a deploy for whatever it's supposed to be, deploy kicks off and sets up monitoring etc, and then ideally your node starts responding affirmatively to some kind of health check and you're in business.

Post provisioning step can include literally just dropping a docker compose file in to a server and configuring an upstart service that runs it. I've worked places where this is how we shipped some applications and it works great assuming you don't need any of the fancier k8s features.

Not that there isn't value in k8s or ec2 outside of "this is a hardware abstraction" -- it's just important to note IMO that none of the tech is magic or even that hard to replicate in a physical DC, if it makes business sense to do so.

# ¿ Oct 2, 2018 23:06

12 rats tied together: Sep 7, 2006

Ansible is agentless, if you can ssh into those servers it's where I would recommend starting.

e: or winrm

# ¿ Oct 9, 2018 00:22

12 rats tied together: Sep 7, 2006

Warbird posted:

Suffice it to say we don�t and are federally prohibited from having any. The entire thing is run by a team of less than a dozen and they still create Linux boxes by putting a disk in a tray.

You don't actually need root to run ansible, fortunately. You set your ssh user as part of your remote block (https://docs.ansible.com/ansible/2.5/user_guide/intro_getting_started.html#remote-connection-information), which by default just runs with whatever OpenSSH config you already have.

Ability to SSH to the machine is the only hard requirement, for linux boxes anyway.

# ¿ Oct 9, 2018 17:09

12 rats tied together: Sep 7, 2006

The only thing I'd mention is that by default ansible runs on a push model, so if you need this automation to keep the environment consistent because folks are making changes, you should be aware that you are going to have to re-run a playbook every time you want to enforce this desired state.

There are some approaches to this problem (AWX, ansible-pull), but they're kind of heavyweight for your use case IMHO.

# ¿ Oct 11, 2018 18:50

12 rats tied together: Sep 7, 2006

Here's my take on the cloud-infracode problem that's been working great for me for the past ~3 years or so:

Do use ARM templates. Use ansible to manage your arm templates (https://docs.ansible.com/ansible/2.5/modules/azure_rm_deployment_module.html). Instead of specifying the template inline like in the examples, or in the template_link field, use an inline template lookup:

code:

[...]
    template: "{{ lookup('template', '/path/to/template.yaml.j2') }}"
[...]

Using an inline lookup here lets you take advantage of ansible's rich variable inheritance and templating mechanisms, so you get pretty much full jinja2 functionality, all of the extra ansible addons for jinja2, and you can write your own custom ones by putting python scripts in your ansible repo.

This lets you do things like loop through data structures of arbitrary complexity, resolve values from data structures from either local parsing or remote lookups (either from other machines or from a web service), it lets you define jinja macros, gives you fully functional if/else/etc flow logic, and fully supports reasonably complex inheritance including the ability to pull a default data structure and extend it or combine it in some way, null out various keys or values, etc.

And then, afterwards, you can use ansible to deploy to your windows servers too, and all of the data you used for your ARM templates is available for you in an extremely consistent manner inside your server deploy code, or the other way around, or both ways around if required.

I prefer this approach to something like Terraform for a lot of reasons, but the main one is that jinja2 has already existed for like 10 years so there's a wealth of documentation, support, and tons of edge cases and gotchas already figured out, compared to something like Terraform where mitchellh needed to be convinced that nullable keys/values were a genuine requirement for infracode tooling.

# ¿ Oct 24, 2018 21:53

12 rats tied together: Sep 7, 2006

Yeah you can definitely replace just about all of that with just ansible, even helm (sort of), and even jenkins (sort of).

If you've read the "choose boring technology" blog article, embracing ansible gets you reaaaally close the 2nd version of the Problems -> Technical Solutions image just by itself. It's an extremely versatile tool and I consider it about as useful as, and indispensable as, a bash shell.

I considered doing the ansible -> j2ed terraform thing in my current role but decided everything I need to be doing in Terraform I would be much better served doing in CloudFormation anyway so there wasn't a lot of value involved. If you're interacting with a vmware thing though like the article you linked, it does seem like templating terraform is one of your only options at the moment.

# ¿ Oct 25, 2018 08:10

12 rats tied together: Sep 7, 2006

Vulture Culture posted:

Terraform's also a solid and well-put-together product�we manage something like 5,000 compute instances with it and it's mostly been painless since the 0.9 release�but it does so many things that everyone has thirty feature requests and a bug story to tell

In my experience Terraform doesn't really fall apart until you have to start managing several disparate environments for it, and for that something like provider + provider region count is usually a more useful measure of "how much is Terraform ruining my day to day life right now".

It's an absolutely fantastic product if you just need to configure 5000 compute instances. var.count plus the splat operator is actually pretty clever and intuitive (generally). Once you start needing to configure vpcs, subnets, regions, multiple cloud accounts, anything with different authentication mechanisms, stuff that might exist sometimes and other times might not exist, it turns into a nightmare shitshow almost instantly.

The combination of "works great for things that are simple" plus "is probably worse than clicking buttons in the ui" once you want to build a single reusable abstraction makes it a particularly dangerous tool IMO. It's like they intentionally built a tool for executable blog articles.

e: I think if you're really excited about for loops in terraform (I am not), you should consider that basically what you want to do is document templating, and that document templating languages have existed for a long time and you definitely do not need to wait for terraform x.y.z to solve this problem for yourself.

12 rats tied together fucked around with this message at 20:21 on Oct 26, 2018

# ¿ Oct 26, 2018 20:19

12 rats tied together: Sep 7, 2006

necrobobsledder posted:

In my defense,

I'm sorry, I didn't mean to imply that you made a poor choice or anything. There are tons of great reasons to not use ansible, "nobody here knows ansible" is definitely one of them. Because the tool is so easy to use, I often find myself getting into tit-for-tat slack threads where basically all I do is say "you can actually do that in ansible already: <docs link>".

I'll have someone link me a blog article where the author just did not notice that the limitation they are complaining about is directly addressed by a core feature, or I'll have someone chime in and be like "well at my last job we used ansible from an employee laptop to copy a 1 GB file across the internet 3000 times to 3000 servers and it was slow, therefore, ansible is a bad tool for deploying software".

Since I've been spending a lot of time doing that lately, it is kind of my default reaction to a technical problem description even though it's usually not appropriate.

# ¿ Oct 26, 2018 20:39

12 rats tied together: Sep 7, 2006

Exhibitor is pretty nice and, while not perfect, is pretty much the only way to run zookeeper IMO.

# ¿ Dec 13, 2018 08:32

12 rats tied together: Sep 7, 2006

Votlook posted:

I'm using ansible for the first time today, and wow it feels scary! I can't stop thinking that running glorified shell scripts over SSH is a terrible idea.
[...]
Time to start lobbying for Kubernetes then! Is it any good?

If your level of involvement in the application is "how do I perform <some configuration task>?" it's worse in every possible way. Minor correction also, you're not running a glorified shell script over SSH, you're invoking a module against a machine. It's a shell script in the same way "/bin/bash -c "python script.py args" is a shell script I guess.

It's pretty much push-mode chef except instead of writing a cookbook you feed the orchestration a list of objects serialized to yaml. It's a way better approach because making assertions about a data structure is way easier than trying to infer meaning from arbitrary ruby/python/golang/whatever.

Votlook posted:

In my previous job I did use immutable servers (without kubernetes unfortunately), and while it takes more effort upfront I loved that when I had an AMI that was tested, it was pretty much ensured to work in production.
Pushing updates with Ansible just feels so loving brittle in comparison

It doesn't have to. It's totally possible to write really lovely ansible (it's just a task orchestration tool -- configuring a server is only one of the many tasks you might choose to orchestrate). It's also possible to write extremely robust ansible -- we did immutable infrastructure except without the images by just having really good habits in a pure ansible shop at my previous role.

I've jokingly referred to it as "deterministic infrastructure" in this very thread in that we assume a fresh server, when fed the same input as another fresh server, ends up in an identical state. It's like rolling an AMI except instead of baking an AMI you configure it every time. Packer/etc can run ansible for you and save the results into a machine image or you can just run ansible yourself, it's the same thing +/- a few minutes of bootstrap time.

It seems like the thing you're worried about here is that ansible can be run at any time? This is 100% on you guys to enforce some kind of procedure or policy here. Ansible modules are idempotent, but it's up to you to write playbooks that don't take down your application at random throughout the day. There's no fundmanetal design choice in ansible that makes this any more dangerous than any other type of automation; it's totally possible to accidentally brick your kubernetes application in pretty much the same way.

You could even use the ansible helm module to brick your kubernetes application if you wanted.

# ¿ Dec 18, 2018 23:11

12 rats tied together: Sep 7, 2006

freeasinbeer posted:

But I truly think that Ansible/puppet/chef are not the right tools to control your workloads. They are super fiddly at times and I�d rather control everything up a level then worrying about nodes.

I agree with you in theory but again, in practice, ansible can only help here. If you have a list of steps in a readme somewhere, that should be a playbook, even (especially) if those steps are kubectl apply, helm create, or whatever else.

There's a reasonable argument that ansible is not worth the extra complexity compared to a makefile with your scheduler orchestration commands in it, or even just having them in that readme, but pretending that ansible is a fundamentally different approach or a solution to a different problem entirely is kind of missing the mark a bit.

# ¿ Dec 18, 2018 23:26

12 rats tied together: Sep 7, 2006

minato posted:

We've used Kubernetes for years and haven't felt the need to automate anything with Ansible (or Chef, Puppet, etc). We use a combination of Jenkins to monitor our gitops repos that contain the kubernetes manifest files, which in turn triggers Helm/Tiller re-deployments. It works very well for 95% of the apps we run. We use AWS RDS for databases and EBS for persistent storage (which Kubernetes supports).

I think you'd have to be crazy to seriously endorse doing anything related to container scheduling with Chef or Puppet. For your use case ansible is a drop-in replacement for Jenkins -- whether or not that's good for you pretty much just depends on how much you guys hate jenkins.

necrobobsledder posted:

I do admit that applications developed from the get go in containers will likely follow stateless, 12F-ish patterns throughout its lifespan.

It's not like you can't (and people haven't) been doing all of this _without_ containers for the past 6? 7? years though. It's unfortunate that k8s is what it took to get development shops as a whole to embrace 12 factor instead of, I dunno, 12 factor?

# ¿ Dec 19, 2018 03:39

12 rats tied together: Sep 7, 2006

necrobobsledder posted:

The reason I�d advocate for using Terraform is that it�s easier to pick up for newcomers overall than trying to learn a cloud-specific DSL like CloudFormation. Hell, nothing stops me from deploying CloudFormation stacks as a Terraform resource anyway but I can�t do the reverse so the migration path is there.

Right, the thing that should stop you though is that Terraform does not support a rich templating language of any sort which should result in the question, what the gently caress is this tool actually doing for me?

Something that is reasonably interesting to me lately is actually reversing this, using Ansible to manage the composition and deployment of a terraform state through the new-ish terraform module. Unfortunately the module kind of sucks because it's still impossible to get terraform plan output in any kind of machine readable format (and it's not coming in 0.12 either).

It's a comedy option tool -- something to embrace if your idea of infrastructure as code is chaining together shell commands, kind of like the aforementioned AWS CDK which turns CloudFormation into "npm install; cdk init app".

Sidenote I understand the ansible guy is working on a new tool: https://github.com/opsmop/opsmop which is allegedly designed with the intention of being easier to attach to exist software. This is pretty interesting because one of my least favorite parts of ansible is that you really do need to run AWX to get the most out of it, and I hate running things just in general.

# ¿ Dec 20, 2018 01:16

12 rats tied together: Sep 7, 2006

Smugworth posted:

The more PRs we get that are just "a set of files where someone just search+replaced the environment"

To give some context to my rather strong opinions about terraform: I have absolutely 0 tolerance for copy paste and change a thing.

It's criminal that they went through all the trouble to write yet another DSL but you can't subclass a module. You can't even run interpolations in state configs or module paths last I checked which is an extremely galaxy brain approach to a declarative infracode tool.

If something is different, override that property and surface it to me. Don't make me code review a +400 line diff where 394 of them are an exact copy paste. I have no patience for that kind of poo poo when inheritance as a concept has existed for longer than I've been alive.

# ¿ Dec 20, 2018 07:40

12 rats tied together: Sep 7, 2006

NihilCredo posted:

Wonder if anybody has ever had an incident where one of their cloud resources had a startup trigger that spun up another 2+ resources and it Von Neumann'd up all their budget and/or the datacenter capacity in a few minutes.

There are some horror stories with recursive lambda functions that go something like this.

My previous job we also provided basically container scheduling as a service and the product was only up in internal preview for a couple of weeks before someone pushed a workflow with a slowly expanding cycle (some kind of model scoring job or something) and blew up production.

The thing that gave out first was actually the VPC DNS server at the +2 address, where we blew past the 1k requests per second per eni limit from a caching resolver.

We started investigating what looked like a DNS problem only to find that inbound requests had gone from a couple hundred per second up to the high 20,000s or something like that.

# ¿ Dec 20, 2018 17:31

12 rats tied together: Sep 7, 2006

12 rats tied together posted:

Something that is reasonably interesting to me lately is actually reversing this, using Ansible to manage the composition and deployment of a terraform state through the new-ish terraform module.

I spent some time today tinkering with this and my findings are:

- Don't use it if you can avoid it
- After spending 6 months away from terraform I never want to go back to that hell

However, you do get for loops (anywhere in terraform, not just on the resource types that will support it in 0.12), nullable keys, the ability to run interpolations anywhere in a terraform config file such as state configuration and module sources, and with some effort you can just stop using modules entirely and just start declaring/inheriting/overriding resource definitions from some external source that don't need to exist as a resource inside the terraform dag. You can also link states together so you also don't need to write a makefile for applying from A and then B which is pretty nice.

As expected it will be better if terraform ever supports json plan output, but until then its about as bad as using the CLI so it's pretty solid overall if your toolchain already includes both terraform and ansible.

# ¿ Dec 27, 2018 04:21

12 rats tied together: Sep 7, 2006

abigserve posted:

Other benefits are;
- It interacts very well with Vault, so you can do things like store S3 keys straight into the password vault without having to ever see them
- The state file is JSON so you can easily scrape it for information (within reason, remember there's secrets in there)
- it plays remarkably nicely with existing resources, that is, you don't have to terraform your entire environment but you can still use it without fear of blowing poo poo away.

I'm actually incredibly disappointed with the vault integration as a feature. When using ansible-vault we have the secret obfuscated throughout the entire lifecycle -- it gets pulled out of a vault file with a psk and included as a cloudformation or alicloud ros parameter with NoEcho: True. Persisting the secrets in plaintext is not totally untenable but it's just yet another reason to not use terraform, unfortunately. The state file schema is totally loving insane btw, if you've ever had to write automation for it (we have a "with_tfstate" ansible plugin at $current_job).

Point 3 is fair as long as you take it to "and eventually you can import existing resources into terraform (and then get to worry about blowing poo poo away)". Otherwise using automation for some stacks but not all isn't terribly specific to Terraform.

abigserve posted:

I echo the sentiment that I would absolutely one hundred million percent not waste a second of time on any cloud specific deployment language.

I'm curious how many people echoing this sentiment have actually gone multi cloud, and which providers if so? The terraform repo we have at $current-job was pushing something like 5 aws accounts, probably ~20 regions across them, 3-4 regions in 2 alibaba cloud accounts, and then some experiments with other providers like the ns1 provider (which ended up being buggy as poo poo).

I thought it was still trash. It's not like the DSL is cloud agnostic -- in alicloud you still need to set PrePaid/PostPaid, vswitch id instead of subnet id, data_drives instead of block device mappings, and other cloud specific parameters. You still need different providers (one for each region, even!), and you still can't interpolate any dynamic values in a provider config for some reason. Even with AWS <--> Alicloud where one provider is essentially a copy paste of the other this is way harder to handle in Terraform than it is in any other applicable tooling.

Every relevant cloud provider (even openstack if you want to squint really hard at what 'provider' means) provides document based orchestration tooling. Any weirdness this tooling exposes is generally going to be a reflection of weirdness in the provider API itself, for example, vpc_security_group_ids vs security_groups, ec2 classic vs main vpc vs any other vpc -- Terraform is never going to handle these for you automatically, and the gotchas are generally lifted straight out of RunInstances.

If you're going multi cloud, or if you want some facsimile of true agnostic, you unfortunately still actually need to learn the cloud providers. In my experience learning the providers is learning whatever version of CloudFormation they have, so layering Terraform on top of that has historically been a huge waste of effort for me when other tooling exists with a superset of the functionality and generally way less restrictions on where I can put a for loop.

# ¿ Jan 3, 2019 10:48

12 rats tied together: Sep 7, 2006

Docjowles posted:

Yeah, it�s that. It spends 15 minutes refreshing the state. And we haven�t even imported all the zones we would have in production yet, lol, this is just a subset for a test.

Probably going to end up writing our own tool to do this which isn�t terribly hard. I just always prefer to use popular off the shelf stuff first if possible.

I was wondering if there was some obvious workaround or something since I assume we are not the first team wanting to manage large zones via terraform. But maybe I am uniquely dumb

I just wanted to echo that I hated doing anything route53 with Terraform and the ansible -> jinja2 for loop -> cloudformation template approach is managing something like 3k records in <30sec using Route53 RecordSets

It looks like Terraform doesn't support the RecordSet resource so my unironic recommendation would be to use the cloudformation terraform resource to push a stack composed of RecordSet objects.

Mr Shiny Pants posted:

Let's say I will have some Linux machines that will be deployed around the world with intermittent connectivity ( think satellites ) running docker on some Linux variant.

What would be a good way to keep these under control and up to date?

Would it be possible to just run regular docker on them and push updates using something like Ansible?

Sure, this sounds fine. Definitely go with the compose file though instead of "docker run whatever".

If you can distribute the ansible repo (including secrets), I'd suggest tinkering with ansible-pull to see if it works better than push-mode updates. Ansible is a great tool but it's not ~super good at handling hosts that are expected to be consistently unreachable.

# ¿ Mar 27, 2019 01:08

12 rats tied together: Sep 7, 2006

PBS posted:

Sorry if this has already been answered, I've dug through the thread some and don't see anything recent that seems fairly specific.

What do you guy's deployment pipelines look like? How do you go from source code in a repo to something running in k8s?

I've been doing some research and I've been surprised by how little information I'm finding. Most of what I have been able to find is just trying to sell me a product.

We're very early stages on getting clusters and the tooling around them setup. I'd be interested in hearing other's experience in this area, tips, tricks, pitfalls, etc.

ansible playbook(s) that contain a series of k8s module calls that push j2 templated manifests to k8s. The j2 templating hooks into the rest of our ansible ecosystem which has hooks into and data about most of our infrastructure.

The playbook steps are nice for managing ordering and dependencies of disparate k8s object applications, for example, push this thing and then watch this metric until it stabilizes, and then push the next thing, followed by the rest of the things in chunks of 20%. I also haven't needed to learn any of the 14 other types of k8s templating systems and my documentation search radius so far has been the k8s api documentation and k8s.py. I really like it.

I do think one of the strengths of the k8s approach though is that you can pick whatever orchestration system you want. Templating text files and running shell commands in sequence has been trivial for about as long as computers have existed.

# ¿ Apr 30, 2019 19:31

12 rats tied together: Sep 7, 2006

Hadlock posted:

[...] ansible to control k8s seems like some sort of 666 unholy antipattern [...]

It's not unreasonable to feel this way, but it helps if you remember that ansible is a deployment tool more than it is a configuration management tool.

It's an extremely supported, common, and well known pattern to have ansible run tasks against its control machine. You don't have to leap very far from "use my control machine to make a series of AWS API calls" to land on "use my control machine to make a series of k8s API calls".

It's been nice to not having to learn helm at all, but there's nothing stopping you from using ansible to orchestrate a series of helm commands either: https://docs.ansible.com/ansible/latest/modules/helm_module.html :wink:

Basically -- any time you have a readme, comment, wiki article, or otherwise that says "do this, then that", ansible is really good at handling that in a way that can be trivially understood and code reviewed.

# ¿ May 1, 2019 15:02

12 rats tied together: Sep 7, 2006

In my experience "dev on call" only works when you put a lot of effort into it. The teams need to be responsible with deploying their own application on their own schedule with no oversight, they need an on call rotation dedicated to their team, and their phones should only ring when their own application breaks.

From an operational perspective you need to (rightly) consider your pagerduty (or whatever) configuration to be a production stack component. Schedules, escalation policies, group membership should be managed with automation and the production configuration of it should be the result of a deploy from some master branch always, so you can peer review proposed changes and roll back things that break. There needs to be clearly defined policies on how new people get added to the rotations, when they get added, what rotations they are on (if they are on multiple), etc.

The moment you slip on any of this and start to annoy developers needlessly you lose all of their trust in the system and they will quickly start to (rightly) ignore it or develop some awful operational practices around it that allows them to accomplish their job while giving it as little time as they can.

Umbreon posted:

If anyone here has some spare time to answer:

What's a day in this career field like? What do you do every day, and what are some of the more difficult parts of your job?(and how do you handle them?)

I've held a few of these jobs in the past 5-ish years. I would broadly describe what I do as providing reusable infrastructure abstractions to development teams. I almost always have a ticket queue of some sort with planned tasks in it, generally the tasks are related to allowing a feature development team to begin work on their team's project.

The goal is to unblock feature development teams as quickly as possible without generating a large amount of technical debt and while sticking to an agreed upon infrastructure philosophy as closely as possible. Personal judgement is a huge factor in deciding when to rigidly follow internal philosophy and when to bend the rules a little bit to get something done faster, more cleanly, or when to spend a little extra time on a project because the likelihood of repeating the project in the future again is very high. I spend a lot of time considering and executing on these tasks, and an additionally large amount of time providing peer review on other team member's proposals for their own tasks.

The difficult parts of the job generally come from 2 sources: getting accurate information from project/product management, and deciding what solutions best fit whatever the internal philosophy of the team that I'm on. Usually the difficulty is one or the other of these things, not both, because in my experience a schizophrenic project/product planning team will result in there being almost no time to consider philosophy; the team goal mutates into accomplishing all work as quickly as possible, and automation code becomes a means to that end instead of something that fits into a more holistic "devops" workflow.

# ¿ May 6, 2019 19:28

12 rats tied together: Sep 7, 2006

Kevin Mitnick P.E. posted:

[...]
EC2 kinda sucks for complexity, but it's a lot cheaper than Fargate.

In my experience ec2 complexity is easier to manage if you totally abandon trying to use the ec2 api at all and just shove everything into autoscaling groups and suspend all of the scaling processes.

# ¿ May 14, 2019 19:21

12 rats tied together: Sep 7, 2006

Kevin Mitnick P.E. posted:

do you want to use on-demand or spot in your ASG. how much spot? do you want multiple instance pools. what's your bid. are your autoscaled nodes hooked into an ALB. tags, subnets, security groups, AMI, key pair. After making a few dozen decisions you're ready to spin up an instance. To be clear, I'm not complaining. EC2 is flexible and that costs complexity.

Totally - it gets even worse when you start running into capacity issues on the AWS side like, we literally cannot sell you any more i3.8xlarge in this AZ, but your application requires a cluster placement group, so you're SOL.

Shoving everything into an ASG lets you defer on capacity and scaling issues until AWS is ready to let you give them money, you can sit there and let the group fail to launch you the i3s until some of them open up. It's basically an absolute requirement from me from an automation perspective, I hate having to revert feature branch merges because we couldn't tell (and there's no API for asking) that a particular environment is at capacity.

It's also nice to be able to answer the "how do we launch more of these?" question with "just change the numbers on the asg". Lifecycle hooks are great too. I had an internal application launch requested in like a sub-30 node proof of concept configuration and then a followup issue to scale the application to 1024 nodes, the only thing I had to do was change some numbers and mark the ticket as done.

Complex subnet mutations are totally possible too: We filled up subnet X, but we need 100 more nodes, and we need you to not turn off the old nodes until we can shift traffic over to the new nodes in the bigger subnet. Just add the new subnet, change the number, wait for your folks to shift traffic, set the termination policy to OldestInstance, lower the number, wait for the ASG to terminate the instances, remove the old subnet, and then turn the number back up. You can do it all in maybe 30 minutes and only have to merge 2 pull requests into your terraform/cloudformation/whatever repository.

The best part is though, if you can't do it in 30 minutes, you just merge the first pull request and wait. Letting the ASG schedule instances for you works really well with declarative infrastructure management.

# ¿ May 15, 2019 20:31

12 rats tied together: Sep 7, 2006

Cancelbot posted:

Launch templates + EC2/spot fleets are great too - "When i scale, i want whatever you have to fill the gap ordered by cheapest/my preference". Most of the work is in instance provisioning anyway so its worth spending that extra 10% of effort to put it into an ASG.

I'm really excited for EC2 fleets to get CloudFormation support, yeah. Having to use AWS::AutoScaling::AutoScalingGroup is a little confusing/intimidating for folks we hire that come from, say, a vmware background or something. It would be nice to be able to declare "I want X instances of types Y,Z with properties blah blah" in a template and have it be really obvious what the intent is.

One minor problem is I don't believe you can permanently suspend scaling processes with cloudformation alone. In ansible we just do cloudformation module pushes the stack, stack outputs contains the ASG name, ansible uses the ASG name along with the ec2_asg module to manage the scaling processes. I believe terraform lets you statically set scaling processes on the asg resource which I will begrudgingly admit is rather nice.

Votlook posted:

Yeah I've used a setup like that with pre baked AMI's, on update we would just configure the launch configuration to use the new AMI, and the ASG would take care of replacing the old instance with the new instance. Very convenient.

We have a couple scaling groups that use cloud-init's phone_home directive to reach out to ansible AWX, which allows us to do some interesting things like rolling updates to the entire ASG every time a new node joins, or "restart the brokers for this cluster, one at a time, wait for the prometheus under-replicated partition metric to hit 0 after restarting, restart the ActiveController broker last, push a grafana annotation when we start and another when we finish".

That particular example ended up not being necessary though. Most of the time it's just "add the new node to some config files on the other nodes, restart services on them in small batches, maybe make some api calls to register yourself with other systems".

# ¿ May 16, 2019 14:40

12 rats tied together: Sep 7, 2006

I assume you're talking about terraform but it is my experience that you're making a mostly untrue statement, although to be fair I stopped using terraform sometime in the middle of 0.11.

I could write you a small book about why I use cfn instead of terraform but I'll try to summarize briefly:
- terraform has no real templating support, the tool is extremely WET, which is unacceptable in my opinion
- terraform plan will straight up lie to you and is, in general, not very useful (especially compared to CloudFormation changesets)
- terraform is extremely slow once you push past ~100 resources, much slower than cloudformation would be with a similar set
- multi-state management in terraform is extremely cumbersome compared to what you can accomplish with standard CloudFormation
- I'm having a hard time communicating how much more useful the "stack" is as an api primitive, compared to a terraform state file, again especially in large environments

This is aside from the obvious problems with terraform that they've been attempting to address in 0.12 with the full rewrite of HCL: that ternary expressions are not lazily evaluated, looping is extremely limited, the "sub block" property of a terraform resource is basically unusable.

Terraform is totally fine to use for limited scope environments. Single applications, maybe a small collection of resources. The syntax is reasonably intuitive, "count" is (while limited in functionality) a great piece of common language you can give to a development team who might not be familiar with using a standard templating language on resource declarations. I think we manage somewhere around 200 unique terraform state environments at my current employer, and it works fine if you are working in (at most) 3 or 4 of them. Your parent stack, their parent stack, any siblings to your stack, and maybe you have some child objects. That's all totally fine -- I am strongly of the opinion that its a great tool if you are a consumer of some kind of PaaS facsimile.

Managing the parents and grandparents, especially if they need to contain multiple provider and region combos, is a complete disaster. To try and briefly summarize again: the value proposition of terraform is that you just make declarations about your desired state and terraform handles the dependencies and ordering of tasks in order to reach that state. Terraform encourages to break your states apart into small, isolated states and then link them together with the remote state data source. Terraform has no native mechanisms for declaring dependencies or task ordering across state boundaries.

The realities of actually using the tool in production, in a reasonably complex environment, directly counter the benefits of the tool. You don't have to manage "security group comes before instance", but you have to manage ""AWS" comes before "AWS network" comes before "standardized security groups" comes before "instance of module application" comes before "module children", you have to do the plan -> review -> apply loop for each layer of inheritance one at a time because you can't accurately terraform plan a change in child 4 until children 1, 2, and 3 have been successfully planned and applied. Your work tree of managing the "plan -> review -> apply" loop explodes exponentially as you add parents, children, and especially children at similar levels of inheritance that need to check each other for outputs.

# ¿ May 16, 2019 15:18

12 rats tied together: Sep 7, 2006

The changes to attributes of undefined j2 objects are going to clean up a lot of bullshit, changes to the k8s set of modules are also really nice.

e: the porting guide for a release is usually a better reference than the full changelog.

# ¿ May 17, 2019 16:29

12 rats tied together: Sep 7, 2006

necrobobsledder posted:

Also, nothing like Salt Reactor built into Ansible either and it�s been handy from time to time.

Thanks for the note, this is a really cool feature.

I have a PoC running at work of cloudwatch events -> awx which is similar, but requires you to run awx, and have a cloudwatch events bus. Considering awx as a part of ansible gives it the ability to compete with something like salt, puppet, or chef, but it also increases the setup burden quite a bit which is generally considered to be one of the huge advantages of ansible. You still don't have to install an agent though.

This release for ansible though is only good things -- there are no breaks in functionality, only the removal of some features that have been deprecated for a bunch of versions now. The only time ansible releases have given me trouble is 1.9.x to 2.x, otherwise they have all been really painless.

# ¿ May 17, 2019 20:34

12 rats tied together: Sep 7, 2006

The dependency chain that you posted is a little strange, one of the nice things about working in AWS is that a lot of these things are decoupled from each other. I usually see people have an AMI build process that is totally separate from their instance provisioning process, for example, and you don't need to recreate your ELB and ASG every time you recreate an instance.

I think a more typical workflow would be that you'd have one terraform state file that manages your "create ami from instance snapshot" setup. You can use the aws_instances data source to query your ASG for any of the instances in it, assuming they are identical, and then use the returned instance id as a source for your aws_ami_from_instance resource; In this state file you'd probably also want to have an output for your created ami's id.

Back in your main state file, you can consume the output from your ami management state file by using the terraform_remote_state data source, use the ami id as an input for your autoscaling group's launch configuration.

To "roll a new ami" you would have to do something like:

The only changes that should need to happen for this are a replacement of the launch configuration, and an update (no replacement) to the ASG to configure it to use the new launch configuration. The last time I worked with terraform and ASGs, you had to roll new instances on the group yourself, there was no analogue to CloudFormation's UpdatePolicy ASG attribute.

You can probably do something with a local-exec creation provisioner on your launch configuration resource that changes your ASG's desired capacity to 0, waits for your nodes to disappear, and then sets it back up to whatever you configured in terraform.

e: added links

12 rats tied together fucked around with this message at 18:09 on May 23, 2019

# ¿ May 23, 2019 18:06

12 rats tied together: Sep 7, 2006

I'm sorry -- I didn't read your post enough and was too focused on the weirdness about recreating the ELB and ASG resources. It's more likely that your random id resource is not being deleted and recreated when you taint -- I believe that the random provider has a "keepers" concept for this use case.

Generally you want to let terraform uniquely name everything it can but it seems like the aws_ami_from_instance resource is weird in this case as it does not have a name_prefix attribute and it _requires_ that you specify a name. You can also probably taint your random id resource along with the instance, if you don't want to use keepers.

# ¿ May 23, 2019 18:20

12 rats tied together: Sep 7, 2006

This comes up a lot with terraform but it runs into the (recently very common) trap of trying to provide a purely declarative interface, which works great until something doesn't behave as expected, and then you need to wrap your mind around where your expectations differed from terraform's and then unroll your probably somewhat complicated declarations into something terraform can manage intelligently.

In my experience you split up your state files for primarily 2 reasons: One, terraform tells you do, and it takes like 15+ minutes to run a terraform plan with >300 resources which is barely acceptable. Two, having two different terraform states gives you a really obvious and explicit "this, then that" interface, which you can use to isolate dependencies or provide some kind of inheritance chain or similar logical (ideally intuitive) topology.

The example of "create an ami" and "use an ami" is a really good example of a producer state and a consumer state, identifying which parts of your terraform stack produce shared resources vs which parts of your stack consume those resources is a great first step towards taking your terraform out of "can tweak the examples" territory into "can talk about terraform in a job interview" territory.

After splitting up producers vs consumers you could also think about how you'd organize repository growth: what if you needed to build a second type of ami, where would that go? What if you needed another ASG that used the same ami? What if you needed to build both amis and ASGs in a new AWS region, or a different AWS account?

A feature in 0.12 that I'm really excited about is the ability to pass an entire module to another module:

code:

module "consul_cluster" {
  source = "./modules/aws-consul-cluster"

  network = module.network
}

I definitely recommend you experiment with this feature asap, doing this was indescribably worse for the past ~3 years.

The last pending question for me was, given the linear-to-exponential growth of terraform state folders, how do you manage the ordering and execution of apply actions across them? That's pretty much when I gave up on the tool and switched to ansible + cloudformation, but you could totally still use ansible as an orchestrator for your terraform states as well. There are also a couple of third party tools I'm aware of such as terragrunt and pulumi which might have answers for you here too.

# ¿ May 24, 2019 16:40

12 rats tied together: Sep 7, 2006

Most 3rd party services that want access to your amazon account should really be asking for role arns and the ability to sts:AssumeRole that role. You can add an extra piece of "security" here by requiring that the role assumer pass an external id, they go into detail on this here (especially note the "confused deputy problem" diagram): https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user_externalid.html

Broadly, giving external entities access to your AWS account is a totally safe, expected, and supported thing to do. You still have to be smart about it, of course, you shouldn't give CircleCI "delete my only vpc" permissions or anything like that, but properly integrating your account with 3rd party services is just an operational reality of using AWS in 2019. Most services will have a documentation page that goes over the API calls they need access to (in general, I wouldn't approve a service that does not have this). You should cross reference the api calls they need with the context and condition keys documentation for the related service to see if you can further limit their permissions in some way.

For a really bad, do-not-actually-do-this example, https://docs.aws.amazon.com/IAM/latest/UserGuide/list_amazonec2.html you can see in this chart that one of the available conditions for TerminateInstances is "ec2:Region", so you could put all of your CircleCI managed infrastructure in us-east-2, for example, and then only allow the CircleCI service to terminate instances in us-east-2. This way you can be sure that no CircleCI code bug will accidentally delete something critical of yours in us-east-1, but the service role itself will hum along happily as if nothing were different.

More realistically, for things that you cannot ever tolerate going down, you should be setting appropriate termination protection and update policies, and ideally propagating tags to those resources through something like CloudFormation. Most api calls will let you condition on tag key/tag value in some way, so you can be like "you can terminate any ec2 instance except those marked MoneyFirehose".

# ¿ Jul 10, 2019 19:39

12 rats tied together: Sep 7, 2006

I don't do a ton of webserver work anymore but a common approach I see to atomic static content is to keep the last n releases as local directories, have your application read content from a "latest" symlink, and then update the symlink as you deploy. Pretty sure capistrano will do this for you, for example, if you have a rails app. You could do something similar with docker applications by building volumes-per-release and treating your container's volume spec the same as you would normally treat a symlink I guess? That's probably where I would start, anyway.

Docker deploys, it depends on what you're scheduling containers with and the rest of their operational context. I've worked with containerized applications where we just ssh -> change compose.yaml -> docker-compose down -> docker-compose up on production deploys. If your containers just serve web traffic you can lean heavily on load balancer health checks, if your containers are task workers or cluster members you might need to restart them in some type of sequence or batched sequence, etc. For example, if you have containerized kafka brokers then you might want to restart the current ActiveController last, if you have a zookeeper cluster you want to do a rolling restart in a way that means you still have a quorum throughout the deployment, etc.

Ansible is really good at handling batch updates or other types of complex deployment logic, it's what I'm currently pushing for at my employer and it handles servers as well as container orchestration, but it's a pretty heavyweight solution unless you're already using it elsewhere or have other ansible experience. You would run AWX (like jenkins, but with ansible), embed api credentials in your gitlab runner jobs, and then use ansible to describe your deployment logic and configure your repo to "run job `production deploy for $app` on merge to master".

A more standard recommendation would be that your scheduler probably has some primitives for declaratively managing certain application requirements. If you use k8s, probably a good place to start reading is about pod disruption budgets and deployment strategies. If you're using docker swarm, I have less experience here but I believe service update behavior is a similarly good place to start.

These are both really similar to AWS AutoScaling UpdatePolicies if you've done any work with them, and is in my experience a very common thing to be working on/thinking about in 2019. Generally in this scenario your runner, or deploy script run by your runner, is just going to authenticate to your scheduler and run some basic cli commands. It's hard to go into more detail, IMO, without a specific description of the application and environment.

# ¿ Jul 25, 2019 16:51

12 rats tied together: Sep 7, 2006

I haven't done a lot of work with operators yet (although "ansible as a k8s operator" is a hot topic right now in ansible) but I've always found the procedural playbook approach easier to deal with than trying to wrap deployment logic in some kind of declarative state manager.

Almost certainly this is because I've spent the past ~6 years working heavily with ansible, combined with a general distaste for any type of polling or watching, but I think investing in robust deployment logic is superior to the operator's definition -> watch -> reconcile pattern. It also seems like no matter what you do in k8s you end up writing a bunch of yaml, so if I'm going to do that anyway I'd really rather just write the playbook.

I haven't yet encountered a compelling argument for, for example, baking some kind of partition-aware rolling update logic into a kafka broker operator instead of just having the kafka broker playbook be partition-aware and perform rolling updates, especially considering that writing the operator means maintaining code, and given the choice I'd pick maintaining yaml over maintaining golang.

# ¿ Jul 25, 2019 18:27

Adbot: ADBOT LOVES YOU

# ¿ May 4, 2024 02:01

12 rats tied together: Sep 7, 2006

Since you're already in AWS I would recommend SSM parameters or Secrets Manager secrets.

If neither service directly ties into beanstalk, you can always pass IAM credentials to your beanstalk servers and then write a simple startup script that pulls and decrypts your secrets before launching your application.

Regarding your nginx settings, I'm not really sure -- I don't spend a lot of time in beanstalk, or working with php, and when I do I usually just use an ELB.

# ¿ Jul 26, 2019 15:13

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread

«‹›8 »