Continuous Integration/build engineering/devops thread

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread

«‹›158 »

crazysim: May 23, 2004; I AM SOOOOO GAY

It's not K8S stuff but I wanted to bring up this news for those of us who are a little more on the hosted services side.

Self-hosted runners for GitHub Actions is now in beta

# ? Nov 7, 2019 18:53

Adbot: ADBOT LOVES YOU

# ? Jun 5, 2024 03:50

New Yorp New Yorp: Jul 18, 2003; Only in Kenya.; Pillbug

crazysim posted:

It's not K8S stuff but I wanted to bring up this news for those of us who are a little more on the hosted services side.

Self-hosted runners for GitHub Actions is now in beta

It's worth noting that "GitHub Actions" are the exact same thing as Azure DevOps build/release pipelines under the hood.

# ? Nov 7, 2019 19:59

crazysim: May 23, 2004; I AM SOOOOO GAY

New Yorp New Yorp posted:

It's worth noting that "GitHub Actions" are the exact same thing as Azure DevOps build/release pipelines under the hood.

I like the pricing difference for self hosted runners running on private repositories: $0/mo. $15/mo on Azure DevOps.

# ? Nov 7, 2019 21:46

The Fool: Oct 16, 2003

crazysim posted:

I like the pricing difference for self hosted runners running on private repositories: $0/mo. $15/mo on Azure DevOps.

The GitHub Actions page seems to imply that will change after beta

# ? Nov 7, 2019 23:05

freeasinbeer: Mar 26, 2015; by Fluffdaddy

So I�ve seen a bunch of stuff that tries to tie those AWS primitives together and most are dog poo poo, even if it was �perfect� at release it starts to atrophy almost immediately. I use k8s on top of AWS so I don�t need to spend the time building that, and having a large team maintain it. Tying poo poo together in AWS with some sort of custom ASG manager at this point is just an exercise in pissing away engineering time.

With k8s I can have someone scratch an itch if we really have one and get it upstream, and it�s self healing primitives around apps are way past anything AWS offers.

Edit: I have two of those wrappers around asg scaling still kicking around and they are a nightmare to deal with. K8s isn�t perfect; but people way smarter then me put a fair bit of thought into some really common problems and produced some amazing tooling.

# ? Nov 8, 2019 01:39

Methanar: Sep 26, 2013; by the sex ghost

12 rats tied together posted:

cloud ops teams are generally pretty inexperienced and unskilled. It's hard to build good abstractions out of AWS primitives, so we just run k8s and developers can post manifests with ELB labels. It's hard to run real service discovery, so we just use k8s and developers can use the /services endpoint.

Everything is hard. Taking the easy way out and just using /services is the point.

12 rats tied together posted:

It's hard to manage secrets so we use k8s but everyone has ssh to the boxes anyway so nothing is actually secured, and we don't bother with namespaces or federated secrets so we have the same secret object all over the place, poo poo like that.

12 rats tied together posted:

Usually someone on a noble but ultimately misguided journey has also spun up a hashicorp vault server somewhere too which has some fraction of your secrets on it for no other reason than they had a ticket that said "try out hashicorp vault".

Please don't post things this real

12 rats tied together posted:

"How do I maintain overhead capacity on my application that scales based on demand" is a fairly solved problem these days and I'm mad and sad that it's giving you any trouble at all.

me too

# ? Nov 8, 2019 02:29

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

freeasinbeer posted:

So I�ve seen a bunch of stuff that tries to tie those AWS primitives together and most are dog poo poo, even if it was �perfect� at release it starts to atrophy almost immediately. I use k8s on top of AWS so I don�t need to spend the time building that, and having a large team maintain it. Tying poo poo together in AWS with some sort of custom ASG manager at this point is just an exercise in pissing away engineering time.

What are the problems you've had that have required you to build wrappers around the ASGs?

# ? Nov 9, 2019 15:08

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

It's sad that that's literally every talk I've been to on CI/CD involving Jenkins tooling becomes a group therapy session ultimately. Always the most packed talk among other subject areas, always the most well received. Maybe if I went to a CloudBees conference it'd be different.

# ? Nov 9, 2019 15:15

PBS: Sep 21, 2015

Had a really weird issue building an image recently, I've never seen anything like it.

It was a basic springboot app. The container uses a base image someone else created with java bundled in, in our docker file we copy down a tarball with our apps binaries and unzip it, then remove an older dependency and replace it with a newer one we pull down from our artifact repository. If we unzip the tarball, remove the dependency, and download the new one in the same layer there are issues starting the embedded tomcat server (I'll talk more about that later), but if we download the updated dependency in a layer after, it works fine.

The observed issue with the image is the developer could run it in docker on their laptop just fine, but when they deployed it to kubernetes the embedded tomcat server wouldn't start with an error stating it was unable to call a method from the updated dependency. It gets a little odder as well, on my local machine I could also execute the image just fine, but another engineer was seeing the same behavior that we were observing in our kubernetes cluster.

We verified the image hash we all had was the same, we were all running the same version of docker on mac.

Has anyone seen anything like that before? I'm at a loss as to what could cause behavior like that.

# ? Nov 13, 2019 05:21

minato: Jun 7, 2004; cutty cain't hang, say 7-up.; Taco Defender

Maybe file permissions? Some k8s clusters are setup to randomize the user/group that the container starts with.

# ? Nov 13, 2019 08:13

freeasinbeer: Mar 26, 2015; by Fluffdaddy

minato posted:

Maybe file permissions? Some k8s clusters are setup to randomize the user/group that the container starts with.

Open shift forces this on each pod it runs. Mainline Kubernetes doesn�t support it, yet.

# ? Nov 14, 2019 01:52

Necronomicon: Jan 18, 2004

Quick question for anyone running Terraform in a CI/CD pipeline - is the ultimate goal to have the `terraform apply` command run automatically, or do you just want to validate your code and formatting to get the final ok before applying the changes manually? I'm using Travis, and the goal is to have changes propagate across three different AWS accounts. I think I have a working solution but I'm curious as to what the consensus best practice is here.

# ? Nov 14, 2019 15:42

New Yorp New Yorp: Jul 18, 2003; Only in Kenya.; Pillbug

Necronomicon posted:

Quick question for anyone running Terraform in a CI/CD pipeline - is the ultimate goal to have the `terraform apply` command run automatically, or do you just want to validate your code and formatting to get the final ok before applying the changes manually? I'm using Travis, and the goal is to have changes propagate across three different AWS accounts. I think I have a working solution but I'm curious as to what the consensus best practice is here.

The ultimate goal of continuous delivery is to never do anything manually. Doing things manually is where mistakes and human error creep in. See: Knight Capital losing 500 million dollars in 45 minutes due to manual processes.

# ? Nov 14, 2019 18:20

Methanar: Sep 26, 2013; by the sex ghost

Don't run terraform without a human reading the plan diff first.

# ? Nov 14, 2019 18:22

fletcher: Jun 27, 2003; ken park is my favorite movie; Cybernetic Crumb

Methanar posted:

Don't run terraform without a human reading the plan diff first.

In production, sure. For the testing environment though, we have automated terraform plan/apply several times a day.

# ? Nov 14, 2019 19:29

New Yorp New Yorp: Jul 18, 2003; Only in Kenya.; Pillbug

Methanar posted:

Don't run terraform without a human reading the plan diff first.

That seems really backwards to me. Why would you not want infrastructure changes to be automatically applied? People shouldn't be manually changing poo poo and your lower environments should be production-like, so there should be no surprises.

# ? Nov 14, 2019 20:52

Bhodi: Dec 9, 2007; Oh, it's just a cat.; Pillbug

The problem with that sentence is that "should" has to be bolded, underlined, and in 24 point font

# ? Nov 14, 2019 20:56

New Yorp New Yorp: Jul 18, 2003; Only in Kenya.; Pillbug

Bhodi posted:

The problem with that sentence is that "should" has to be bolded, underlined, and in 24 point font

Well, yes. But that's also a maturity thing that's achievable. The solution to the problem of "things might be out of sync with our infrastructure-as-code provider" shouldn't be "let's manually validate all the changes it's going to make before we run it", because that still leaves a huge manual error gap.

# ? Nov 14, 2019 22:20

Methanar: Sep 26, 2013; by the sex ghost

New Yorp New Yorp posted:

Well, yes. But that's also a maturity thing that's achievable. The solution to the problem of "things might be out of sync with our infrastructure-as-code provider" shouldn't be "let's manually validate all the changes it's going to make before we run it", because that still leaves a huge manual error gap.

The goal isn't to have everything execute or deploy run on git push for the sake of doing so.

The goal is to remove human error where possible. Not having a human sanity check a terraform plan output introduces human error.

Its not that unthinkable that a mistake can be made that removes some middle dependency that results in removing a bunch of SGs from things you don't want to or otherwise has an incorrect string interpolation somewhere.

# ? Nov 14, 2019 22:44

Necronomicon: Jan 18, 2004

FWIW I implemented auto-apply, but with some safeguards in place.

1. Travis runs init, validate, fmt, and plan on all branches that get pushed. Checks will fail if your code has errors.

2. The git repos holding our Terraform code (we have a few different ones for different sections of AWS) have their master branches protected by the three of us who make up the devops team.

3. Travis only runs apply on branches that get merged into master, which require approval by at least one of the devops team. So you still have a safeguard in place.

The main issue I ran into was a lack of clarity into how AWS IAM roles and Travis interacted w/ each other. I ran into a lot of errors until I added an "assume_role" block into the provider definition, assuming a role that had access to the bucket defined in a bucket permission. The thing I am pretty happy about is how portable this solution is - I have a bash script that includes a little one-liner to find all directories that contain Terraform code (disregarding generic modules) and ignore everything else.

# ? Nov 14, 2019 22:47

New Yorp New Yorp: Jul 18, 2003; Only in Kenya.; Pillbug

Methanar posted:

The goal isn't to have everything execute or deploy run on git push for the sake of doing so.

The goal is to remove human error where possible. Not having a human sanity check a terraform plan output introduces human error.

Its not that unthinkable that a mistake can be made that removes some middle dependency that results in removing a bunch of SGs from things you don't want to or otherwise has an incorrect string interpolation somewhere.

That's why you have production-like lower environments that things are tested against. The goal to work toward is making sure there's never a question of "is this thing that's being done to our production environment going to do the right thing?"

I don't do a lot of work with Terraform, but I've been using ARM templates for years, and yes, people introduce errors into ARM templates occasionally, which are caught in lower environments or during integration testing. I'm honestly surprised that it's a common practice in Terraform land to hold up deployments while someone manually verifies output from a Terraform command.

# ? Nov 14, 2019 23:02

The Fool: Oct 16, 2003

Part of the issue with terraform is that it's not easy to recover if your state file gets deleted/corrupted. Which is not something I've had happen to myself, but have heard of it happening enough times that I'm wary.

# ? Nov 14, 2019 23:11

New Yorp New Yorp: Jul 18, 2003; Only in Kenya.; Pillbug

The Fool posted:

Part of the issue with terraform is that it's not easy to recover if your state file gets deleted/corrupted. Which is not something I've had happen to myself, but have heard of it happening enough times that I'm wary.

I thought that you could import resources into a state file, and beyond that you're supposed to use Terraform Enterprise to track state if you're serious about using Terraform.

# ? Nov 14, 2019 23:13

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Methanar posted:

The goal isn't to have everything execute or deploy run on git push for the sake of doing so.

The goal is to remove human error where possible. Not having a human sanity check a terraform plan output introduces human error.

Its not that unthinkable that a mistake can be made that removes some middle dependency that results in removing a bunch of SGs from things you don't want to or otherwise has an incorrect string interpolation somewhere.

This isn't an either-or proposition. You can have your code review system trigger plans on submission (the results of which are injected back into your code review system) and applies on merges to master. Need to review? Don't merge it yet. (You generally should try to not merge untested code.) But one of the points of configuration-as-code is that the repo content in trunk describes the state of the real world, and deliberately preventing prod from being the thing you coded is an rear end-backwards way to operate.

There's also a compelling argument to be made that if you have ad-hoc composition of global security groups on Terraform-managed resources, you're doing Terraform wrong. Keep your SGs tightly scoped and close to the resources they're applied to and you avoid a big rat's nest of untracked dependencies.

(Yeah, yeah, real world.)

Vulture Culture fucked around with this message at 00:31 on Nov 15, 2019

# ? Nov 15, 2019 00:27

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

The Fool posted:

Part of the issue with terraform is that it's not easy to recover if your state file gets deleted/corrupted. Which is not something I've had happen to myself, but have heard of it happening enough times that I'm wary.

One extremely low-lift way to resolve this is to use remote state with S3 and object versioning on the bucket.

# ? Nov 15, 2019 00:30

Bhodi: Dec 9, 2007; Oh, it's just a cat.; Pillbug

Vulture Culture posted:

One extremely low-lift way to resolve this is to use remote state with S3 and object versioning on the bucket.

I really hope everyone who is using remote state on S3 already has versioning enabled, if not, do this immediately!

# ? Nov 15, 2019 01:33

fletcher: Jun 27, 2003; ken park is my favorite movie; Cybernetic Crumb

Vulture Culture posted:

One extremely low-lift way to resolve this is to use remote state with S3 and object versioning on the bucket.

Yeah we use remote state with S3 and object versioning on the bucket. It didn't help in my scenario I mentioned in my previous comment though.

The terraform apply run had already started modifying state and then failed partway through due to network interruption, which also caused it not to be able to write the state back to the remote. In this scenario it's supposed to at least dump the state file to the local filesystem so you can do something with it, but it was either empty or non-existant, can't remember. I googled around and found an open ticket on github where others had run into the same issue. The old version of the state file that we still had in S3 was not useful to recover from.

# ? Nov 15, 2019 03:50

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

re: versioning I have no idea how the hell I deleted an S3 object versioned remote state file but I had a reproducible test case when I was running a refresh but against the wrong AWS account. I'd be ok with it creating a blank remote state version but Terraform decided that I didn't need any remote state. I didn't need to apply a plan or even apply and managed to delete it.

fletcher posted:

In production, sure. For the testing environment though, we have automated terraform plan/apply several times a day.

We're working on testing terraform plan output using Terraform 0.12 and it's capable of avoiding making changes that are incorrect. Our current tests are about confirming the current Terraform state matches what we'd expect as well as the final state.

# ? Nov 15, 2019 04:09

FISHMANPET: Mar 3, 2007; Sweet 'N Sour
Can't
Melt
Steel Beams

I haven't used Terraform or Jenkins specifically but if you want a human to verify the plan output could you put an approval in your pipeline that requires a human to look at it and say "yes this is good" and then approve it and let the pipeline still do the automated deployment, rather than doing a whole deployment manually.

# ? Nov 15, 2019 16:53

Necronomicon: Jan 18, 2004

FISHMANPET posted:

I haven't used Terraform or Jenkins specifically but if you want a human to verify the plan output could you put an approval in your pipeline that requires a human to look at it and say "yes this is good" and then approve it and let the pipeline still do the automated deployment, rather than doing a whole deployment manually.

Right, this is why I'm using git merge into master as the final approval process.

# ? Nov 15, 2019 20:32

amethystdragon: Sep 14, 2019

So our setup... (and a lot of other companies I've worked for):
Git
Jenkins
Terraform

Workflow:

Write code (the start of all problems)
Run locally (often skipped in favor of testing in Dev...)
Commit to a feature branch in Git (Or SVN or worse if you have a legacy code base (CVS))
Have Jenkins configured to have a hook so that when code is pushed to a branch to build/test (if there are any, some people don't want to unit test)
Create a pull request asking for someone to review the code and requiring an approval before allowing it to be merged.
Merge that feature branch into the development/main/master branch of the code using a pull request.
Have Jenkins configured to have a hook so that when code is merged that the development/main/master branch is built (yes again just in case the merge does... something) and then deployed (usually a lower environment at this point)
Test on a non-local machine (things break in the cloud)
If good either make a PR for the main branch if using a development branch for dev or kick off a build in Jenkins manually to deploy to higher environments.
Repeat 9 and 10 for each environment. (Hopefully having more manual gating for higher environments)

Anywho

FISHMANPET posted:

I haven't used Terraform or Jenkins specifically but if you want a human to verify the plan output could you put an approval in your pipeline that requires a human to look at it and say "yes this is good" and then approve it and let the pipeline still do the automated deployment, rather than doing a whole deployment manually.

Yes you can configure it to have some one say yes, either using a version control system or in Jenkins itself (via a pipeline)
Highly recommend setting up environments via code and "deploying" via a CI/CD system. You get visability on changes, you gate so that only "approved" things get pushed out to run, and you automate the process taking away chances for errors when running things as one-off deploys...

It is a bit of setup, but Law of Threes suggests the more you deploy the more time automation can save you.

# ? Nov 17, 2019 01:23

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

For those folks who have Jenkins or something else apply their Terraform changes, how do you handle cases where the apply fails because Terraform can't work out the correct order to create things, or if you get rate limited, or if your IAM role's (you're using roles, right?) temporary creds expire, or if you have a resource limit that you hit during the apply, or if really any number of other things that might cause a plan to succeed but an apply to fail?

Our platform team has been a bit shy about the idea of automating applies but I'd love to be able to do it if someone has a good answer for how to recover TF from a bad state that a machine put it in.

# ? Nov 17, 2019 01:37

Gyshall: Feb 24, 2009; Had a couple of drinks.
Saw a couple of things.

Blinkz0rz posted:

For those folks who have Jenkins or something else apply their Terraform changes, how do you handle cases where the apply fails because Terraform can't work out the correct order to create things, or if you get rate limited, or if your IAM role's (you're using roles, right?) temporary creds expire, or if you have a resource limit that you hit during the apply, or if really any number of other things that might cause a plan to succeed but an apply to fail?

Our platform team has been a bit shy about the idea of automating applies but I'd love to be able to do it if someone has a good answer for how to recover TF from a bad state that a machine put it in.

We wrap the terraform binary in a thin shell script so we can specify a retry for clouds that are #eventuallyconsistent. We also do other fun stuff with this script, such as slack hook/integration requiring 2fa chat authentication which is pretty sweet. This lets us gate upper env deployments by pausing the TF apply until someone responds/approves via slack app.

Our script also runs terraform validate/plan prior to any apply or destroy operation. Plan output is attached to jira/pull requests where applicable. I'm doing SRE in a highly regulated industry right now fwiw

Also one other thing is that we literally use the same TF between environments. By the time we're deploying production, we've already ran the same code apply multiple times with only our $ENVIRONMENT tfvar changing between deploys, which is part of the Jenkins/CI environment job declaration.

e: also if you have the luxury, make your servers immutable and leverage load balancing. Our workflows are usually destroy=> apply. Depends heavily on your workloads but this gives us a much higher predictability in deployed infrastructure

Gyshall fucked around with this message at 02:42 on Nov 17, 2019

# ? Nov 17, 2019 02:38

amethystdragon: Sep 14, 2019

Gyshall posted:

e: also if you have the luxury, make your servers immutable and leverage load balancing. Our workflows are usually destroy=> apply. Depends heavily on your workloads but this gives us a much higher predictability in deployed infrastructure

If you can docker containers in ECR/ECS or Kubernetes. This enforces the immutability and allows for some fun things like dynamically scaling based on load (if you use AWS there's a whole elastic scaling cluster option based on triggers with limits)

# ? Nov 17, 2019 02:47

Gyshall: Feb 24, 2009; Had a couple of drinks.
Saw a couple of things.

True, but you're (probably, hopefully) not using Terraform for containers.

# ? Nov 18, 2019 00:32

amethystdragon: Sep 14, 2019

Gyshall posted:

True, but you're (probably, hopefully) not using Terraform for containers.

No, true.
Just automating the creation of roles, task definitions, etc.
Typically just a docker build and calling java -c or python setup.py or what have you for packing the code itself.

# ? Nov 20, 2019 03:46

Methanar: Sep 26, 2013; by the sex ghost

I don't know how to even begin deploying changes that affect over 100 developers worth of services.

# ? Nov 21, 2019 04:24

Doc Hawkins: Jun 15, 2010; Dashing? But I'm not even moving!

iirc on AWS you could add a scaling rule that terminates the oldest instance in a group on a regular interval, while another scaling rule starts new ones to maintain capacity

is there an off-the-shelf way to do that in google cloud? especially with GKE node pools, and hell, why not kubernetes pods

# ? Nov 21, 2019 06:29

Methanar: Sep 26, 2013; by the sex ghost

Doc Hawkins posted:

iirc on AWS you could add a scaling rule that terminates the oldest instance in a group on a regular interval, while another scaling rule starts new ones to maintain capacity

is there an off-the-shelf way to do that in google cloud? especially with GKE node pools, and hell, why not kubernetes pods

Just feed your chaos monkey some bath salts.

# ? Nov 21, 2019 07:19

Adbot: ADBOT LOVES YOU

# ? Jun 5, 2024 03:50

Docjowles: Apr 9, 2009

Methanar posted:

I don't know how to even begin deploying changes that affect over 100 developers worth of services.

Methanar posted:

Just feed your chaos monkey some bath salts.

You have just answered your own question

# ? Nov 21, 2019 09:18

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread

«‹›158 »