|
It's not K8S stuff but I wanted to bring up this news for those of us who are a little more on the hosted services side. Self-hosted runners for GitHub Actions is now in beta
|
# ? Nov 7, 2019 18:53 |
|
|
# ? Jun 5, 2024 03:50 |
|
crazysim posted:It's not K8S stuff but I wanted to bring up this news for those of us who are a little more on the hosted services side. It's worth noting that "GitHub Actions" are the exact same thing as Azure DevOps build/release pipelines under the hood.
|
# ? Nov 7, 2019 19:59 |
|
New Yorp New Yorp posted:It's worth noting that "GitHub Actions" are the exact same thing as Azure DevOps build/release pipelines under the hood. I like the pricing difference for self hosted runners running on private repositories: $0/mo. $15/mo on Azure DevOps.
|
# ? Nov 7, 2019 21:46 |
|
crazysim posted:I like the pricing difference for self hosted runners running on private repositories: $0/mo. $15/mo on Azure DevOps. The GitHub Actions page seems to imply that will change after beta
|
# ? Nov 7, 2019 23:05 |
|
So I’ve seen a bunch of stuff that tries to tie those AWS primitives together and most are dog poo poo, even if it was “perfect” at release it starts to atrophy almost immediately. I use k8s on top of AWS so I don’t need to spend the time building that, and having a large team maintain it. Tying poo poo together in AWS with some sort of custom ASG manager at this point is just an exercise in pissing away engineering time. With k8s I can have someone scratch an itch if we really have one and get it upstream, and it’s self healing primitives around apps are way past anything AWS offers. Edit: I have two of those wrappers around asg scaling still kicking around and they are a nightmare to deal with. K8s isn’t perfect; but people way smarter then me put a fair bit of thought into some really common problems and produced some amazing tooling.
|
# ? Nov 8, 2019 01:39 |
|
12 rats tied together posted:cloud ops teams are generally pretty inexperienced and unskilled. It's hard to build good abstractions out of AWS primitives, so we just run k8s and developers can post manifests with ELB labels. It's hard to run real service discovery, so we just use k8s and developers can use the /services endpoint. 12 rats tied together posted:It's hard to manage secrets so we use k8s but everyone has ssh to the boxes anyway so nothing is actually secured, and we don't bother with namespaces or federated secrets so we have the same secret object all over the place, poo poo like that. 12 rats tied together posted:Usually someone on a noble but ultimately misguided journey has also spun up a hashicorp vault server somewhere too which has some fraction of your secrets on it for no other reason than they had a ticket that said "try out hashicorp vault". Please don't post things this real 12 rats tied together posted:"How do I maintain overhead capacity on my application that scales based on demand" is a fairly solved problem these days and I'm mad and sad that it's giving you any trouble at all.
|
# ? Nov 8, 2019 02:29 |
|
freeasinbeer posted:So I’ve seen a bunch of stuff that tries to tie those AWS primitives together and most are dog poo poo, even if it was “perfect” at release it starts to atrophy almost immediately. I use k8s on top of AWS so I don’t need to spend the time building that, and having a large team maintain it. Tying poo poo together in AWS with some sort of custom ASG manager at this point is just an exercise in pissing away engineering time.
|
# ? Nov 9, 2019 15:08 |
|
It's sad that that's literally every talk I've been to on CI/CD involving Jenkins tooling becomes a group therapy session ultimately. Always the most packed talk among other subject areas, always the most well received. Maybe if I went to a CloudBees conference it'd be different.
|
# ? Nov 9, 2019 15:15 |
|
Had a really weird issue building an image recently, I've never seen anything like it. It was a basic springboot app. The container uses a base image someone else created with java bundled in, in our docker file we copy down a tarball with our apps binaries and unzip it, then remove an older dependency and replace it with a newer one we pull down from our artifact repository. If we unzip the tarball, remove the dependency, and download the new one in the same layer there are issues starting the embedded tomcat server (I'll talk more about that later), but if we download the updated dependency in a layer after, it works fine. The observed issue with the image is the developer could run it in docker on their laptop just fine, but when they deployed it to kubernetes the embedded tomcat server wouldn't start with an error stating it was unable to call a method from the updated dependency. It gets a little odder as well, on my local machine I could also execute the image just fine, but another engineer was seeing the same behavior that we were observing in our kubernetes cluster. We verified the image hash we all had was the same, we were all running the same version of docker on mac. Has anyone seen anything like that before? I'm at a loss as to what could cause behavior like that.
|
# ? Nov 13, 2019 05:21 |
|
Maybe file permissions? Some k8s clusters are setup to randomize the user/group that the container starts with.
|
# ? Nov 13, 2019 08:13 |
|
minato posted:Maybe file permissions? Some k8s clusters are setup to randomize the user/group that the container starts with. Open shift forces this on each pod it runs. Mainline Kubernetes doesn’t support it, yet.
|
# ? Nov 14, 2019 01:52 |
|
Quick question for anyone running Terraform in a CI/CD pipeline - is the ultimate goal to have the `terraform apply` command run automatically, or do you just want to validate your code and formatting to get the final ok before applying the changes manually? I'm using Travis, and the goal is to have changes propagate across three different AWS accounts. I think I have a working solution but I'm curious as to what the consensus best practice is here.
|
# ? Nov 14, 2019 15:42 |
|
Necronomicon posted:Quick question for anyone running Terraform in a CI/CD pipeline - is the ultimate goal to have the `terraform apply` command run automatically, or do you just want to validate your code and formatting to get the final ok before applying the changes manually? I'm using Travis, and the goal is to have changes propagate across three different AWS accounts. I think I have a working solution but I'm curious as to what the consensus best practice is here. The ultimate goal of continuous delivery is to never do anything manually. Doing things manually is where mistakes and human error creep in. See: Knight Capital losing 500 million dollars in 45 minutes due to manual processes.
|
# ? Nov 14, 2019 18:20 |
|
Don't run terraform without a human reading the plan diff first.
|
# ? Nov 14, 2019 18:22 |
Methanar posted:Don't run terraform without a human reading the plan diff first. In production, sure. For the testing environment though, we have automated terraform plan/apply several times a day.
|
|
# ? Nov 14, 2019 19:29 |
|
Methanar posted:Don't run terraform without a human reading the plan diff first. That seems really backwards to me. Why would you not want infrastructure changes to be automatically applied? People shouldn't be manually changing poo poo and your lower environments should be production-like, so there should be no surprises.
|
# ? Nov 14, 2019 20:52 |
|
The problem with that sentence is that "should" has to be bolded, underlined, and in 24 point font
|
# ? Nov 14, 2019 20:56 |
|
Bhodi posted:The problem with that sentence is that "should" has to be bolded, underlined, and in 24 point font Well, yes. But that's also a maturity thing that's achievable. The solution to the problem of "things might be out of sync with our infrastructure-as-code provider" shouldn't be "let's manually validate all the changes it's going to make before we run it", because that still leaves a huge manual error gap.
|
# ? Nov 14, 2019 22:20 |
|
New Yorp New Yorp posted:Well, yes. But that's also a maturity thing that's achievable. The solution to the problem of "things might be out of sync with our infrastructure-as-code provider" shouldn't be "let's manually validate all the changes it's going to make before we run it", because that still leaves a huge manual error gap. The goal isn't to have everything execute or deploy run on git push for the sake of doing so. The goal is to remove human error where possible. Not having a human sanity check a terraform plan output introduces human error. Its not that unthinkable that a mistake can be made that removes some middle dependency that results in removing a bunch of SGs from things you don't want to or otherwise has an incorrect string interpolation somewhere.
|
# ? Nov 14, 2019 22:44 |
|
FWIW I implemented auto-apply, but with some safeguards in place. 1. Travis runs init, validate, fmt, and plan on all branches that get pushed. Checks will fail if your code has errors. 2. The git repos holding our Terraform code (we have a few different ones for different sections of AWS) have their master branches protected by the three of us who make up the devops team. 3. Travis only runs apply on branches that get merged into master, which require approval by at least one of the devops team. So you still have a safeguard in place. The main issue I ran into was a lack of clarity into how AWS IAM roles and Travis interacted w/ each other. I ran into a lot of errors until I added an "assume_role" block into the provider definition, assuming a role that had access to the bucket defined in a bucket permission. The thing I am pretty happy about is how portable this solution is - I have a bash script that includes a little one-liner to find all directories that contain Terraform code (disregarding generic modules) and ignore everything else.
|
# ? Nov 14, 2019 22:47 |
|
Methanar posted:The goal isn't to have everything execute or deploy run on git push for the sake of doing so. That's why you have production-like lower environments that things are tested against. The goal to work toward is making sure there's never a question of "is this thing that's being done to our production environment going to do the right thing?" I don't do a lot of work with Terraform, but I've been using ARM templates for years, and yes, people introduce errors into ARM templates occasionally, which are caught in lower environments or during integration testing. I'm honestly surprised that it's a common practice in Terraform land to hold up deployments while someone manually verifies output from a Terraform command.
|
# ? Nov 14, 2019 23:02 |
|
Part of the issue with terraform is that it's not easy to recover if your state file gets deleted/corrupted. Which is not something I've had happen to myself, but have heard of it happening enough times that I'm wary.
|
# ? Nov 14, 2019 23:11 |
|
The Fool posted:Part of the issue with terraform is that it's not easy to recover if your state file gets deleted/corrupted. Which is not something I've had happen to myself, but have heard of it happening enough times that I'm wary. I thought that you could import resources into a state file, and beyond that you're supposed to use Terraform Enterprise to track state if you're serious about using Terraform.
|
# ? Nov 14, 2019 23:13 |
|
Methanar posted:The goal isn't to have everything execute or deploy run on git push for the sake of doing so. There's also a compelling argument to be made that if you have ad-hoc composition of global security groups on Terraform-managed resources, you're doing Terraform wrong. Keep your SGs tightly scoped and close to the resources they're applied to and you avoid a big rat's nest of untracked dependencies. (Yeah, yeah, real world.) Vulture Culture fucked around with this message at 00:31 on Nov 15, 2019 |
# ? Nov 15, 2019 00:27 |
|
The Fool posted:Part of the issue with terraform is that it's not easy to recover if your state file gets deleted/corrupted. Which is not something I've had happen to myself, but have heard of it happening enough times that I'm wary.
|
# ? Nov 15, 2019 00:30 |
|
Vulture Culture posted:One extremely low-lift way to resolve this is to use remote state with S3 and object versioning on the bucket.
|
# ? Nov 15, 2019 01:33 |
Vulture Culture posted:One extremely low-lift way to resolve this is to use remote state with S3 and object versioning on the bucket. Yeah we use remote state with S3 and object versioning on the bucket. It didn't help in my scenario I mentioned in my previous comment though. The terraform apply run had already started modifying state and then failed partway through due to network interruption, which also caused it not to be able to write the state back to the remote. In this scenario it's supposed to at least dump the state file to the local filesystem so you can do something with it, but it was either empty or non-existant, can't remember. I googled around and found an open ticket on github where others had run into the same issue. The old version of the state file that we still had in S3 was not useful to recover from.
|
|
# ? Nov 15, 2019 03:50 |
|
re: versioning I have no idea how the hell I deleted an S3 object versioned remote state file but I had a reproducible test case when I was running a refresh but against the wrong AWS account. I'd be ok with it creating a blank remote state version but Terraform decided that I didn't need any remote state. I didn't need to apply a plan or even apply and managed to delete it.fletcher posted:In production, sure. For the testing environment though, we have automated terraform plan/apply several times a day.
|
# ? Nov 15, 2019 04:09 |
|
I haven't used Terraform or Jenkins specifically but if you want a human to verify the plan output could you put an approval in your pipeline that requires a human to look at it and say "yes this is good" and then approve it and let the pipeline still do the automated deployment, rather than doing a whole deployment manually.
|
# ? Nov 15, 2019 16:53 |
|
FISHMANPET posted:I haven't used Terraform or Jenkins specifically but if you want a human to verify the plan output could you put an approval in your pipeline that requires a human to look at it and say "yes this is good" and then approve it and let the pipeline still do the automated deployment, rather than doing a whole deployment manually. Right, this is why I'm using git merge into master as the final approval process.
|
# ? Nov 15, 2019 20:32 |
|
So our setup... (and a lot of other companies I've worked for): Git Jenkins Terraform Workflow:
Anywho FISHMANPET posted:I haven't used Terraform or Jenkins specifically but if you want a human to verify the plan output could you put an approval in your pipeline that requires a human to look at it and say "yes this is good" and then approve it and let the pipeline still do the automated deployment, rather than doing a whole deployment manually. Highly recommend setting up environments via code and "deploying" via a CI/CD system. You get visability on changes, you gate so that only "approved" things get pushed out to run, and you automate the process taking away chances for errors when running things as one-off deploys... It is a bit of setup, but Law of Threes suggests the more you deploy the more time automation can save you.
|
# ? Nov 17, 2019 01:23 |
|
For those folks who have Jenkins or something else apply their Terraform changes, how do you handle cases where the apply fails because Terraform can't work out the correct order to create things, or if you get rate limited, or if your IAM role's (you're using roles, right?) temporary creds expire, or if you have a resource limit that you hit during the apply, or if really any number of other things that might cause a plan to succeed but an apply to fail? Our platform team has been a bit shy about the idea of automating applies but I'd love to be able to do it if someone has a good answer for how to recover TF from a bad state that a machine put it in.
|
# ? Nov 17, 2019 01:37 |
|
Blinkz0rz posted:For those folks who have Jenkins or something else apply their Terraform changes, how do you handle cases where the apply fails because Terraform can't work out the correct order to create things, or if you get rate limited, or if your IAM role's (you're using roles, right?) temporary creds expire, or if you have a resource limit that you hit during the apply, or if really any number of other things that might cause a plan to succeed but an apply to fail? We wrap the terraform binary in a thin shell script so we can specify a retry for clouds that are #eventuallyconsistent. We also do other fun stuff with this script, such as slack hook/integration requiring 2fa chat authentication which is pretty sweet. This lets us gate upper env deployments by pausing the TF apply until someone responds/approves via slack app. Our script also runs terraform validate/plan prior to any apply or destroy operation. Plan output is attached to jira/pull requests where applicable. I'm doing SRE in a highly regulated industry right now fwiw Also one other thing is that we literally use the same TF between environments. By the time we're deploying production, we've already ran the same code apply multiple times with only our $ENVIRONMENT tfvar changing between deploys, which is part of the Jenkins/CI environment job declaration. e: also if you have the luxury, make your servers immutable and leverage load balancing. Our workflows are usually destroy=> apply. Depends heavily on your workloads but this gives us a much higher predictability in deployed infrastructure Gyshall fucked around with this message at 02:42 on Nov 17, 2019 |
# ? Nov 17, 2019 02:38 |
|
Gyshall posted:e: also if you have the luxury, make your servers immutable and leverage load balancing. Our workflows are usually destroy=> apply. Depends heavily on your workloads but this gives us a much higher predictability in deployed infrastructure If you can docker containers in ECR/ECS or Kubernetes. This enforces the immutability and allows for some fun things like dynamically scaling based on load (if you use AWS there's a whole elastic scaling cluster option based on triggers with limits)
|
# ? Nov 17, 2019 02:47 |
|
True, but you're (probably, hopefully) not using Terraform for containers.
|
# ? Nov 18, 2019 00:32 |
|
Gyshall posted:True, but you're (probably, hopefully) not using Terraform for containers. No, true. Just automating the creation of roles, task definitions, etc. Typically just a docker build and calling java -c or python setup.py or what have you for packing the code itself.
|
# ? Nov 20, 2019 03:46 |
|
I don't know how to even begin deploying changes that affect over 100 developers worth of services.
|
# ? Nov 21, 2019 04:24 |
|
iirc on AWS you could add a scaling rule that terminates the oldest instance in a group on a regular interval, while another scaling rule starts new ones to maintain capacity is there an off-the-shelf way to do that in google cloud? especially with GKE node pools, and hell, why not kubernetes pods
|
# ? Nov 21, 2019 06:29 |
|
Doc Hawkins posted:iirc on AWS you could add a scaling rule that terminates the oldest instance in a group on a regular interval, while another scaling rule starts new ones to maintain capacity Just feed your chaos monkey some bath salts.
|
# ? Nov 21, 2019 07:19 |
|
|
# ? Jun 5, 2024 03:50 |
|
Methanar posted:I don't know how to even begin deploying changes that affect over 100 developers worth of services. Methanar posted:Just feed your chaos monkey some bath salts. You have just answered your own question
|
# ? Nov 21, 2019 09:18 |