Continuous Integration/build engineering/devops thread

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

Can I please get some advice from those who have done this before? My org wants to get started with being able to host Windows containers, as well as support some CI for a handful of devs each doing their own thing. Some are on TFS and others on an old version of GitLab, but nobody here has any CI/CD experience. We're 100% on-prem, no butt stuff. There's a need to keep things as simple as possible so that I'm not creating a nightmare for the rest of the Ops team.

My current plan is to do a couple Docker Swarms with Traefik for ingress, and then move all the devs to an upgraded version of GitLab for image repositories and CI jobs. I'd like to make them a sample pipeline to use as a reference, and then make them responsible for their own crap. I'm not sure yet if I should do a build environment or have them build on their workstations and upload to the repository. Does this approach make sense?

I don't have a clear idea of our dev's typical workflow, but they mostly make little .NET webapps with databases on an existing SQL cluster. They manually update UAT/prod by copying files over. Is there anything in my proposed plan that would be a no-go for normal dev work? What should I be asking them or looking for?

# ¿ Jun 7, 2018 15:48

Adbot: ADBOT LOVES YOU

# ¿ May 4, 2024 02:00

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

Thanks for the good thoughts, I appreciate it. I'll sit down with each project and get an idea of how they do things, but my expectations are as low as yours. I agree that starting with a whole build pipeline will be too much.

I have to worry about buy-in from both sides here, and the easier sell for Ops is containerizing all these tiny crap apps to cut down on VM sprawl and especially Windows Server licensing costs. Storage-wise, 6.3GB per node for microsoft/windowsservercore is nothing when they're already using dedicated Dev, UAT and Prod VMs (plus backups).

My prototypes so far can definitely confirm that docker on Windows is crap; there's no end of gotchas. Overlay network egress didn't work at all in a mixed Linux/Windows swarm. and docker why did you lower the MTU of the host's external adapter to 1450 but leave the bridged adapter at 1500 when you know MSS announcements don't work between them. Hello semi-randomly reset connections. And your MTU config option does sweet fuckall on Windows. gently caress.

But I think for simple stuff it's pretty feasible. I containerized two ASP.NET 3.5 sites without much hassle, and I've never done this before. I don't know that I'd want to try it with any of the legacy off-the-shelf apps we're hosting though.

# ¿ Jun 7, 2018 21:23

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

Hadlock posted:

If you get it to work, please post your notes and experiences here.

I've yet to hear of anyone get Windows containers to actually work so good luck with your endeavors. Linux is pretty well baked at this point but I have yet to see a good writeup of a functional Windows container system. I haven't really been looking lately though.

Good luck sir

I did this last spring. It was awful so I wrote a guide on how I did it. https://github.com/mooseracer/WindowsDocker

# ¿ Oct 18, 2018 01:54

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

yes but it looks good on the resume :pseudo:

My main complaints are about Swarm and the overlay network. The Windows hosts are so unreliable I had to build swarm-rejoining scripts. I feel like uptime would be better if there was no cluster, just a single manager.

But it did improve with patching as I was building it out. Started with Server 1709 and Docker EE 17.06-06 and it barely worked at all. Maybe someday this will be a viable option for hosting a containerized version of your bullshit mission critical legacy ASP app until the end of time.

# ¿ Oct 18, 2018 18:26

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

My little SaaS shop still has the majority of its components on EC2, with newer stuff developed locally in docker and then GitHub Actions'd into ECS. Almost every component is a dependency; you need the full stack in every dev environment. But a dev environment is created by getting EC2 going, then editing + manually running terraform in every component's repo to get ECS up, with a PR per repo. This ECS pattern is obviously poo poo and won't scale beyond a handful of components.

Is this finally a good time for us to use EKS? Have a helm chart from each repo, then just pull down which ones you want when a new environment is needed and deploy it to a new namespace in some premade cluster? How do I sell my Devs on it, it took them like a year to stop bitching about managing docker locally.

# ¿ Mar 14, 2023 07:03

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

fletcher posted:

Ugh I was dealing with this the other day. I have like 40 of these repeated in my terraform code and I could not for the life of me figure out how to just create a list variable to do it:

Sure, we've implemented this. Here's the lifecycle rules, we wanted ours to keep images tagged for an environment and purge the rest:

code:

variable "ecr_tags" {
  type = list(string)
  default = [
    "dev",
    "staging",
    "prod"
  ]
}

locals {
  tag_rules = [
    for env in var.ecr_tags :
    {
      rulePriority = index(var.ecr_tags, env) + 1
      description  = "keep ${env} images"
      selection = {
        tagStatus     = "tagged",
        tagPrefixList = ["${env}"],
        countType     = "sinceImagePushed",
        countUnit     = "days",
        countNumber   = 1095
      },
      action = {
        type = "expire"
      }
    }
  ]
  default_rules = [
    {
      rulePriority = length(var.ecr_tags) + 1
      description  = "Remove untagged images"
      selection = {
        tagStatus   = "untagged",
        countType   = "sinceImagePushed",
        countUnit   = "days",
        countNumber = 7
      },
      action = {
        type = "expire"
      }
    },
    {
      rulePriority = length(var.ecr_tags) + 2,
      description  = "Purge the rest of the tagged images",
      selection = {
        tagStatus   = "any",
        countType   = "sinceImagePushed",
        countUnit   = "days",
        countNumber = 30
      }
      action = {
        type = "expire"
      }
    }

  ]
  ecr_lifecycle_rules = concat(local.tag_rules, local.default_rules)
}

Then you apply them to whatever ECR repositories. Mine were made in the same state but you could use a data lookup here.

code:

resource "aws_ecr_lifecycle_policy" "components" {
  for_each   = var.repository_names
  repository = aws_ecr_repository.components[each.key].name

  policy = jsonencode(
    {
      rules = local.ecr_lifecycle_rules
    }
  )
}

# ¿ Mar 15, 2023 00:39

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

Cloudflare Zero Trust / Access has a free plan for up to 50 users. You can basically use it instead of a VPN. Relatively new but seems decent so far.

# ¿ Apr 4, 2023 06:25

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

The Iron Rose posted:

... Put simply, reliability for any message M != reliability of all services through which M *might* traverse. I can easily measure reliability per service. But the more *useful* metric to provide to users (and customers!) is the success rate for the ...

Presenting a metric for customers is a completely different goal than internal engineering/troubleshooting, and you shouldn't handle them the same way. For example you might have an initial customer request create a bunch of failed traces, but if the automatic failure handling was good enough then maybe the customer didn't notice -- should that transaction count against your metric? Probably not.

# ¿ Apr 25, 2023 23:55

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

What about introducing another branch before main, say 'staging', that doesn't have all the branch protections slowing you down. Have the dev cluster manifests track it instead of main and enable auto sync. Then you can get fancy with automated testing, like if tests on the dev cluster pass then it auto-PRs from staging to main and Slacks you for final approval.

# ¿ Jul 6, 2023 10:35

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

Sure, it's reasonable to hate that branching pattern. Your alternative is to track via git tags or commit SHAs. Same idea: have ArgoCD track the repo differently for dev than prod.

# ¿ Jul 6, 2023 11:45

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

Hughmoris posted:

my eyes start to glaze over when I start looking at application code. I am able to read and write basic apps but I don't particularly enjoy it. I really enjoy scripting/automating and have more fun putting the lego blocks together.
...
For those of you working professionally in this field, are you writing lots of .NET/Node/Java etc?

I'm a <stupid Cloud title> at a small (~100 devs) SaaS shop and came from an infra/SOE background. The platform's all .NET for the mid/backend components. The traditional programmer types massively outnumber anyone with operational knowledge, so we're spread pretty thin and are encouraged to push whatever work we can back onto the scrum teams. I need to know where to find a component's init/config code, and how to read it, but I don't write any C#. Like if they want to implement OpenTelemetry for APM or something that's cool and we'll help, but it's their story. What I do write a lot of is poo poo to automate and glue things together, mostly in powershell. The rest is terraform, yaml/json configs for deployment tooling, some python and bash. I absolutely never need to work on some app's internal business logic or API contracts or data schemas or any of that boring rear end poo poo.

# ¿ Aug 28, 2023 07:18

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

We use New Relic instead of Prometheus, but yeah the pattern is to run an OpenTel collector agent as a sidecar to your .NET app's container. That's not the same as installing the agent in your Dockerfile, which I suspect is what you were thinking. AWS has a sample task definition. If you're using Fargate your options are pretty limited and this one is probably the least poo poo.

# ¿ Oct 17, 2023 07:44

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

I wound up building our ephemeral environments based around Crossplane and ArgoCD in EKS, with a custom powershell module to wrap all the CLI crap so folks didn't need to learn 5 different utilities. Our environments consist of container stacks, a fat Windows EC2 instance with MSSQL and our legacy platform, mysql backed by EFS, vhosts on a Cloudamqp instance, Cloudflare and ALBs for ingress, telepresence.io, etc. There's like 9 kubernetes operators going and the learning cliff was steep, but I'm super happy with how reliable it's proving to be now that it's tuned for our scale.

# ¿ Jan 30, 2024 07:40

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

We use Cloudflare instead of Cloudfront, but the idea is externaldns provisions Cloudflare DNS records for both static assets and APIs. We use a Cloudflare Worker (you'd use Lambda@Edge) to handle routing requests to static assets to the right path prefix for that ephemeral environment -- by default we want them using the prod assets, but folks can set a branch name in their HTTP header. So for S3 in particular you don't need to handle it from k8s, and if that lets you avoid running ACK or Crossplane or some poo poo then that's a big plus.

But if you do have other AWS resources that you must handle from k8s then I'd suggest looking at Crossplane's aws-provider-family over ACK.

# ¿ Feb 20, 2024 23:50

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

Docjowles posted:

We went the route of many microsegmented accounts for better and worse.
...
Instead we have many other problems

We're starting to implement this after our TAM pushed hard for it for 2 years. I feel like it's going to be a lot of effort for a sidegrade, but I don't care enough to fight it. Any advice on how to minimize the pain?

# ¿ Feb 22, 2024 03:56

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

A team member's stood up Account Factory / Control Tower and has just started switching us over to Identity Center (the only part of this I actually like), so hopefully we can get it as automated as you described.

Nothing will be truly greenfield. All of our containerized microservices will need migrations. The main thing we'll get out of this IMO is having small blast radii, so we can run terraform from dozens of pipeline-specific roles (presently 100% manual) without a nightmare spiderweb of bespoke IAM policies given the wildly inconsistent naming/structure of the legacy account. We also have to redo most of the networking to support the shared VPC model, so it's a good time to clean house and ditch our awful security group design.

But neither of these things require micro accounts. I've had a bit of an adversarial relationship with our TAM and tend to think that their proposals are a bad fit for our org given our small headcount. I definitely have the impression that they push what's best for AWS rather than what's best for us.

# ¿ Feb 23, 2024 06:08

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

Hadlock posted:

How do you tell Argo to redeploy the helm chart with the updated container image? Does Joe the part time release manager/QA dude just manually update the helm chart, in perpetuity? What if you have dev and staging environments? Are you now maintaining three nearly identical helm charts? Do you have three different values files you update? What's going to update it? Will it live near ArgoCD or maybe since other group will own they custom tooling/business process. Who the gently caress knows? It's not ArgoCD's problem. we don't like to force choices on the end user. gently caress you ArgoCD. gently caress you

I'll chime in with how I handled the above points, because yeah they're definitely things that need dealing with that are kinda outside Argo's scope. You're essentially asking, "how do you handle updated container images across different environments?"

I approach this in two ways, both from the CICD pipeline (I use GitHub Actions). All my helm charts are in their own monorepo. I update charts on the main branch when prod images are published, and on feature/hotfix branches when their images are published (usually by a pull request in the microservice's own repo). So given that, when the pipeline for a microservice runs it can do things like:

1) Update the image tag in values.yml whenever it publishes a new container image, on the appropriate monorepo branch.

2) Patch specific ArgoCD Application / ApplicationSet CRs for a given environment. They have optional values.yml overrides for the image tag and other environment-specific things. In general this is where I differentiate the deployment variables for different environments, though there can be overlap/duplication with ones stored in GHA.

Secrets are always messy. I typically have ones required for builds & automation testing stored in GHA, and ones required by containers at runtime stored in AWS Secrets Manager (sync'd via externalsecrets).

# ¿ Feb 25, 2024 01:43

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

Hughmoris posted:

Are any of you devoppers(?) heavy on the database side? If so, could you tell me a little bit on how databases fit into your ci/cd process?

I'm comfortable with the hobbyist basics of setting up a new postgresql cluster, loading data, and writing queries. Now, I'm trying to better understand how databases fit into ci/cd workflows that one might realistically see on the job.

For schema changes we use tooling that runs the app's SQL scripts idempotently. It's not too smart, just has a hierarchy for running things in the right sequence, and keeps track of whether any given script has run before or not (or if it should run every time).

Where it gets tricky is marrying up schema changes with rolling deployments. You could easily introduce a breaking schema change that requires a new app version, which means all your blue/green or staggered rolling deployments are going to poo poo a brick once the database updates.

We handle it through policy -- breaking schema changes MUST be implemented gradually across multiple deployments and a distinct release phase. For example, if you wanted to rename a column:
- 1st deployment adds a new column with the new name. Don't loving touch the old one. App version n+1 will initially use the old column, but will be feature flagged to use the new one. You deploy new code and app versions n and n+1 are happily running side by side, until it completes and they're all n+1 using the old column.
- 1st release is when you sync the rows from the old column to the new one and toggle that feature flag. n+1 is now using the new column and you didn't break anything.
- 2nd deployment for n+2 deletes the old column to clean up and reduce DBA grumpiness. Also de-feature flag it in the app.
- After 2nd deployment is complete you can delete the feature flag itself.

# ¿ Feb 25, 2024 02:06

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

Hadlock posted:

Can you explain just how the hell you're doing this?

I think I want to create a cloudfront distribution using an ACK* CRD and have external dns read a value (arn?) out of the ACK CRD

I'm guessing I need to program external dns to point at a different resource than an ingress

*Aws controller for Kubernetes, a helm chart thing for AWS services, among them S3 and cloudfront

Also open to pointing external-dns at cloudflare CDN, since clearly you have that working, and we're also cloudflare customers

I have 3 copies of externaldns running, one each for Cloudflare, public route53, and private route53. They watch for their own custom annotations on both ingresses & services. I can get a static CNAME in Cloudflare just by making a service of type ExternalName and giving it the same annotation I configured my externaldns-cloudflare provider with. You can use the other types (ClusterIP/NodePort) to get dynamic CNAMEs pointing to load balancers -- externaldns will figure it out from the related ingress.

I think your problem is the need to look up that S3 bucket URI and patch it to the service. Do you really need to create/destroy S3 buckets on the fly? If you can have static bucket URIs for your CNAMEs then your problem goes away. Otherwise, this is where ACK falls on its face and you have to use a Crossplane XRD (which supports arbitrary patching of attributes from one resource to another). This is something that's dead simple in terraform/cloudformation but is pretty poo poo in kubernetes. Like I wanted to create EC2 instances and then use their new private IPs for target group registrations as well as private route53 records, and a Crossplane XRD was the only decent option I found.

I think ACK support from AWS got yanked hard. Once my TAM got wind that I was even considering using EKS they started having Come to Jesus interventions with me, and even put together a call with 3 other AWS specialist support engineers to try to dissuade me. (They didn't offer a better solution, just a nebulous "automate Cloudformation!") It's cynical of me but I suspect they have orders from on high to discourage kubernetes use however they can, to make sure you're locked in to the AWS-specific services.

# ¿ Feb 27, 2024 11:36

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

Hughmoris posted:

Thanks for the detailed insight!

You mention DBA grumpiness, so I'm guessing you have a stand-alone DBA team that you work with? How involved are they with your team on day to day work?

We have a pair of dedicated DBAs / data architects, who work alongside a BI team. My team (infra/ops) has fairly regular requests for work to do for them, but DBA involvement in any CICD-related stuff is rare. We might work together when migrating a component, or for standing up a new ETL process, or general troubleshooting when poo poo's blowing up. I'm a bit of an outlier in that I'm the only one on my team who jumps in to our component repos and submits PRs and deeply understands the entire deployment process(es). I typically work with the software architects, developer leads, and platform folks.

# ¿ Feb 28, 2024 05:54

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

My static environments use RDS, but my ephemeral environments each have a mysql container backed by EFS. Microservices run their sql scripts as init containers, which adds a lot of startup time but guarantees their schemas are correct. It was relatively painless to set up and hasn't given us much trouble. Cleaning up expired environments from EFS was a few more lines in the Purge cronjob.

crazypenguin posted:

Maybe, but as a former aws engineer (on the build services side, not the handhold customers side), kubernetes' popularity was just baffling.

It really does/did look to us like a way of spending more money (sometimes a LOT more money, so many people ignored cross-AZ bandwidth costs) on aws, in order to get a worse result.

I would legitimately believe they've transitioned from "lol our customers are demanding to spend more money haha give them what they want" to "oh god, the kubernetes-shaped money firehose they built is getting so expensive they're starting have a bad experience with 'the cloud' in general, we don't want them to end up motivated to gtfo, let's help them get their poo poo together even if it severely reduces their spend short term"

That's probably a more realistic view than mine, I'll try to give them the benefit of the doubt. Honestly the biggest factor for using EKS vs ECS for these ephemeral environments was the available tooling / user experience. My devs have had access to the ECS UI for years and still get lost in there. They are better with Docker Desktop, so I wanted to present them with something like that (or Podman). ArgoCD's UI isn't about to win any awards but it does a much better job of showing you your containers, their health & logs, and letting you change crap like environment variables on the fly.

# ¿ Mar 2, 2024 08:07

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

Can I get some opinions on how common it is to have a dedicated Architect role for handling upfront cloud infra & CICD design, versus Engineers or Sysadmins throwing poo poo at the wall and iterating forever?

# ¿ Apr 2, 2024 02:47

Adbot: ADBOT LOVES YOU

# ¿ May 4, 2024 02:00

Extremely Penetrated: Aug 8, 2004; Hail Spwwttag.

Thanks for the replies. For context, our small SaaS shop (automotive, ~100 devs & engineers) has a dedicated architect team but the 4 are all developers and focus on working with the app teams. Infra architecture has always been handled by engineering, and we basically work like Iron Rose says. It works, but only if everyone's trying hard to fit with established patterns and figure out where a project fits into the big picture. When we don't, we build random patchwork projects that aren't maintainable. There's no real oversight, just peer feedback in a culture of folks who hate rocking the boat. I mostly asked to get an idea of how unreasonable it would be for me to push for creating another role on the architect team.

# ¿ Apr 2, 2024 07:58

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread