Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Hadlock
Nov 9, 2004

minato posted:

PE @ Meta == SRE everywhere else. Facebook decided to call it something different because they refuse to follow Google's lead or even use any Google products internally, despite copy/pasting most of their culture from Google. I vaguely recall their rationale being along the lines of "Reliability implies that the engineer just cares about uptime, but really the scope is about engineering products to work at a production scale". Yeah whatever, it's just SRE.

Cheers, thanks sire goon; that clears a lot up for me

Adbot
ADBOT LOVES YOU

Methanar
Sep 26, 2013

by the sex ghost

whats for dinner posted:

We switched all of our CI agents to this (GitLab runner using the Kubernetes executor) and build times didn't change up or down but we didn't really expect them to because they aren't compute bound. The real benefit for us was vastly increased spot availability and lower spot prices compared to the equiv intel instance types.

Increased spot availability is interesting. Do you have any sort of numbers quantifying that? Just judging by the interruption rate advertised by AWS, it would seem that the AMD stuff actually gets slightly more interruptions for the larger instance types that I'd be interested in.
I also see that despite a price different in on-demand, the spot price is actually the same.

code:
interruption rate
c6a.16xlarge	>20%
c6a.24xlarge	5-10%

c6i.16xlarge	5-10%
c6i.24xlarge	5-10%

demand price
c6a.16xlarge	$2.448	64	128 GiB	EBS Only	25000 Megabit
c6a.24xlarge	$3.672	96	192 GiB	EBS Only	37500 Megabit

c6i.16xlarge	$2.72	64	128 GiB	EBS Only	25000 Megabit
c6i.24xlarge	$4.08	96	192 GiB	EBS Only	37500 Megabit

spot price
c6a.16xlarge	$1.0818 per Hour
c6a.24xlarge	$1.6226 per Hour

c6i.16xlarge	$1.0818 per Hour
c6i.24xlarge	$1.8081 per Hour
https://aws.amazon.com/ec2/spot/instance-advisor/
https://aws.amazon.com/ec2/pricing/on-demand/
https://aws.amazon.com/ec2/spot/pricing/

This is all sort of nebulous enough I'm probably going to just start with going to c6i across the board and then will do a trial run of converting some subset to c6a later on to compare.

Methanar fucked around with this message at 20:07 on Sep 27, 2022

whats for dinner
Sep 25, 2006

IT TURN OUT METAL FOR DINNER!

Methanar posted:

Increased spot availability is interesting. Do you have any sort of numbers quantifying that?

When we made the move, I did the same calculations as you in the region we're operating in and for the instance types we cared about (c5a.4xlarge and c5.4xlarge) and the AMD boxes came ahead on both counts; especially interruption rate. On Intel boxes we were seeing bad interruptions but also running into a lot of instances where spot availability was exhausted in the region. Now, about a year later, spot interruption and savings for both are pretty much at parity.

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb

fletcher posted:

I assume you are backing up your git repos in some manner, since you mentioned you keep most of them locally. Definitely setup backups if you haven't already.

That being said, I just use Bitbucket and their Pipelines feature to run stuff when I push to my repos. GitLab and GitHub have their equivalents as well, and they all work about the same as far as I know. I use Bitbucket just because that's what I started with way back when. If I was starting today I'd probably use Gitlab since that's what I'm familiar with using at work. I think each of these providers offers a certain amount of free "build minutes". I pay a small fee for additional build minutes on Bitbucket.

I think bitbucket still offers unlimited free private repositories, so it was a no brainer for me rather than trying to self host something.

My photoprism build kept failing on bitbucket pipelines so I ended up trying it on gitlab, and it fixed the issue. I think the bitbucket runner didn't have enough memory, even with the 2x setting enabled.

I had been thinking about migrating from bitbucket to gitlab anyways, and I wanted to host my own runner so I don't have to pay for build minutes. This ended up being a good excuse to do both, because in order to use a custom image in the gitlab pipeline, you have to host your own runner. In this regard, I think bitbucket has an advantage, since I was able to use my custom image on shared runners on bitbucket.

Using private images for gitlab runners and pushing to ecr registries was kind of a pain in the rear end overall compared to doing it on bitbucket. I can't help but think that is intentional, to get you to use gitlab's container registry service.

To push images I'm building to ECR on bitbucket it was simply:
code:
image:
  name: debian:bullseye-slim

pipelines:
  default:
    - step:
        name: my-image-to-do-builds-with image
        services:
          - docker
        caches:
          - docker
        script:
          - export IMAGE_NAME="fletcher/my-image-to-do-builds-with"
          - docker build -t "$IMAGE_NAME" ./
          - pipe: atlassian/aws-ecr-push-image:1.5.0
            variables:
              AWS_ACCESS_KEY_ID: $AWS_ACCESS_KEY_ID
              AWS_SECRET_ACCESS_KEY: $AWS_SECRET_ACCESS_KEY
              AWS_DEFAULT_REGION: $AWS_DEFAULT_REGION
              IMAGE_NAME: $IMAGE_NAME
On gitlab, that part wasn't too much trouble:
code:
publish:
  stage: build
  image:
    name: docker:latest
  services:
    - docker:dind
  variables:
    IMAGE_NAME: "fletcher/my-image-to-do-builds-with"
  before_script:
    - apk add --no-cache curl jq python3 py3-pip
    - pip install awscli
    - aws ecr get-login-password | docker login --username AWS --password-stdin $DOCKER_REGISTRY
  script:
    - docker build -t "$DOCKER_REGISTRY/$IMAGE_NAME:$CI_COMMIT_BRANCH" ./
    - docker push "$DOCKER_REGISTRY/$IMAGE_NAME:$CI_COMMIT_BRANCH"
On bitbucket to use a private image to do my build I just had to include this in my bitbucket-pipelines.yml (and specify the env vars on the repo):
code:
image:
  name: 123456789.dkr.ecr.us-west-1.amazonaws.com/fletcher/my-image-to-do-builds-with
  aws:
    access-key: $AWS_ACCESS_KEY_ID
    secret-key: $AWS_SECRET_ACCESS_KEY
On gitlab to use a private image to do my builds it was a lot more involved, you are required to host your own runner and I had to sift through a bunch of poo poo in this ticket to get it working. At least it all works now :toot:

Methanar
Sep 26, 2013

by the sex ghost
It's 12:38, past midnight.

You're a bit bummed out.
You lay down in bed, it's not that comfortable and your tinnitus is flaring up.
You close your eyes and wait for the next work day.
Your phone chimes.
"Maybe it's a bumble match"
It's not.

"I'm sure it's everything is fine."
code:
{
timestamp: 1665392458894,
status: 999,
error: "None",
message: "No message available"
}
its 3:05 AM
Spinnaker was very not fine.
I am very not fine.

Methanar fucked around with this message at 11:05 on Oct 10, 2022

Junkiebev
Jan 18, 2002


Feel the progress.

Methanar posted:

It's 12:38, past midnight.

You're a bit bummed out.
You lay down in bed, it's not that comfortable and your tinnitus is flaring up.
You close your eyes and wait for the next work day.
Your phone chimes.
"Maybe it's a bumble match"
It's not.

"I'm sure it's everything is fine."
code:
{
timestamp: 1665392458894,
status: 999,
error: "None",
message: "No message available"
}
its 3:05 AM
Spinnaker was very not fine.
I am very not fine.

#HugOps

Junkiebev
Jan 18, 2002


Feel the progress.

Junkiebev posted:

I can't decide if something is a crazy anti-pattern for terraform

I have a bunch of vcenters (some linked, but links don't propagate tag categories or values)
I have a tag category (department number - single cardinality) and tags (the actual department number values) I'd like to put on them in a uniform way so that they may be applied to VMs and such.
What I'm thinking is
JSON with the vCenter URI available via REST call
hard-code tag categories in TF module
JSON with the tag values available via REST call
tagging done in a terraform module with a provider populated by provided variables in main.tf
for-each the vCenters, run the module
within the module, for-each the tags and create them

is this madness because it's not super declarative, or shrewd? I'm sure I'd end up using dynamics, but you can't initialize or reference different providers within a dynamic afaik

i came up with a remarkably cursed solution for this which is a shell script to parse json and dynamically build providers.tf and main.tf in the module which does the tagging and a thrice-damned dynamic map iteration which makes me nauseous but also :smug:

poo poo works, ship it

code:
/*
    ___           __     __________ __        __      _       __ 
   /   |         / /__  / __/ __/ //_/       / /___  (_)___  / /_
  / /| |    __  / / _ \/ /_/ /_/ ,<     __  / / __ \/ / __ \/ __/
 / ___ |   / /_/ /  __/ __/ __/ /| |   / /_/ / /_/ / / / / / /_  
/_/  |_|   \____/\___/_/ /_/ /_/ |_|   \____/\____/_/_/ /_/\__/  
                                                                                                                                                                                        
*/

Junkiebev fucked around with this message at 17:48 on Oct 10, 2022

Methanar
Sep 26, 2013

by the sex ghost
It's 11 am. I finally did a hand off to a European at 4 am after making my last change of the night time significantly increase the size of redis which solved the problem enough for the moment.

I wake up to 30 plus messages on my phone at about 9 am. I laid in bed for 2 hours trying to go back to sleep but I can't and instead just lay here anyway and answer all of the slack stuff from my phone.

I was already not doing great but I'm a solid 1 out of 10 right now after last night. I legitimately don't want to do anything. There's so much follow up to be done now when I was already 3x overloaded. Doing what I can to delegate, but man, I don't even want to do that.

Last night's incident barely even registers on the severity scale of the poo poo I've seen. But this is really kicking me when I'm down.

Methanar fucked around with this message at 19:20 on Oct 10, 2022

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS
Why does your org rely so heavily on spinnaker such that it has to page someone in the US? Like, if you’ve got Euro colleagues and they’re using it why can’t they fix it rather than paging you in the middle of the night?

Methanar
Sep 26, 2013

by the sex ghost

Blinkz0rz posted:

Why does your org rely so heavily on spinnaker such that it has to page someone in the US? Like, if you’ve got Euro colleagues and they’re using it why can’t they fix it rather than paging you in the middle of the night?

I'm on call this week. Nobody really knows how to manage spinnaker, I'm just somehow by accident the most familiar with it after dealing with it for so long. I didn't want to throw the incident response completely over the wall to a relatively new hire European , who knows absolutely nothing about Spinnaker, when it's completely on-fire and everyone and their dog is jumping into the pile to +1 they're having issues. Which is why I only did the handoff many hours later once things were stabilized and things were reasonably understood.

Spinnaker is the worst thing we own by a huge margin and I want to get rid of it so bad. I'm aware of like 5 different problems with it, but I don't have time to deal with it myself. I've assigned a few work items out on fixing it, but it's just a very slow to get anything done and last night was an example of the fixes taking too long to be made.

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS
It’s also just a release bottleneck unless you’re horribly abusing it. Worst thing that should happen if spinnaker goes down is that devs lose the ability to release services which can either be dealt with by someone geographically closer to the devs in question or can wait until morning.

If I were expected to be on call for spinnaker after hours I’d quit effective immediately.

Methanar
Sep 26, 2013

by the sex ghost
I'm just dumb and stress myself out constantly.

3 years ago I did push back with the idea that issues with a dev environment are not wake me up in the middle of the night worthy. When I did that the sre escalating just went down the list paging people until he got his response. I thought it was bs that I looked bad when the senior at the time got woke up to deal with something I pushed back on. But whatever.

I wasn't about to tell the 15 people dogpiling complaining who are just trying to do their jobs that I'm going to compromise their deadlines and ability to do integration testing or whatever that are coming up with the end of the company quarter at the end of October. Real things that was mentioned to me. It reflects badly on everyone and is non productive for me to leave all of the dev groups in a bad spot when the thing broken is explicitly owned by us and I am on call. That's not how I do things.

Like I mentioned the spinnaker support situation is trash and severely neglected. It would be extremely unfair for me to have dumped this on somebody else that knows nothing about spinnaker and wasnt on call. It's just a bad situation and I was holding the bag when things went sideways.

I have a long list of things to be fixed. I just don't have enough time to throw at it. My own or otherwise. Idk

madmatt112
Jul 11, 2016

Is that a cat in your pants, or are you just a lonely excuse for an adult?

Methanar posted:

I'm just dumb and stress myself out constantly.

3 years ago I did push back with the idea that issues with a dev environment are not wake me up in the middle of the night worthy. When I did that the sre escalating just went down the list paging people until he got his response. I thought it was bs that I looked bad when the senior at the time got woke up to deal with something I pushed back on. But whatever.

I wasn't about to tell the 15 people dogpiling complaining who are just trying to do their jobs that I'm going to compromise their deadlines and ability to do integration testing or whatever that are coming up with the end of the company quarter at the end of October. Real things that was mentioned to me. It reflects badly on everyone and is non productive for me to leave all of the dev groups in a bad spot when the thing broken is explicitly owned by us and I am on call. That's not how I do things.

Like I mentioned the spinnaker support situation is trash and severely neglected. It would be extremely unfair for me to have dumped this on somebody else that knows nothing about spinnaker and wasnt on call. It's just a bad situation and I was holding the bag when things went sideways.

I have a long list of things to be fixed. I just don't have enough time to throw at it. My own or otherwise. Idk

Sounds like you need about four more people on your team

MightyBigMinus
Jan 26, 2020

or, and this is the hardest part of all, a mature understanding of how to scope and when to say no

crazypenguin
Mar 9, 2005
nothing witty here, move along
"Engineers getting pages in the middle of the night about mere dev infrastructure" is exactly the sort of thing that should be resulting in post-incident engineering post-mortems.

It's absolutely something that shouldn't be happening, and there's lots of solutions. Many of them are very good.

I'm personally fond of devops actually meaning devops, not this sre or "devops just means ops runs kubernetes now lol" garbage. Then teams with broken dev infrastructure would of course just fix their own stuff and paging someone else about it would be completely unthinkable.

Bruegels Fuckbooks
Sep 14, 2004

Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

Methanar posted:

3 years ago I did push back with the idea that issues with a dev environment are not wake me up in the middle of the night worthy. When I did that the sre escalating just went down the list paging people until he got his response. I thought it was bs that I looked bad when the senior at the time got woke up to deal with something I pushed back on. But whatever.

I wasn't about to tell the 15 people dogpiling complaining who are just trying to do their jobs that I'm going to compromise their deadlines and ability to do integration testing or whatever that are coming up with the end of the company quarter at the end of October. Real things that was mentioned to me. It reflects badly on everyone and is non productive for me to leave all of the dev groups in a bad spot when the thing broken is explicitly owned by us and I am on call. That's not how I do things.

One of the things that leads to less work and more money is becoming much more comfortable with allowing bad thing to happen to other people. All struggling and waking up in the middle of the night is doing is giving you more work, limiting your career, and bandaging over a problem that should be someone else's.

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
sorry, disagree, if you own “development environment” infrastructure, it doesn’t matter that it’s being used for development, that’s production infrastructure. teams in other regions being able to do development is business critical.

if you own the infrastructure, you of course need to be granted the resources and development time owning it requires, but that’s a separate issue.

LochNessMonster
Feb 3, 2005

I need about three fitty


my homie dhall posted:

if you own the infrastructure, you of course need to be granted the resources and development time owning it requires, but that’s a separate issue.

This is what goes wrong 99 out of 100 times which means dev environments are going to be treated as low(er) prio.

Usually companies that do this well also have teams in multiple timezones to support such infra and don’t need to page someone for a broken dev environment.

Vanadium
Jan 8, 2005

I think we tend to treat outages in the deploy infra as critical just because we may not be able to respond to an actually customer facing outage if we can't deploy fixes or maybe even roll back. idk to what extent it's normal to have CI and production deploys all coupled together.

Hadlock
Nov 9, 2004

Yeah "can't deploy to prod" is two things

1) when they are actually have the fix, now there's no way to build and deploy it

possibly more important for your career,
2) everyone from your boss all the way up to the CEO, then back down again in a series of meetings that are basically "you had ONE job" when the company is losing money waiting to deploy. Our analytics dept had a dashboard and could project how much money we lost in revenue for each outage. Our department is well funded and has very few outages now

Methanar
Sep 26, 2013

by the sex ghost
I'm at the life stage where I'm thinking about jira workflows in the shower.

LochNessMonster
Feb 3, 2005

I need about three fitty


Methanar posted:

I'm at the life stage where I'm thinking about jira workflows in the shower.

If you’re considering a career switch to Product Owner or Scrum Master this is probably the wrong thread.

Hadlock
Nov 9, 2004

I've been there

It's time to switch jobs to a less hosed up work environment for more money

Sylink
Apr 17, 2004

I'm tasked with building out EKS with terraform, and I'm new to terraform but have used k8s quite a bit, so this is a fun disaster in the making.


However, I'm not sure how to reconcile certain problems in terraform when connecting all the dots.

For example, I have a EKS cluster, then various apps on it for development purposes. We want to have the same app cloned in separate namespaces on this cluster representing git branches/PRS etc. Simple via helm chart

However, to make them useful, they need to be exposed at an ALB. So its not clear to me how to design the various components of the ALB part so that it takes in a list of potential endpoints then generates all the updated target groups and listeners based off those domains.

So we can call a helm chart install via terraform then it goes back up and updates the existing ALB with all the proper endpoint configurations. Or maybe I just don't understand how to use terraform to iterate over lists the right way.

Hadlock
Nov 9, 2004

When you spin up the k8s cluster, you'll also bootstrap nginx or traefik via helm charts, and then each custom helm chart will have an ingress controller that does work on nginx or traefik. You'll also probably want to install cert manager to handle ssl certs using let's encrypt. Your ingress controllers will inform how the load balancer per cluster is configured automatically

I think EKS only works with ELB not ALB? Maybe that has changed

So yeah your terraform will look like this

Spin up cluster
Install nginx
Install cert manager
Install flux/argocd

And then create your branch and references to the helm chart and let your CI/CD install the helm chart to the correct namespaces

Someday nginx will support let's encrypt out of the box

Edit: installing helm charts via terraform is primarily for bootstrapping the cluster; nginx, cert manager etc. You should not be installing/deleting helm/hadlockco/myapp via terraform, that should be handled by your gitops branches and Ci/cd

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS
Use the aws load balancer ingress controller instead of caddy/traefik/nginx. It handles mapping ingresses to ALBs.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Hadlock posted:

When you spin up the k8s cluster, you'll also bootstrap nginx or traefik via helm charts, and then each custom helm chart will have an ingress controller that does work on nginx or traefik. You'll also probably want to install cert manager to handle ssl certs using let's encrypt. Your ingress controllers will inform how the load balancer per cluster is configured automatically

I think EKS only works with ELB not ALB? Maybe that has changed

So yeah your terraform will look like this

Spin up cluster
Install nginx
Install cert manager
Install flux/argocd

And then create your branch and references to the helm chart and let your CI/CD install the helm chart to the correct namespaces

Someday nginx will support let's encrypt out of the box

Edit: installing helm charts via terraform is primarily for bootstrapping the cluster; nginx, cert manager etc. You should not be installing/deleting helm/hadlockco/myapp via terraform, that should be handled by your gitops branches and Ci/cd

I don’t know if I would want to manage the helm charts for nginx/cert-manager via terraform at all. I think I’d rather have a data structure accessible in a K/V store with all your clusters, and then have a repo with all your managed applications that can deploy to all those clusters. Use something like helmfile for declarative specifications of your chart releases and use your CI/CD of choice to deploy the Helm releases.

That way you can deploy config changes to all your clusters at once rather than go cluster by cluster, *and* you don’t need to deal with the misery that’s managing k8s resources via terraform.


If I can recommend nothing else, you want to manage as few kubernetes resources with terraform as possible. Helm is much better suited for that work. In EKS you’ll largely be provisioning things like load balancers using annotations on service or ingress objects, so you wouldn’t manage or define them in terraform at all. They’re aws managed and governed by your k8s templates, which are governed by helm.

The Iron Rose fucked around with this message at 19:55 on Oct 24, 2022

Sylink
Apr 17, 2004

Thanks for the suggestions, I agree that managing k8s with terraform is weird as gently caress and I don't like it. So I have been arriving at the conclusion that terraform should just do the top level resources, and then helm etc for everything else.

I did try the aws alb controller, it worked, but I never got external-dns to work properly with it. Maybe there is a simpler way to update a Route53 record in response to ingress ?

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Sylink posted:

Thanks for the suggestions, I agree that managing k8s with terraform is weird as gently caress and I don't like it. So I have been arriving at the conclusion that terraform should just do the top level resources, and then helm etc for everything else.

I did try the aws alb controller, it worked, but I never got external-dns to work properly with it. Maybe there is a simpler way to update a Route53 record in response to ingress ?

You can survive easily enough by managing DNS separately all together. Have a terraform repo with your R53 records, expose it to the entire engineering org, and set A records to resolve to your ingress controller’s service IP using an internal/external load balancer IP as needed. The ingress controller will then route to the backing service based on the host header, which is defined in each ingress object.

It’s a separate action altogether. External DNS is a perfectly valid way to do it, but I just haven’t used it personally (we use our own self managed DNS because of very silly reasons).

Hadlock
Nov 9, 2004

I have never run into a situation where I unilaterally changed all my nginx instances at once. We always used dev clusters to test changes before promoting to prod. The few times we didn't it was the CTO and his lieutenant cowboying changes into prod and almost always broke something

I'm sure you could configure nginx to read secrets/config from KMS or something? If that was actually a concern?

Not super worried about secrets in dev clusters. And secrets on prod clusters? If you have access to the pride cluster, you probably have the ability to read/modify those secrets via (choose your favorite roundabout method). Limiting access to production via the normal methods has always passed security audit. Not everyone needs beyondcorp security model

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Hadlock posted:

I have never run into a situation where I unilaterally changed all my nginx instances at once. We always used dev clusters to test changes before promoting to prod. The few times we didn't it was the CTO and his lieutenant cowboying changes into prod and almost always broke something

I'm sure you could configure nginx to read secrets/config from KMS or something? If that was actually a concern?

Not super worried about secrets in dev clusters. And secrets on prod clusters? If you have access to the pride cluster, you probably have the ability to read/modify those secrets via (choose your favorite roundabout method). Limiting access to production via the normal methods has always passed security audit. Not everyone needs beyondcorp security model

Do you not want to update the version of your helm deployments of nginx or cert manager? APIs get deprecated, config changes need to be made, replicas need to be added, annotations adjusted…

You wouldn’t change all of them at once, obviously test in dev first, or have release rings, whatever. But I’d much rather set yourself up for scaling now then have to go and update one terraform state per cluster every time you want to make a change, which also allows you to have a consistent config across your environment.

This is specifically for *your* managed services you want to put on all clusters though. I would include the ingress controller, security tooling, etc here, but ingress objects should obviously be managed on an environment by environment if not cluster by cluster basis. Same with cert manager since you probably want to limit access for providing domain certs to specific clusters.

luminalflux
May 27, 2005



The Iron Rose posted:

If I can recommend nothing else, you want to manage as few kubernetes resources with terraform as possible

This is the way. Terraform is not great at all with managing Terraform - we only let it install ArgoCD and the ArgoCD bootstrap application (which installs ALB controller, ExternalDNS, Datadog et c) and even that's too much for it. We're looking to move to managing nothing in k8s via terraform.

Docjowles
Apr 9, 2009

I know I’m late to this but I had been away from k8s for a couple years and I am blown away by how awesome ArgoCD is. I just push changes to git and within a minute or two it’s reflected in the cluster.

I’m sure as I use it more it will have warts just like any software. But my initial experience is overwhelmingly positive.

luminalflux
May 27, 2005



I'm deep into ArgoCD right now and it's real nice... except when you find warts and it isn't. Thankfully there's a lot of traction around it compared to Spinnaker which I'm coming from.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Sylink posted:

Thanks for the suggestions, I agree that managing k8s with terraform is weird as gently caress and I don't like it. So I have been arriving at the conclusion that terraform should just do the top level resources, and then helm etc for everything else.

I did try the aws alb controller, it worked, but I never got external-dns to work properly with it. Maybe there is a simpler way to update a Route53 record in response to ingress ?
External DNS is extremely easy to get working with the AWS Load Balancer Controller nowadays. It mostly Just Works. The documentation is very good, and most of it is superfluous if you understand what you're doing (assuming roles into the correct accounts if the zone is hosted in a different account. etc.). Enable the registry to use TXT records for ownership, then keep an eye on the logs to see what's failing.

LochNessMonster
Feb 3, 2005

I need about three fitty


Is there a good way to start Azure DevOps pipelines in batches. I'm trying to find a way to trigger over 1k downstream pipelines after my initial pipeline runs successfully.

I'm not sure if our Azure DevOps infra will like it if I start them all at once as we share the build agents company wide. On busy days we're already running into some limitations where we see 30+ min of queues. Bad scaling/sizing on their part, I know, but I don't want to make the problem worse. The plan is to start this process outside of business hours to minimize impact on the rest of the organization, but you just know there's going to be one day that somebody can't deploy a hotfix for a prio 1 incident because there's a 4 hour queue for the build agents.

The main pipeline will create a feature branch and updates a config file with versions for each downstream repo, which will build on commit. The downstream repo's are managed in a config file in the main repo, so it's iterable. The only thing I came up with so far is externalize updating the downstream repos so it can be done in batches. Was hoping I'm missing something and there's an easier way.

The Fool
Oct 16, 2003


use the ado api, either right in your initial pipeline or from a function app, to trigger and manage the downstream pipelines

or redo your design to be less silly

LochNessMonster
Feb 3, 2005

I need about three fitty


The Fool posted:

or redo your design to be less silly

Been thinking about a better approach but drawing blanks.

All these downstream pipelines are building container images based on customer specific configs.

I guess I could put the config on an S3 bucket and download them on container start but that would mean it’s not immutable anymore. Messing with the config for active services means I could have multiple configs live for the same version.

It’s an inherited service that’s less optimal in many other ways and won’t see the end of 2023 (hopefully) so rewriting the entire thing isn’t worth it I think.

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb

LochNessMonster posted:

Been thinking about a better approach but drawing blanks.

All these downstream pipelines are building container images based on customer specific configs.

I guess I could put the config on an S3 bucket and download them on container start but that would mean it’s not immutable anymore. Messing with the config for active services means I could have multiple configs live for the same version.

It’s an inherited service that’s less optimal in many other ways and won’t see the end of 2023 (hopefully) so rewriting the entire thing isn’t worth it I think.

It's a common pattern to pull down some config at runtime. You wouldn't want secrets baked into the images, for example.

Adbot
ADBOT LOVES YOU

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender

LochNessMonster posted:

All these downstream pipelines are building container images based on customer specific configs.
The build is config-dependent, or the runtime is config-dependent? Because if it's the latter, you should absolutely be injecting the config in at runtime, not during build. While it can be convenient to ship config baked into the container, it results in situations like this where now you need 1 container per config. If the config was injected at runtime, you only need 1 build.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply