Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Dyscrasia
Jun 23, 2003
Give Me Hamms Premium Draft or Give Me DEATH!!!!
Is there any sort of consensus on IaC management platforms? We use ARM templates at this time and I'm looking to not just improve our Azure capabilities but also other platforms we use. Drift detection is a key feature for me.

Terraform cloud seems to be on the out, while spacelift is quite interesting. Pulumi is also interesting, but I am working with an infrastructure team that will have an easier job adopting terraform/opentofu vs the languages Pulumi works with.

Adbot
ADBOT LOVES YOU

Hadlock
Nov 9, 2004

It seems like most people are using GitHub actions, or terraform could from what I can tell. You can self host atlantis but I've never tried it personally

Pulumi is something people like to talk about but I don't know anyone using it in production as their primary go to

The Fool
Oct 16, 2003


Spacelift seems solid but I've never used it

If your just doing this for your own team, or the infrastructure is going to be managed by a smaller group and not doing a self service platform thing I wouldn't bother

just run the terraform/opentofu in pipelines/actions and keep remote state in storage accounts/s3

Docjowles
Apr 9, 2009

Hadlock posted:

It seems like most people are using GitHub actions, or terraform could from what I can tell. You can self host atlantis but I've never tried it personally

Pulumi is something people like to talk about but I don't know anyone using it in production as their primary go to

We run Atlantis integrated with Gitlab for pushing out Terraform. It's worlds better than people yoloing poo poo from their workstation obviously. There's locking and the opportunity to force code/plan reviews if desired and making sure the branch you're about to deploy isn't a month behind main. It's an important piece but I would not call it a full "IaC management platform" by any means. Atlantis IS free, though, and by god my company will tie itself into a pretzel to avoid buying software if at all possible :v:

whats for dinner
Sep 25, 2006

IT TURN OUT METAL FOR DINNER!

Hadlock posted:

It seems like most people are using GitHub actions, or terraform could from what I can tell. You can self host atlantis but I've never tried it personally

Pulumi is something people like to talk about but I don't know anyone using it in production as their primary go to

We use Pulumi for isolated application stacks that devs work with frequently and Terraform for shared services like networking, shared load balancers, etc. which belong to ops staff. We run purely in AWS so all the relevant info like VPC IDs, cache endpoints, etc. are stored in SSM parameters by Terraform and sourced by Pulumi at runtime. It's fine for that use case but I wouldn't want to go any bigger or more complex.

Part of why I picked it was the automation API - it made it very easy to write deployment tooling around our ECS tasks. The smaller community means less knowledge available when something goes wrong, their docs are pretty tragic and badly organised and a lot of the state manipulation stuff is either painful or non-existent which makes importing unmanaged resources or refactoring a massive chore.

We might be migrating away from it to Terraform because that's what the rest of the dev teams seem to be consolidating around and I certainly won't be shedding any tears. And it turns out the automation API (for C# at least) just shells out to a local pulumi binary and loads the results into some models anyway; so for our use case, redoing that with Terraform's an option.

Terraform or Pulumi we just use GitLab runner. When a commit is made to a branch other than main, it runs a plan without refreshing state. If a commit is made to main it runs a plan, waits for manual approval and then runs apply.

whats for dinner fucked around with this message at 01:42 on May 11, 2024

Hadlock
Nov 9, 2004

I'm not arguing with you but for a full "IaC management platform" what additional boxes would it need to tick for you to give it the green light

Dyscrasia
Jun 23, 2003
Give Me Hamms Premium Draft or Give Me DEATH!!!!
This overall effort is about modernizing our processes and ensuring compliance. Terraform as the input is required, but kubernetes is a plus with OS level config being a bonus. Config drift de detection is the primary factor though. My goal is to ensure standards compliance and defeat pushes by our projects to dig themselves into the holes that they do.

We currently have a very flexible ARM template practice that cannot really be enforced. So state management is my main thing. A proper repo strategy which we lack today will be a part of this effort. We use git as storage right now, not as any sort of change control mechanism .

We have a solid devops practice for applications, but our infrastructure is lacking. I'm looking at products because I won't get developer hours to roll out own solution.

And no, please challenge what I'm looking for, if I'm off base, I'm interested.

Dyscrasia fucked around with this message at 04:02 on May 11, 2024

Hadlock
Nov 9, 2004

Dyscrasia posted:

This overall effort is about modernizing our processes and ensuring compliance. Terraform as the input is required, but kubernetes is a plus with OS level config being a bonus. Config drift de detection is the primary factor though. My goal is to ensure standards compliance and defeat pushes by our projects to dig themselves into the holes that they do.

I ended up in my current field because I was looking for a way to solve configuration drift and went to work for a company that monitored it, but ultimately they had to pivot out of the space because it was obvious that declarative config and iac were going to destroy that market

Between git ops, iac and declarative config (k8s) you effectively eliminate the entire class of configuration drift errors instantly

I guess the point I'm trying to make is that by using git ops and declarative config there's really no need for configuration drift monitoring

The favorite story I like to tell is that we used to do war games, where on Friday afternoon someone would do something nefarious like, turn off nginx, comment out the database password in a config file, or rename the SSL cert. Then engineers would race to find and fix the problem

When we switched to Kubernetes and gitops we had to end that because if you broke something it would launch a replacement, or if you made a change to the git repo, they would just revert the breaking change and it'd start working again

You'll still probably need something to do scanning to show the auditors but I guess my point is you're heading in the right direction if you want to solve configuration drift once and for all

Warbird
May 23, 2012

America's Favorite Dumbass

Hadlock posted:

Pulumi is something people like to talk about but I don't know anyone using it in production as their primary go to

Hi, hello.

Dyscrasia posted:

the languages Pulumi works with.

It’s a little known language called Python. Shows real promise.

12 rats tied together
Sep 7, 2006

i've used it in production in the past and at the current place i'll be switching us to it later.

seems like OP has a pretty good handle on it i would just note that pulumi has a yaml mode which makes it head & shoulders the best choice for any greenfield iac application since its feature compatible with terraform at a minimum, likely better in some dimensions (aws, azure), you can write it in almost any relevant programming language + the only relevant markup language, and things you build in it are shareable as cross-language constructs

i think the only reason to stay on terraform these days is if you have a specific need to consume an OSS thing that only works in terraform like community modules, which is not a place i would really want to be, but im sure its reality for a bunch of folks and i wish them well

Docjowles
Apr 9, 2009

Hadlock posted:

I ended up in my current field because I was looking for a way to solve configuration drift and went to work for a company that monitored it, but ultimately they had to pivot out of the space because it was obvious that declarative config and iac were going to destroy that market

Between git ops, iac and declarative config (k8s) you effectively eliminate the entire class of configuration drift errors instantly

I guess the point I'm trying to make is that by using git ops and declarative config there's really no need for configuration drift monitoring

The favorite story I like to tell is that we used to do war games, where on Friday afternoon someone would do something nefarious like, turn off nginx, comment out the database password in a config file, or rename the SSL cert. Then engineers would race to find and fix the problem

When we switched to Kubernetes and gitops we had to end that because if you broke something it would launch a replacement, or if you made a change to the git repo, they would just revert the breaking change and it'd start working again

You'll still probably need something to do scanning to show the auditors but I guess my point is you're heading in the right direction if you want to solve configuration drift once and for all

I'm envious of your leadership's discipline that's allowed you to eliminate config drift. Everywhere I've worked since config management / iac / gitops have existed as concepts, the development side of the house has been able to successfully argue to senior management BUT WHAT IF AN OUTAGE??? I NEED A BACK DOOR JUST IN CASE. Which they then abuse to circumvent procedure and make random rear end changes when it suits them. The one that made me want to keep a handle of booze in my desk at a past job was devs editing some config file then doing "chattr +i foo" to make it immutable so Chef couldn't revert changes. Why are you doing this instead of checking the change into version control and letting it roll out automatically??? We have made this extremely easy for you :guinness:

I don't want to complain too much about my current employer because overall I am very happy. But the ops and security leadership always get steamrolled politically and we have to deal with stupid things that fall out of that sometimes. The company is pretty drat old in terms of businesses born on the web and turning the ship is a slow loving process.

Hadlock
Nov 9, 2004

Docjowles posted:

I'm envious of your leadership's discipline that's allowed you to eliminate config drift.

Yeah developers have no access to production currently. It's pretty great. I'm gonna keep it on lock down until (if) they fire me.

vanity slug
Jul 20, 2010

I'm pretty happy with Spacelift, pricing's good (compared to Terraform Cloud) and their support is very helpful and responsive.

Warbird
May 23, 2012

America's Favorite Dumbass

12 rats tied together posted:

i've used it in production in the past and at the current place i'll be switching us to it later.

seems like OP has a pretty good handle on it i would just note that pulumi has a yaml mode which makes it head & shoulders the best choice for any greenfield iac application since its feature compatible with terraform at a minimum, likely better in some dimensions (aws, azure), you can write it in almost any relevant programming language + the only relevant markup language, and things you build in it are shareable as cross-language constructs

i think the only reason to stay on terraform these days is if you have a specific need to consume an OSS thing that only works in terraform like community modules, which is not a place i would really want to be, but im sure its reality for a bunch of folks and i wish them well

We need to sit down and review it now that we’re “going legit” and also, you know, IBM, but it seems like an excellent option. I don’t love that it requires an account and phones home your state or whatever but I’d presume that if you give them a bunch of money you can get a airgapped version or whatever. They’re going to be big mad when they find out what the guy that genned up our infrastructure is using it for.

FISHMANPET
Mar 3, 2007

Sweet 'N Sour
Can't
Melt
Steel Beams
We have a free tier Terraform account for one project, with state stored in Terraform Cloud (I don't know if it's possible to set it up otherwise, it was here when I started). We're also on some old tier of the free plan that's limited by users rather than resources. We're over 500 resources so we can't switch over to the "new" free tier.

I don't like how running a plan locally just triggers a plan in TFC, because it ends up being incredibly slow. I don't like how all it reports back to GitHub is a green or red check, and you have to actually go into TFC to see the plan output for a PR. Which requires an account (which I don't know if there are limits on the number of accounts anymore or not).

Doing a POC on Atlantis is a low priority project for us. Actually a second POC, but the first was also before my time so not sure where it went. We're interested in putting plan output in the PRs, so devs can see what their changes are going to do. Also I'd like for state to still be in s3 and be able to run a plan locally in a reasonable amount of time.

Currently most of our infrastructure for various services is configured in a monorepo, but different modules for each service/environment. We recently had an outage caused because nobody knew whose job it was to actually apply terraform changes after a PR gets merged, and so we broke a service when we finally did run an apply. And so we're hoping that Atlantis is something we can drop into our environment and help clean things up a bit. It might be the first service we'd run that isn't one of our own products, but the alternative seems to be managing a bunch of custom configuration in our CI to just replicate what something like Atlantis does.

Docjowles
Apr 9, 2009

We have Atlantis + gitlab set up such that you cannot merge a branch unless Atlantis cleanly ran terraform apply first. Then when it does it gets merged automatically. I would be kind of shocked if TFC and GitHub cannot be configured similarly but I’ve not actually used it so I guess it’s possible. But anyway yeah it sounds like Atlantis can do everything you want. “Someone” shouldn’t be responsible for applying it should just happen in CICD when the MR is approved. Whether that’s a manual review or just when tests pass if you’re confident enough.

Docjowles fucked around with this message at 16:24 on May 11, 2024

George Wright
Nov 20, 2005
The people creating the terraform PRs should be the ones who own, apply, and shepherd the changes.

We also use Atlantis + Gitlab. It’s not perfect, but it’s free and it’s better than running apply from your laptop.

I haven’t used TFC for years so I’m sure my opinion of it is out of date, hopefully.

Resdfru
Jun 4, 2004

I'm a freak on a leash.
I recently set up Atlantis, it's definitely got some nice pros. I think my preference is still gonna be to roll my own pipeline if I'm on github or gitlab because you can use oidc and if you're using their runners you don't need any infra for the terraform pipeline. If you're using bitbucket or something lovely like that then Atlantis is the way to go. Atlantis is also probably a good option to get something deployed ASAP. Definitely always set up apply before merge and add branch protection rules so you can't merge until apply. Agree with whoever said the person writing the terraform is responsible for applying it. Wild that people are pushing terraform prs and just assuming someone else is gonna finish their work for them.

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS
We use Atlantis and have a pretty significant footprint. As a dev I was skeptical at first but it's pretty decent so far. State locking is incredibly useful and plan output in PRs is great. Only thing that's not ideal is just that it's still terraform under the hood so it's still all the sucky bits of hcl, etc.

FISHMANPET
Mar 3, 2007

Sweet 'N Sour
Can't
Melt
Steel Beams
Our terraform is in a right state, we've got like 18 layers of "it shouldn't be that way" and yet... Thankfully they've brought me on, and I just keep finding things and fixing them.

Dyscrasia
Jun 23, 2003
Give Me Hamms Premium Draft or Give Me DEATH!!!!
Seriously thank you everyone , this is all extremely helpful information and I'll have to do some more consideration, particularly regarding terraform in general.

whats for dinner
Sep 25, 2006

IT TURN OUT METAL FOR DINNER!

Warbird posted:

We need to sit down and review it now that we’re “going legit” and also, you know, IBM, but it seems like an excellent option. I don’t love that it requires an account and phones home your state or whatever but I’d presume that if you give them a bunch of money you can get a airgapped version or whatever. They’re going to be big mad when they find out what the guy that genned up our infrastructure is using it for.

It does let you use self-managed backends like S3: https://www.pulumi.com/docs/concepts/state/#using-a-self-managed-backend

That's what we do and it works just fine

Extremely Penetrated
Aug 8, 2004
Hail Spwwttag.
Has anyone looked at using Digger instead of Atlantis? Seems potentially less clunky and with better PR output but I haven't bothered yet.

Hadlock
Nov 9, 2004

Just FYI apparently back in February argo-helm released v6.0.0 of their argo-cd chart version. They're already on v6.9.2

The big breaking change is that you don't need to do absurd backflips to get the standard helm chart to play nice with modern ingress controllers :woop:

Warbird
May 23, 2012

America's Favorite Dumbass

I don't really understand Argo despite working with it in the past. Is my vague understanding that it's a fancy helm repo frontend that can also deploy the things as well more or less on point? My K8s ecosystem understanding is still laughably underdeveloped.

I also made the extremely stupid mistake of volunteering to handle compliance stuff to get our exempted app, now in prod with customers, into internal line with infosec and other bureaucracy and have been in email hell for a couple of weeks now.

George Wright
Nov 20, 2005

Warbird posted:

I don't really understand Argo despite working with it in the past. Is my vague understanding that it's a fancy helm repo frontend that can also deploy the things as well more or less on point? My K8s ecosystem understanding is still laughably underdeveloped.

I also made the extremely stupid mistake of volunteering to handle compliance stuff to get our exempted app, now in prod with customers, into internal line with infosec and other bureaucracy and have been in email hell for a couple of weeks now.

What do you need help with, specifically?

The way I explain it to internal customers is that it’s like helm apply but they don’t have to manage or automate it, and it will optionally prevent drift due to manual changes. That’s it.

People try it. Some like it because it’s less to thick about. Some like it because of the UI. Some don’t like it. They probably don’t like it because it forces them to be honest about how often they need to make small, unplanned, and unannounced tweaks to keep things humming along.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.
If you aren't using Argo Rollouts it's debatable if you ever needed Argo at all

Like, seriously, unless you're doing something dumb with sync waves, the features in Rollouts are the whole reason you choose Argo over something with fewer moving parts

Warbird
May 23, 2012

America's Favorite Dumbass

idk, I don't see a great deal of benefit there vs just having CI/CD call helm install/upgrade itself, but I fully admit my ignorance in the area. We don't have to dwell on it, I'm sure I'll get more contextual understanding in time.

Hadlock
Nov 9, 2004

Argo is nice because it gives you a single pane of glass, a dedicated notification controller, a health check for the overall system, and a way to pause specific helm deployments

Bhodi
Dec 9, 2007

Oh, it's just a cat.
Pillbug
that’s what we use it for, its a good watcher and has integration so you can see pod log messages and health, terminate and manage deployments and stuff like that from the console. there are lots of tools to get your code from git into active running but it’s that plus a pretty neat package that has some dashboards and usability tools built in. it’s also not just for helm, not all our appsets have charts and it works for them too.

Hadlock
Nov 9, 2004

I have been pouring over the helm options to deploy, I think

Prometheus
Tempo
Loki
Grafana
Prom-tail/alloy

Grafana offers no less than 5 helm charts for Loki, at least one is completely depreciated, and then there's a secret sixth one distributed with the main app

None of them seem to be updated with any particular regularity, every couple of months

Grafana went v11.0.0 over a week ago and the helm chart hasn't been updated despite having multiple rc in the lead up

There's a helm chart called "lgtm-stack" that also includes their Thanos (Prometheus distribution proxy) alternative... Mirai or something

Seems like the best option for deploying Prometheus/grafana is to just use the poorly named "kube-prometheus-stack" which installs Prometheus, Thanos, and grafana. Then install... I guess, prom tail or alloy (alloy just hit v1.0 and they're getting rid of prom tail/grafana agent) helm chart, then install loki, then install tempo?

Looks like CNCF is producing perses as a CNCF alternative to grafana

https://github.com/perses/perses

Looks like they hit v0.43 the other day, looks roughly on par with grafana 2/3 based on a single screenshot

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
my dirty k8s secret is that I’ve never actually deployed Prometheus/grafana in 5 years of kubernetesing

datadog and opentelemetry go a long way yall

The Fool
Oct 16, 2003


datadog costs money though

The Fool
Oct 16, 2003


grafana and prometheus are popular because its the same garbage nerds are running in their home labs

digitalist
Nov 17, 2000

journey into Kirk's unknown


The Fool posted:

grafana and prometheus are popular because its the same garbage nerds are running in their home labs

:colbert:

Resdfru
Jun 4, 2004

I'm a freak on a leash.
Lol I just tried to get prometheus grafana loki etc on my home lab k8s cluster. Like Hadlock I decided that it looks like the best way would be that kube prometheus stack. I've been trying to avoid helm whenever possible but that seems to be the only choice. The issue I ran into is that some of the things I'd like to change aren't exposed in values. Like the number of grafana replicas (I don't need 3) and I need to add an annotation to the service for external dns but they don't let you add annotations for it.

I tried patching with kustomize but I'm doing something wrong cause I'm not getting any errors but the service isn't being patched. I might poke at it a little more before just disabling that charts grafana and just install grafana on my own.


well I guess im dumb. looks like I could just add the service/annotations to the values even though the default values file didn't have that

Resdfru fucked around with this message at 07:30 on May 18, 2024

Hadlock
Nov 9, 2004

I'm waiting for my company to XYZ and then we can move over to datadog or whatever, if necessary

There's a three year old jira ticket (which predates me by 2.5 years) that says "ship logging somewhere that engineers can see it" so that's where I'm coming from

We have a pretty small stack with ~500 active 9-5 users so "computer janitoring grafana" seems like decent make-work once everything is in place and stable, rather than farming it out to grafana cloud or datadog

There's no helm chart that formally supports grafana 11 which feels like a pretty big upgrade, plus I accidentally ran it so it already migrated my db so I'm stuck with it, so working on a solution for that

In the mean time, having the "lgtm" stack (lol) is nice as I'm not dropping down to the command line constantly, and helps build confidence with my boss that the added expense is worth it.

Importantly, building/struggle bussing through running your own monitoring + alerting stack also forces you to really look at what's going on, which is important if you're the principal over everything

Hadlock
Nov 9, 2004

The Fool posted:

datadog costs money though

Yeah this

At some point I'll make the financial case that datadog is cheaper than hiring a jr to janitor Prometheus/grafana. We just haven't crossed that bridge to operational maturity yet

Collateral Damage
Jun 13, 2009

Hadlock posted:

At some point I'll make the financial case that datadog is cheaper than hiring a jr to janitor Prometheus/grafana. We just haven't crossed that bridge to operational maturity yet
Datadog ... cheaper? :confused:

Datadog has a great service but holy poo poo it's expensive if you have even a modest amount of data you want to feed it. If you're in a multi dev organization you really need to police what people are putting into it because the cost can very quickly run away if people aren't careful, and if you're the one who championed Datadog guess who gets the blame when that unexpectedly big invoice shows up?

Adbot
ADBOT LOVES YOU

xzzy
Mar 5, 2009

Prometheus was a huge upgrade for me because it finally weaned the nerds I work with from the feeling that they needed to store metrics going back to the birth of Christ. Every time a ganglia rrd got lost I heard about it and it drove me nuts.

But when our server count outgrew what ganglia could handle I got to swap to Prometheus and was all "sorry, it can only handle 6 months, behind that it tosses chunks." There was grumbling but they adapted.

It's been pretty maintenance free too. Wish I could say the same about grafana.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply