Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
trem_two
Oct 22, 2002

it is better if you keep saying I'm fat, as I will continue to score goals
Fun Shoe

xzzy posted:

Prometheus was a huge upgrade for me because it finally weaned the nerds I work with from the feeling that they needed to store metrics going back to the birth of Christ. Every time a ganglia rrd got lost I heard about it and it drove me nuts.

But when our server count outgrew what ganglia could handle I got to swap to Prometheus and was all "sorry, it can only handle 6 months, behind that it tosses chunks." There was grumbling but they adapted.

That struggle is real, fighting a very similar fight right now

Adbot
ADBOT LOVES YOU

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

Collateral Damage posted:

Datadog ... cheaper? :confused:

Datadog has a great service but holy poo poo it's expensive if you have even a modest amount of data you want to feed it. If you're in a multi dev organization you really need to police what people are putting into it because the cost can very quickly run away if people aren't careful, and if you're the one who championed Datadog guess who gets the blame when that unexpectedly big invoice shows up?

We're currently migrating off of Datadog because of the cost.

Why yes we do have a team whose job it is to police metrics until we're migrated. And yes, it's utterly thankless work and I genuinely feel bad for them.

Collateral Damage
Jun 13, 2009

Blinkz0rz posted:

We're currently migrating off of Datadog because of the cost.
Likewise, less than two years after we started using them.

The time you save working on logging and presenting you have to instead spend making sure nobody lazily enables every metric/statistic on their project and blows your yearly datadog budget in two days. The fear of sending too much data also made some teams get too restrictive with their logging which caused its own set of issues.

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS
Blessedly we've only been using DD for metrics and do our logging internally so a ton of the work being done is figuring out how to efficiently scale Prometheus and convert apps that previously published to DD directly to migrate over to a statsd implementation that can be switched over during the migration.

Also dashboards and monitors. Jesus I don't envy that team.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
Datadog custom metrics and logs are an obscenity cost wise.

Un/fortunately my business decided to go with elasticsearch instead for everything other than ootb infra metrics, which is… ok for logs, terrible for metrics, and kinda lovely at APM.

The latter one is kind of inherent to elastic/lucene though. A great columnar database it is not.

xzzy
Mar 5, 2009

I feel bad for our ES team, week after week of endless requests to log random poo poo and make a grafana view out of it.

No mister scientists you do not need a chart of how many times your crappy analysis software got deadlocked because you're trying to do everything over nfs and blew up the server because you were too smart for databases and decided to log your data as millions of 5k files.

FISHMANPET
Mar 3, 2007

Sweet 'N Sour
Can't
Melt
Steel Beams
We use Coralogix, which has a daily quota and will cut you off each day when you've exceeded your quota. It sucks to lose your metrics and logs for the day, but nice that you can't blow all your money at once if someone screws up.

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS
We acquired a log analysis product years ago both as a product offering but also so we could write off our own internal logging costs as r&d lol

In hindsight it wasn't the worst business decision we've made

Docjowles
Apr 9, 2009

trem_two posted:

That struggle is real, fighting a very similar fight right now

+1. Our devs were horrified to learn that when we introduced k8s and Prometheus they would no longer be able to look back at garbage collection stats from 5 years ago or whatever. As if that has any bearing on anything. They got over it.

Hadlock
Nov 9, 2004

I'm doing log retention at 6 at six weeks and metrics at 400 days

When troubleshooting stuff it's nice to look back a month and see if it was happening then, the code changes enough from week to week that logs older than that are mostly useless; and for metrics 400 days gives you annual trends + ~1+ month && a nice round number to plug into config files

Mehsticles
Jul 2, 2012
I was hired as a DevOps apprentice here in the UK, late last year. It's been a big departure from what I was doing previously, but it's been so much more engaging. It's helped me get out of a mental rut that architecture had me in.

I'm the first 'DevOps-y' named-role at the company, so it is a bit of a free-form "if you find it and think it's useful, make a case for spending the time on it" sort of deal, aside from learning the platform / server basics from our platform engineer & my manager. Which is both a blessing and a curse sometimes, getting my head turned every which way trying to figure out if something would be useful to us, and if I wrap my head around it in the first place...

I've deployed a Prometheus / Grafana / Alertmanager container set on a VM, which felt pretty awesome once I got it all running. I had been charged with manual monitoring 'check-ins' for things like UTC / HDD space monthly, and I am happy that time spent on that is automated through this - it's saved me time on manually ISO logging things. It's just scraping node_exporter on a few machines, so I'll be looking at statsd everyone's been mentioning - no doubt it'll end up looking like something I should have picked in the first place!

Centralised logging also seems like a worthwhile endeavour, rather than having to SSH into different machines to access them.

A big challenge for me, which a poster touched on a few pages back, is struggling to decipher 'best practices' for documentation for tools. I suppose architecture / requirements are so different from company to company, but sometimes it's a bit daunting figuring out if something I've been working on is 'production ready' for our servers or kept to my local machine. Not spending all my 'researching time' pissing in the wind figuring that out has been tricky, but I think that may be some 'imposter syndrome' telling me what I've been doing isn't useful at all.

Thanks all for sharing your experiences here! Hoping to absorb some of your wisdom by osmosis. It's good to see GitHub Actions mentioned, as that is something I've been learning as well, to see if it can be implemented in a new project we're looking at. Do you guys use GitHub's runners for that, or do you run your own?

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.
Most tech metrics rot fast, but whatever your system, it's good to keep rollups of business metrics for a year or more in case you need to look back on seasonal trends

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
I have log retention for 7 years :negative:

In fairness after 60 days it’s all in blob storage and still queryable on demand via searchable snapshots for a year, but still, yeesh

APM telemetry gets 60 days and I’m thinking of bumping to 90. It’s not that much data, and you can sample it pretty efficiently whereas people freak the gently caress out about the idea of sampling logs.

Hadlock
Nov 9, 2004

Vulture Culture posted:

Most tech metrics rot fast, but whatever your system, it's good to keep rollups of business metrics for a year or more in case you need to look back on seasonal trends

Yeah this is my main thing. If we're rising by 3% per month is that because summer is our peak, or are we actually growing?

Six weeks of logs gives you weekly trends and a lot of jobs only run weekly or monthly and usually it's a couple days later after that monthly job fails that you have a chance to dig into the logs

At one place I worked at we only kept logs for 31 days and I had a script setup to scrape job logs to my local disk so I could troubleshoot issues

7 years retention IMO means export it to plain text and zip it up; ship it to glacier after 1 year or whatever your WORM solution is. Nobody needs searchable access to logs that old

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Hadlock posted:

Yeah this is my main thing. If we're rising by 3% per month is that because summer is our peak, or are we actually growing?

Six weeks of logs gives you weekly trends and a lot of jobs only run weekly or monthly and usually it's a couple days later after that monthly job fails that you have a chance to dig into the logs

At one place I worked at we only kept logs for 31 days and I had a script setup to scrape job logs to my local disk so I could troubleshoot issues

7 years retention IMO means export it to plain text and zip it up; ship it to glacier after 1 year or whatever your WORM solution is. Nobody needs searchable access to logs that old

These ones thankfully go to glacier and intelligent tiering is actually configured.

unfortunately we have like 50 PB of data from years ago that’s still in s3 standard and nobody knows what’s safe to delete 😭

Docjowles
Apr 9, 2009

The Iron Rose posted:

These ones thankfully go to glacier and intelligent tiering is actually configured.

unfortunately we have like 50 PB of data from years ago that’s still in s3 standard and nobody knows what’s safe to delete 😭

Yeah we struggle with this too. Not quite 50 PB (:stare:) but we are cutting Amazon a big rear end check every year for a few buckets dating back like a dozen years that nobody in the business is comfortable giving the OK to delete or even put a retention policy in place for new objects.

We are just a typical web company, not in finance or health care or anything, I cannot fathom what is in these logs that we are required to keep until the heat death of the universe. But “legal says do not delete” so welp. Above my pay grade.

George Wright
Nov 20, 2005
We once had a filer that we couldn’t retire because it had a single empty file on it named “caca”. Storage admins refused to until they could find the owner because they had no way of knowing if it was some critical file in use by a seldom run job or something like that. They didn’t want to be responsible for the outage.

caca.

Docjowles
Apr 9, 2009

George Wright posted:

We once had a filer that we couldn’t retire because it had a single empty file on it named “caca”. Storage admins refused to until they could find the owner because they had no way of knowing if it was some critical file in use by a seldom run job or something like that. They didn’t want to be responsible for the outage.

caca.

:lol:

xzzy
Mar 5, 2009

Docjowles posted:

But “legal says do not delete” so welp. Above my pay grade.

Legal actually said "don't give a gently caress it's not my money, don't delete it because I'm too busy playing golf to figure it out."

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
We have a bit of a sliding scale with metrics; basically the aggregation gets longer the further back you go - 1 minute metrics are retained for like 15 days, then 5 minute metrics for 30 days, 1 hr for 180 days, and then I think our daily metrics are retained for at least years, not sure how long.

It's a pretty reasonable middle ground, since you don't have to retain the super huge amounts of metrics from years ago but you can still go 'hey uh what was our utilization like a few years ago during that one event?' It sucks sometimes because it means we have a window to snapshot metrics and there isn't a convenient way to do it and our tool we wrote for it sucks.

Docjowles
Apr 9, 2009

xzzy posted:

Legal actually said "don't give a gently caress it's not my money, don't delete it because I'm too busy playing golf to figure it out."

You're not supposed to say the quiet part out loud

Warbird
May 23, 2012

America's Favorite Dumbass

On the subject, Grafana itself is a bit of a nightmare in the time I’ve spent with it. I need to sit down and spend a good amount of time with the query language and so on but getting going there without something being a substantial portion of your actual job seems a somewhat fraught game of looking for a dashboard that more or less does what you want and hoping it’s only slightly broken.

It was weirdly hard to find a solution that would let me view container metrics on a given host when I went a courting a few weeks back. I expect that most of that is a factor of people wanting to do that sort of thing using K8s instead so eh.

Speaking of, the helm Grafana/prometheus/loki chart we ended up using was pretty simple to deal with. I can go pull it up if there is interest. I’m not sure what versions it’s using but still.

Hadlock
Nov 9, 2004

Grafana 11 came out and I was amused to see that it has OpenAI/LLM api support now. You can use it to name and describe your graphs

Actually the describe graph thing might actually be useful because there's been a couple times where I built what I thought was the correct graph, but it was using adjacent data

I haven't played around with it yet personally

madmatt112
Jul 11, 2016

Is that a cat in your pants, or are you just a lonely excuse for an adult?

Resdfru posted:

well I guess im dumb. looks like I could just add the service/annotations to the values even though the default values file didn't have that

You can always look at the chart's templates folder and search for the key names you want to use, to see if they're populated via the values file or not.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.
In a multi-tenant K8s environment, how do you set up either Crossplane or the AWS Controller for Kubernetes so that the attack surface isn't completely insane? Both of them have busted designs that don't allow you to effectively use separate IRSA roles as a backstop to misuse. There's got to be a better answer than "just put all the logic in your policy engine".

Warbird
May 23, 2012

America's Favorite Dumbass

I finally got my k8s MacOS vm cluster up and running the other day and promptly got sick from one of kiddo’s daycare surprises. I don’t know if you ever had a night of cluster based fever dreams but I don’t fukkin recommend it.

JehovahsWetness
Dec 9, 2005

bang that shit retarded

Vulture Culture posted:

In a multi-tenant K8s environment, how do you set up either Crossplane or the AWS Controller for Kubernetes so that the attack surface isn't completely insane? Both of them have busted designs that don't allow you to effectively use separate IRSA roles as a backstop to misuse. There's got to be a better answer than "just put all the logic in your policy engine".

Lol, Crossplane's model and the resulting IAM exposure if dumb as all hell and piercing the workload orchestration plane to _also_ be a resource management plane is just the biggest stink. Completely fucks the workload isolation up and moves basically all the responsibility off of AWS IAM and onto k8s RBAC where for sure nobody fucks that up.

Crossplane's solution to it is to have a ProviderConfig per-namespace and to patch it in the Composition so a specific target IAM Role is used, depending on namespace: https://docs.crossplane.io/latest/guides/multi-tenant/#namespaces-as-an-isolation-mechanism. You get one provider per namespace because you have to collide the ProviderConfig name and the namespace, because you're patching in "spec.claimRef.namespace". Crossplane is really just hacking a way to use a specific ProviderConfig, all the rest of the multi-tenancy controls are up to the operator to make sure there's no cross-tenant abuse, etc.

Really, the whole IAM Role / Policy management from within Crossplane is busted and just about every example hand waves it or just says "gently caress it" and probably has the EC2 profile with "*:*.

e: and really the whole model smells because workload least privilege is wayyyyy different than resource management least privilege and Crossplane makes you expose stupidly powerful IAM Roles to what _should_ really only be workload poo poo. I hate it.

JehovahsWetness fucked around with this message at 00:17 on May 26, 2024

Collateral Damage
Jun 13, 2009

Warbird posted:

I finally got my k8s MacOS vm cluster up and running the other day and promptly got sick from one of kiddo’s daycare surprises. I don’t know if you ever had a night of cluster based fever dreams but I don’t fukkin recommend it.
When I was first learning Kubernetes I had spent a couple of days entirely absorbed into vscode and kubernetes config. For unrelated reasons I was sleeping pretty poorly at the time and ended up with a nightmare where I was literally drowning in YAML. Not recommended.

whats for dinner
Sep 25, 2006

IT TURN OUT METAL FOR DINNER!

Collateral Damage posted:

When I was first learning Kubernetes I had spent a couple of days entirely absorbed into vscode and kubernetes config. For unrelated reasons I was sleeping pretty poorly at the time and ended up with a nightmare where I was literally drowning in YAML. Not recommended.

During a series of particularly awful incidents, I was having nightmares about needing to properly deploy a helm chart to get some sleep but none of the "have a functional life" CRDs were installing right. Dark times.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

JehovahsWetness posted:

Lol, Crossplane's model and the resulting IAM exposure if dumb as all hell and piercing the workload orchestration plane to _also_ be a resource management plane is just the biggest stink. Completely fucks the workload isolation up and moves basically all the responsibility off of AWS IAM and onto k8s RBAC where for sure nobody fucks that up.

Crossplane's solution to it is to have a ProviderConfig per-namespace and to patch it in the Composition so a specific target IAM Role is used, depending on namespace: https://docs.crossplane.io/latest/guides/multi-tenant/#namespaces-as-an-isolation-mechanism. You get one provider per namespace because you have to collide the ProviderConfig name and the namespace, because you're patching in "spec.claimRef.namespace". Crossplane is really just hacking a way to use a specific ProviderConfig, all the rest of the multi-tenancy controls are up to the operator to make sure there's no cross-tenant abuse, etc.

Really, the whole IAM Role / Policy management from within Crossplane is busted and just about every example hand waves it or just says "gently caress it" and probably has the EC2 profile with "*:*.

e: and really the whole model smells because workload least privilege is wayyyyy different than resource management least privilege and Crossplane makes you expose stupidly powerful IAM Roles to what _should_ really only be workload poo poo. I hate it.
What would the right architecture for a K8s operator look like here?

Hadlock
Nov 9, 2004

I'm using AWS load balancer controller to generate an ALB to route traffic to a fairly vanilla Django backend. Right now it's just a single pod

Getting 502 bad gateway errors (usually just a single one, so 3-5 errors per day, varies with the number of prod deploys) when a request is sent while the new pod starts picking up the traffic

Can I enforce like, sticky sessions to fix this, or will going to 2 pods resolve the issue?

LochNessMonster
Feb 3, 2005

I need about three fitty


Hadlock posted:

I'm using AWS load balancer controller to generate an ALB to route traffic to a fairly vanilla Django backend. Right now it's just a single pod

Getting 502 bad gateway errors (usually just a single one, so 3-5 errors per day, varies with the number of prod deploys) when a request is sent while the new pod starts picking up the traffic

Can I enforce like, sticky sessions to fix this, or will going to 2 pods resolve the issue?

Sticky sessions won’t make a difference if there’s no pod.

Sounds like you’re using a recreate deployment strategy and you want a rolling deployment strategy.

JehovahsWetness
Dec 9, 2005

bang that shit retarded

Vulture Culture posted:

What would the right architecture for a K8s operator look like here?

I don't think there is a right architecture here, mainly because the multi-tenant segmentations concerns and the intent of crossplane (cloud poo poo from YAML! *jazz hands*) are kinda orthogonal or at least the teeth don't mesh right. I honestly believe the whole model is broken because:
- mingling the workload orchestration (and isolation!) and resource management plane removes previously hard/external boundaries of identity and puts it entirely on the cluster RBAC (poke a bunch of holes through AWS IAM and backstop it with k8s RBAC + OPA. Cool.)
- a big rear end reconciliation loop to manage external resource that _aren't changing_ 99% of the time is recipe for poo poo scaling
- "automatic" rectification of drift is a ops disaster on the horizon

I think there could be something like Argo's operator-declared Applications? Where the Crossplane config/namespace includes ProviderConfigs, defined by the cluster operator, each of which includes a set of selectors that define how/where the ProviderConfig may be used. It means the cluster operators have to define all the ProviderConfigs, but meh, if you're running a multi-tenant cluster then tenants poking IAM holes / adding external trusts with Crossplane-level privs should be a opt-in event. You still end up having the same issue where there's a common root identity used by the Crossplane controller that has to jump / assume the ProviderConfig roles so there's an underlying potential big gently caress-up if you lose control of the operator identity, but you can soften that a bit with really restrictive trust policies.

(Which is what we kinda ended up doing: restrictive trust policies, permissions boundaries, and some custom cloudtrail alerting for things we deem fishy.)

e: (AWS only) You move ProviderConfig to namespace-scoped, allowing tenants to define their own. _But_ crossplane always passes an external-id of .. the containing namespace? ... during the AssumeRole call. Then you can have RoleA that's usable only from NamespaceA and NamespaceB can't make their own ProviderConfig that can use it, stopping cross-namespace/tenant abuse.

JehovahsWetness fucked around with this message at 03:19 on May 29, 2024

trem_two
Oct 22, 2002

it is better if you keep saying I'm fat, as I will continue to score goals
Fun Shoe

Hadlock posted:

I'm using AWS load balancer controller to generate an ALB to route traffic to a fairly vanilla Django backend. Right now it's just a single pod

Getting 502 bad gateway errors (usually just a single one, so 3-5 errors per day, varies with the number of prod deploys) when a request is sent while the new pod starts picking up the traffic

Can I enforce like, sticky sessions to fix this, or will going to 2 pods resolve the issue?

If you’re using IP mode rather than instance mode, make sure that you have pod readiness gates enabled https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.8/deploy/pod_readiness_gate/

That said, with a single pod you’re still at risk of 502s for scenarios where your pod is deleted and recreated outside of a deployment, like if the node where the pod is running is drained due to cluster autoscaler activity, or eviction due to priority. So it probably is worth running two pods and adding a pod disruption budget with a max unavailable value of 1 to ensure there is always at least one pod available in the ALB target group(s) associated to the ALB.

Also, it’s possible that the 502s are happening because of the old pod that is shutting down, due to a mismatch with http keep alive configs between the ALB and your app. This could be the case regardless of whether you’re using IP mode (ALB routes requests directly to your pods) or instance mode (ALB routes requests to your pods via kube-proxy instances running on the nodes in your cluster). So, double check that those settings line up: https://www.tessian.com/blog/how-to-fix-http-502-errors/

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

JehovahsWetness posted:

I don't think there is a right architecture here, mainly because the multi-tenant segmentations concerns and the intent of crossplane (cloud poo poo from YAML! *jazz hands*) are kinda orthogonal or at least the teeth don't mesh right. I honestly believe the whole model is broken because:
- mingling the workload orchestration (and isolation!) and resource management plane removes previously hard/external boundaries of identity and puts it entirely on the cluster RBAC (poke a bunch of holes through AWS IAM and backstop it with k8s RBAC + OPA. Cool.)
- a big rear end reconciliation loop to manage external resource that _aren't changing_ 99% of the time is recipe for poo poo scaling
- "automatic" rectification of drift is a ops disaster on the horizon

I think there could be something like Argo's operator-declared Applications? Where the Crossplane config/namespace includes ProviderConfigs, defined by the cluster operator, each of which includes a set of selectors that define how/where the ProviderConfig may be used. It means the cluster operators have to define all the ProviderConfigs, but meh, if you're running a multi-tenant cluster then tenants poking IAM holes / adding external trusts with Crossplane-level privs should be a opt-in event. You still end up having the same issue where there's a common root identity used by the Crossplane controller that has to jump / assume the ProviderConfig roles so there's an underlying potential big gently caress-up if you lose control of the operator identity, but you can soften that a bit with really restrictive trust policies.

(Which is what we kinda ended up doing: restrictive trust policies, permissions boundaries, and some custom cloudtrail alerting for things we deem fishy.)

e: (AWS only) You move ProviderConfig to namespace-scoped, allowing tenants to define their own. _But_ crossplane always passes an external-id of .. the containing namespace? ... during the AssumeRole call. Then you can have RoleA that's usable only from NamespaceA and NamespaceB can't make their own ProviderConfig that can use it, stopping cross-namespace/tenant abuse.
Yeah, this is a little bit of a doozy to figure out even as someone who agrees that the current model is totally broken. The key problem is that the controllers are too numerous and too heavyweight to duplicate the whole configuration per namespace, so there needs to be some kind of compromise in the design somewhere. (Cluster-scoped resources are a fracture point already.)

If you're using Crossplane for n-ary cardinality of deployments, the buck's going to stop with your deployment system anyway, so it doesn't hugely matter whether the permissions rest within Kubernetes RBAC or a Terraform Cloud instance or something else. But there's a good chance that whatever's upstream (GitHub Actions, GitLab CI, etc.) is going to have a less tightly-coupled identity model. For example, both GitHub and GitLab function as OAuth2 IdPs and every pipeline gets a distinct identity with the organization and project name encoded as attributes. Terraform Cloud/Enterprise and Spacelift both take this a step further and their respective built-in IdPs can assign different credentials depending on the run phase. This feels "right" to me, as far as a highly-permissioned service can be; your management roles are coupled to hosted resources in the service, without being tightly coupled to the specific installation or implementation of the service itself.

It's still a little unclear how a pattern like this would play out in the context of a K8s operator. You'd need some kind of agent living in each namespace and pulling work into its namespace/SA scope and execution context; hopefully something leaner than the massive proliferation of controllers that's needed for ACK or some Crossplane configurations. But once you've built a remote execution model into the controller with on-demand controller pods for each provider, you might as well just run an instance of the whole thing per namespace, right?

Alternatively, within the context of an external pipeline, the IdP and RBAC in Kubernetes could be used, but to avoid the service having "run arbitrary code in any namespace" permissions, it would need to be able to pass tokens from the managing service (something like GitLab CI). But Crossplane reconciles resources, it doesn't handle requests, so there's no particularly effective or secure way to get those tokens into Crossplane with the current lifecycle. And with token lifetimes being limited how they are, you would lose the key feature of Crossplane (hooking the lifetime of your external dependencies to the lifetime of the Kubernetes namespace).

It feels like there's got to be something here that I'm missing.

JehovahsWetness posted:

- "automatic" rectification of drift is a ops disaster on the horizon
Calling out this part specifically, this is a great reason why global permissions are really terrible, because denying the ability to "rectify" by deleting production data just feels like table stakes for safety. There shouldn't be anything exceptional about prod workers for a service having different permissions than dev/staging or ephemeral environments, but the design of role chaining into both these operator families begs you to handle these cases in an exceptional way.

Vulture Culture fucked around with this message at 13:02 on May 29, 2024

whats for dinner
Sep 25, 2006

IT TURN OUT METAL FOR DINNER!

What's the current hotness for managing DB schemas? I've inherited a few hundred postgres DBs that should, in theory, all have schemas matching the versions of the application (big ol' java monolith) but they most definitely don't. I'm not talking a few missing indexes either, there's columns missing null and unique constraints ('cause the unattended SQL scripts failed due to duplicate/null fields) that 100% cause issues when the right code paths are hit.

We've got work going on to try and reconcile the schemas but I want to make sure a lot of the underlying process problems are solved too. And part of that is having versioned migration scripts where it's easy to see what's been run and failures are nice and loud. In a previous job we used flyway which seemed fine but that was four or five years ago now.

LochNessMonster
Feb 3, 2005

I need about three fitty


I don’t have an anwser for that besides managing config as code.

But the Data engineering thread might have better solutions: https://forums.somethingawful.com/showthread.php?threadid=4050611&perpage=40&pagenumber=2&noseen=1

Docjowles
Apr 9, 2009

Not really my area but I believe we use Liquibase

Hadlock
Nov 9, 2004

I was gonna also suggest flyway but the last time I touched that it was 2019

Adbot
ADBOT LOVES YOU

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS
Flyway is still good enough

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply