Continuous Integration/build engineering/devops thread

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread

«‹›158 »

New Yorp New Yorp: Jul 18, 2003; Only in Kenya.; Pillbug

zynga dot com posted:

DB rollbacks: how is everyone managing them in a cloud environment where there's no public access? I'm trying to come up with a more turnkey way of performing a DB rollback using Liquibase that isn't "spin up a VM instance in the same virtual network and do it manually". I'm asking somewhat generally, but also the specific implementation will be in Azure. We don't own the overall account so I can't do certain things like spin up a new resource group, although I don't think that will affect the solution at all.

Some initial spitballing:

Just open up the firewall locally for the few seconds it takes to run the rollback. Pros: easy; Cons: might be a hard no from security

Azure Functions. Pros: easy, Terraform friendly; Cons: cost (need Premium to work within a virtual network), getting the files in place (subscribing to a message queue should work)

VM templates that contains the script we need as user data. Pros: repeatable and can be done through the portal; Cons: still need to get our files in place (could maybe mitigate this by pushing the files we need to storage before spinning up the VM), heavyweight

We use App Services, so just use the same deploy process with a new name and override the Docker entrypoint. Pros: probably the most maintainable; Cons: also seems heavyweight

?

Ideally you just design your databases to always support the current and previous version of the application so rollbacks aren't necessary.

# ? Jul 29, 2021 01:11

Adbot: ADBOT LOVES YOU

# ? Jun 4, 2024 01:12

zynga dot com: Nov 11, 2001; wtf jill im not a bear!!!

A dossier and a state of melted brains: The Jess campaign has it all.

New Yorp New Yorp posted:

Ideally you just design your databases to always support the current and previous version of the application so rollbacks aren't necessary.

Agreed, but let's say for the sake of argument that this needs to be an option, regardless of whether or not you're following best practices. I could also generalize the question to "need to perform an operation against a DB" rather than specifically a rollback.

# ? Jul 29, 2021 03:37

New Yorp New Yorp: Jul 18, 2003; Only in Kenya.; Pillbug

zynga dot com posted:

Agreed, but let's say for the sake of argument that this needs to be an option, regardless of whether or not you're following best practices. I could also generalize the question to "need to perform an operation against a DB" rather than specifically a rollback.

Then you run a rollback via whatever continuous delivery process you use to do a rollforward. There doesn't need to be any fancy additional process involved. You obviously have a way to upgrade the database, so use the same process to downgrade it.

Ideally the necessary scripts are generated right alongside the upgrade scripts and it's just a matter of running the "oh poo poo" pipeline.

# ? Jul 29, 2021 04:44

Chunks Hammerdong: Nov 1, 2009

You can have rollback/"down" migrations to go along with your "up" migrations. This has the potential advantage of being quick, because it's already written. The disadvantage is that every database change comes with a bunch of extra work to reverse it even if that will never happen, as well as sort of being a separate process to the usual release.

Alternatively you can just push another release that has new "up" migrations that reverse the old ones. This means you aren't doing any extra unnecessary work most of the time and also requires no changes or adjustments to the existing release pipeline. The main disadvantage is that it's not as quick to roll out because someone has to actually write the migrations.

If you can't rollback anyway, because of data changes or something, then you just have to run a backup and push a new release that removes the migrations that made the bad change so they don't get run again.

Obviously if you don't do anything like migrations then... :rip:

I guess.

# ? Jul 29, 2021 12:27

zynga dot com: Nov 11, 2001; wtf jill im not a bear!!!

A dossier and a state of melted brains: The Jess campaign has it all.

I think the rollback portion of my question is getting too much of the focus. We have a down migration written (via Liquibase) so it's just a CLI operation, and it's a stakeholder requirement that I'm not able to change (i.e. someone's scared that a standard roll forward via the pipeline will take too long). I'm trying to figure out the best way to run this in a scenario where the DB isn't publicly accessible - we'd need repo access or at least some files from it, the ability to install arbitrary packages (Liquibase), etc.

# ? Jul 29, 2021 16:32

New Yorp New Yorp: Jul 18, 2003; Only in Kenya.; Pillbug

zynga dot com posted:

(i.e. someone's scared that a standard roll forward via the pipeline will take too long).

They need to clearly articulate what constitutes an acceptable length of time for the operation to take so you can do some testing to determine how long it actually takes to do it via the normal, sane process instead of building some gargantuan bespoke custom process to shave 30 seconds off of a 2 minute process when they're fine with it taking under 5 minutes.

# ? Jul 29, 2021 17:10

Hadlock: Nov 9, 2004

I've gone back and forth on this

One option is to tell your CTO the emergency revert button is to just tell them if it's bad enough, you'll just restore the db to the backup made before the update, and then put a team of developers on recovering any data from the bad db and inserting it into the restored db. Option B is hiring a dedicated DBA whose job is to write revert SQL for every migration "just in case".

But yeah as others have pointed out this is largely a data architect/engineering problem, there's no magic bullet to devops your way out of writing unreversible changes to your state store unless it was baked in from the beginning.

Every place I've worked at has had a "always roll forward, if it's really bad we'll consider a roll back" policy, but I've personally never seen it come to actually rolling back. Usually they'll just run with degraded performance for a week or two and everyone eats poo poo, it's cheaper than trying to engineer the perfect roll back database scheme

# ? Jul 29, 2021 18:05

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

I'll never forget the time at an old job a dev decided to shove a DROP INDEX into their liquibase files for an application upgrade and all hell broke loose as every transaction started failing due to the lock in prod (because all applications were written in a way that they couldn't tolerate the DB being down basically ever just like it's 1999, yay). People just snoozed on that line in the PR somehow. When deployment failure risks are greater, you need to have greater protections in your processes before an action is taken. A staging environment, better observability tools in development environments, getting rid of relational DBs, sharding, etc. are all forms of mitigations for these kinds of technical risks.

It takes a lot of discipline in processes to do continuous delivery and deployment when DB operations like alteration of schemas can occur. Really, the rule I've found is that deletion or even renaming of an entity should be strictly forbidden in automated processes and that it should be performed manually or in a separate job or deployment task once it's clear that the entity isn't in use anymore. It's similar to decommissioning old services.

# ? Jul 29, 2021 19:29

zynga dot com: Nov 11, 2001; wtf jill im not a bear!!!

A dossier and a state of melted brains: The Jess campaign has it all.

Appreciate everyone's responses! I have been arguing these points in parallel to investigating this, I just left that part out for the sake of question clarity. I think I may have gotten an agreement on parameterizing the normal CI/CD workflow to accept an optional tag to roll forward/back to rather than having to build some sort of Mousetrap-esque framework. Based on how easy it was to shave days worth of work off, I'm getting the impression that the (new to me) team really didn't have the right people in the room to write these tickets.

# ? Jul 29, 2021 21:55

Hadlock: Nov 9, 2004

zynga dot com posted:

Appreciate everyone's responses! I have been arguing these points in parallel to investigating this, I just left that part out for the sake of question clarity. I think I may have gotten an agreement on parameterizing the normal CI/CD workflow to accept an optional tag to roll forward/back to rather than having to build some sort of Mousetrap-esque framework. Based on how easy it was to shave days worth of work off, I'm getting the impression that the (new to me) team really didn't have the right people in the room to write these tickets.

New thread title?

CI/CD Devopsy Thread: Having to build some sort of Mousetrap-esque framework

# ? Jul 29, 2021 22:04

Gyshall: Feb 24, 2009; Had a couple of drinks.
Saw a couple of things.

Is the answer liquibase/flyway?

# ? Jul 30, 2021 02:17

Methanar: Sep 26, 2013; by the sex ghost

Methanar fucked around with this message at 23:39 on Aug 4, 2021

# ? Aug 2, 2021 22:57

The Fool: Oct 16, 2003

lol, titles are meaningless but how were you not a senior already

gently caress even I have senior in front of my title

# ? Aug 5, 2021 15:50

Methanar: Sep 26, 2013; by the sex ghost

The Fool posted:

lol, titles are meaningless but how were you not a senior already

gently caress even I have senior in front of my title

I work for a newish big company that hasn't inflated all of its titles yet

# ? Aug 5, 2021 16:33

madmatt112: Jul 11, 2016; Is that a cat in your pants, or are you just a lonely excuse for an adult?

Methanar posted:

I work for a newish big company that hasn't inflated all of its titles yet

My strategy is to start with the inflated title and just bloat it from there.

Select Executive Senior Engineering Partner III reporting for duty.

# ? Aug 5, 2021 16:53

madmatt112: Jul 11, 2016; Is that a cat in your pants, or are you just a lonely excuse for an adult?

madmatt112 posted:

My strategy is to start with the inflated title and just bloat it from there.

Select Executive Senior Engineering Partner III reporting for duty.

Well the technical interviews went well enough that one of them told me �you nailed it� soooo :shrug:

# ? Aug 10, 2021 01:12

Methanar: Sep 26, 2013; by the sex ghost

# ? Aug 11, 2021 02:55

madmatt112: Jul 11, 2016; Is that a cat in your pants, or are you just a lonely excuse for an adult?

Methanar posted:

Ok I�ll bite. Why is that process using thousand of percentage points?

Wait, is this to do with your 80-node scale up that overloaded your kubeapis?

# ? Aug 11, 2021 04:31

Methanar: Sep 26, 2013; by the sex ghost

madmatt112 posted:

Ok I�ll bite. Why is that process using thousand of percentage points?

Something something large and sudden node scale ups causes to etcd run out of memory because I've been tuning it badly causing a leader election causing the api servers to evict their watcherCaches and simultaneously all try to repopulate the caches, but OOMing themselves in doing so. The APIservers are now in a doomed loop of trying to come back, but die under the backlogged pressure and failing.

While the APIservers are dead the controller-managers continue to attempt to keep state up to date except the APIservers are dead so they accumulate a backlog of work to do. The longer you take the more backlogged events and stale events that must be refreshed accumulate by both the kubelets and the controller-managers. Which makes the APIservers even more likely to keelover and die as they hammered with thousands and thousands of requests per second the moment they come back, right when they are trying to repopulate their watches.

Eventually the backlog becomes ridiculous and the apiservers need 128gb of ram and 32 cores to survive the initial thundering herd onsaught when they try to come back. Also the controller-manager and etcd themselves demand more than usual too.

That is the current working theory after seeing some strange logs about evicting watcher caches, etcd OOMing and looking at metrics for the number of objects in etcd, RPC rates and heapsizes

This is the first time the problem has happened in about 4-5 months (g r o w t h), and at least some of my mitigations I put in place last time have helped, but clearly not enough.

Methanar fucked around with this message at 05:07 on Aug 11, 2021

# ? Aug 11, 2021 05:04

bad boys for life: Jun 6, 2003; by sebmojo

Anyone have any good examples of writing manifests for IaC deployments? Were moving beyond just automating our server infrastructure to tie in network changes and wanted to settle on a package manifest at some point that defines the intent for deployment.

# ? Aug 11, 2021 15:06

Methanar: Sep 26, 2013; by the sex ghost

bad boys for life posted:

Anyone have any good examples of writing manifests for IaC deployments? Were moving beyond just automating our server infrastructure to tie in network changes and wanted to settle on a package manifest at some point that defines the intent for deployment.

I don't understand the question. by iac you mean terraform?

# ? Aug 12, 2021 02:06

Gyshall: Feb 24, 2009; Had a couple of drinks.
Saw a couple of things.

We just zip up a git hash/release and include a Json of the deploy with whatever necessary

# ? Aug 13, 2021 03:19

madmatt112: Jul 11, 2016; Is that a cat in your pants, or are you just a lonely excuse for an adult?

I did it. Thanks to Methanar�s extreme coaching.

# ? Aug 19, 2021 00:43

Nohearum: Nov 2, 2013

I'm not very familiar with this kind of stuff...but I maintain a Docker image that is stored in a github repository. The Dockerfile pulls from someone else's repository on github and builds the application from source inside the container. I'm setting up github actions to automatically build/push the image to dockerhub. Is there any way to automatically trigger a new docker build if the git repository I'm dependent on gets updated? I've looked at dependabot a bit but I can't seem to figure out how to make it check another github repository.

# ? Aug 19, 2021 05:23

fletcher: Jun 27, 2003; ken park is my favorite movie; Cybernetic Crumb

Nohearum posted:

I'm not very familiar with this kind of stuff...but I maintain a Docker image that is stored in a github repository. The Dockerfile pulls from someone else's repository on github and builds the application from source inside the container. I'm setting up github actions to automatically build/push the image to dockerhub. Is there any way to automatically trigger a new docker build if the git repository I'm dependent on gets updated? I've looked at dependabot a bit but I can't seem to figure out how to make it check another github repository.

Looks like this should do it: https://github.community/t/triggering-by-other-repository/16163

# ? Aug 19, 2021 07:06

Nohearum: Nov 2, 2013

fletcher posted:

Looks like this should do it: https://github.community/t/triggering-by-other-repository/16163

Thanks ended up having to do a slightly more complicated setup since I don't own the other repository but this pointed me in the right direction.

# ? Aug 20, 2021 05:45

The Fool: Oct 16, 2003

I made a post in the job fair thread: https://forums.somethingawful.com/showthread.php?threadid=3075135&pagenumber=113#post517259726

The team I'm on has two openings.

# ? Aug 26, 2021 20:31

Hadlock: Nov 9, 2004

Because oauth

Nginx reverse proxy,

I need to take a header, X and append the value, oauth-proxy=abcde=, to cookies header value

This ought to just be a append-cookie $http_x .. ";" in the nginx.conf right?

Hadlock fucked around with this message at 01:30 on Aug 27, 2021

# ? Aug 27, 2021 01:02

Methanar: Sep 26, 2013; by the sex ghost

Geez, I'm like 0/3 on interviewing people who claim to have extensive experience with prometheus who can't explain to me what the group_left or group_right statements are, or what they do at all.

# ? Aug 27, 2021 02:48

xzzy: Mar 5, 2009

Methanar posted:

Geez, I'm like 0/3 on interviewing people who claim to have extensive experience with prometheus who can't explain to me what the group_left or group_right statements are, or what they do at all.

I don't claim to be a prometheus expert but I do happen to be the one running our instance of it for my group so know the most of anyone where I'm at. I couldn't tell you what group_left or group_right is either but it sounds an awful lot like a join which means it's evil and should be avoided.

# ? Aug 27, 2021 03:06

Methanar: Sep 26, 2013; by the sex ghost

xzzy posted:

I don't claim to be a prometheus expert but I do happen to be the one running our instance of it for my group so know the most of anyone where I'm at. I couldn't tell you what group_left or group_right is either but it sounds an awful lot like a join which means it's evil and should be avoided.

Just saying the word 'join' would be enough.

If somebody is claiming to be an expert on it, or making a big deal of how much prometheus they manage for their company, I do expect them to be able to convince me they actually understand promql beyond typing node_cpu_seconds_total as a query and nodding their head at how the community-provided dashboards that come with the prometheus-operator chart work.

I really don't think its crazy to ask to have this type of query explained to me.

code:

kube_pod_init_container_status_running * on(pod) group_left(node) kube_pod_info

Methanar fucked around with this message at 03:35 on Aug 27, 2021

# ? Aug 27, 2021 03:22

Hadlock: Nov 9, 2004

I managed Prometheus for three companies, total combined valuation $alot and never had to use whatever weird query you're talking about. Built a poo poo ton of dashboards in grafana

Only outage we had was a known issue that got patched in the next point release, I think 2.22, what, four years ago

There's not much to manage unless you want to federate and retain terabytes of data and even most of that is a solved problem. It's pretty drama free software in my experience. Are you guys having enough problems that you're hiring a dedicated Prometheus developer or something

# ? Aug 27, 2021 03:36

Gangsta Lean: Dec 3, 2001; Calm, relaxed...what could be more fulfilling?

Hadlock posted:

Are you guys having enough problems that you're hiring a dedicated Prometheus developer or something

Interview tests for poo poo that doesn�t matter � news at 11.

# ? Aug 27, 2021 03:41

Methanar: Sep 26, 2013; by the sex ghost

Hadlock posted:

I managed Prometheus for three companies, total combined valuation $alot and never had to use whatever weird query you're talking about. Built a poo poo ton of dashboards in grafana

Only outage we had was a known issue that got patched in the next point release, I think 2.22, what, four years ago

There's not much to manage unless you want to federate and retain terabytes of data and even most of that is a solved problem. It's pretty drama free software in my experience. Are you guys having enough problems that you're hiring a dedicated Prometheus developer or something

The job is to help manage what is one of the largest prometheus/thanos footprints and stacks in the world (800TB of data at a 1 year retention) and to support hundreds of developers use it for all of their monitoring and alerting needs.

Methanar fucked around with this message at 03:47 on Aug 27, 2021

# ? Aug 27, 2021 03:42

xzzy: Mar 5, 2009

Hadlock posted:

There's not much to manage unless you want to federate and retain terabytes of data and even most of that is a solved problem.

I keep getting nagged to store metrics "forever." What's your favorite solution for that?

Longest I've been able to retain is one year on a single server. Worked great until an untested software update generated some really high cardinality values and brought everything to a screeching halt.

# ? Aug 27, 2021 03:43

Hadlock: Nov 9, 2004

xzzy posted:

I keep getting nagged to store metrics "forever." What's your favorite solution for that?

Longest I've been able to retain is one year on a single server. Worked great until an untested software update generated some really high cardinality values and brought everything to a screeching halt.

1) I think 90 days retention is really useful, you can look back N major feature releases/updates and see where the Bad Trend started, without blowing out any major complexity angles. 30 days is a really good way to make everyone say "I wish we had 32 days data so we could look back 4 weekly releases and see what was happening". If you're retaining metrics at a higher resolution than 1/min I'd look at dialing that back except for newly rolled out services that are still being tuned

2) Ask them what they want the data for, and if it's actually useful or just a warm fuzzy blanket that they're wishing for. If it's for the analytics team, you might be using the wrong tool for the job. Might want to use some sort of collector to feed data at certain periods into their favorite analytics tool. Get management to support you and say something like "if your team wants N frequency forever, it's going to cost this much, if you want 0.0N/N frequency forever it's going to cost you an additional headcount plus $X budget, sign here and we'll send the req to the VP for approval while you wait here in the meeting". Dollars per month seems to be the only way to explain the complexity of devops tasks to outside groups.

Barring that, I dunno, setup a "forever" prometheus server that pings everything once an hour, once you get to looking at graphs at the three month resolution, that's about what you're samping anyways. Bill it and the maintenance of it to the department that wants their metrics forever. If that doesn't work, see #1

Our data architect demanded that we keep a years worth of his postgres metrics at 1 minute resolution; between that and the core product metrics we ended up at about 1TB/year, I thought that was a reasonable ask.

That said, at my last job, our director would have just rolled over on his back and told us to go boil the ocean and solve for infinity prometheus and not push back or ask why.

Methanar posted:

The job is to help manage what is one of the largest prometheus/thanos footprints and stacks in the world (800TB of data at a 1 year retention) and to support hundreds of developers use it for all of their monitoring and alerting needs.

I guess at that point, assuming they've passed the "how do you design an 800TB prometheus installation? how do you feel about federation, what are the pros and cons of M3, cortex etc" you can get into behavioral stuff like "developer A has written a really heavy query that's running once a minute and trashing performance for everyone else, what do you do". The answer is probably something like "look in the logs, find out where the query is coming from and work with the developer a) to dial the gently caress back on that query frequency and b) investigate with a query analyzer why it's so heavy, and maybe work with them to ~~explain why they're a bad person and they should feel bad~~ optimize the query". It's fun to sprinkle in random domain specific trivia but I can't imagine disqualifying them on it. I can tell you what http codes 200, 301, 404, 418 and 500 do, but if you asked me to list off a 100-class response I'm just gonna stare blankly at you until asked politely to leave. I get that you're trying to filter for bullshit artists but I dunno if that is the way I'd personally go about it

# ? Aug 27, 2021 06:04

freeasinbeer: Mar 26, 2015; by Fluffdaddy

I�ve used group left and group right and I don�t remember off the top of my head; I think to do some fancy joins on two metrics with different labels? But I might be pulling that out of my rear end.

I run a cortex stack that does several billion metrics a day; don�t know about backend size tho.

# ? Aug 27, 2021 08:21

Methanar: Sep 26, 2013; by the sex ghost

freeasinbeer posted:

I�ve used group left and group right and I don�t remember off the top of my head; I think to do some fancy joins on two metrics with different labels? But I might be pulling that out of my rear end.

ding ding ding this guy Prometheuses

group_left and group_right are useful when you are relating two or more time series' together and need to specify on which labels you want the relation to take place. It's basically a join.

This is commonly used to enrich time series with additional data, commonly against series' that are provided by kube-state-metrics to relate some series to a kubernetes construct..
The canned dashboards and queries that come with the kube-prometheus-stack chart do this all the time.

https://prometheus.io/docs/prometheus/latest/querying/operators/#vector-matching
https://github.com/kubernetes/kube-state-metrics/tree/master/docs#join-metrics

Methanar fucked around with this message at 08:45 on Aug 27, 2021

# ? Aug 27, 2021 08:37

xzzy: Mar 5, 2009

Hadlock posted:

2) Ask them what they want the data for, and if it's actually useful or just a warm fuzzy blanket that they're wishing for. If it's for the analytics team, you might be using the wrong tool for the job. Might want to use some sort of collector to feed data at certain periods into their favorite analytics tool. Get management to support you and say something like "if your team wants N frequency forever, it's going to cost this much, if you want 0.0N/N frequency forever it's going to cost you an additional headcount plus $X budget, sign here and we'll send the req to the VP for approval while you wait here in the meeting". Dollars per month seems to be the only way to explain the complexity of devops tasks to outside groups.

It's mostly from momentum. Our previous tool was ganglia and they really liked having 10 years of CPU stats, they claimed it helped figure out how users have used resources over time and it informed new hardware purchases (we're still an extremely on-prem oriented business). So it's very much write-once-read-maybe.

quote:

Barring that, I dunno, setup a "forever" prometheus server that pings everything once an hour, once you get to looking at graphs at the three month resolution, that's about what you're samping anyways. Bill it and the maintenance of it to the department that wants their metrics forever. If that doesn't work, see #1

That's actually clever and easy and I should try that. I keep looking for prometheus remote_write targets that would downsample information and there are certainly tools that do that.. but they're annoying and add too much poo poo to support. Definitely geared for environments much larger than us.

quote:

Our data architect demanded that we keep a years worth of his postgres metrics at 1 minute resolution; between that and the core product metrics we ended up at about 1TB/year, I thought that was a reasonable ask.

That sounds about like us, I had prometheus configured for 400 days of retention holding about 400 metrics at 1 minute resolution for each of 2800 servers. It happily lived on a 4TB partition, up until the aforementioned config issue that doubled the metrics without warning.

# ? Aug 27, 2021 15:00

Adbot: ADBOT LOVES YOU

# ? Jun 4, 2024 01:12

Hughmoris: Apr 21, 2007; Let's go to the abyss!

I'm currently an RN working with EHR applications but I really want to explore the world of devops as a career pivot. I just built out a website on Azure Static Web Apps to host my resume, using Azure Functions, attached to a Cosmos DB and using CI/CD.

I did all this while clicking around the Azure portal UI. Would a good next step be deploying all of that from the Azure CLI?

If anyone is working with Azure Devops, any entry career advice? I'm new to these tools but I've been scripting/automating work tasks for a long time.

# ? Aug 27, 2021 22:12

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread

«‹›158 »