|
minato posted:PE @ Meta == SRE everywhere else. Facebook decided to call it something different because they refuse to follow Google's lead or even use any Google products internally, despite copy/pasting most of their culture from Google. I vaguely recall their rationale being along the lines of "Reliability implies that the engineer just cares about uptime, but really the scope is about engineering products to work at a production scale". Yeah whatever, it's just SRE. Cheers, thanks sire goon; that clears a lot up for me
|
# ? Sep 27, 2022 04:19 |
|
|
# ? May 18, 2024 08:16 |
|
whats for dinner posted:We switched all of our CI agents to this (GitLab runner using the Kubernetes executor) and build times didn't change up or down but we didn't really expect them to because they aren't compute bound. The real benefit for us was vastly increased spot availability and lower spot prices compared to the equiv intel instance types. Increased spot availability is interesting. Do you have any sort of numbers quantifying that? Just judging by the interruption rate advertised by AWS, it would seem that the AMD stuff actually gets slightly more interruptions for the larger instance types that I'd be interested in. I also see that despite a price different in on-demand, the spot price is actually the same. code:
https://aws.amazon.com/ec2/pricing/on-demand/ https://aws.amazon.com/ec2/spot/pricing/ This is all sort of nebulous enough I'm probably going to just start with going to c6i across the board and then will do a trial run of converting some subset to c6a later on to compare. Methanar fucked around with this message at 20:07 on Sep 27, 2022 |
# ? Sep 27, 2022 20:04 |
|
Methanar posted:Increased spot availability is interesting. Do you have any sort of numbers quantifying that? When we made the move, I did the same calculations as you in the region we're operating in and for the instance types we cared about (c5a.4xlarge and c5.4xlarge) and the AMD boxes came ahead on both counts; especially interruption rate. On Intel boxes we were seeing bad interruptions but also running into a lot of instances where spot availability was exhausted in the region. Now, about a year later, spot interruption and savings for both are pretty much at parity.
|
# ? Sep 27, 2022 23:44 |
fletcher posted:I assume you are backing up your git repos in some manner, since you mentioned you keep most of them locally. Definitely setup backups if you haven't already. My photoprism build kept failing on bitbucket pipelines so I ended up trying it on gitlab, and it fixed the issue. I think the bitbucket runner didn't have enough memory, even with the 2x setting enabled. I had been thinking about migrating from bitbucket to gitlab anyways, and I wanted to host my own runner so I don't have to pay for build minutes. This ended up being a good excuse to do both, because in order to use a custom image in the gitlab pipeline, you have to host your own runner. In this regard, I think bitbucket has an advantage, since I was able to use my custom image on shared runners on bitbucket. Using private images for gitlab runners and pushing to ecr registries was kind of a pain in the rear end overall compared to doing it on bitbucket. I can't help but think that is intentional, to get you to use gitlab's container registry service. To push images I'm building to ECR on bitbucket it was simply: code:
code:
code:
|
|
# ? Oct 3, 2022 19:02 |
|
It's 12:38, past midnight. You're a bit bummed out. You lay down in bed, it's not that comfortable and your tinnitus is flaring up. You close your eyes and wait for the next work day. Your phone chimes. "Maybe it's a bumble match" It's not. "I'm sure it's everything is fine." code:
Spinnaker was very not fine. I am very not fine. Methanar fucked around with this message at 11:05 on Oct 10, 2022 |
# ? Oct 10, 2022 09:59 |
|
Methanar posted:It's 12:38, past midnight. #HugOps
|
# ? Oct 10, 2022 17:43 |
|
Junkiebev posted:I can't decide if something is a crazy anti-pattern for terraform i came up with a remarkably cursed solution for this which is a shell script to parse json and dynamically build providers.tf and main.tf in the module which does the tagging and a thrice-damned dynamic map iteration which makes me nauseous but also poo poo works, ship it code:
Junkiebev fucked around with this message at 17:48 on Oct 10, 2022 |
# ? Oct 10, 2022 17:46 |
|
It's 11 am. I finally did a hand off to a European at 4 am after making my last change of the night time significantly increase the size of redis which solved the problem enough for the moment. I wake up to 30 plus messages on my phone at about 9 am. I laid in bed for 2 hours trying to go back to sleep but I can't and instead just lay here anyway and answer all of the slack stuff from my phone. I was already not doing great but I'm a solid 1 out of 10 right now after last night. I legitimately don't want to do anything. There's so much follow up to be done now when I was already 3x overloaded. Doing what I can to delegate, but man, I don't even want to do that. Last night's incident barely even registers on the severity scale of the poo poo I've seen. But this is really kicking me when I'm down. Methanar fucked around with this message at 19:20 on Oct 10, 2022 |
# ? Oct 10, 2022 19:18 |
|
Why does your org rely so heavily on spinnaker such that it has to page someone in the US? Like, if you’ve got Euro colleagues and they’re using it why can’t they fix it rather than paging you in the middle of the night?
|
# ? Oct 10, 2022 20:15 |
|
Blinkz0rz posted:Why does your org rely so heavily on spinnaker such that it has to page someone in the US? Like, if you’ve got Euro colleagues and they’re using it why can’t they fix it rather than paging you in the middle of the night? I'm on call this week. Nobody really knows how to manage spinnaker, I'm just somehow by accident the most familiar with it after dealing with it for so long. I didn't want to throw the incident response completely over the wall to a relatively new hire European , who knows absolutely nothing about Spinnaker, when it's completely on-fire and everyone and their dog is jumping into the pile to +1 they're having issues. Which is why I only did the handoff many hours later once things were stabilized and things were reasonably understood. Spinnaker is the worst thing we own by a huge margin and I want to get rid of it so bad. I'm aware of like 5 different problems with it, but I don't have time to deal with it myself. I've assigned a few work items out on fixing it, but it's just a very slow to get anything done and last night was an example of the fixes taking too long to be made.
|
# ? Oct 10, 2022 21:28 |
|
It’s also just a release bottleneck unless you’re horribly abusing it. Worst thing that should happen if spinnaker goes down is that devs lose the ability to release services which can either be dealt with by someone geographically closer to the devs in question or can wait until morning. If I were expected to be on call for spinnaker after hours I’d quit effective immediately.
|
# ? Oct 10, 2022 23:01 |
|
I'm just dumb and stress myself out constantly. 3 years ago I did push back with the idea that issues with a dev environment are not wake me up in the middle of the night worthy. When I did that the sre escalating just went down the list paging people until he got his response. I thought it was bs that I looked bad when the senior at the time got woke up to deal with something I pushed back on. But whatever. I wasn't about to tell the 15 people dogpiling complaining who are just trying to do their jobs that I'm going to compromise their deadlines and ability to do integration testing or whatever that are coming up with the end of the company quarter at the end of October. Real things that was mentioned to me. It reflects badly on everyone and is non productive for me to leave all of the dev groups in a bad spot when the thing broken is explicitly owned by us and I am on call. That's not how I do things. Like I mentioned the spinnaker support situation is trash and severely neglected. It would be extremely unfair for me to have dumped this on somebody else that knows nothing about spinnaker and wasnt on call. It's just a bad situation and I was holding the bag when things went sideways. I have a long list of things to be fixed. I just don't have enough time to throw at it. My own or otherwise. Idk
|
# ? Oct 11, 2022 01:42 |
Methanar posted:I'm just dumb and stress myself out constantly. Sounds like you need about four more people on your team
|
|
# ? Oct 11, 2022 02:19 |
|
or, and this is the hardest part of all, a mature understanding of how to scope and when to say no
|
# ? Oct 11, 2022 02:39 |
|
"Engineers getting pages in the middle of the night about mere dev infrastructure" is exactly the sort of thing that should be resulting in post-incident engineering post-mortems. It's absolutely something that shouldn't be happening, and there's lots of solutions. Many of them are very good. I'm personally fond of devops actually meaning devops, not this sre or "devops just means ops runs kubernetes now lol" garbage. Then teams with broken dev infrastructure would of course just fix their own stuff and paging someone else about it would be completely unthinkable.
|
# ? Oct 11, 2022 04:39 |
|
Methanar posted:3 years ago I did push back with the idea that issues with a dev environment are not wake me up in the middle of the night worthy. When I did that the sre escalating just went down the list paging people until he got his response. I thought it was bs that I looked bad when the senior at the time got woke up to deal with something I pushed back on. But whatever. One of the things that leads to less work and more money is becoming much more comfortable with allowing bad thing to happen to other people. All struggling and waking up in the middle of the night is doing is giving you more work, limiting your career, and bandaging over a problem that should be someone else's.
|
# ? Oct 11, 2022 05:31 |
|
sorry, disagree, if you own “development environment” infrastructure, it doesn’t matter that it’s being used for development, that’s production infrastructure. teams in other regions being able to do development is business critical. if you own the infrastructure, you of course need to be granted the resources and development time owning it requires, but that’s a separate issue.
|
# ? Oct 11, 2022 08:45 |
|
my homie dhall posted:if you own the infrastructure, you of course need to be granted the resources and development time owning it requires, but that’s a separate issue. This is what goes wrong 99 out of 100 times which means dev environments are going to be treated as low(er) prio. Usually companies that do this well also have teams in multiple timezones to support such infra and don’t need to page someone for a broken dev environment.
|
# ? Oct 11, 2022 10:28 |
|
I think we tend to treat outages in the deploy infra as critical just because we may not be able to respond to an actually customer facing outage if we can't deploy fixes or maybe even roll back. idk to what extent it's normal to have CI and production deploys all coupled together.
|
# ? Oct 11, 2022 17:24 |
|
Yeah "can't deploy to prod" is two things 1) when they are actually have the fix, now there's no way to build and deploy it possibly more important for your career, 2) everyone from your boss all the way up to the CEO, then back down again in a series of meetings that are basically "you had ONE job" when the company is losing money waiting to deploy. Our analytics dept had a dashboard and could project how much money we lost in revenue for each outage. Our department is well funded and has very few outages now
|
# ? Oct 11, 2022 17:37 |
|
I'm at the life stage where I'm thinking about jira workflows in the shower.
|
# ? Oct 13, 2022 18:57 |
|
Methanar posted:I'm at the life stage where I'm thinking about jira workflows in the shower. If you’re considering a career switch to Product Owner or Scrum Master this is probably the wrong thread.
|
# ? Oct 13, 2022 19:36 |
|
I've been there It's time to switch jobs to a less hosed up work environment for more money
|
# ? Oct 13, 2022 19:43 |
|
I'm tasked with building out EKS with terraform, and I'm new to terraform but have used k8s quite a bit, so this is a fun disaster in the making. However, I'm not sure how to reconcile certain problems in terraform when connecting all the dots. For example, I have a EKS cluster, then various apps on it for development purposes. We want to have the same app cloned in separate namespaces on this cluster representing git branches/PRS etc. Simple via helm chart However, to make them useful, they need to be exposed at an ALB. So its not clear to me how to design the various components of the ALB part so that it takes in a list of potential endpoints then generates all the updated target groups and listeners based off those domains. So we can call a helm chart install via terraform then it goes back up and updates the existing ALB with all the proper endpoint configurations. Or maybe I just don't understand how to use terraform to iterate over lists the right way.
|
# ? Oct 24, 2022 19:17 |
|
When you spin up the k8s cluster, you'll also bootstrap nginx or traefik via helm charts, and then each custom helm chart will have an ingress controller that does work on nginx or traefik. You'll also probably want to install cert manager to handle ssl certs using let's encrypt. Your ingress controllers will inform how the load balancer per cluster is configured automatically I think EKS only works with ELB not ALB? Maybe that has changed So yeah your terraform will look like this Spin up cluster Install nginx Install cert manager Install flux/argocd And then create your branch and references to the helm chart and let your CI/CD install the helm chart to the correct namespaces Someday nginx will support let's encrypt out of the box Edit: installing helm charts via terraform is primarily for bootstrapping the cluster; nginx, cert manager etc. You should not be installing/deleting helm/hadlockco/myapp via terraform, that should be handled by your gitops branches and Ci/cd
|
# ? Oct 24, 2022 19:44 |
|
Use the aws load balancer ingress controller instead of caddy/traefik/nginx. It handles mapping ingresses to ALBs.
|
# ? Oct 24, 2022 19:46 |
|
Hadlock posted:When you spin up the k8s cluster, you'll also bootstrap nginx or traefik via helm charts, and then each custom helm chart will have an ingress controller that does work on nginx or traefik. You'll also probably want to install cert manager to handle ssl certs using let's encrypt. Your ingress controllers will inform how the load balancer per cluster is configured automatically I don’t know if I would want to manage the helm charts for nginx/cert-manager via terraform at all. I think I’d rather have a data structure accessible in a K/V store with all your clusters, and then have a repo with all your managed applications that can deploy to all those clusters. Use something like helmfile for declarative specifications of your chart releases and use your CI/CD of choice to deploy the Helm releases. That way you can deploy config changes to all your clusters at once rather than go cluster by cluster, *and* you don’t need to deal with the misery that’s managing k8s resources via terraform. If I can recommend nothing else, you want to manage as few kubernetes resources with terraform as possible. Helm is much better suited for that work. In EKS you’ll largely be provisioning things like load balancers using annotations on service or ingress objects, so you wouldn’t manage or define them in terraform at all. They’re aws managed and governed by your k8s templates, which are governed by helm. The Iron Rose fucked around with this message at 19:55 on Oct 24, 2022 |
# ? Oct 24, 2022 19:52 |
|
Thanks for the suggestions, I agree that managing k8s with terraform is weird as gently caress and I don't like it. So I have been arriving at the conclusion that terraform should just do the top level resources, and then helm etc for everything else. I did try the aws alb controller, it worked, but I never got external-dns to work properly with it. Maybe there is a simpler way to update a Route53 record in response to ingress ?
|
# ? Oct 24, 2022 20:06 |
|
Sylink posted:Thanks for the suggestions, I agree that managing k8s with terraform is weird as gently caress and I don't like it. So I have been arriving at the conclusion that terraform should just do the top level resources, and then helm etc for everything else. You can survive easily enough by managing DNS separately all together. Have a terraform repo with your R53 records, expose it to the entire engineering org, and set A records to resolve to your ingress controller’s service IP using an internal/external load balancer IP as needed. The ingress controller will then route to the backing service based on the host header, which is defined in each ingress object. It’s a separate action altogether. External DNS is a perfectly valid way to do it, but I just haven’t used it personally (we use our own self managed DNS because of very silly reasons).
|
# ? Oct 24, 2022 20:10 |
|
I have never run into a situation where I unilaterally changed all my nginx instances at once. We always used dev clusters to test changes before promoting to prod. The few times we didn't it was the CTO and his lieutenant cowboying changes into prod and almost always broke something I'm sure you could configure nginx to read secrets/config from KMS or something? If that was actually a concern? Not super worried about secrets in dev clusters. And secrets on prod clusters? If you have access to the pride cluster, you probably have the ability to read/modify those secrets via (choose your favorite roundabout method). Limiting access to production via the normal methods has always passed security audit. Not everyone needs beyondcorp security model
|
# ? Oct 24, 2022 20:17 |
|
Hadlock posted:I have never run into a situation where I unilaterally changed all my nginx instances at once. We always used dev clusters to test changes before promoting to prod. The few times we didn't it was the CTO and his lieutenant cowboying changes into prod and almost always broke something Do you not want to update the version of your helm deployments of nginx or cert manager? APIs get deprecated, config changes need to be made, replicas need to be added, annotations adjusted… You wouldn’t change all of them at once, obviously test in dev first, or have release rings, whatever. But I’d much rather set yourself up for scaling now then have to go and update one terraform state per cluster every time you want to make a change, which also allows you to have a consistent config across your environment. This is specifically for *your* managed services you want to put on all clusters though. I would include the ingress controller, security tooling, etc here, but ingress objects should obviously be managed on an environment by environment if not cluster by cluster basis. Same with cert manager since you probably want to limit access for providing domain certs to specific clusters.
|
# ? Oct 24, 2022 20:40 |
|
The Iron Rose posted:If I can recommend nothing else, you want to manage as few kubernetes resources with terraform as possible This is the way. Terraform is not great at all with managing Terraform - we only let it install ArgoCD and the ArgoCD bootstrap application (which installs ALB controller, ExternalDNS, Datadog et c) and even that's too much for it. We're looking to move to managing nothing in k8s via terraform.
|
# ? Oct 24, 2022 22:28 |
|
I know I’m late to this but I had been away from k8s for a couple years and I am blown away by how awesome ArgoCD is. I just push changes to git and within a minute or two it’s reflected in the cluster. I’m sure as I use it more it will have warts just like any software. But my initial experience is overwhelmingly positive.
|
# ? Oct 24, 2022 22:46 |
|
I'm deep into ArgoCD right now and it's real nice... except when you find warts and it isn't. Thankfully there's a lot of traction around it compared to Spinnaker which I'm coming from.
|
# ? Oct 24, 2022 23:14 |
|
Sylink posted:Thanks for the suggestions, I agree that managing k8s with terraform is weird as gently caress and I don't like it. So I have been arriving at the conclusion that terraform should just do the top level resources, and then helm etc for everything else.
|
# ? Oct 31, 2022 06:49 |
|
Is there a good way to start Azure DevOps pipelines in batches. I'm trying to find a way to trigger over 1k downstream pipelines after my initial pipeline runs successfully. I'm not sure if our Azure DevOps infra will like it if I start them all at once as we share the build agents company wide. On busy days we're already running into some limitations where we see 30+ min of queues. Bad scaling/sizing on their part, I know, but I don't want to make the problem worse. The plan is to start this process outside of business hours to minimize impact on the rest of the organization, but you just know there's going to be one day that somebody can't deploy a hotfix for a prio 1 incident because there's a 4 hour queue for the build agents. The main pipeline will create a feature branch and updates a config file with versions for each downstream repo, which will build on commit. The downstream repo's are managed in a config file in the main repo, so it's iterable. The only thing I came up with so far is externalize updating the downstream repos so it can be done in batches. Was hoping I'm missing something and there's an easier way.
|
# ? Oct 31, 2022 11:23 |
|
use the ado api, either right in your initial pipeline or from a function app, to trigger and manage the downstream pipelines or redo your design to be less silly
|
# ? Oct 31, 2022 14:43 |
|
The Fool posted:or redo your design to be less silly Been thinking about a better approach but drawing blanks. All these downstream pipelines are building container images based on customer specific configs. I guess I could put the config on an S3 bucket and download them on container start but that would mean it’s not immutable anymore. Messing with the config for active services means I could have multiple configs live for the same version. It’s an inherited service that’s less optimal in many other ways and won’t see the end of 2023 (hopefully) so rewriting the entire thing isn’t worth it I think.
|
# ? Oct 31, 2022 18:11 |
LochNessMonster posted:Been thinking about a better approach but drawing blanks. It's a common pattern to pull down some config at runtime. You wouldn't want secrets baked into the images, for example.
|
|
# ? Oct 31, 2022 18:24 |
|
|
# ? May 18, 2024 08:16 |
|
LochNessMonster posted:All these downstream pipelines are building container images based on customer specific configs.
|
# ? Oct 31, 2022 18:47 |