Continuous Integration/build engineering/devops thread

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread

«‹›156 »

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

I'd contend that while the foo-api service shouldn't explicitly be aware of the read replica, the library it uses to handle connections and pooling or the abstraction they've hopefully written over that library should handle the distinction.

If the service is aware of the nature of the DB op (write/mutate vs read) then it should implement the logic to route requests accordingly.

Also doing round-robin DB requests is going to result in pain beyond belief. Better to have SELECT statements go to the read replica automatically unless they need to be wrapped within a transaction.

# ? Jul 22, 2022 22:21

Adbot: ADBOT LOVES YOU

# ? May 15, 2024 18:09

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

Also I'm not sure nginx can handle postgres's wire protocol and make the distinction between the type of operation the client wants to do and what's being sent over the wire.

If you really must do this outside of the app I bet pgbouncer has some stuff that supports redirecting requests based on parameters but I've never used it so I can't confirm.

# ? Jul 22, 2022 22:23

Methanar: Sep 26, 2013; by the sex ghost

I'm having a real hard time interpreting this as a ops problem at all. you really don't want to be making their db problems your infra problems.
If they've outgrown single DB assumptions, then your developers need to fix that.

How is this going to continue to grow and scale, do you really want to be supporting this forever and on the hook for the next time they need more performance?
I would flatly tell them not my problem.

Methanar fucked around with this message at 22:49 on Jul 22, 2022

# ? Jul 22, 2022 22:46

12 rats tied together: Sep 7, 2006

The Iron Rose posted:

The developers of foo-api came to me asking if I could help them load balance between the main DB and the read replica, only for GET request, with session affinity so as to account for the varying performance characteristics and replication lag (i.e. we don't need session affinity if we balance between multiple read replicas). The developers are reluctant to implement this in the service themselves, since the service assumes only one one database and they don't want to rewrite their DB connection logic. I can push back on all of that if I have to, but I don't know if services should be aware of the database infrastructure configuration in the first place.

Sadly, your best option in this scenario is to implement this in the server yourself by rewriting the DB connection login on your own. It's absolutely normal for a service to have a complex understanding of its data storage needs, including the existence of read and write replicas if they exist.

# ? Jul 22, 2022 22:55

minato: Jun 7, 2004; cutty cain't hang, say 7-up.; Taco Defender

Blinkz0rz posted:

Better to have SELECT statements go to the read replica automatically unless they need to be wrapped within a transaction.

This kind of situation is why it's impossible to solve this purely on the Loadbalancer/DB side. Apps often need some reads before it performing a write, and it may have transaction isolation requirements or maybe not.... it's up to the app. The backend DB can't just figure this out by inspecting statements; if it did then at that point, it becomes its own app.

When I worked on a web-app back in the day, we ran into a similar issue. The scarce resource was R/W DB connections, but RO DB conns were plentiful because they could be spread over many replicas (and we assumed that slight delays between master -> replica were not an issue).

The HTTP type suggested whether RO or RW was necessary (e.g. GET => RO, POST / PUT => RW) but this wasn't always the case; some GET pages also had internal side effects like updating page-read counters. So we couldn't rely on just looking at the HTTP method to know which DB connection type we'd need.

We solved this by having a DB abstraction layer in the app itself with a single API call "getDBConnection(need_read_write: bool)". Whenever the app code needed to talk to the DB, it called this Singleton to get one, and it might be called repeatedly throughout the lifespan of the page that was generated. Internally, the method was smart enough to cache a previous connection, and to upgrade an existing connection from RO to RW if suddenly the app changed its mind about what type it needed.

This design meant the app didn't have to decide at the beginning of processing whether it needed RO or RW; that might be something it decided mid-way through processing. This can be important if page processing involves libraries / plugin architectures where the main controller is not always 100% aware of what its sub libraries might be doing with the DB. For example a page renderer might call an internal library "update_page_seen(request.user, page_url)" that will store a record if the user has seen the page, but do nothing if the user has already seen the page; so the main page renderer doesn't know if a RW connection will be needed when it calls that.

# ? Jul 22, 2022 23:11

The Iron Rose: May 12, 2012; Cat Army

Thanks folks. pgbouncer seems like a possible solution to do this outside the app, and making the service aware of its database needs is preferable by far. Appreciate the advice!

The Iron Rose fucked around with this message at 23:45 on Jul 22, 2022

# ? Jul 22, 2022 23:18

12 rats tied together: Sep 7, 2006

The lua plugin idea is cool, by the way, and I think you could make it work well enough with some post PoC tweaks. It's a testament to your skills as an engineer that you could turn what looks like a developer code quality problem into an nginx plugin.

I know everyone was kind of immediately in "don't do this mode", which is absolutely the correct response, I just wanted to also highlight the "huh, no poo poo" reaction I had to seeing the lua idea.

# ? Jul 23, 2022 00:07

jaegerx: Sep 10, 2012; Maybe this post will get me on your ignore list!

I�m pretty sure you can use istio for this exact use case.

# ? Jul 23, 2022 00:29

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

The Iron Rose posted:

Thanks folks. pgbouncer seems like a possible solution to do this outside the app, and making the service aware of its database needs is preferable by far. Appreciate the advice!

The app has to be the fundamental arbiter of how it uses its data store especially since there�s a major difference in how primary and replicas behave. You can�t just round-robin traffic because a whole class of operations can�t be done on the replica. If you try to solve for that now you�re putting business logic in your network layer or introducing another dependency as a half baked solution to engineers doing the legwork they need to do.

Like Methanar said, this isn�t an ops problem and any attempt to make it one is going to be a bandaid that becomes more and more painful to maintain.

# ? Jul 23, 2022 00:32

Methanar: Sep 26, 2013; by the sex ghost

I am 100% this is overkill for your developers and they can live with 2005 tier postgres read replicas, but if you want to sound smart and eventual consistency is fine: introduce them to cloud-native CQRS design patterns

https://docs.aws.amazon.com/prescriptive-guidance/latest/modernization-data-persistence/cqrs-pattern.html
https://docs.microsoft.com/en-us/azure/architecture/patterns/cqrs
https://github.com/andrewwebber/cqrs

# ? Jul 23, 2022 01:38

The Iron Rose: May 12, 2012; Cat Army

jaegerx posted:

I�m pretty sure you can use istio for this exact use case.

I could but the fact that customers are banging on the drums here is a limiting factor. There�s lots of approaches here with service meshes though. There was a whole �nother solution I briefly entertained of using canary traffic splitting to do the same thing before I realized I was trying to fit a square into a round jole.

Which frankly is what�s happening in general, and there�s no way I�d want to actually implement the lua plugin.

Anyways this thread remains the best grey tech thread, I continue to learn a phenomenal amount from reading what folks post here. Never even heard of CQRS before and it�s fascinating. And 2000s era tech read replicas may be, but it�s quite new and exciting round these here parts.

E: With regards to splitting traffic between read and write services, it�s really a terrible idea and the difficulty of implementation reflects that. While a fun afternoon�s research and testing, this is not the model we�re going to use. With the requirement to split traffic across multiple services with session affinity gone, the problem set becomes much simpler. More importantly, the service becomes easier to comprehend and maintain, we reduce reliability and quality problems, and we preserve flexibility in how we design our application going forwards.

The Iron Rose fucked around with this message at 21:41 on Jul 23, 2022

# ? Jul 23, 2022 03:50

Mr Shiny Pants: Nov 12, 2012

The Iron Rose posted:

I could but the fact that customers are banging on the drums here is a limiting factor. There�s lots of approaches here with service meshes though. There was a whole �nother solution I briefly entertained of using canary traffic splitting to do the same thing before I realized I was trying to fit a square into a round jole.

Which frankly is what�s happening in general, and there�s no way I�d want to actually implement the lua plugin.

Anyways this thread remains the best grey tech thread, I continue to learn a phenomenal amount from reading what folks post here. Never even heard of CQRS before and it�s fascinating. And 2000s era tech read replicas may be, but it�s quite new and exciting round these here parts.

E: With regards to splitting traffic between read and write services, it�s really a terrible idea and the difficulty of implementation reflects that. While a fun afternoon�s research and testing, this is not the model we�re going to use. With the requirement to split traffic across multiple services with session affinity gone, the problem set becomes much simpler. More importantly, the service becomes easier to comprehend and maintain, we reduce reliability and quality problems, and we preserve flexibility in how we design our application going forwards.

CQRS is amazing, good luck finding good examples and people understanding it well enough. Combined with event sourcing it is really powerful but it will take quite a rewrite.

# ? Jul 24, 2022 14:59

Zapf Dingbat: Jan 9, 2001

Can someone point me in the direction of a good structure for IAM permissions? I'm working for a startup in my first cloud toucher job, and I don't have super admin privileges over the AWS infrastructure. That's fine, but it's too restrictive for what they want me to do. A lot of my tasks get bottlenecked due to me getting to a point where I, for example, have to delete something and I have absolutely no delete privileges.

The owner of the company is cautious but admits he knows no good way of handling this.

# ? Jul 26, 2022 18:57

Hadlock: Nov 9, 2004

Which cloud provider

Basically the cloud toucher either has full admin rights, or the ability to assume a role (named "admin" is common) that they can use, with an audit log of who assumed what role and when

Tell your boss that someone on the internet thinks he's stupid for hiring a cloud admin, then refusing to give that person admin access

Depending on X, sometimes they will revoke your ability to see billing, close the account but that should be the only thing you're limited from. There's a reason why cloud touchers make bank; great power great responsibility etc etc

# ? Jul 26, 2022 19:55

Zapf Dingbat: Jan 9, 2001

It's AWS. Yeah, I wouldn't expect to see billing or do account-related things.

I think there's a disconnect here. He seems to want to prevent accidental deletion from ever happening in the first place, rather than blame someone after the fact. But that's what backups are for, right?

# ? Jul 26, 2022 20:28

Happiness Commando: Feb 1, 2002; $$ joy at gunpoint $$

It's the classic tension between needing expansive permissions to do things and being able to do things with those expansive permissions.

But yes. You should be using some IaC tool and have snapshots and backups and resilient architectures.

# ? Jul 26, 2022 21:25

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

The more common pattern these days is a sort of cloud-centric sudo pattern where we only use the permissions we need on-demand and if we do they�re nonrepudiable forms of access such as with MFA and additionally with a system of access control layers such that even the admin can�t cover up their audit trail. This is essentially what I had to prove for myself as part of SOC2 type 1 where we�re making sure we have cloudtrail setup and anyone or anything that performs a write action that would delete or stop recording audit information is going to trigger a page to a lot of people.

Also, deploy Gravitational Teleport for your ssh access to machines if you�re not going to ever escape the sweet deathly vendorlock embrace of SSM in AWS. Just freakin� do it. Don�t open port 22 in 2022 in a fresh cloud for the love of Almighty God and Satan.

# ? Jul 27, 2022 03:03

Zapf Dingbat: Jan 9, 2001

You know, this actually got me thinking about this company's whole setup. Like I said, this is my first cloud toucher job. I have more of a background in traditional networking, Linux admin, and VoIP. This is the first time I've had this much power and this much backend access. I also have a feeling that most people in this company are on second careers like I am.

The company I work for is a fintech that provides a service to banks and credit unions. Since it's financial data and our application needs access to the bank's core or mainframe, everything is very segregated. Usually the path is Bank's mainframe -> AWS VPN -> static routes through a VPC -> VM -> docker container on that VM. The container can talk to a DB that's hosted elsewhere in AWS, and the VMs can't talk to each other unless explicitly allowed by network routes and security groups.

It's always one container per instance. I'm not sure if there's a better way to do this because of the need for separation. I'm running into what seems like old world IT problems though, like having to manually shut down, upgrade, then start up a VM whenever development for that customer requires more CPU or RAM.

Over the last couple of months I've taught myself Terraform which has really helped me out in deploying new customers. However, the problem I run into is not having sufficient permissions to make changes using Terraform after the fact, which is why I was asking for advice on permissions earlier. This is making it a little hard to do the whole infrastructure as code thing.

A config management tool hasn't been fully implemented yet. A more senior admin than me uses Ansible, but that's only for him since only he has SSH access to everything because it's built into the VM image. I connect to the VMs using AWS EC2 Connect, which is convenient, but there's no interoperability between that tool and Ansible, so I can't make config changes unless I manually connect to it. We're not sure how to solve this problem.

Oh, also there's no automated updates to the docker images yet. What tends to happen is that the code gets worked on and the image gets updated in our repo, but there is no plan to update it unless they run into a problem on a specific customer, then they ask me to manually pull, relaunch, and monitor the container for a minute to make sure it doesn't throw any errors. There's nothing continuous about this, of course.

What would this thread do differently, if anything?

Zapf Dingbat fucked around with this message at 17:33 on Jul 27, 2022

# ? Jul 27, 2022 17:29

madmatt112: Jul 11, 2016; Is that a cat in your pants, or are you just a lonely excuse for an adult?

Zapf Dingbat posted:

You know, this actually got me thinking about this company's whole setup. Like I said, this is my first cloud toucher job. I have more of a background in traditional networking, Linux admin, and VoIP. This is the first time I've had this much power and this much backend access. I also have a feeling that most people in this company are on second careers like I am.

The company I work for is a fintech that provides a service to banks and credit unions. Since it's financial data and our application needs access to the bank's core or mainframe, everything is very segregated. Usually the path is Bank's mainframe -> AWS VPN -> static routes through a VPC -> VM -> docker container on that VM. The container can talk to a DB that's hosted elsewhere in AWS, and the VMs can't talk to each other unless explicitly allowed by network routes and security groups.

It's always one container per instance. I'm not sure if there's a better way to do this because of the need for separation. I'm running into what seems like old world IT problems though, like having to manually shut down, upgrade, then start up a VM whenever development for that customer requires more CPU or RAM.

Over the last couple of months I've taught myself Terraform which has really helped me out in deploying new customers. However, the problem I run into is not having sufficient permissions to make changes using Terraform after the fact, which is why I was asking for advice on permissions earlier. This is making it a little hard to do the whole infrastructure as code thing.

Oh, also there's no automated updates to the docker images yet. What tends to happen is that the code gets worked on and the image gets updated in our repo, but there is no plan to update it unless they run into a problem on a specific customer, then they ask me to manually pull, relaunch, and monitor the container for a minute to make sure it doesn't throw any errors. There's nothing continuous about this, of course.

What would this thread do differently, if anything?

This is classic devops problem domain. I�m on mobile rn but I�ll throw my own two cents on the pile later. I�m sure someone else here is gonna write you a helpful essay in the meantime.

Also kubernetes is probably your friend here too

# ? Jul 27, 2022 17:33

Hadlock: Nov 9, 2004

Zapf Dingbat posted:

You know, this actually got me thinking about this company's whole setup. Like I said, this is my first cloud toucher job.

. The container can talk to a DB that's hosted elsewhere in AWS, and the VMs can't talk to each other unless explicitly allowed by network routes and security groups.

It's always one container per instance. I'm not sure if there's a better way to do this because of the need for separation.

What would this thread do differently, if anything?

If it were me, I'd move this to kubernetes

Each container on a VM represents, effectively, a service for the company, right

I would setup each service as a kubernetes Deployment. Once you get the cluster up you should be able to create babbys first deployment via your favorite online tutorial. Converting your container to a k8s deployment should be pretty straightforward. Put Deployment of service Foo in it's own Namespace, maybe call it "Foo". You can specify the Namespace in the Deployment yaml file

Then convert the other service containers to their own Deployments in their own Namespaces

Once you're done with that, build a helm chart (kubernetes Deployment templating system)for each service, redeploy the services using helm.

Then, pick argocd or flux2 to update your Helm releases

voila, your skill set just doubled in value, and you have full modern CD

oh yeah see if you can squeeze in spinning up k8s clusters in terraform, but not strictly required for this exercise

Edit: the reason you're putting the different deployments in different Namespaces is that it fully separates the containers so they share no resources they can access, this is the same as putting a container on a separate VM

This also eliminates the need to run and manage VMs, manage SSH/SSM etc, and lines you up to scale up in the future (just change deployment replicas from 1 to 99 or whatever you need)

For monitoring and alerting, you can deploy Prometheus and Grafana which are gold standard helm charts to practice with and learn about helm

Double edit: for web facing stuff you'll need to go through babbys first ingress controller using static DNS. Once you get access to R53 you can setup let's encrypt to handle dynamic ssl certificates. I recommend just going with wildcard SSL certs out of the box ([url]https://*.example.com[/url] instead of fooservice.example.com) slightly harder but you'll never hit the service rate limit for certs that way in five years

Hadlock fucked around with this message at 18:48 on Jul 27, 2022

# ? Jul 27, 2022 18:37

madmatt112: Jul 11, 2016; Is that a cat in your pants, or are you just a lonely excuse for an adult?

Hadlock posted:

If it were me, I'd move this to kubernetes

Each container on a VM represents, effectively, a service for the company, right

I would setup each service as a kubernetes Deployment. Once you get the cluster up you should be able to create babbys first deployment via your favorite online tutorial. Converting your container to a k8s deployment should be pretty straightforward. Put Deployment of service Foo in it's own Namespace, maybe call it "Foo". You can specify the Namespace in the Deployment yaml file

Then convert the other service containers to their own Deployments in their own Namespaces

Once you're done with that, build a helm chart (kubernetes Deployment templating system)for each service, redeploy the services using helm.

Then, pick argocd or flux2 to update your Helm releases

voila, your skill set just doubled in value, and you have full modern CD

oh yeah see if you can squeeze in spinning up k8s clusters in terraform, but not strictly required for this exercise

Edit: the reason you're putting the different deployments in different Namespaces is that it fully separates the containers so they share no resources they can access, this is the same as putting a container on a separate VM

This also eliminates the need to run and manage VMs, manage SSH/SSM etc, and lines you up to scale up in the future (just change deployment replicas from 1 to 99 or whatever you need)

For monitoring and alerting, you can deploy Prometheus and Grafana which are gold standard helm charts to practice with and learn about helm

Alternatively, instead of doing namespaces for hardware segregation, just use taints+tolerations and anti affinities.

# ? Jul 27, 2022 18:47

Hadlock: Nov 9, 2004

The containers still live on the same CNI right? Physical separation of the containers on different hosts doesn't matter, last I checked. The ability for two pods in the same namespace to talk to each other on the same cluster regardless of where/which node they live on via kubernetes networking is a big part of the Kubernetes Magic™

Edit: namespacing also allows you in the future to prevent the finance team from messing with the order management teams' stuff and vice versa by limiting what Namespaces they're allowed to read/write

You might be able to do permissions via the same tags the node affinity is setup for but that would be super ugly and brittle

Hadlock fucked around with this message at 18:53 on Jul 27, 2022

# ? Jul 27, 2022 18:51

minato: Jun 7, 2004; cutty cain't hang, say 7-up.; Taco Defender

I strongly suspect that if your admin team hasn't figured out how to properly use Ansible amongst themselves, they're not ready to use Kubernetes. I feel that Kubernetes is a game changer but it requires a competent DevOps team to run it, and if you don't have that then you're just setting yourself up for more pain.

As someone upthread stated, your situation is a classic DevOps problem and you need to get your infra into code. Nobody should be doing anything manually; I'm guessing that if your boss is nervous about handing permissions to individuals, it's precisely because they've been bitten in the past by people cowboying it and making manual adjustments. The whole point of getting infra into code is that you have a source-of-truth that's outside of your cloud, and it has a well-defined and audited change process.
- The fact that only the Ops Bot has permissions to make changes gives your boss peace of mind that no-one is going to Leeroy Jenkins it and tweak stuff manually in production.
- The fact that changes go through Git (and thus should have a change log and ideally a review process) gives you the confidence that changes aren't being made unilaterally, and also gives you the ability to roll back changes if things go sideways.
- The fact that they're defined as code means you can parameterize them, so it's easy to (say) spin up testing/staging environments with different parameters, giving you even more confidence that your production changes will be successful.

In your situation I would start with what you have, which is Ansible + Terraform:
- Your whole team should be using Ansible, not just one person
- All the Ansible playbooks & Terraform modules need to be under version control (i.e. Git) ASAP, and you need to implement a process for making changes to those.
- All your credentials and other secrets need to be in a vault, either Ansible Vault or Hashicorp Vault.

Slightly more advanced: if you don't want people running Ansible/Terraform from their laptops (and they really shouldn't be), then look into a product like Ansible Tower. The whole point of that is to provide a hosted GUI/API where a DevOps team can run their playbooks.

The 1-Docker Container-per-VM thing is wasteful and likely a PITA to manage. It's true that you could probably clump those onto a single larger VM and manage them all using Ansible or (ideally) Kubernetes. But IMHO you've got more fundamental problems to deal with first.

# ? Jul 27, 2022 19:11

New Yorp New Yorp: Jul 18, 2003; Only in Kenya.; Pillbug

I still don't get why everyone has such a boner for ansible. It seems like yet another configuration management tool for managing vm farms, yet everyone seems to want to hyperextend it to do everything

# ? Jul 27, 2022 19:43

madmatt112: Jul 11, 2016; Is that a cat in your pants, or are you just a lonely excuse for an adult?

Hadlock is right, nvm what I said about namespaces vs affinity/taints

# ? Jul 27, 2022 19:46

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

New Yorp New Yorp posted:

I still don't get why everyone has such a boner for ansible. It seems like yet another configuration management tool for managing vm farms, yet everyone seems to want to hyperextend it to do everything

Tbh it�s really just that one poster here

Aside from the Ansible, Chef, and Salt are basically dead products

# ? Jul 27, 2022 19:57

The Fool: Oct 16, 2003

Blinkz0rz posted:

Aside from the Ansible, Chef, and Salt are basically dead products

I don't get this attitude either. They're still actively developed and have huge customer bases. They're just not the flavor the month anymore.

# ? Jul 27, 2022 20:01

xzzy: Mar 5, 2009

Puppet is out there, struggling to prove it has a place in the cloud universe.

(IMO it's still the best choice for traditional configuration management)

# ? Jul 27, 2022 20:06

Hadlock: Nov 9, 2004

New Yorp New Yorp posted:

I still don't get why everyone has such a boner for ansible. It seems like yet another configuration management tool for managing vm farms, yet everyone seems to want to hyperextend it to do everything

This was kind of my take as well

At a previous job I had a coworker that basically lied through his teeth to use anisble to create a new templating engine for cloud formation :smithicide:

(immediately thereafter, he ran away to work for a Blockchain company, lol)

When all you know is Ansible, every problem looks like an Ansible playbook I guess

You can run terraform from Jenkins, GitHub actions or terraform cloud just as easy

# ? Jul 27, 2022 20:23

JehovahsWetness: Dec 9, 2005; bang that shit retarded

Hadlock posted:

different Namespaces is that it fully separates the containers so they share no resources they can access, this is the same as putting a container on a separate VM

Namespaces only provide resource name isolation and an RBAC boundary, they don't provide any actual workload isolation out of the box. Containers in separate namespaces can absolutely reach each other and kube-dns provides namespaced DNS names for Services in other namespaces. If you want actual isolation then you'll need to enforce namespace boundaries with CNI-supported NetworkPolicies and deny common pod priv escalation capabilities with PodSecurity Admission rules or something like OPA Gatekeeper.

Hardening a cluster to secure mixed-tenant workloads in k8s is hard and there's TONS of wiggling an attacker can do with the ability to launch a pod (or even just an RCE). Just this month I purple teamed two clusters who's teams intended for them to be used by other tenants and was able to steal k8s SA tokens, read secrets from every namespace, pivot into other AWS accounts from IRSA tokens, break Network Policy namespace boundaries and use sensitive Services, read secrets from render ENV vars, etc all from launching a single Pod.

Multi-tenancy is hard and it's better to know that k8s doesn't do anything to enforce it at the workload level, just in k8s API RBAC, than to assume there's any segregation of containers.

JehovahsWetness fucked around with this message at 20:27 on Jul 27, 2022

# ? Jul 27, 2022 20:25

Zapf Dingbat: Jan 9, 2001

Yeah, this is a lot to think about. I don't do well with something unless I can conceptualize it first, and I don't think I've gotten that far with Kubernetes. That is, I don't have a good visual in my mind like I do with VMs, containers and networks. I'll have to watch some videos and read some articles until it clicks, and then maybe I can apply it to what we're doing.

I think part of the problem too is that all the techs in this company are devs, and I'm the first pure "Ops" person they've hired. I don't think they have a good idea of what an efficient operation looks like. To be fair, neither do I.

Hadlock posted:

The containers still live on the same CNI right? Physical separation of the containers on different hosts doesn't matter, last I checked. The ability for two pods in the same namespace to talk to each other on the same cluster regardless of where/which node they live on via kubernetes networking is a big part of the Kubernetes Magic™

The reason I mention the segregation is that the boss really wants to have an environment where everything can't talk to each other unless you explicitly tell it to because he doesn't want those finance systems to bleed into each other. So the concern is not how easily things can talk to each other, but how natively hard it is.

edit: And to another point: yes, I'm going to have to solve my change management problem first. I created a git repo for my Terraform changes, but it only applies to new customers, and I'm at the point where I'm going to broach the subject with the boss. Hopefully soon we'll be in a place where we review, approve and execute infrastructure changes based on what's in the git repo.

Zapf Dingbat fucked around with this message at 20:37 on Jul 27, 2022

# ? Jul 27, 2022 20:30

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

Kubernetes is absolutely overkill for this. Telling the OP to use it is like being told someone needs to fix a Civic and telling them to just buy a Lambo instead.

Isolating workloads on individual VMs is fine. Using a container on a single VM is also fine as it gives you more options in terms of how you might want to deploy in the future.

At this point it's probably worth doing what this poster said:

minato posted:

In your situation I would start with what you have, which is Ansible + Terraform:
- Your whole team should be using Ansible, not just one person
- All the Ansible playbooks & Terraform modules need to be under version control (i.e. Git) ASAP, and you need to implement a process for making changes to those.
- All your credentials and other secrets need to be in a vault, either Ansible Vault or Hashicorp Vault.

The other thing to consider is whether any regulations might require actual workload isolation. If not, then Kubernetes might be the play in the future but golden image AMIs still work fine, they're just not the hotness anymore.

# ? Jul 27, 2022 21:24

New Yorp New Yorp: Jul 18, 2003; Only in Kenya.; Pillbug

Hadlock posted:

This was kind of my take as well

Thank god others feel the same way. I was half expecting to get raked across the coals. Or an eye opening explanation of the benefits of ansible.

# ? Jul 27, 2022 21:51

12 rats tied together: Sep 7, 2006

New Yorp New Yorp posted:

I still don't get why everyone has such a boner for ansible. It seems like yet another configuration management tool for managing vm farms, yet everyone seems to want to hyperextend it to do everything

It's hard to respond to this without breaking the grey forums rules but, to keep this quick, you are mistaken.

If your takeaway from reading the ansible documentation is that it's a tool for managing config on virtual machine farms I would encourage you to give it another read.

People keep extending it to support whatever the new thing is this year because it's good at solving hard problems.

# ? Jul 27, 2022 22:59

MightyBigMinus: Jan 26, 2020

I'm gonna go ahead and bet that most of the "mainframe + vpn/vpc + dedicated vm + 1:1 container" chain there is an almost perfect reflection of the stakeholders involved[1] and the sales-cycle/regulatory-compliance processes they've been through. going kube would require firing 2/3rds of them and re-doing all the compliance stuff with every customer. utterly bananas. take whats there and just encode it as as-is as possible in tf and with a git workflow. if your customers have mainframes and demanded dedicated environments then who gives a poo poo if their wildly overprovisioned vm sits idle with one container on it. I sincerely doubt you have a cogs margin problem.

for your permissions problem, have your boss give you an iam role with the ReadOnly policy and setup athena querying for your cloudtrail audit logs[2]. use your readonly role to be able to look into whatever without doing any harm, and then when your TF runs into trouble query the logs for the exact permissions you need (the ones that failed) and craft that into an update to your policy. manage the policy in TF, and use a PR to document the specific scenario/use-case that warrants the permissions.

1 - customer + network engineer + sysadmin + devs, also known as conway's law
2 - https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html

MightyBigMinus fucked around with this message at 23:22 on Jul 27, 2022

# ? Jul 27, 2022 23:11

New Yorp New Yorp: Jul 18, 2003; Only in Kenya.; Pillbug

12 rats tied together posted:

People keep extending it to support whatever the new thing is this year because it's good at solving hard problems.

Okay, how is building a bunch of extra ansible poo poo to run a helm chart better than just running a helm chart? That's what I'm missing. It looks like it's adding an extra layer of tooling on top of tools that already solve hard problems. What's the benefit?

I'm not trying to be difficult. I've just never seen a scenario where adding ansible would improve a process, only complicate it.

New Yorp New Yorp fucked around with this message at 23:56 on Jul 27, 2022

# ? Jul 27, 2022 23:52

drunk mutt: Jul 5, 2011; I just think they're neat

New Yorp New Yorp posted:

Okay, how is building a bunch of extra ansible poo poo to run a helm chart better than just running a helm chart? That's what I'm missing. It looks like it's adding an extra layer of tooling on top of tools that already solve hard problems. What's the benefit?

I'm not trying to be difficult. I've just never seen a scenario where adding ansible would improve a process, only complicate it.

It really depends on the environment of the solution(s), but generally being able to rapidly introduce/reproduce changes across an entire fleet in an idempotent manner really changes the ball game. The same work that goes into writing a four line task in Ansible winds up being like 50 lines of bash.

One of the best examples I've seen where there was push back on introducing Ansible, was having it execute the TF plan/apply which instead of having 50 different variable files in the repo, there was a simple dictionary defined within an Ansible variable file. When ever there was a change to the infrastructure, or configuration change that needed to be applied to all environments, just push them into the pipeline and rest assured that the exact change you tested is going to be delivered across the fleet.

I hope this doesn't come off wrong, but I feel maybe it's just not a tool you see fit for your environment so you're willing to wave it off as invaluable.

Edit to add: The Ansible variable file was quickly replaced with a Consul backend which Ansible just plucked from and very minimal changes were needed to any of the tasks/playbooks.

# ? Jul 28, 2022 00:34

12 rats tied together: Sep 7, 2006

Sure, absolutely understood, I don't think you're trying to be difficult, I have just found that leading with the explanation stifles the discussion rather than encourages it.

To your question, there are multiple overlapping dimensions of "better" that ansible makes this, but much like you don't necessarily need ansible to deploy software to 1 server, you're right that you don't really need ansible to run one helm chart on one cluster. I think that's pretty intuitive, if you don't have hard problems the "hard problems" tool is going to look stupid and pointless.

If you need to run one helm chart on two dozen clusters, with different k8s versions, and some of the clusters you can only access through a jumpbox over ssh, but other clusters you need to launch a container inside of, and some of the clusters have the chart already applied but the config needs to be adjusted in some way because of a special case, and you also want to make sure the helm chart runs on all new clusters too, but the chart can only run after some other resources already exist, and those resources aren't things that helm manages directly (maybe they are some databricks resources or a particular cosmos db connection string which comes from a per-cluster azure account), and maybe some of the data you need comes from a cloud ops team that only uses terraform, but there's also data you need from DBRE that is only ARM, then you would want to pull out the hard problems tool.

Obviously the best solution here is to not have hard problems, many of the things above could have been designed away by a competent architecture, but actual infrastructure at a profit generating enterprise is rarely that easy or clean.

# ? Jul 28, 2022 00:50

Hadlock: Nov 9, 2004

Blinkz0rz posted:

Kubernetes sux0rz

Counterpoint

Half of our nodes in an AZ went down due to a power outage

I did not know this until I was debugging a single Jenkins failure when it happened, and then saw the notification from AWS. The load got shunted to healthy (online) nodes and hardly saw a blip

# ? Jul 28, 2022 19:04

Adbot: ADBOT LOVES YOU

# ? May 15, 2024 18:09

fletcher: Jun 27, 2003; ken park is my favorite movie; Cybernetic Crumb

Hadlock posted:

Counterpoint

Half of our nodes in an AZ went down due to a power outage

I did not know this until I was debugging a single Jenkins failure when it happened, and then saw the notification from AWS. The load got shunted to healthy (online) nodes and hardly saw a blip

It's certainly possible to have a well architected solution where an AZ going down barely causes a blip without Kubernetes as well. In fact, that should be the norm for a single AZ going down, no matter what you're using. Not that I'm arguing for or against a particular way of doing things, just making the point that it's not really something specific to k8s.

# ? Jul 28, 2022 22:42

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread

«‹›156 »