Continuous Integration/build engineering/devops thread

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread

George Wright: Nov 20, 2005

Azure Support/TAMs at all levels have been god awful. The only thing they truly care about on every single phone call on every single ticket is how quickly can they reduce the ticket severity. Their SLAs are also meaningless. I can�t understand why anyone would willingly go with Azure. It�s so bad that we�ve had conversations with them and have told them that it�s a one of a few big reasons why we don�t increase our spend with them.

That said, GrafanaLabs support and TAM teams have been great the last ~5 years. Obviously a much smaller company, so that makes sense.

# ¿ Dec 16, 2022 02:40

Adbot: ADBOT LOVES YOU

# ¿ May 3, 2024 20:51

George Wright: Nov 20, 2005

Helm is great if you never have to look at it beyond helm install <some public chart>. The things people do in the helm templates can be breathtaking.

ArgoCD, on the other hand, is spectacular.

# ¿ Dec 21, 2022 02:49

George Wright: Nov 20, 2005

What Linux and K8s distros are folks using for K8s on metal these days?

Looks like on the Linux distro side there are the usual suspects, but also distros like Flatcar, Bottlerocket, and Talos. Any experience here with those? Any horror stories?

As for K8s, EKS-D looks appealing so we could have the same distro on metal as we do AWS. Any experience with EKS-D or any other K8s distros? Any horror stories?

# ¿ Jan 15, 2023 19:39

George Wright: Nov 20, 2005

freeasinbeer posted:

Cilium is very alpha quality at the moment unless you are just replacing your existing CNI, I�d wait, but it�s still the right direction

From a CNI perspective, a service mesh perspective, or both?

# ¿ Jan 24, 2023 01:21

George Wright: Nov 20, 2005

Docjowles posted:

Yeah... I mean for both of us, StatefulSets exist. But why are you running this app in k8s at all except to say you can.

edit: I guess this is a good time to ask the audience if any of you are running important databases or elasticsearch clusters or something in k8s and are happy about it or doing it at gunpoint. We run a lot of k8s but really try to limit it to just stateless services here.

Our DBA team has been collectively dying for this resume bullet point.

# ¿ Feb 25, 2023 06:56

George Wright: Nov 20, 2005

We use Prometheus internally so if teams are very reluctant to add a health check endpoint then using the metrics endpoint for health checks is better than nothing, as long as it initializes last.

But yes, adding a health check endpoint is probably the most reasonable way to go.

# ¿ Sep 7, 2023 01:56

George Wright: Nov 20, 2005

I strongly prefer JSONx over XML, JSON, and YAML.

# ¿ Dec 17, 2023 23:26

George Wright: Nov 20, 2005

We switched away from Argo+Kustomize towards Argo+Helm. Helm is awful, don�t get me wrong, but we manage 40 �Addons� (and counting) across 38 clusters (and counting) across four different cloud providers (and counting) and the amount of Kustomize overlays required was unwieldy at best. Plus, a lot of OSS �Addons� use a Helm chart as their default/official install method so it�s fairly easy to consume.

With Argo+Helm things are much cleaner in the repo and much easier to grok. We also have an in-house tool that runs as a pipeline to render a diff between main and the branch just to ensure the changes we�re intending to make are what ArgoCD will make.

All in all it works well enough. We�ve explored other tools and patterns but they haven�t ever been better.

Helm itself is quite bad. Templating YAML is horrible. But it is what it is.

# ¿ Feb 4, 2024 07:21

George Wright: Nov 20, 2005

Warbird posted:

Question for the room: I distinctly remember �never ever ever run a database in a container� being common wisdom back in the mid 20teens. That�s clearly no longer a problem and I find myself wondering if that was ever an actual issue and things have changed, or it was a misunderstanding of the tech/an old engineer�s tale. What�s the word there?

Some databases were built with containers in mind, but they are all relatively new so I don�t trust them yet for any serious workloads. Every other database has support bolted on well after the fact.

We have/had efforts here to run databases on K8s and the first time an underlying host gets rotated out or patched the DBA or owner immediately comes and complains and requests an instance that never gets retired, or 2 weeks heads up on a maintenance and we just tell them too bad.

The big problem is that DBAs and to a lesser extent feature teams don�t understand the implications and often don�t like to read documentation that goes against their beliefs. This means they get surprised in a bad way and then you have to have 2 months of meetings with PMs to discuss how to accommodate or migrate away.

# ¿ Mar 1, 2024 20:01

George Wright: Nov 20, 2005

The Iron Rose posted:

We�re migrating some k8s clusters, which historically have been split between individual dev teams, and would like to start building resources in the same cluster to save some money and operational overhead.

This means NetworkPolicies, namespaced roleBindings, resource quotas/separate node pools, et cetera. My inclination is to require namespaces in the form <service-team>. I�m aware that partial matches are prohibited, so this is really just a logical organization thing so it�s obvious that �redis� is actually owned by �team foo�.

We�re currently using the Azure CNI rather than cilium, which can potentially change.

Any advice here? Mostly in terms of naming conventions. every deployment is tagged with service and team labels/annotations.

I don�t like to embed team information in namespace names. This should be metadata attached to the namespace by way of labels. Services can change hands, teams can rename themselves, reorgs can delete teams. It�s easier to update metadata than it is to move a service to a new namespace. In most cost tracking services the team label automatically gets applied to any objects within that namespace, so we don�t worry about annotations on the deployments or pods.

We heavily use rbac-manager for managing access to namespaces. It is used to create rolebindings between namespaces and groups from our directory. Our directory structure sucks in that our IT people only offer team based groups, but this is better than a single flat group. To grant a team access is to add an annotation to the namespace.

We don�t do much network segmentation so I can�t offer insight there.

# ¿ Mar 19, 2024 22:57

George Wright: Nov 20, 2005

The NPC posted:

If you are storing team info in metadata, what are your namespace naming conventions? If there isn't team1-redis and team2-redis how do you prevent collisions?

For the record we are using team-app-env so we have webdev-homepage-dev, webdev-homepage-uat, finance-batch-dev etc. with each of these tied to an AD group for permissions. We include the environment in the name because we have 1 nonprod cluster.

Typically where I work a service will own it�s own cache or data stores in K8s, so it doesn�t make sense to name a namespace after the software powering that cache, let alone to have that cache have a separate namespace from the application consuming it.

If the service name is bombadier the namespace would be bombadier or bombadier-<env>. Within that namespace you would have your app deployment and your cache store(s) defined.

A database team managing a data store would either offer a shared data store in their own namespace, or they would have an operator that is allowed to create a data store in a service�s namespace.

Exceptions always exist. We try to do our best to steer teams towards best practices but we still allow people enough rope to learn lessons the hard way as we just silently judge while tapping on our best practices docs that they choose to ignore. At the end of the day it�s their decisions that typically cause downtime, not ours.

# ¿ Mar 20, 2024 06:17

George Wright: Nov 20, 2005

If you�re handling PII or you�ve got a reliable, well used, integrated, and supported PKI, then you should terminate at the pod. Otherwise it�s easier to terminate at the LB and let your cloud provider deal with certs.

# ¿ Mar 30, 2024 15:40

George Wright: Nov 20, 2005

We�ve resisted all service meshes and so far no one has had a compelling enough use case to consider one. We�re open to them if someone actually has a valid need for them, but it�s mostly been attempted cargo culting or resume driven development.

We don�t have the team size to support it and quite frankly we�ve still got larger problems to solve so we don�t want the distraction.

# ¿ Mar 30, 2024 21:46

Adbot: ADBOT LOVES YOU

# ¿ May 3, 2024 20:51

George Wright: Nov 20, 2005

Infrastructure Janitor or Internet Janitor

# ¿ Apr 5, 2024 01:50

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread