Continuous Integration/build engineering/devops thread

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread

«‹›2 »

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

I think ELK/EFK is pretty widespread, no?

# ¿ Jul 27, 2018 20:56

Adbot: ADBOT LOVES YOU

# ¿ May 16, 2024 01:12

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

the idea that everyone on a team should be involved in all three of writing, releasing, and deploying code is a bad one

# ¿ Aug 1, 2018 21:23

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

crazypenguin posted:

I've got a medium-sized (50 project/repos) jenkins build thing going with the newish multibranch pipeline approach. It works mostly great, but I have a few issues.

If I can't fix these... oh well. This works acceptably already. I just wanted to check to see if I was missing anything obvious.

To do downstream integration testing/building, I manually put a list of other jobs to build in each Jenkinsfile. This slightly annoys me. It means in addition to downstream projects having to know about their upstream dependencies (of course!), I also have to update the upstream projects to inform them what downstream projects should be rebuilt on a change. Not the worst, but if there's a better way of dealing with this, I don't know about it.

I have problems with diamond dependencies. If project A is a dependency of B and C, obviously I have A rebuild B and C on a change. But if D has dependencies on B and C, I kinda get screwed. Right now, if A changes, I end up rebuilding D twice, redundantly, via both B and C. If there's a smart way to handle this stuff, I don't know about it.

Coordinated changes are annoying. If we have to change two projects at the same time (e.g. make breaking change in one, fix its use in another), we have a half-assed thing right now where it tries to build branches of the same name, so we can sorta do that. But then when merging these branches in multiple projects at once, we overwhelm jenkins with a lot of rebuilds, most of them redundant again. (If A depends on B, then committing to both rebuilds A and B, then A again, this time downstream of B.)

It seems like these should be common issues, but I dunno if I'm missing something, or if this is just part of the fun.

For the first issue we keep our automation code separate from the actual code and it works pretty well. You still keep your pom or whatever is required to build with the code in the code repo, but you keep all the automation and coordination between separate jobs in another repo. The pattern where the jenkinsfile lives with the code leads only to headaches.

# ¿ Aug 17, 2018 01:46

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

Is there anyone else doing non-image artifact-based CM, eg chef-zero or ansible bundles? Where/when do you drop your artifacts?

We have a bunch of terraform for setting up instances of our infrastructure and currently it lives separately from the provisioning code, but we want to merge the repos so we can release them as one. Currently our hosts pull provisioning code from git (I know, I know), but continuing to do this is for sure going to gently caress us up for any sort of branching when we merge the two repos (here come the self-dependencies) and also it�s just bad practice. My thought right now is to generate an instance-specific CM bundle with Terraform that�s pushed onto S3, but I�m wondering if anyone has hit any issues with something like that. We also have artifactory available, but we wouldn�t be able to have per-instance artifacts which are important for us for testing new Infra or CM code.

I�ve handled this in the past for smaller projects by delivering a hex-encoded tar of the provisioning code in user data, but unfortunately we�re just past the 16k limit 😉

# ¿ Sep 11, 2018 04:08

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

Hadlock posted:

Our monolith takes about 15 minutes to build, other services are in the 3-10 minute range.

Full commit to live smoke test including generating private dns record etc is probably close to 20 minutes

Do you guys do load tests? Any integration with internal services outside your control?

# ¿ Sep 28, 2018 01:19

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

Let�s say you want to have a version endpoint or string somewhere in your app. Isn�t that something that�s going to be static in your artifacts? If so, how do you keep the pattern of using same artifact in QA/Staging/Prod if you�ll need to modify this as it gets promoted?

# ¿ Nov 24, 2018 15:27

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

Kevin Mitnick P.E. posted:

If it changes based on deployment environment then it's not a version.

It doesn�t change based on deployment environment, but when it�s in QA it�s still just a release candidate for us, not a real release yet. We promote it to being a real release once it goes through a bunch of integration tests with other systems. (having a �QA process� instead of real CICD is non-ideal, but it�s unfortunately not my decision to make)

Maybe our endpoint should just return a build number?

# ¿ Nov 25, 2018 03:06

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

Kevin Mitnick P.E. posted:

Build a thing, do a bunch of QA on it, then build a different (but hopefully close enough) thing and release it is not a process I would be comfortable with.

Right, that�s something I don�t think I want to do. Here�s what I think are my options, but I�m wondering if there�s maybe another
- rebuilding/repackaging artifact at promote time, after QA (bad for the reasons stated)
- not having release candidates and only ever sending real releases to QA (would result in a ton of versions for us because we always find stuff in QA)
- not having version information available in the artifact

# ¿ Nov 25, 2018 05:06

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

ThePeavstenator posted:

I'm not sure if I've missed something but why can't you just do the old <Major>.<Minor>.<Release>.<Build> ?

actually I think this would totally work and I don�t know why I didn�t consider doing this. we have been sticking �RC� in our version strings for stuff that hasn�t gone through QA, but dropping that would alleviate all the pain. 🤦‍♂️ thanks!

# ¿ Nov 25, 2018 05:46

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

terraform is certainly an imperfect and frequently buggy tool, but also the least bad one available for certain types of problems

# ¿ Dec 20, 2018 02:09

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

New Yorp New Yorp posted:

I've been translating my ARM template for this project into Terraform. Here are my Terraform impressions:

- I like the syntax better than ARM syntax. This is the only thing I like.
- I don't understand the point of a state file. It creates something that's generated by interrogating Azure, but won't work without it? And now I have some extra file generated at deployment time that I have to worry about synchronizing and sharing between deployments? Maybe I'm missing something, but it seems really pointless.
- Creating the resource I need takes upwards of an hour. An equivalent ARM template takes seconds. Literally 5 seconds. I have no idea what Terraform is doing differently.
- It has a hard-coded timeout of an hour for resource creation. If the timeout expires, it just fails. There is no way to change the timeout, at least for the Azure resources that I'm using. This is the dumbest loving thing. Combined with the above point, this means that I have about a 75% chance of failure when creating my resource.
- If Terraform fails creating a resource, my understanding is I can use the "import" command to add it to my state file once it's done, so I should be able to pick up my deployment where I left off. This does not loving work, it tells me that the resource has to be destroyed and recreated, even though it is the exact resource specified in my Terraform configuration.

So, in short: I like the syntax better but it's totally unusable in my specific scenario (usually fails, about 30x slower than ARM), and would complicate deployments by making me persist and share the state file.

[edit] Or we could use a Terraform Enterprise server to persist the state. Because who doesn't love adding additional crap to their toolchain to solve an already-solved problem, but in a slightly different way?

The state file is like the entire point of the software!! Terraform bills itself as infrastructure provisioning software, which it does, but so do lots of other solutions. Its killer feature is keeping track of provisioned resources so it can determine the operations required to go from the current state to the new desired state. And it does this for a ton of �providers�, which can be things like cloud resources or a deployed helm application or TLS resources. It�s great for creating and keeping track of long-lived, relatively static things in a repeatable manner and falls down when you need custom or complicated transitions between states.

If you don�t care about keeping track of the state/understand why you need to then use another tool. And you should drop the state in whatever azure�s equivalent of s3 is.

# ¿ Jan 9, 2019 18:17

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

New Yorp New Yorp posted:

Why does it need an external tracking mechanism? All I care about is the current state (which it can look at by querying the platform) and the desired state (which is defined in my configuration).

If you store the state in azure�s s3 it is querying the platform :grin:

To respond to what you mean though, �querying the platform� is not a good way to figure out what it�s responsible for. What is used to uniquely identify a resource varies from resource to resource. Resource names are not guaranteed to be unique for every resource and tags are not available on all types of resources Terraform wants to provision.

And, as mentioned, there are many providers other than the cloud ones which may have no means for being queried.

e: Also, resource deletion.

# ¿ Jan 9, 2019 18:44

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

Spring Heeled Jack posted:

Are there any good online resources regarding networking patterns in container orchestration? We're very much coming from a place of VLAN network-segmentation with ACLs and we're wondering how this translates into the wonderful world of containers.

Assuming we have two separate apps running in a single swarm cluster that share serivces between them (say a mobile app backend and a component of our customer facing website that share a common user info API), is there any reason to separate the networks? Or just run them all together under a single overlay? Does k8s handle this kind of thing any differently?

K8s will run everything on the same overlay. If you want to restrict connectivity in that overlay you would use something like a service mesh or NetworkPolicy resources.

# ¿ Jan 31, 2019 14:51

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

redis is a fine place to store things you don�t care about. if you don�t care about what you�re putting in the queue, then yes it�s good for that. If your application�s correctness relies on a message being received then redis is a bad choice

# ¿ May 13, 2019 19:26

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

Volguus posted:

What would be a good and reliable message queue then?

Persistent message queues at the center of your system � your �enterprise message bus� � add an unnecessary level of indirection and add a centralized point of failure. Use service discovery for sending messages and/or a reliable persistent store if you have any need to handle your messages transactionally.

# ¿ May 14, 2019 05:07

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

Kevin Mitnick P.E. posted:

Something has gone badly wrong if calling a persistent store "reliable" isn't trivially true.

and yet that�s the operational reality of running rabbit/activemq/redis in my experience. something bad happens in the queue - it fills up, slows down, or is losing messages somehow, so you bounce or purge it to fix it. I hope your devs took into account that their �reliable� queue could drop everything on the floor at any time.

# ¿ May 14, 2019 12:38

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

Admission controller is the way to go.

# ¿ Jun 4, 2019 01:06

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

rt4 posted:

I use Puppet at work but I just don't like Ruby. What alternatives might I consider that aren't so drat slow?

They�re all bad in their own unique ways, sorry

# ¿ Aug 14, 2019 22:22

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

something about the name "data dog" rubs me the wrong way and I can't get over it

# ¿ Nov 14, 2020 04:48

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

your nginx controller is probably running as a daemonset in hostnetwork mode or with hostports which means all you need to do is get the traffic to the nodes and you�re set (for ingress resources)

needing non-http services is when you�d want something like metallb (although nodeports can do in a pinch)

# ¿ Nov 30, 2020 23:42

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

the reason I think that config management tools all suck is because they only ever solve one out of the two problems they are supposed to.

one problem is orchestration, when you need control over how and when something happens in your infrastructure which is not going to be immutable. this is something that ansible provides.

the other problem is immutable/stateless/unorchestrated configuration which is provided by tools like chef.

as an operator sometimes i need one or the other, but all the tools I know of will only give me one and, reasonably, most groups only want to use one tool. i think it�s easier to try and pretend you have immutability in ansible than it is to shoehorn orchestration into chef, but I would love a tool that could give me both!

# ¿ Dec 12, 2020 23:19

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

I use alembic for a very basic data model and it is needs suiting

# ¿ Jan 27, 2021 06:38

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

docker is totally fine to use and likely you won�t notice any difference between runtimes except that docker will be easier to find resources online for

# ¿ Mar 20, 2021 02:27

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

container runtime is like the least interesting thing to spend time thinking about unless you are executing arbitrary untrusted code

# ¿ Mar 20, 2021 02:32

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

Methanar posted:

a gross fragile single point of failure hack that works by MITMing traffic with iptables dnat rules.

so, uh, just like kube-proxy?

# ¿ Mar 24, 2021 14:12

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

Methanar posted:

kube-proxy isn't quite the same as its just a control plane component. If kube-proxy starts crashing, traffic still flows properly since the real datapath is through iptables which persist.

If kiam goes down, all requests for API tokens fail and you have an immediate problem.

ah, true

# ¿ Mar 25, 2021 10:43

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

Methanar posted:

Mostly unrelated but we're looking at probably using Kube-router as our CNI with our own ASN that we peer with the main DC network.

If your AZs are correspond to eg different subnets you will need to use Calico or something else that gives you control over the IPAM if I'm assuming correctly that you're going to be advertising pod IPs. Do you have a specific reason for advertising them? In general advertising pod IPs is a PITA and the overhead of encapsulation + SNAT has been negligible for the size of clusters I've worked on. Another thing to consider is that if you ever want to grow your IP range it may be more difficult with kube-router than with Calico, with Calico you just add another pool and you can delete pools that are empty. No experience with kube-router, but for us Calico has been a pretty positive experience.

For somewhat similar reasons I would advise not advertising service IPs and instead getting a loadbalancer implementation.

As long as you have >= 3 AZs per DC though having etcd span the AZs makes sense. Where you'd want a separate cluster per AZ is if you only had two or something

# ¿ Apr 7, 2021 02:49

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

Methanar posted:

I've had 2 separate etcd fires two days in a row.

I'm like, getting OOMs on instances that have 12Gb of freeable memory (but only 500mb actually free!). I don't really understand why. But there's some huge vertical spike in memory consumption on my control planes occasionally, like a malloc gone wrong or something. The current theory is the kernel is getting some bad page thrashing or something and its either freeing memory pages too quickly or not quickly enough and everything cascadingly fails from there.

just curious, are the etcds hosting just a single kube cluster? and I'm assuming they have local storage, not something over a network?

only somewhat related, but I watched a talk a few months back from a guy who seemed to know what he was on about who said that freeable memory is basically a lie because IIRC and eg in this case, everything in the page cache including etcd pages would be considered freeable but are so hot the kernel would never (and should never) actually free that memory. also any program paged into memory is technically considered freeable, but again if it's hot the kernel won't do it because it would spend all its time freeing and then reloading the page. his conclusion was that there isn't (or wasn't at the time of his talk) a single good heuristic to tell how much memory is truly available for use at a given time on a system which was not reassuring.

# ¿ Apr 23, 2021 00:11

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

imo having your clusters dependent on broadcast domains is a bad idea. use an overlay or l3 routing if you can.

is this a requirement for certain public clouds?

# ¿ Sep 22, 2021 05:37

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

luminalflux posted:

Hi hello I do the release engineering among all other stuff I deal with, and we do CI/CD where if a merge to main passes tests, bucko it's getting deployed whether you want to or not.

Basic flow for our main app:

CI runs on each push to your branch

If CI is green and you get approval, you can merge to main

CI runs on main branch. If this goes green, it merges that SHA to the release branch.

Release branch gets deployed to staging

CI runs browser-based integration tests against staging to make sure that the backend app and react client play well and we didn't make important butans unclickable

If this goes green, and deploys aren't frozen, CD kicks off a deploy to Spinnaker

Spinnaker bakes an AMI with the new version of the app (*)

Spinnaker makes a new autoscaling group with the new AMI, and attaches it to the loadbalancer

When healthchecks go green on the new ASG, Spinnaker drops out the old ASG.

how long does this process take end to end?

# ¿ Oct 11, 2021 12:54

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

necrobobsledder posted:

Historically for myself the most lax customers that are furthest behind in technology tend to be super important clients that are like 30% of the total company revenue so I have to do horrible things like scanning for their IP ranges advertised from their networks and make an nginx rule that offers certain ciphers only for them while everyone else gets what I meant to do. Freakin' enterprise I tell ya.

cripes

# ¿ Oct 28, 2021 00:40

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

Your concern from the load perspective is how many time series you are creating. If the problems you're trying to solve with prometheus involve slicing and dicing at the customer level there's really no way around adding a unique per-customer tag to your metrics. So for every metric you push with those tags you just need to be aware of the fact that you're creating (customer count)x the number of time series. As long as you are judicious with which metrics you're tagging (ie not pushing the 1000s of metrics that might come from something like node exporter) per-customer, my guess is that you'll probably be fine.

# ¿ Nov 8, 2021 12:26

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

zokie posted:

Maybe not the right thread but it hope it fits: we have another team managing a ES cluster we use for regular logging stuff and real user metrics. They started with just the one, but now they have setup a test cluster for their testing and a pre cluster that �is a prod environment for your non prod environments� and they want us to just have our prod environment logging &c to the prod cluster.

I am flat out refusing, that means 100% more work for us. Not just maintaining reports and dashboards, but also just uuugh. I�m pretty sure Support/Application Management barely knows this stuff exists and me and the rest of our team use it for like 80% test stuff and only ever check the prod data if some issue is escalated all the way to us.

Also it�s not like we treat all environments equal, we purge test indexes much faster.

Am I crazy? Isn�t this just useless?

It sounds like what you're asking is if your infrastructure team should have N > 1 instances of critical infrastructure, which to me seems in your best interest. In doing so, they should be taking steps to make sure this transition is as transparent as possible, meaning they should be providing a way for you to replicate and update whatever existing tooling you have across any new instances they decide to bring up.

# ¿ Nov 20, 2021 03:52

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

duck monster posted:

So recently I started a new job that partially involves inhereting a giant Kubernetes cluster on DigitalOcean. I've never used Kubernetes so its all a massive learning curve.

This moring I got into the office and realised the entire cluster down with all the pods in "Pending" mode (Including about a bazillion cronjob containers that seemed to be piling up).

It would seem at some point in the night for reasons I'm completely unsure of the whole drat thing was reset causing it to reissue a whole bunch of nodes which where in an unlabeled state.
So after labelling them, it all came back up, although I had to delete the node spec for the cronjobs because there where literally hundreds of the bloody things trying to be created. Followed by a slow recycling of nodes to get the drat things to exit the "Terminating" state.

Massive and disrupive pain in the arse.

Is there a way to tell Kubernetes how to label nodes after a rebuild? Beause this *sucks*

The process you should be looking at is kubelet. Looks like you can modify the kubelet config to have the kubelet come up with whatever node labels you want.

# ¿ Nov 22, 2021 08:33

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

they call them taints becvause they t'aint here and they t'aint there

# ¿ Nov 24, 2021 13:03

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

Methanar posted:

code:
net.ipv4.conf.default.rp_filter=1
https://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.kernel.rpf.html
somebody thought secretly turning on source address validation for infrastructure that is highly dependent on vxlans and nat abuse without any sort of heads up whatsoever was a good idea.

glad you were able to tick your checkbox mr security man.

this was causing us strange issues for months before we realized it was enabled by default on ubuntu

# ¿ Dec 30, 2021 02:04

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

I loving wish it was 2015 where I work

# ¿ Jan 1, 2022 13:19

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

Dukes Mayo Clinic posted:

�k8s without all the poo poo we don�t need� was all the pitch we needed to go hard on k3s in production. Time will tell.

isn't it the same API, but the binary is just smaller and it supports SQL backends? lol c'mon man

# ¿ Jan 14, 2022 00:52

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

does faang interview SREs like they interview SWEs? like, do I need to start reviewing stuff like red black trees?

# ¿ Jan 26, 2022 13:52

Adbot: ADBOT LOVES YOU

# ¿ May 16, 2024 01:12

my homie dhall: Dec 9, 2010; honey, oh please, it's just a machine

dehumanize yourself and face to Jenkins

# ¿ May 6, 2022 13:28

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Continuous Integration/build engineering/devops thread

«‹›2 »