Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
I think ELK/EFK is pretty widespread, no?

Adbot
ADBOT LOVES YOU

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
the idea that everyone on a team should be involved in all three of writing, releasing, and deploying code is a bad one

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

crazypenguin posted:

I've got a medium-sized (50 project/repos) jenkins build thing going with the newish multibranch pipeline approach. It works mostly great, but I have a few issues.

If I can't fix these... oh well. This works acceptably already. I just wanted to check to see if I was missing anything obvious.

  • To do downstream integration testing/building, I manually put a list of other jobs to build in each Jenkinsfile. This slightly annoys me. It means in addition to downstream projects having to know about their upstream dependencies (of course!), I also have to update the upstream projects to inform them what downstream projects should be rebuilt on a change. Not the worst, but if there's a better way of dealing with this, I don't know about it.
  • I have problems with diamond dependencies. If project A is a dependency of B and C, obviously I have A rebuild B and C on a change. But if D has dependencies on B and C, I kinda get screwed. Right now, if A changes, I end up rebuilding D twice, redundantly, via both B and C. If there's a smart way to handle this stuff, I don't know about it.
  • Coordinated changes are annoying. If we have to change two projects at the same time (e.g. make breaking change in one, fix its use in another), we have a half-assed thing right now where it tries to build branches of the same name, so we can sorta do that. But then when merging these branches in multiple projects at once, we overwhelm jenkins with a lot of rebuilds, most of them redundant again. (If A depends on B, then committing to both rebuilds A and B, then A again, this time downstream of B.)

It seems like these should be common issues, but I dunno if I'm missing something, or if this is just part of the fun.

For the first issue we keep our automation code separate from the actual code and it works pretty well. You still keep your pom or whatever is required to build with the code in the code repo, but you keep all the automation and coordination between separate jobs in another repo. The pattern where the jenkinsfile lives with the code leads only to headaches.

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
Is there anyone else doing non-image artifact-based CM, eg chef-zero or ansible bundles? Where/when do you drop your artifacts?

We have a bunch of terraform for setting up instances of our infrastructure and currently it lives separately from the provisioning code, but we want to merge the repos so we can release them as one. Currently our hosts pull provisioning code from git (I know, I know), but continuing to do this is for sure going to gently caress us up for any sort of branching when we merge the two repos (here come the self-dependencies) and also it’s just bad practice. My thought right now is to generate an instance-specific CM bundle with Terraform that’s pushed onto S3, but I’m wondering if anyone has hit any issues with something like that. We also have artifactory available, but we wouldn’t be able to have per-instance artifacts which are important for us for testing new Infra or CM code.

I’ve handled this in the past for smaller projects by delivering a hex-encoded tar of the provisioning code in user data, but unfortunately we’re just past the 16k limit 😉

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Hadlock posted:

Our monolith takes about 15 minutes to build, other services are in the 3-10 minute range.

Full commit to live smoke test including generating private dns record etc is probably close to 20 minutes

Do you guys do load tests? Any integration with internal services outside your control?

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
Let’s say you want to have a version endpoint or string somewhere in your app. Isn’t that something that’s going to be static in your artifacts? If so, how do you keep the pattern of using same artifact in QA/Staging/Prod if you’ll need to modify this as it gets promoted?

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Kevin Mitnick P.E. posted:

If it changes based on deployment environment then it's not a version.

It doesn’t change based on deployment environment, but when it’s in QA it’s still just a release candidate for us, not a real release yet. We promote it to being a real release once it goes through a bunch of integration tests with other systems. (having a “QA process” instead of real CICD is non-ideal, but it’s unfortunately not my decision to make)

Maybe our endpoint should just return a build number?

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Kevin Mitnick P.E. posted:

Build a thing, do a bunch of QA on it, then build a different (but hopefully close enough) thing and release it is not a process I would be comfortable with.

Right, that’s something I don’t think I want to do. Here’s what I think are my options, but I’m wondering if there’s maybe another
- rebuilding/repackaging artifact at promote time, after QA (bad for the reasons stated)
- not having release candidates and only ever sending real releases to QA (would result in a ton of versions for us because we always find stuff in QA)
- not having version information available in the artifact

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

ThePeavstenator posted:

I'm not sure if I've missed something but why can't you just do the old <Major>.<Minor>.<Release>.<Build> ?

actually I think this would totally work and I don’t know why I didn’t consider doing this. we have been sticking “RC” in our version strings for stuff that hasn’t gone through QA, but dropping that would alleviate all the pain. 🤦‍♂️ thanks!

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
terraform is certainly an imperfect and frequently buggy tool, but also the least bad one available for certain types of problems

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

New Yorp New Yorp posted:

I've been translating my ARM template for this project into Terraform. Here are my Terraform impressions:

- I like the syntax better than ARM syntax. This is the only thing I like.
- I don't understand the point of a state file. It creates something that's generated by interrogating Azure, but won't work without it? And now I have some extra file generated at deployment time that I have to worry about synchronizing and sharing between deployments? Maybe I'm missing something, but it seems really pointless.
- Creating the resource I need takes upwards of an hour. An equivalent ARM template takes seconds. Literally 5 seconds. I have no idea what Terraform is doing differently.
- It has a hard-coded timeout of an hour for resource creation. If the timeout expires, it just fails. There is no way to change the timeout, at least for the Azure resources that I'm using. This is the dumbest loving thing. Combined with the above point, this means that I have about a 75% chance of failure when creating my resource.
- If Terraform fails creating a resource, my understanding is I can use the "import" command to add it to my state file once it's done, so I should be able to pick up my deployment where I left off. This does not loving work, it tells me that the resource has to be destroyed and recreated, even though it is the exact resource specified in my Terraform configuration.

So, in short: I like the syntax better but it's totally unusable in my specific scenario (usually fails, about 30x slower than ARM), and would complicate deployments by making me persist and share the state file.

[edit] Or we could use a Terraform Enterprise server to persist the state. Because who doesn't love adding additional crap to their toolchain to solve an already-solved problem, but in a slightly different way? :waycool:

The state file is like the entire point of the software!! Terraform bills itself as infrastructure provisioning software, which it does, but so do lots of other solutions. Its killer feature is keeping track of provisioned resources so it can determine the operations required to go from the current state to the new desired state. And it does this for a ton of “providers”, which can be things like cloud resources or a deployed helm application or TLS resources. It’s great for creating and keeping track of long-lived, relatively static things in a repeatable manner and falls down when you need custom or complicated transitions between states.


If you don’t care about keeping track of the state/understand why you need to then use another tool. And you should drop the state in whatever azure’s equivalent of s3 is.

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

New Yorp New Yorp posted:

Why does it need an external tracking mechanism? All I care about is the current state (which it can look at by querying the platform) and the desired state (which is defined in my configuration).

If you store the state in azure’s s3 it is querying the platform :grin:

To respond to what you mean though, “querying the platform” is not a good way to figure out what it’s responsible for. What is used to uniquely identify a resource varies from resource to resource. Resource names are not guaranteed to be unique for every resource and tags are not available on all types of resources Terraform wants to provision.

And, as mentioned, there are many providers other than the cloud ones which may have no means for being queried.

e: Also, resource deletion.

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Spring Heeled Jack posted:

Are there any good online resources regarding networking patterns in container orchestration? We're very much coming from a place of VLAN network-segmentation with ACLs and we're wondering how this translates into the wonderful world of containers.

Assuming we have two separate apps running in a single swarm cluster that share serivces between them (say a mobile app backend and a component of our customer facing website that share a common user info API), is there any reason to separate the networks? Or just run them all together under a single overlay? Does k8s handle this kind of thing any differently?

K8s will run everything on the same overlay. If you want to restrict connectivity in that overlay you would use something like a service mesh or NetworkPolicy resources.

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
redis is a fine place to store things you don’t care about. if you don’t care about what you’re putting in the queue, then yes it’s good for that. If your application’s correctness relies on a message being received then redis is a bad choice

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Volguus posted:

What would be a good and reliable message queue then?

Persistent message queues at the center of your system — your “enterprise message bus” — add an unnecessary level of indirection and add a centralized point of failure. Use service discovery for sending messages and/or a reliable persistent store if you have any need to handle your messages transactionally.

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Kevin Mitnick P.E. posted:

Something has gone badly wrong if calling a persistent store "reliable" isn't trivially true.

and yet that’s the operational reality of running rabbit/activemq/redis in my experience. something bad happens in the queue - it fills up, slows down, or is losing messages somehow, so you bounce or purge it to fix it. I hope your devs took into account that their “reliable” queue could drop everything on the floor at any time.

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
Admission controller is the way to go.

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

rt4 posted:

I use Puppet at work but I just don't like Ruby. What alternatives might I consider that aren't so drat slow?

They’re all bad in their own unique ways, sorry

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
something about the name "data dog" rubs me the wrong way and I can't get over it

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
your nginx controller is probably running as a daemonset in hostnetwork mode or with hostports which means all you need to do is get the traffic to the nodes and you’re set (for ingress resources)

needing non-http services is when you’d want something like metallb (although nodeports can do in a pinch)

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
the reason I think that config management tools all suck is because they only ever solve one out of the two problems they are supposed to.

one problem is orchestration, when you need control over how and when something happens in your infrastructure which is not going to be immutable. this is something that ansible provides.

the other problem is immutable/stateless/unorchestrated configuration which is provided by tools like chef.

as an operator sometimes i need one or the other, but all the tools I know of will only give me one and, reasonably, most groups only want to use one tool. i think it’s easier to try and pretend you have immutability in ansible than it is to shoehorn orchestration into chef, but I would love a tool that could give me both!

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
I use alembic for a very basic data model and it is needs suiting

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
docker is totally fine to use and likely you won’t notice any difference between runtimes except that docker will be easier to find resources online for

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
container runtime is like the least interesting thing to spend time thinking about unless you are executing arbitrary untrusted code

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Methanar posted:

a gross fragile single point of failure hack that works by MITMing traffic with iptables dnat rules.

so, uh, just like kube-proxy?

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Methanar posted:

kube-proxy isn't quite the same as its just a control plane component. If kube-proxy starts crashing, traffic still flows properly since the real datapath is through iptables which persist.

If kiam goes down, all requests for API tokens fail and you have an immediate problem.

ah, true

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Methanar posted:

Mostly unrelated but we're looking at probably using Kube-router as our CNI with our own ASN that we peer with the main DC network.

If your AZs are correspond to eg different subnets you will need to use Calico or something else that gives you control over the IPAM if I'm assuming correctly that you're going to be advertising pod IPs. Do you have a specific reason for advertising them? In general advertising pod IPs is a PITA and the overhead of encapsulation + SNAT has been negligible for the size of clusters I've worked on. Another thing to consider is that if you ever want to grow your IP range it may be more difficult with kube-router than with Calico, with Calico you just add another pool and you can delete pools that are empty. No experience with kube-router, but for us Calico has been a pretty positive experience.

For somewhat similar reasons I would advise not advertising service IPs and instead getting a loadbalancer implementation.

As long as you have >= 3 AZs per DC though having etcd span the AZs makes sense. Where you'd want a separate cluster per AZ is if you only had two or something

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Methanar posted:

I've had 2 separate etcd fires two days in a row.

I'm like, getting OOMs on instances that have 12Gb of freeable memory (but only 500mb actually free!). I don't really understand why. But there's some huge vertical spike in memory consumption on my control planes occasionally, like a malloc gone wrong or something. The current theory is the kernel is getting some bad page thrashing or something and its either freeing memory pages too quickly or not quickly enough and everything cascadingly fails from there.

just curious, are the etcds hosting just a single kube cluster? and I'm assuming they have local storage, not something over a network?

only somewhat related, but I watched a talk a few months back from a guy who seemed to know what he was on about who said that freeable memory is basically a lie because IIRC and eg in this case, everything in the page cache including etcd pages would be considered freeable but are so hot the kernel would never (and should never) actually free that memory. also any program paged into memory is technically considered freeable, but again if it's hot the kernel won't do it because it would spend all its time freeing and then reloading the page. his conclusion was that there isn't (or wasn't at the time of his talk) a single good heuristic to tell how much memory is truly available for use at a given time on a system which was not reassuring.

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
imo having your clusters dependent on broadcast domains is a bad idea. use an overlay or l3 routing if you can.

is this a requirement for certain public clouds?

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

luminalflux posted:

Hi hello I do the release engineering among all other stuff I deal with, and we do CI/CD where if a merge to main passes tests, bucko it's getting deployed whether you want to or not.

Basic flow for our main app:
  • CI runs on each push to your branch
  • If CI is green and you get approval, you can merge to main
  • CI runs on main branch. If this goes green, it merges that SHA to the release branch.
  • Release branch gets deployed to staging
  • CI runs browser-based integration tests against staging to make sure that the backend app and react client play well and we didn't make important butans unclickable
  • If this goes green, and deploys aren't frozen, CD kicks off a deploy to Spinnaker
  • Spinnaker bakes an AMI with the new version of the app (*)
  • Spinnaker makes a new autoscaling group with the new AMI, and attaches it to the loadbalancer
  • When healthchecks go green on the new ASG, Spinnaker drops out the old ASG.

how long does this process take end to end?

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

necrobobsledder posted:

Historically for myself the most lax customers that are furthest behind in technology tend to be super important clients that are like 30% of the total company revenue so I have to do horrible things like scanning for their IP ranges advertised from their networks and make an nginx rule that offers certain ciphers only for them while everyone else gets what I meant to do. Freakin' enterprise I tell ya.

cripes

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
Your concern from the load perspective is how many time series you are creating. If the problems you're trying to solve with prometheus involve slicing and dicing at the customer level there's really no way around adding a unique per-customer tag to your metrics. So for every metric you push with those tags you just need to be aware of the fact that you're creating (customer count)x the number of time series. As long as you are judicious with which metrics you're tagging (ie not pushing the 1000s of metrics that might come from something like node exporter) per-customer, my guess is that you'll probably be fine.

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

zokie posted:

Maybe not the right thread but it hope it fits: we have another team managing a ES cluster we use for regular logging stuff and real user metrics. They started with just the one, but now they have setup a test cluster for their testing and a pre cluster that “is a prod environment for your non prod environments” and they want us to just have our prod environment logging &c to the prod cluster.

I am flat out refusing, that means 100% more work for us. Not just maintaining reports and dashboards, but also just uuugh. I’m pretty sure Support/Application Management barely knows this stuff exists and me and the rest of our team use it for like 80% test stuff and only ever check the prod data if some issue is escalated all the way to us.

Also it’s not like we treat all environments equal, we purge test indexes much faster.

Am I crazy? Isn’t this just useless?

It sounds like what you're asking is if your infrastructure team should have N > 1 instances of critical infrastructure, which to me seems in your best interest. In doing so, they should be taking steps to make sure this transition is as transparent as possible, meaning they should be providing a way for you to replicate and update whatever existing tooling you have across any new instances they decide to bring up.

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

duck monster posted:

So recently I started a new job that partially involves inhereting a giant Kubernetes cluster on DigitalOcean. I've never used Kubernetes so its all a massive learning curve.

This moring I got into the office and realised the entire cluster down with all the pods in "Pending" mode (Including about a bazillion cronjob containers that seemed to be piling up).

It would seem at some point in the night for reasons I'm completely unsure of the whole drat thing was reset causing it to reissue a whole bunch of nodes which where in an unlabeled state.
So after labelling them, it all came back up, although I had to delete the node spec for the cronjobs because there where literally hundreds of the bloody things trying to be created. Followed by a slow recycling of nodes to get the drat things to exit the "Terminating" state.

Massive and disrupive pain in the arse.

Is there a way to tell Kubernetes how to label nodes after a rebuild? Beause this *sucks*

The process you should be looking at is kubelet. Looks like you can modify the kubelet config to have the kubelet come up with whatever node labels you want.

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
they call them taints becvause they t'aint here and they t'aint there

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Methanar posted:

code:
net.ipv4.conf.default.rp_filter=1
https://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.kernel.rpf.html
somebody thought secretly turning on source address validation for infrastructure that is highly dependent on vxlans and nat abuse without any sort of heads up whatsoever was a good idea.

glad you were able to tick your checkbox mr security man. :thumbsup:

this was causing us strange issues for months before we realized it was enabled by default on ubuntu

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
I loving wish it was 2015 where I work

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Dukes Mayo Clinic posted:

“k8s without all the poo poo we don’t need” was all the pitch we needed to go hard on k3s in production. Time will tell.

isn't it the same API, but the binary is just smaller and it supports SQL backends? lol c'mon man

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
does faang interview SREs like they interview SWEs? like, do I need to start reviewing stuff like red black trees?

Adbot
ADBOT LOVES YOU

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
dehumanize yourself and face to Jenkins

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply