Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Methanar
Sep 26, 2013

by the sex ghost

quote:

Important
You cannot install certificates with 4096-bit RSA keys or EC keys on your load balancer through integration with ACM. You must upload certificates with 4096-bit RSA keys or EC keys to IAM in order to use them with your load balancer.

That's dumb, but fine.

quote:

Important
You cannot install certificates with RSA keys larger than 2048-bit or EC keys on your Network Load Balancer.

Come on.

Adbot
ADBOT LOVES YOU

crazypenguin
Mar 9, 2005
nothing witty here, move along
NLB has some appalling huge missing features, it really seemed like a botched service somehow.

You can't put security groups on them. da fu

12 rats tied together
Sep 7, 2006

AFAICT that is a perf based limitation. NLB is high throughput, it makes sense that it would skip whatever software path allows security groups to be evaluated. Same deal with 4096 bit certs, at the scale the NLB service operates at it wouldn't be too surprising to me if allowing 4096 certs would cause the NLB autoscalers to self-DDOS.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
I did some crazy shenanigans to support something like security groups on NLBs by finding the underlying ENIs that NLB provisions (there’s a way to determine the underlying IPs and ENIs that an NLB leverages under the hood) and assigning SGs to them. IIRC they had some limitations with the SGs on them like not being able to reference other SGs but IP based rules worked and is what I assigned.

Methanar
Sep 26, 2013

by the sex ghost

necrobobsledder posted:

I did some crazy shenanigans to support something like security groups on NLBs by finding the underlying ENIs that NLB provisions (there’s a way to determine the underlying IPs and ENIs that an NLB leverages under the hood) and assigning SGs to them. IIRC they had some limitations with the SGs on them like not being able to reference other SGs but IP based rules worked and is what I assigned.

Kubernetes basically does this.

But rather, you tell k8s which SG is on the workers and is owned by k8s with a tag. K8s's integration will then create SG rules on the tagged SG permitting access to whatever port the NLB is forwarding to.

https://github.com/kubernetes/kuber....go#L3975-L4022


Except it sucks because it duplicates each rule for n of the subnets that your NLB is on. 4 nodeports on 3 subnets = 12 rules. And you very very rapidly run into the maximum number of SG ruless you can have on an SG, which is 60 by default. And the real maximum is the from formula. So you need to adjust the ratio of how many SGs you can have per ENI to SG rules per SG, which is configured aws account wide.

1000 > Number of sg rules per SG * max number of SGs per ENI
Default is 60 * 15
code:
		subnetCidrs, err := c.getSubnetCidrs(subnetIDs)

		...
		err = c.updateInstanceSecurityGroupsForNLB(loadBalancerName, instances, subnetCidrs, sourceRangeCidrs, v2Mappings)


func (c *Cloud) updateInstanceSecurityGroupForNLBTraffic(sgID string, sgPerms IPPermissionSet, ruleDesc string, protocol string, ports sets.Int64, cidrs []string) error {
	desiredPerms := NewIPPermissionSet()
	for port := range ports {
		for _, cidr := range cidrs {
...

Methanar fucked around with this message at 06:34 on Apr 13, 2021

madsushi
Apr 19, 2009

Baller.
#essereFerrari

12 rats tied together posted:

AFAICT that is a perf based limitation. NLB is high throughput, it makes sense that it would skip whatever software path allows security groups to be evaluated. Same deal with 4096 bit certs, at the scale the NLB service operates at it wouldn't be too surprising to me if allowing 4096 certs would cause the NLB autoscalers to self-DDOS.

Concur, but more specifically, my guess is hardware perf. At any kind of scale, you're offloading SSL actions to whatever ASIC / FPGA / sub-processor, and 2048-bit SSL procs are a lot cheaper / easier / ubiquitous. So it's not even "4096-bit certs are X% worse for perf", it's "4096-bit certs consume a different hardware resource than 2048-bit certs, and that resource is orders of magnitude slower".

Methanar
Sep 26, 2013

by the sex ghost
I'm interviewing somebody soon. Here are some of the talking points I came up with for discussing k8s. Did I miss anything, or is anything in here not appropriate?

code:
	Why kubernetes. What is kubernetes good for. why bother.
		people have been deploying software for a long time without it.
		immutable infrastructure
			people have had immutable infrastructure patterns based off of packer for 10 years.

	What are the economics of k8s
		multicloud, platform independence, self service, binpacking, centralized views, common infra with clear owner

	Multi tenancy
	    How have you handled granting authentication to k8s before. your resume mentions a lot of auth and okta related work. was k8s access managed by okta? how.
	    Application/team isolation

	Have you ever needed to read through the k8s source to identify how something works?

	Do you use any kubernetes distros like EKS or AKS. Running it yourself?

	Any familiarity with autoscaling within k8s, whether its the cluster itself or applications within

	full kubernetes network model
               what is iptables's role
               what is ipvs
               what is kube-proxy
		what is a CNI, which are you familiar with, how does it work. What does it do.
               what is bgp
               what is vxlan

               what is ingress
               what is a service object, how does it work.
 
    What is a namespace
              How can I enter a namespace to debug manually.

    K8s is deprecating docker soon. What does that mean
  		  What is the relationship between docker, dockershm, containerd, runc, and CRI, and OCI specification

	cloud provider integration
        what might you not have running k8s outside of a public cloud

    what is an operator within k8s

	what is a loadbalancer type service, how does it work

	Have you written any tooling using the kubernetes API clients?

	(bonus)	
		familiarity with the actual implementation of control loops
        	What if we wanted to write our own special behaviour around that. Like the out of the box one couldn't do what we wanted. eg we wanted to build our own kubernetes federation story and that meant modifying behaviour around how the loop responsible for Service objects worked and what happened upon creation of one.


	Tell me a war story

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
The tone sounds adversarial to me, and is pushing someone to defend a position, when I think you probably just want people to discuss a tool's advantages/weaknesses.

Also what are you interviewing them for? As a k8s janitor, or a k8s user? A janitor would need to know about etcd, security, handling cluster upgrades, and (perhaps) dealing with apps that need high speed PVs.

Methanar
Sep 26, 2013

by the sex ghost

minato posted:

The tone sounds adversarial to me, and is pushing someone to defend a position, when I think you probably just want people to discuss a tool's advantages/weaknesses.

Also what are you interviewing them for? As a k8s janitor, or a k8s user? A janitor would need to know about etcd, security, handling cluster upgrades, and (perhaps) dealing with apps that need high speed PVs.

General infra/k8s janitor/k8s ecosystem developer. Relatively senior.

In my opinion the use of k8s is something to be defended. I've been involved multiple times now in persuading different teams and suborgs to not migrate their workloads onto k8s due to a bad fit when the initial impetus for doing so was just 'we felt like it or it seemed like what everybody else was doing'. Stewardship of our k8s ecosystem sometimes does involve saying no rather than trying to round peg square hole your way through it.

Methanar fucked around with this message at 00:47 on Apr 16, 2021

Pile Of Garbage
May 28, 2007



Methanar posted:

In my opinion the use of k8s is something to be defended. I've been involved multiple times now in persuading different teams and suborgs to not migrate their workloads onto k8s due to a bad fit when the initial impetus for doing so was just 'we felt like it or it seemed like what everybody else was doing'. Stewardship of our k8s ecosystem sometimes does involve saying no rather than trying to round peg square hole your way through it.

If someone put a gun to my head and told me to use k8s I'd just go with something like AWS EKS so that I can avoid as much janitoring as humanly possible.

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
To me k8s is the aircraft carrier of options and the more important things to consider is how well applications are architected for a cloud environment to begin with. While stateful sets exist in my experience developers never architecting for a cloud environment has been the primary barrier to making containers work without even more effort by myself as a one-man army trying to push dozens of different crappy cloud-washed containers for 6+ year old software work on K8S, EKS, or even Nomad remotely reliably. As such, I find myself in interviews asking "how much experience does the company have culturally with containers?" before thinking about container orchestration at all.

Scikar
Nov 20, 2005

5? Seriously?

Methanar posted:

In my opinion the use of k8s is something to be defended.

I completely agree with this, but I think an adversarial tone towards Kubernetes itself is a bit unfair in the current state of the industry. If somebody was looking at Swarm, Cattle, Mesos, and Kubernetes in 2016 or whenever that was a meaningful choice then of course you want to know that they evaluated them fairly and didn't just pick the shiniest tool in the box. But today the industry default is to assume that if you're doing containers, you're doing k8s (or at least trying to), so it's like asking someone why they allowed Rancher 1.6 to go EOL. You then follow with technical questions drilling into the details of Kubernetes anyway, because you've been swept down this path like everybody else. I would be totally comfortable having a discussion like that with a colleague or at a tech meetup because it's genuinely interesting, but I'd be very put off in an interview if you started it by implying that I was wrong to choose a technology that is in the title of the job I'm interviewing for in the first place.

It seems that what you're really asking is if they have a realistic grasp of the downsides of Kubernetes (and they definitely should do if they can answer the technical questions you've got), and the overhead of moving an existing workload onto it (both the containerisation part in the first place and the fact that the containers need to work with Kubernetes specifically because there is no other choice right now). That makes perfect sense, but I'd be careful to keep the questions within that framing, making sure they understand the starting point is "you need to host this app", not "you need to host this container". You mentioned war stories at the end, but I'd specifically ask about migrations they've worked on, which ones went well, which ones went badly or were cancelled or were reverted etc., and I'd probably ask that before the technical questions since some of them will be answered by that experience anyway.

disclaimer: I have never interviewed somebody for a k8s engineer role.

Methanar
Sep 26, 2013

by the sex ghost

Scikar posted:

I completely agree with this, but I think an adversarial tone towards Kubernetes itself is a bit unfair in the current state of the industry. If somebody was looking at Swarm, Cattle, Mesos, and Kubernetes in 2016 or whenever that was a meaningful choice then of course you want to know that they evaluated them fairly and didn't just pick the shiniest tool in the box. But today the industry default is to assume that if you're doing containers, you're doing k8s (or at least trying to), so it's like asking someone why they allowed Rancher 1.6 to go EOL. You then follow with technical questions drilling into the details of Kubernetes anyway, because you've been swept down this path like everybody else. I would be totally comfortable having a discussion like that with a colleague or at a tech meetup because it's genuinely interesting, but I'd be very put off in an interview if you started it by implying that I was wrong to choose a technology that is in the title of the job I'm interviewing for in the first place.

It seems that what you're really asking is if they have a realistic grasp of the downsides of Kubernetes (and they definitely should do if they can answer the technical questions you've got), and the overhead of moving an existing workload onto it (both the containerisation part in the first place and the fact that the containers need to work with Kubernetes specifically because there is no other choice right now). That makes perfect sense, but I'd be careful to keep the questions within that framing, making sure they understand the starting point is "you need to host this app", not "you need to host this container". You mentioned war stories at the end, but I'd specifically ask about migrations they've worked on, which ones went well, which ones went badly or were cancelled or were reverted etc., and I'd probably ask that before the technical questions since some of them will be answered by that experience anyway.

disclaimer: I have never interviewed somebody for a k8s engineer role.

Excellent feedback.

Thanks, I'll be sure to tone down the adverserial anti-k8s tone and keep the big picture framing of around which classes of problems is k8s intended for and good at solving.

Methanar fucked around with this message at 20:41 on Apr 16, 2021

minato
Jun 7, 2004

cutty cain't hang, say 7-up.
Taco Defender
In my interviewing experience, candidates often have experience with just a few products and some try to shoehorn every solution into those products, regardless of whether it's actually a good fit. I give points to a candidate who understands the tool well enough to know where it's appropriate, and where it's not.

the talent deficit
Dec 20, 2003

self-deprecation is a very british trait, and problems can arise when the british attempt to do so with a foreign culture





that list of questions is a lot of trivia and not a lot of questions focusing on experience. i'd ask broader questions that let the candidate flex their knowledge of k8s rather than questions that reveal how many boxes they can check

for instance, if you want to ensure your hire has solid knowledge of k8s networking don't ask them what a vxlan is, ask them about a k8s networking problem they experienced and how they solved it

LochNessMonster
Feb 3, 2005

I need about three fitty


Since people inevitably want to run persistent applications on k8s I’d also include storage related questions.

Like, have you used Portworx, longhorn or ran your own NFS/Ceph/gluster storage backend and what did you like/dislike about it.

freeasinbeer
Mar 26, 2015

by Fluffdaddy
There be dragons.

I’ve used rook; it’s an operator for ceph, so all that headache is hidden from you until it isn’t.


Openebs uses zfs under the hood iirc, so it’s a bit less complicated.

Plain old nfs is probably the easiest

CSI is an evolving area and not well documented.

Basically it’s all a shitshow.

xzzy
Mar 5, 2009

If your users don't need fast I/O, NFS is the way to go.

Ceph will turn you grey.

Gyshall
Feb 24, 2009

Had a couple of drinks.
Saw a couple of things.
NFS is good and as simple as works native on Docker if you want to see how it works under the hood that way

ILikeVoltron
May 17, 2003

I <3 spyderbyte!

freeasinbeer posted:

There be dragons.

I’ve used rook; it’s an operator for ceph, so all that headache is hidden from you until it isn’t.


Openebs uses zfs under the hood iirc, so it’s a bit less complicated.

Plain old nfs is probably the easiest

CSI is an evolving area and not well documented.

Basically it’s all a shitshow.

I'd love for somebody to tell me about their use of rook/ceph on a k8s cluster, I'm familiar with ceph (having deployed it a few times) and curious how well the operator works in real world scenarios. I think the hard part is surely how resources get scheduled as ceph based hosts are going to be heavily utilized on cpu/backend ethernet. TIA

Methanar
Sep 26, 2013

by the sex ghost
lmao this guy ghosted the interview.

Docjowles
Apr 9, 2009

clearly he's a goon and saw your anti-k8s agenda :tinfoil:

LochNessMonster
Feb 3, 2005

I need about three fitty


Docjowles posted:

clearly he's a goon and saw your anti-k8s agenda :tinfoil:

If he’s a goon he’s probably more scared about Methanars onboarding stories.

chutwig
May 28, 2001

BURLAP SATCHEL OF CRACKERJACKS

ILikeVoltron posted:

I'd love for somebody to tell me about their use of rook/ceph on a k8s cluster, I'm familiar with ceph (having deployed it a few times) and curious how well the operator works in real world scenarios. I think the hard part is surely how resources get scheduled as ceph based hosts are going to be heavily utilized on cpu/backend ethernet. TIA

I tried Rook out once as somebody who ran standalone Ceph for a couple years at PB scale and lost faith in Rook pretty quickly after it destroyed the monitor quorum. Whoops!

In general I don't like co-locating storage with compute; whatever it saves you in money it will more than cost you in operational headaches. You lose the ability to evolve the two sides of the platform separately, they're going to battle with each other over resources on each host, and any time there's a problem like a host failure you always have two problems instead of just one. Modern network-accessible storage is fundamentally a consensus problem unless you just don't care about data integrity and high availability, and having tried out Rook, OpenEBS and Portworx, none of them come close to providing the level of assurance that is necessary to feel confident that they're not going to trash your data. When you're dealing with anything consensus-related, keep that piece as simple as you can.

chutwig fucked around with this message at 20:04 on Apr 19, 2021

ILikeVoltron
May 17, 2003

I <3 spyderbyte!

chutwig posted:

I tried Rook out once as somebody who ran standalone Ceph for a couple years at PB scale and lost faith in Rook pretty quickly after it destroyed the monitor quorum. Whoops!

In general I don't like co-locating storage with compute; whatever it saves you in money it will more than cost you in operational headaches. You lose the ability to evolve the two sides of the platform separately, they're going to battle with each other over resources on each host, and any time there's a problem like a host failure you always have two problems instead of just one. Modern network-accessible storage is fundamentally a consensus problem unless you just don't care about data integrity and high availability, and having tried out Rook, OpenEBS and Portworx, none of them come close to providing the level of assurance that is necessary to feel confident that they're not going to trash your data. When you're dealing with anything consensus-related, keep that piece as simple as you can.

I'm going to play devils advocate a bit here and ask some seemingly stupid questions, but can't you just use labels for hosting the OSD/Mon/etc, and completely and totally separate the two workloads? I'll note, considering good reference arch you can't separate the workloads and ethernet ports properly in a k8s cluster (or as far as I know) but at the very least you should be able to force the right and only the right hosts to use?

Edit: to be clear, I'm not advocating a sort of hyperconverged setup, as perhaps you're alluding too.

ILikeVoltron fucked around with this message at 23:02 on Apr 19, 2021

necrobobsledder
Mar 21, 2005
Lay down your soul to the gods rock 'n roll
Nap Ghost
I've always gotten the impression that mixing storage, control plane, and general network communications traffic would be a Bad Idea at the very least from a compliance perspective due to needing to physically separate traffic (see: private VLANs required to separate tenants in I think PCI-DSS level 3). Heck, I'm nervous enough having observability traffic over the same networks as the data plane let alone mixing the switches for the storage traffic itself with it. It's similar to the arguments for putting databases like SQL Server on K8S. I mean, you can do it, but it seems ridiculously reductive to think that we should design, provision, and manage databases exactly the same way we manage stateless application deployments without some negative consequences as well as positive.

Methanar
Sep 26, 2013

by the sex ghost
We have a hard rule of not running databases or persistent application storage, like postgres or whatever, in k8s fwiw. Those go in RDS or, in a DC, on vsphere and whatever we used for shared storage there.

Hadlock
Nov 9, 2004

Strong agree, unless you're running toy dev/unit testing databases, production db shouldn't be run inside k8s

Methanar
Sep 26, 2013

by the sex ghost
I've had 2 separate etcd fires two days in a row.

I'm like, getting OOMs on instances that have 12Gb of freeable memory (but only 500mb actually free!). I don't really understand why. But there's some huge vertical spike in memory consumption on my control planes occasionally, like a malloc gone wrong or something. The current theory is the kernel is getting some bad page thrashing or something and its either freeing memory pages too quickly or not quickly enough and everything cascadingly fails from there.

Methanar fucked around with this message at 23:23 on Apr 22, 2021

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Methanar posted:

I've had 2 separate etcd fires two days in a row.

I'm like, getting OOMs on instances that have 12Gb of freeable memory (but only 500mb actually free!). I don't really understand why. But there's some huge vertical spike in memory consumption on my control planes occasionally, like a malloc gone wrong or something. The current theory is the kernel is getting some bad page thrashing or something and its either freeing memory pages too quickly or not quickly enough and everything cascadingly fails from there.

just curious, are the etcds hosting just a single kube cluster? and I'm assuming they have local storage, not something over a network?

only somewhat related, but I watched a talk a few months back from a guy who seemed to know what he was on about who said that freeable memory is basically a lie because IIRC and eg in this case, everything in the page cache including etcd pages would be considered freeable but are so hot the kernel would never (and should never) actually free that memory. also any program paged into memory is technically considered freeable, but again if it's hot the kernel won't do it because it would spend all its time freeing and then reloading the page. his conclusion was that there isn't (or wasn't at the time of his talk) a single good heuristic to tell how much memory is truly available for use at a given time on a system which was not reassuring.

Methanar
Sep 26, 2013

by the sex ghost

my homie dhall posted:

just curious, are the etcds hosting just a single kube cluster? and I'm assuming they have local storage, not something over a network?

only somewhat related, but I watched a talk a few months back from a guy who seemed to know what he was on about who said that freeable memory is basically a lie because IIRC and eg in this case, everything in the page cache including etcd pages would be considered freeable but are so hot the kernel would never (and should never) actually free that memory. also any program paged into memory is technically considered freeable, but again if it's hot the kernel won't do it because it would spend all its time freeing and then reloading the page. his conclusion was that there isn't (or wasn't at the time of his talk) a single good heuristic to tell how much memory is truly available for use at a given time on a system which was not reassuring.

Confirming the kernel is a lying bastard because immediately after my last post it happened again except way the gently caress worse this time

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

Slab cache issues? Take a look at slabinfo

Methanar
Sep 26, 2013

by the sex ghost
I have had 3 etcd fires in 2 days.


well that was extremely unpleasant and the first time I've had to recover a particularly catastrophic etcd failure in this particular of an adhoc, undocumented, unrunbook'd way before. I did the recovery plan in a test environment first and it worked, but wasn't so great in prod because my prod clusters are 1000x larger. I'm now thinking about wanting to stop colocating compute control plane and my etcd state on the same VMs. Even though kubeadm community says you should colocate them. I don't know why, but I've now seen the APIservers, nodelocaldnscache, and coreDNS pin themselves to 100% CPU whenever something funky happens to etcd, sort of regardless of what the funky is. I'm still making this all up in my head, but maybe the CPU being pinned is causing like context switching contention or something when the kernel tries to free its many dirty memory pages, and it can't because of the contention and somehow that just becomes increasingly cascadingly bad with etcd losing quorum and then failing health checks causing the API server to break harder. I don't know yet.

This is now the 2nd time I've had to respond to a nasty prod fire with a director and army of SREs looking over my shoulder because this had the potential to get significantly worse and I needed to make sure that the risk was being mitigated with having all these other people on standby and aware of the situation and know what might happen and how to respond to that might happen. I have no idea how others perceive me when I'm responding to these things. My director sounded scared though when we were describing the maybe it-could-get-worse worst cases. I'm kind of worried that it reflects badly on me that poo poo with my name on it is on fire again, even though it wasn't really my fault either time.

I have several more take away fixes because certain helpful shutdown hooks are extremely not helpful and keep shooting me in the foot, which I am going to be removing. I also need to really see about sysctl tuning the virtual memory manager in the kernel to maybe not have it persist so many freeable pages for so long. At least I'm gp3 volumes with more IOPS now :)


unrelated but did you know if a kubelet ever tries to speak to an API server that returns invalid TLS CA cert errors the kubelet decides the safest thing to do is halt all pods?


All of this after this morning having a huge application fire for several hours because of a ton of reasons including the code passing strings where functions wanted pointers which ended up ultimately meaning this app had a cache hit rate of 0 for 2 years and nobody noticed and somehow this wasn't a compile error. and a weird http library being used, except used improperly so it was creating an entirely new tcp session for every http call which for more complicated reasons ended up exhausting the available ephemeral ports and tcp tuples available. and cool things like tests failing and the build system secretly silently just producing the binaries using the last good branch of the code instead anyway when tests fail (?????) so nobody knew that the hotfixed build didn't actually contain any hotfixes until it was like debugged we saw weird http library symbols were still being called that shouldn't have existed anymore and more

im tired

Methanar fucked around with this message at 06:25 on Apr 23, 2021

chutwig
May 28, 2001

BURLAP SATCHEL OF CRACKERJACKS

I just implemented a mitigation for etcd slowly getting crapped up over time (multiple weeks plus) that is basically a weekly cron to dump the page cache. It’s not a satisfying fix at all, but hard to argue with it when it runs and 30 seconds later all the whining in the log about slow read-only range requests vanishes.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Methanar posted:

I have had 3 etcd fires in 2 days.

[snip]
What's going on with your request rates to etcd during this time (are they normal, or is your issue precipitated by an uptick in etcd getting hit)? Does your memory usage correlate against an increase in the size of the outstanding proposals reported by etcd? If so, which one seems to trend upward first? Does your WAL performance look healthy?

Methanar
Sep 26, 2013

by the sex ghost

Vulture Culture posted:

What's going on with your request rates to etcd during this time (are they normal, or is your issue precipitated by an uptick in etcd getting hit)? Does your memory usage correlate against an increase in the size of the outstanding proposals reported by etcd? If so, which one seems to trend upward first? Does your WAL performance look healthy?


Proper interpretation of this graph requires a bunch of context, but tldr the three spikes were recovery events and the APIservers went nuts, just as they had earlier which initially triggered the issue. On the bottom graph you can first see an OOM on a 16GB mem instance. Then we bumped to 32GB instances and still OOMed. It's off graph but on a 64GB of memory instance we finally had enough to barely survive. 64GB memory spikes on something that usually requires like, <3Gb.

After further investigation, the run away process is not actually etcd. But rather the APIservers losing their mind in pegging all cores and suddenly requesting non-sensical amounts of memory in response to etcd leader elections. CoreDNS pegging itself to 100% CPU in some failure cases is also suspicious. There are some spikes in APIserver request volume at the time of recovery which is expected since there will be kubelet state being reported back and requiring reconciliation after several hours of outage, but that doesn't seem at all proportional to what the load goes to.

It's the APIserver(s) freaking out in response to etcd elections, sometimes. In a terribe irony as the APIservers saturate their hardware, etcd begins to fail health checks under all the CPU contention which triggers further etcd elections and hangs causing the APIservers to freak out harder. So I'm doubling down on running my etcd state on separate hardware from my compute control plane. Current mitigation is mostly just throwing money at the problem with oversized hardware until next week's end of quarter prod freeze ends and we can do something about the situation for real. Setting CPU quota limits on the static pod manifests might also be helpful, we think. At least that way the thing we want to die, the API servers, die rather the entire node.

I have no explanation as to why the APIservers are doing this. I thought of maybe going through and dumping some flame graphs of the call stack or something of the APIservers, but the issue is apparently unreproducible in a small environment for us, and its questionable whether any profile tooling would work when this has a tendency to completely kill a node. So ¯\_(ツ)_/¯

Methanar fucked around with this message at 03:32 on Apr 24, 2021

madmatt112
Jul 11, 2016

Is that a cat in your pants, or are you just a lonely excuse for an adult?

Methanar posted:

¯\_(ツ)_/¯

When you say “compute control plane”, are you referring to K8S master nodes?

Methanar
Sep 26, 2013

by the sex ghost

madmatt112 posted:

When you say “compute control plane”, are you referring to K8S master nodes?

APIserver, controller manager and scheduler processes = compute control plane

madmatt112
Jul 11, 2016

Is that a cat in your pants, or are you just a lonely excuse for an adult?

Methanar posted:

APIserver, controller manager and scheduler processes = compute control plane

This makes sense.

Adbot
ADBOT LOVES YOU

Methanar
Sep 26, 2013

by the sex ghost
I was debugging some network thing for somebody and I was getting annoyed that our containers have zero tools on the filesystem them for debugging and no root.

As a joke I asked myself, given that we use a ubuntu base image, that means we came with a python interpreter.

lmao the ubuntu docker image comes with a full python interpreter.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply