|
12 rats tied together posted:its a fair gripe, this app probably shouldnt be in k8s prob not but at the time k8s was implemented we had a 12 hour sla and cared extremely little about failure rates. retries will fix everything. now we have some features that care very much about latency and failure rate but are all-in on k8s so now making do is part of my job Ploft-shell crab posted:the authors of kubernetes would probably say you’re going to have to solve this problem (requests dropping) eventually somewhere and also the problem you want them to solve is intractable. you’re asking them to somehow come up with a routing model that’s a) consistent/atomic across distributed nodes, b) supports dynamic scaling events and c) never dropping requests. how would you do this with any other provider or even conceptually other than a sleep after scaling down? your a) is stronger than I need. what I’m looking for is an ordering guarantee between iptables updates and SIGKILL. you don’t need consensus for that. write a txid into endpoints when they’re updated, and write the most recent txid visible in iptables into node status. the rest is a pretty simple pre-stop hook. paging someone to unfuck the cluster when a node breaks is acceptable to me. god knows we do that enough with loving docker as it is
|
# ? Jun 11, 2020 16:46 |
|
|
# ? May 19, 2024 18:38 |
|
Nomnom Cookie posted:can you understand me being pissed that the best practice guidance for this situation is to hack in a sleep long enough that you probably won’t have an issue, because making the thing work properly would be a “layering violation”. it’s worse is better pretending to be just plain better and it bugs the hell out of me i understand that you're pissed, it's ok to be mad that it doesn't work for you. you shouldn't use it if it doesn't work for you. my point is that nothing is perfect, but the tradeoffs are usually worthwhile. so, my day job is dev rel, right? i work on opentelemetry, opentracing, poo poo like that. i frequently field discussions with frontend devs who complain about the bundle size of javascript tracers. "why is it so big? this is an extra 30kb gzipped, that's too much, that's going to slow down page loads." yeah, it's true! distributed tracing libs aren't like segment or some other analytic lib, the data structure of a trace is more complex and honestly, we haven't done much to create a lightweight version for browsers/frontend use. my response is 'if you can't use it, then don't use it. personally, i think the tradeoff in file size is worth it for the other benefits, but if that isn't the case for you, then don't use it, ask someone else to fix it, or fix it yourself.' those are the options in our brave new oss future. complaining into the void is fun and cathartic, but at the end of the day you should do what's best for your situation. in this case, if kubernetes solved 99% of the problems you're having now but caused a new problem that you couldn't work around, what would you do? if the new problem was truly something disqualifying then your options are to 1) not use it, 2) fix the underlying problem, 3) work around the problem, 4) ask someone else to fix the problem.
|
# ? Jun 11, 2020 16:48 |
|
obtw 12 rats can you point me to where on the k8s marketing site or docs it says that kubernetes is unsuitable for applications requiring correctness from the cluster orchestrator. I mean, we’ve just established that it is, so isn’t that something that is pretty important to communicate to people considering using k8s
|
# ? Jun 11, 2020 16:49 |
|
i don't think kubernetes has ever claimed that it supports absolute correctness in the way you want it to.
|
# ? Jun 11, 2020 16:57 |
|
meanwhile, everything i can find and everything i've read and everything i've ever done has stressed the importance of graceful termination, adding prestop hooks to deployments, etc. etc. just because you don't like the solution-as-designed doesn't mean it's the wrong solution, it just means it's unsuitable for you.
|
# ? Jun 11, 2020 17:04 |
|
uncurable mlady posted:meanwhile, everything i can find and everything i've read and everything i've ever done has stressed the importance of graceful termination, adding prestop hooks to deployments, etc. etc. Those are all well and good but don’t actually solve the problem either. I’ve fought with this issue as well and it sucks for several social and technical reasons. Technically, they could make cycling pods a lot safer (but not perfectly safe without taking a long, hard look at the trade offs WRT CAP theorem), they just don’t. They built a house out of baling wire and iptables rules and now they have to live in it. Socially, it’s a hard conversation to have with devs/managers when they’ve bought into the whole kubernetes stack and it’s worked in the past but now the requirements have changed slightly and you have to create completely separate infra for this app now. The only thing that kubernetes works for is web/api poo poo where response times don’t matter and retries are fine, or similarly for batch processing. However, kubernetes is sold as the answer for everything (and oh there may be some corner cases you have to worry about, like the entire loving DNS stack). Even the “scale pods with your load” selling points are overblown because the autoscalers suck poo poo
|
# ? Jun 11, 2020 17:19 |
|
Gaukler posted:Those are all well and good but don’t actually solve the problem either. I’ve fought with this issue as well and it sucks for several social and technical reasons. i mean you and nomnom are both right, k8s has some design tradeoffs that make it broadly unsuitable for many classes of applications. fixing those flaws would require a lot of work that will probably happen eventually, or people will just not really care and use other things for applications/services that need them.
|
# ? Jun 11, 2020 17:35 |
|
uncurable mlady posted:i don't think kubernetes has ever claimed that it supports absolute correctness in the way you want it to. how often do you publicly claim not to beat your spouse and should i derive any conclusions from the frequency of such claims
|
# ? Jun 11, 2020 17:39 |
|
kube saying "im so wasted lol is it time to murder this pod, idk lets just do it yolo" with no way to change it, i just wanna bitch, and yall bein like 1) here is a thing you are already doing, have you tried it 2) actually this is good, you should stop complaining 3) i would simply do a 360 and rebase my entire infrastructure to a different system what you should say is "drat bro, that sucks"
|
# ? Jun 11, 2020 17:50 |
|
idk i feel like a lot of people keep saying 'drat bro, that sucks, but here's some productive options and discussion about the issue' but if you just want to scream into the void might i recommend using the no-reply tweets to @thockin
|
# ? Jun 11, 2020 18:01 |
|
uncurable mlady posted:idk i feel like a lot of people keep saying 'drat bro, that sucks, but here's some productive options and discussion about the issue' but if you just want to scream into the void might i recommend using the no-reply tweets to @thockin you're the one who just got done saying i should either shut up and suck it up or spend a few man years to move off of k8s, so it's really interesting to me that this is how you see your side of it
|
# ? Jun 11, 2020 18:13 |
|
you're the one who keeps getting more and more hostile. no one owes you the response you want. if you don't like the ones you're getting, you don't have to keep posting.
|
# ? Jun 11, 2020 18:58 |
|
just yeet your packets to nullroute, who cares
|
# ? Jun 11, 2020 19:04 |
|
Captain Foo posted:just yeet your packets to nullroute, who cares mickens monitorama preso edge network guy is the hero we need
|
# ? Jun 11, 2020 19:51 |
|
Nomnom Cookie posted:your a) is stronger than I need. what I’m looking for is an ordering guarantee between iptables updates and SIGKILL. you don’t need consensus for that. write a txid into endpoints when they’re updated, and write the most recent txid visible in iptables into node status. the rest is a pretty simple pre-stop hook. paging someone to unfuck the cluster when a node breaks is acceptable to me. god knows we do that enough with loving docker as it is just because a host no longer has an iptables rule for a target doesn’t mean it’s not going to route traffic to that target though. existing conntrack flows will continue to use that route which means you’re not just waiting on iptables to update, but also all existing connections (which may or may not even be valid) to that pod to terminate before you can delete it. what happens if neither side closes the connection, should the pod just never delete? it might be workable for your specific use case (in which case, write your own kube-proxy! there are already cni providers that replace it), but I don’t think it’s as trivial as you’re making it sound
|
# ? Jun 11, 2020 22:49 |
|
fwiw I think relying on iptables as the core mechanism in your data plane is a pretty questionable decision, but I’m not sure there are/were better alternatives available for kernels without nftables. I have only limited exposure to ipvs
my homie dhall fucked around with this message at 00:21 on Jun 12, 2020 |
# ? Jun 11, 2020 22:53 |
|
Captain Foo posted:just yeet your packets to nullroute, who cares tired: static default route wired: static default null route
|
# ? Jun 11, 2020 23:40 |
|
routing packets through linux hosts still kinda sucks poo poo and there's a reason vendors like cumulus or whatever charge 2k+ for software per device
|
# ? Jun 11, 2020 23:41 |
|
i have still never in my career written software which was not targeted to run on a known physical internally managed machine, and i think this will be my curmudgeon rubicon where i will forever go 'hrumph' about doing anything else.
|
# ? Jun 12, 2020 08:03 |
|
Gaukler posted:but now the requirements have changed slightly switching from a background processing model to a real time SLA is not a slight change in requirements, and it's a shame that their management apparently can't be convinced of that fact instead they've apparently decided to have nomnom figure out how best to make a best-effort peg fit a guaranteed response hole
|
# ? Jun 12, 2020 14:07 |
|
we tried using the wrong tools for the job and we're all out of ideas!
|
# ? Jun 12, 2020 14:07 |
|
Progressive JPEG posted:switching from a background processing model to a real time SLA is not a slight change in requirements Nomnom Cookie posted:obtw 12 rats can you point me to where on the k8s marketing site or docs it says that kubernetes is unsuitable for applications requiring correctness from the cluster orchestrator admitting that k8s isn't an ops panacea, or the google might not know the best way to run any infrastructure for any purpose, would ruin that perception its true that you can work around the problem. it sucks that you work around it with "sleep 3" or by doing way too much loving work compared to any other load balancer to ever exist except maybe the Cisco (R) Catalyst (Copyright 1995) Content Switching Module
|
# ? Jun 12, 2020 14:46 |
|
did i miss the post where someone said race multiple requests to distinct destinations
|
# ? Jun 12, 2020 16:57 |
|
12 rats tied together posted:i wont even pretend for a little bit that k8s isn't at least half marketing the google brand to get people to work there. SRE as a role sucks rear end in general, the best way to get people to stick with it at your multinational megacorp would be to convince them that they are special in some way are there experienced SREs in the world that are not disillusioned and resigned to the touch computer forever for capitalism life
|
# ? Jun 12, 2020 18:14 |
|
I think the conversation for the past page or so is pretty interesting and I'll definitely consider the innate limitations of k8s next time someone I'm working with suggests it for a solution.
|
# ? Jun 12, 2020 18:26 |
|
Progressive JPEG posted:switching from a background processing model to a real time SLA is not a slight change in requirements, and it's a shame that their management apparently can't be convinced of that fact I worded that vaguely, I was actually referring to situations I’ve seen. I agree going from batch to an SLA of a few hundred ms is a big change. In my case it was for something that went from “best effort or we’ll just use filler content” to “this has to return content” except in the passive aggressive way of “yeah we understand that this request may fail” and then loudly escalating a bug every time it failed.
|
# ? Jun 12, 2020 19:38 |
|
12 rats tied together posted:except maybe the Cisco (R) Catalyst (Copyright 1995) Content Switching Module I once had to replace a pair of these with F5's and one of the business owners legitimately argued with me that I shouldn't do that because the risk of breaking production was too great the css at the time was end of support
|
# ? Jun 12, 2020 22:45 |
|
a huge part of my job in like 20...15? was managing a pair of them in a pair of 6509s. it was really something they had some config sync poo poo that worked really well actually
|
# ? Jun 12, 2020 23:17 |
|
They actually worked totally fine and the config was easy to understand. Far more straightforward than netscalers, I hated managing those.
|
# ? Jun 13, 2020 03:22 |
|
does f5 stand for “just keep refreshing until it finally points you to a server that’s actually alive” or just the two instances i have to rely on were configured by clowns?
|
# ? Jun 13, 2020 03:23 |
|
mod saas posted:does f5 stand for “just keep refreshing until it finally points you to a server that’s actually alive” or just the two instances i have to rely on were configured by clowns? Any load balancing solution requires the person operating it to have a beyond-cursory understanding of the apps they are load balancing and that's an unreasonable request for a network team that may have to look after several thousand virtual servers so you get a lot of "tcp port alive" health checks and poo poo like that
|
# ? Jun 13, 2020 03:40 |
|
abigserve posted:Any load balancing solution requires the person operating it to have a beyond-cursory understanding of the apps they are load balancing and that's an unreasonable request for a network team that may have to look after several thousand virtual servers so you get a lot of "tcp port alive" health checks and poo poo like that even better is the ping health check
|
# ? Jun 13, 2020 05:02 |
|
abigserve posted:Any load balancing solution requires the person operating it to have a beyond-cursory understanding of the apps they are load balancing and that's an unreasonable request for a network team that may have to look after several thousand virtual servers so you get a lot of "tcp port alive" health checks and poo poo like that "requires" here is perhaps more "should have, if you want to balance it well". you can totally set up a load balancer with little understanding of network protocols and troubleshooting them, people do it every day the results may be less than ideal, but hey, welcome to infrastructure
|
# ? Jun 14, 2020 02:59 |
|
CMYK BLYAT! posted:are there experienced SREs in the world that are not disillusioned and resigned to the touch computer forever for capitalism life the only thing keeping me touching computer is health insurance. abigserve posted:so you get a lot of "tcp port alive" health checks and poo poo like that i do like that i can punt this to the dev team really easy in k8s. i can even just link them api docs, its reasonable to expect developers to be able to read and understand api documentation we still have a lot of extremely dumb health checks. its either an issue with .net or an issue with the gestalt understanding of running web applications on windows and i am wholly uninterested in finding out which is true
|
# ? Jun 16, 2020 17:24 |
|
I had the css too, replaced them with the 6500 ACE modules. The probe config was great and it tied right into the distribution switch vlans. Never once did they have an issue.
|
# ? Jun 16, 2020 23:55 |
|
It seems like the 6500 era was when Cisco had all the good engineers, who then left to form competitors. I can't think of a single bad thing to say about that platform.
|
# ? Jun 17, 2020 01:19 |
|
I have a love, hate relationship with the 9k series, Cisco still had some good folks. There so many good ideas but its tainted by silly crap. No, I don't want to manage 35668 SMUs, tyvm. Also rediculous for them trying to call the broadcom NPs Cisco silicon for years. Sure, you guys both used trident, typhoon, tomahawk, ok... Now it's a marketing point on the lightspeed slide deck.
|
# ? Jun 17, 2020 01:49 |
|
abigserve posted:It seems like the 6500 era was when Cisco had all the good engineers, who then left to form competitors. I can't think of a single bad thing to say about that platform. well the 6500 era also spanned from like 1996-2009 or something. my opinion is that Cisco jumped the shark right around 2006 (when they first hired me, lol).
|
# ? Jun 17, 2020 03:58 |
|
I think the nexus 7k was the first platform that ruffled everyones feathers. It was such a dramatic architecture change from the 6500 it was literally "we're going to deprecate this good, working platform, and replace it with something far inferior for no reason" Like sure, on paper the platform was far more performant, it was much closer aligned with the needs of the DC, but in practice most customers don't need twenty petabits of backblane throughput but what they DO need is l3vpns that work, and a solid BGP implementation, a working HA model, etc etc etc And the thing is, they never really fixed it. That legacy is still around, albeit in the 9K form factor now.
|
# ? Jun 17, 2020 13:40 |
|
|
# ? May 19, 2024 18:38 |
|
don't worry this will all be solved by intent based networking
|
# ? Jun 17, 2020 13:53 |