Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Nomnom Cookie
Aug 30, 2009



12 rats tied together posted:

its a fair gripe, this app probably shouldnt be in k8s

prob not but at the time k8s was implemented we had a 12 hour sla and cared extremely little about failure rates. retries will fix everything. now we have some features that care very much about latency and failure rate but are all-in on k8s so now making do is part of my job

Ploft-shell crab posted:

the authors of kubernetes would probably say you’re going to have to solve this problem (requests dropping) eventually somewhere and also the problem you want them to solve is intractable. you’re asking them to somehow come up with a routing model that’s a) consistent/atomic across distributed nodes, b) supports dynamic scaling events and c) never dropping requests. how would you do this with any other provider or even conceptually other than a sleep after scaling down?

your a) is stronger than I need. what I’m looking for is an ordering guarantee between iptables updates and SIGKILL. you don’t need consensus for that. write a txid into endpoints when they’re updated, and write the most recent txid visible in iptables into node status. the rest is a pretty simple pre-stop hook. paging someone to unfuck the cluster when a node breaks is acceptable to me. god knows we do that enough with loving docker as it is

Adbot
ADBOT LOVES YOU

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison

Nomnom Cookie posted:

can you understand me being pissed that the best practice guidance for this situation is to hack in a sleep long enough that you probably won’t have an issue, because making the thing work properly would be a “layering violation”. it’s worse is better pretending to be just plain better and it bugs the hell out of me

i understand that you're pissed, it's ok to be mad that it doesn't work for you. you shouldn't use it if it doesn't work for you.

my point is that nothing is perfect, but the tradeoffs are usually worthwhile.

so, my day job is dev rel, right? i work on opentelemetry, opentracing, poo poo like that. i frequently field discussions with frontend devs who complain about the bundle size of javascript tracers. "why is it so big? this is an extra 30kb gzipped, that's too much, that's going to slow down page loads." yeah, it's true! distributed tracing libs aren't like segment or some other analytic lib, the data structure of a trace is more complex and honestly, we haven't done much to create a lightweight version for browsers/frontend use. my response is 'if you can't use it, then don't use it. personally, i think the tradeoff in file size is worth it for the other benefits, but if that isn't the case for you, then don't use it, ask someone else to fix it, or fix it yourself.' those are the options in our brave new oss future. complaining into the void is fun and cathartic, but at the end of the day you should do what's best for your situation.

in this case, if kubernetes solved 99% of the problems you're having now but caused a new problem that you couldn't work around, what would you do? if the new problem was truly something disqualifying then your options are to 1) not use it, 2) fix the underlying problem, 3) work around the problem, 4) ask someone else to fix the problem.

Nomnom Cookie
Aug 30, 2009



obtw 12 rats can you point me to where on the k8s marketing site or docs it says that kubernetes is unsuitable for applications requiring correctness from the cluster orchestrator. I mean, we’ve just established that it is, so isn’t that something that is pretty important to communicate to people considering using k8s

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison
i don't think kubernetes has ever claimed that it supports absolute correctness in the way you want it to.

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison
meanwhile, everything i can find and everything i've read and everything i've ever done has stressed the importance of graceful termination, adding prestop hooks to deployments, etc. etc.

just because you don't like the solution-as-designed doesn't mean it's the wrong solution, it just means it's unsuitable for you.

Gaukler
Oct 9, 2012


uncurable mlady posted:

meanwhile, everything i can find and everything i've read and everything i've ever done has stressed the importance of graceful termination, adding prestop hooks to deployments, etc. etc.

just because you don't like the solution-as-designed doesn't mean it's the wrong solution, it just means it's unsuitable for you.

Those are all well and good but don’t actually solve the problem either. I’ve fought with this issue as well and it sucks for several social and technical reasons.

Technically, they could make cycling pods a lot safer (but not perfectly safe without taking a long, hard look at the trade offs WRT CAP theorem), they just don’t. They built a house out of baling wire and iptables rules and now they have to live in it.

Socially, it’s a hard conversation to have with devs/managers when they’ve bought into the whole kubernetes stack and it’s worked in the past but now the requirements have changed slightly and you have to create completely separate infra for this app now.

The only thing that kubernetes works for is web/api poo poo where response times don’t matter and retries are fine, or similarly for batch processing. However, kubernetes is sold as the answer for everything (and oh there may be some corner cases you have to worry about, like the entire loving DNS stack). Even the “scale pods with your load” selling points are overblown because the autoscalers suck poo poo

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison

Gaukler posted:

Those are all well and good but don’t actually solve the problem either. I’ve fought with this issue as well and it sucks for several social and technical reasons.

Technically, they could make cycling pods a lot safer (but not perfectly safe without taking a long, hard look at the trade offs WRT CAP theorem), they just don’t. They built a house out of baling wire and iptables rules and now they have to live in it.

Socially, it’s a hard conversation to have with devs/managers when they’ve bought into the whole kubernetes stack and it’s worked in the past but now the requirements have changed slightly and you have to create completely separate infra for this app now.

The only thing that kubernetes works for is web/api poo poo where response times don’t matter and retries are fine, or similarly for batch processing. However, kubernetes is sold as the answer for everything (and oh there may be some corner cases you have to worry about, like the entire loving DNS stack). Even the “scale pods with your load” selling points are overblown because the autoscalers suck poo poo

i mean you and nomnom are both right, k8s has some design tradeoffs that make it broadly unsuitable for many classes of applications. fixing those flaws would require a lot of work that will probably happen eventually, or people will just not really care and use other things for applications/services that need them.

Nomnom Cookie
Aug 30, 2009



uncurable mlady posted:

i don't think kubernetes has ever claimed that it supports absolute correctness in the way you want it to.

how often do you publicly claim not to beat your spouse and should i derive any conclusions from the frequency of such claims

Nomnom Cookie
Aug 30, 2009



kube saying "im so wasted lol is it time to murder this pod, idk lets just do it yolo" with no way to change it, i just wanna bitch, and yall bein like

1) here is a thing you are already doing, have you tried it
2) actually this is good, you should stop complaining
3) i would simply do a 360 and rebase my entire infrastructure to a different system

what you should say is "drat bro, that sucks"

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison
idk i feel like a lot of people keep saying 'drat bro, that sucks, but here's some productive options and discussion about the issue' but if you just want to scream into the void might i recommend using the no-reply tweets to @thockin

Nomnom Cookie
Aug 30, 2009



uncurable mlady posted:

idk i feel like a lot of people keep saying 'drat bro, that sucks, but here's some productive options and discussion about the issue' but if you just want to scream into the void might i recommend using the no-reply tweets to @thockin

you're the one who just got done saying i should either shut up and suck it up or spend a few man years to move off of k8s, so it's really interesting to me that this is how you see your side of it

carry on then
Jul 10, 2010

by VideoGames

(and can't post for 10 years!)

you're the one who keeps getting more and more hostile. no one owes you the response you want. if you don't like the ones you're getting, you don't have to keep posting.

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

just yeet your packets to nullroute, who cares

Qtotonibudinibudet
Nov 7, 2011



Omich poluyobok, skazhi ty narkoman? ya prosto tozhe gde to tam zhivu, mogli by vmeste uyobyvat' narkotiki

Captain Foo posted:

just yeet your packets to nullroute, who cares

mickens monitorama preso edge network guy is the hero we need

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Nomnom Cookie posted:

your a) is stronger than I need. what I’m looking for is an ordering guarantee between iptables updates and SIGKILL. you don’t need consensus for that. write a txid into endpoints when they’re updated, and write the most recent txid visible in iptables into node status. the rest is a pretty simple pre-stop hook. paging someone to unfuck the cluster when a node breaks is acceptable to me. god knows we do that enough with loving docker as it is

just because a host no longer has an iptables rule for a target doesn’t mean it’s not going to route traffic to that target though. existing conntrack flows will continue to use that route which means you’re not just waiting on iptables to update, but also all existing connections (which may or may not even be valid) to that pod to terminate before you can delete it. what happens if neither side closes the connection, should the pod just never delete?

it might be workable for your specific use case (in which case, write your own kube-proxy! there are already cni providers that replace it), but I don’t think it’s as trivial as you’re making it sound

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
fwiw I think relying on iptables as the core mechanism in your data plane is a pretty questionable decision, but I’m not sure there are/were better alternatives available for kernels without nftables. I have only limited exposure to ipvs

my homie dhall fucked around with this message at 00:21 on Jun 12, 2020

abigserve
Sep 13, 2009

this is a better avatar than what I had before

Captain Foo posted:

just yeet your packets to nullroute, who cares

tired: static default route
wired: static default null route

abigserve
Sep 13, 2009

this is a better avatar than what I had before
routing packets through linux hosts still kinda sucks poo poo and there's a reason vendors like cumulus or whatever charge 2k+ for software per device

Cybernetic Vermin
Apr 18, 2005

i have still never in my career written software which was not targeted to run on a known physical internally managed machine, and i think this will be my curmudgeon rubicon where i will forever go 'hrumph' about doing anything else.

Progressive JPEG
Feb 19, 2003

Gaukler posted:

but now the requirements have changed slightly

switching from a background processing model to a real time SLA is not a slight change in requirements, and it's a shame that their management apparently can't be convinced of that fact

instead they've apparently decided to have nomnom figure out how best to make a best-effort peg fit a guaranteed response hole

Progressive JPEG
Feb 19, 2003

we tried using the wrong tools for the job and we're all out of ideas!

12 rats tied together
Sep 7, 2006

Progressive JPEG posted:

switching from a background processing model to a real time SLA is not a slight change in requirements
yeah, rofl

Nomnom Cookie posted:

obtw 12 rats can you point me to where on the k8s marketing site or docs it says that kubernetes is unsuitable for applications requiring correctness from the cluster orchestrator
i wont even pretend for a little bit that k8s isn't at least half marketing the google brand to get people to work there. SRE as a role sucks rear end in general, the best way to get people to stick with it at your multinational megacorp would be to convince them that they are special in some way

admitting that k8s isn't an ops panacea, or the google might not know the best way to run any infrastructure for any purpose, would ruin that perception

its true that you can work around the problem. it sucks that you work around it with "sleep 3" or by doing way too much loving work compared to any other load balancer to ever exist except maybe the Cisco (R) Catalyst (Copyright 1995) Content Switching Module

FamDav
Mar 29, 2008
did i miss the post where someone said race multiple requests to distinct destinations

Qtotonibudinibudet
Nov 7, 2011



Omich poluyobok, skazhi ty narkoman? ya prosto tozhe gde to tam zhivu, mogli by vmeste uyobyvat' narkotiki

12 rats tied together posted:

i wont even pretend for a little bit that k8s isn't at least half marketing the google brand to get people to work there. SRE as a role sucks rear end in general, the best way to get people to stick with it at your multinational megacorp would be to convince them that they are special in some way

are there experienced SREs in the world that are not disillusioned and resigned to the touch computer forever for capitalism life

ate shit on live tv
Feb 15, 2004

by Azathoth
I think the conversation for the past page or so is pretty interesting and I'll definitely consider the innate limitations of k8s next time someone I'm working with suggests it for a solution.

Gaukler
Oct 9, 2012


Progressive JPEG posted:

switching from a background processing model to a real time SLA is not a slight change in requirements, and it's a shame that their management apparently can't be convinced of that fact

instead they've apparently decided to have nomnom figure out how best to make a best-effort peg fit a guaranteed response hole

I worded that vaguely, I was actually referring to situations I’ve seen. I agree going from batch to an SLA of a few hundred ms is a big change.

In my case it was for something that went from “best effort or we’ll just use filler content” to “this has to return content” except in the passive aggressive way of “yeah we understand that this request may fail” and then loudly escalating a bug every time it failed.

abigserve
Sep 13, 2009

this is a better avatar than what I had before

12 rats tied together posted:

except maybe the Cisco (R) Catalyst (Copyright 1995) Content Switching Module

I once had to replace a pair of these with F5's and one of the business owners legitimately argued with me that I shouldn't do that because the risk of breaking production was too great

the css at the time was end of support

12 rats tied together
Sep 7, 2006

a huge part of my job in like 20...15? was managing a pair of them in a pair of 6509s. it was really something

they had some config sync poo poo that worked really well actually

abigserve
Sep 13, 2009

this is a better avatar than what I had before
They actually worked totally fine and the config was easy to understand. Far more straightforward than netscalers, I hated managing those.

mod saas
May 4, 2004

Grimey Drawer
does f5 stand for “just keep refreshing until it finally points you to a server that’s actually alive” or just the two instances i have to rely on were configured by clowns?

abigserve
Sep 13, 2009

this is a better avatar than what I had before

mod saas posted:

does f5 stand for “just keep refreshing until it finally points you to a server that’s actually alive” or just the two instances i have to rely on were configured by clowns?

Any load balancing solution requires the person operating it to have a beyond-cursory understanding of the apps they are load balancing and that's an unreasonable request for a network team that may have to look after several thousand virtual servers so you get a lot of "tcp port alive" health checks and poo poo like that

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

abigserve posted:

Any load balancing solution requires the person operating it to have a beyond-cursory understanding of the apps they are load balancing and that's an unreasonable request for a network team that may have to look after several thousand virtual servers so you get a lot of "tcp port alive" health checks and poo poo like that

even better is the ping health check

Qtotonibudinibudet
Nov 7, 2011



Omich poluyobok, skazhi ty narkoman? ya prosto tozhe gde to tam zhivu, mogli by vmeste uyobyvat' narkotiki

abigserve posted:

Any load balancing solution requires the person operating it to have a beyond-cursory understanding of the apps they are load balancing and that's an unreasonable request for a network team that may have to look after several thousand virtual servers so you get a lot of "tcp port alive" health checks and poo poo like that

"requires" here is perhaps more "should have, if you want to balance it well".

you can totally set up a load balancer with little understanding of network protocols and troubleshooting them, people do it every day

the results may be less than ideal, but hey, welcome to infrastructure

12 rats tied together
Sep 7, 2006

CMYK BLYAT! posted:

are there experienced SREs in the world that are not disillusioned and resigned to the touch computer forever for capitalism life

the only thing keeping me touching computer is health insurance.

abigserve posted:

so you get a lot of "tcp port alive" health checks and poo poo like that

i do like that i can punt this to the dev team really easy in k8s. i can even just link them api docs, its reasonable to expect developers to be able to read and understand api documentation

we still have a lot of extremely dumb health checks. its either an issue with .net or an issue with the gestalt understanding of running web applications on windows and i am wholly uninterested in finding out which is true

FalseNegative
Jul 24, 2007

2>/dev/null
I had the css too, replaced them with the 6500 ACE modules. The probe config was great and it tied right into the distribution switch vlans. Never once did they have an issue.

abigserve
Sep 13, 2009

this is a better avatar than what I had before
It seems like the 6500 era was when Cisco had all the good engineers, who then left to form competitors. I can't think of a single bad thing to say about that platform.

FalseNegative
Jul 24, 2007

2>/dev/null
I have a love, hate relationship with the 9k series, Cisco still had some good folks. There so many good ideas but its tainted by silly crap. No, I don't want to manage 35668 SMUs, tyvm. Also rediculous for them trying to call the broadcom NPs Cisco silicon for years. Sure, you guys both used trident, typhoon, tomahawk, ok... Now it's a marketing point on the lightspeed slide deck.

ate shit on live tv
Feb 15, 2004

by Azathoth

abigserve posted:

It seems like the 6500 era was when Cisco had all the good engineers, who then left to form competitors. I can't think of a single bad thing to say about that platform.

well the 6500 era also spanned from
like 1996-2009 or something. my opinion is that Cisco jumped the shark right around 2006 (when they first hired me, lol).

abigserve
Sep 13, 2009

this is a better avatar than what I had before
I think the nexus 7k was the first platform that ruffled everyones feathers. It was such a dramatic architecture change from the 6500 it was literally "we're going to deprecate this good, working platform, and replace it with something far inferior for no reason"

Like sure, on paper the platform was far more performant, it was much closer aligned with the needs of the DC, but in practice most customers don't need twenty petabits of backblane throughput but what they DO need is l3vpns that work, and a solid BGP implementation, a working HA model, etc etc etc

And the thing is, they never really fixed it. That legacy is still around, albeit in the 9K form factor now.

Adbot
ADBOT LOVES YOU

Forums Medic
Oct 2, 2010

i be out there in orbit
don't worry this will all be solved by intent based networking

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply