Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
jaegerx
Sep 10, 2012

Maybe this post will get me on your ignore list!


The NPC posted:

I don't see this talked about much and assume I am doing something wrong at this point.

Are you kidding? It's all we discuss. You don't limit cpu but you do requests for cpu. Memory should be limited and probably requested.

Adbot
ADBOT LOVES YOU

The NPC
Nov 21, 2010


jaegerx posted:

Are you kidding? It's all we discuss. You don't limit cpu but you do requests for cpu. Memory should be limited and probably requested.

What's the logic behind not limiting cpu? Do you ever run into some app bogging down a whole node? Right now we are setting limits, and have alerting set up to inform us if anything gets throttled for an extended period of time.

jaegerx
Sep 10, 2012

Maybe this post will get me on your ignore list!


The NPC posted:

What's the logic behind not limiting cpu? Do you ever run into some app bogging down a whole node? Right now we are setting limits, and have alerting set up to inform us if anything gets throttled for an extended period of time.

I don't remember the math anymore, but it was like, memory is finite which we understand but cpu is elastic so why restrict poo poo. you set requirements and then the cpu will always have that no matter what and can burst if needed, so if a pod started taking a poo poo load of the cpu the other pods would still have their bare minimum.

12 rats tied together
Sep 7, 2006

lots of types of thing run totally fine with degraded CPUs, agree that you should not set any requests or limits and let the scheduler(s) figure it out

if access to cpu is causing a meaningful problem for your app it should fail its health check, such that it eventually lands on a node without cpu noisy neighbors, and stops failing

if this happens a lot you should consider tainting / tolerating a busycpu tag, or perhaps moving the noisy neighbor off-cluster entirely

jaegerx
Sep 10, 2012

Maybe this post will get me on your ignore list!


Dude has a point, you can make a custom scheduler for that poo poo and move it off when it starts doing whatever the gently caress it does.

Methanar
Sep 26, 2013

by the sex ghost
CPU limits are fine. If one app is running hot for some reason (replaying a kafka queue for example), it's not actually a feature that it bleeds over and starts bulldozing over other colocated apps unrestricted.

Let CPU limits contain the blast radius somewhat until HPA or otherwise kicks in, or just let it be restricted as long as necessary.

luminalflux
May 27, 2005



The NPC posted:

What's the logic behind not limiting cpu? Do you ever run into some app bogging down a whole node? Right now we are setting limits, and have alerting set up to inform us if anything gets throttled for an extended period of time.

I was stumbling around this the other day. Here's a good post (despite being on reddit) from one of the kubernetes maintainers:

https://www.reddit.com/r/kubernetes/comments/all1vg/comment/efgyygu/

Basically, not setting CPU limits lets you use spare CPU if it's available.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
Memory requests and memory limits should generally also be identical: https://home.robusta.dev/blog/kubernetes-memory-limit

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
When first putting CPU limits in place, be aware that it can wreck your latency if your application has more threads than its CPU allocation, which almost everything on the JVM or CLR will.

https://danluu.com/cgroup-throttling/

I do think that in the big picture we are heading for some type of CPU pinning, but most of the orchestration platforms don't core pinning comfortably yet.

Junkiebev
Jan 18, 2002


Feel the progress.

Twerk from Home posted:

When first putting CPU limits in place, be aware that it can wreck your latency if your application has more threads than its CPU allocation, which almost everything on the JVM or CLR will.

https://danluu.com/cgroup-throttling/

I do think that in the big picture we are heading for some type of CPU pinning, but most of the orchestration platforms don't core pinning comfortably yet.

They need to make an nproc equivalent for k8s

E: can you pull limits/requests from the downward api?

Junkiebev fucked around with this message at 08:49 on Dec 2, 2022

Methanar
Sep 26, 2013

by the sex ghost

Junkiebev posted:

They need to make an nproc equivalent for k8s

E: can you pull limits/requests from the downward api?

For most languages there is a library that can help you with that.
https://github.com/uber-go/automaxprocs


Twerk from Home posted:

When first putting CPU limits in place, be aware that it can wreck your latency if your application has more threads than its CPU allocation, which almost everything on the JVM or CLR will.

https://danluu.com/cgroup-throttling/

I do think that in the big picture we are heading for some type of CPU pinning, but most of the orchestration platforms don't core pinning comfortably yet.

https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/
Kubernetes has had CPU affinity pinning for years. I assigned somebody to get this enabled on all of our clusters last month and it's been working great.

Methanar fucked around with this message at 08:58 on Dec 2, 2022

luminalflux
May 27, 2005



Junkiebev posted:

They need to make an nproc equivalent for k8s

E: can you pull limits/requests from the downward api?

Yes. We do this to pass down limits and requests set on the application container in the pod to our Ansible init container, along with pod labels and annotations. Since Ansible is rendering configuration based on the number of CPUs and amount of memory as an init container, we couldn't use the automaxprocs-style parsing.

(later on we'll get rid of the ansible init container but this is due to keeping configuration similar between our EC2 deploys and k8s deploys while migrating)
code:
  spec:
      volumes:
        - name: podinfo
          downwardAPI:
            items:
              - path: "labels"
                fieldRef:
                  fieldPath: metadata.labels
              - path: "annotations"
                fieldRef:
                  fieldPath: metadata.annotations
              - path: cpu_requests
                resourceFieldRef:
                  resource: requests.cpu
                  containerName: application
              - path: cpu_limits
                resourceFieldRef:
                  resource: limits.cpu
                  containerName: application
      initContainers:
        - name: ansible
          volumeMounts:
            - name: podinfo
              mountPath: /etc/podinfo

luminalflux fucked around with this message at 16:54 on Dec 4, 2022

Junkiebev
Jan 18, 2002


Feel the progress.

luminalflux posted:

Yes. We do this to pass down limits and requests set on the application container in the pod to our Ansible init container, along with pod labels and annotations. Since Ansible is rendering configuration based on the number of CPUs and amount of memory as an init container, we couldn't use the automaxprocs-style parsing.

(later on we'll get rid of the ansible init container but this is due to keeping configuration similar between our EC2 deploys and ansible deploys while migrating)
code:
  spec:
      volumes:
        - name: podinfo
          downwardAPI:
            items:
              - path: "labels"
                fieldRef:
                  fieldPath: metadata.labels
              - path: "annotations"
                fieldRef:
                  fieldPath: metadata.annotations
              - path: cpu_requests
                resourceFieldRef:
                  resource: requests.cpu
                  containerName: application
              - path: cpu_limits
                resourceFieldRef:
                  resource: limits.cpu
                  containerName: application
      initContainers:
        - name: ansible
          volumeMounts:
            - name: podinfo
              mountPath: /etc/podinfo

noice

The NPC
Nov 21, 2010


Thanks for the links and advice everyone. Looks like our use case (charge back on a shared cluster) is one of the few reasons to set cpu limits.

Junkiebev
Jan 18, 2002


Feel the progress.

The NPC posted:

Thanks for the links and advice everyone. Looks like our use case (charge back on a shared cluster) is one of the few reasons to set cpu limits.

we get around this at my company with node pools - common pool? lol QOS. dedicated compute? you can only sit on your own balls, but it costs more.

Methanar
Sep 26, 2013

by the sex ghost
current mood: drawing up network diagrams and tables for bgp anycasting 512 /32s from many different k8s clusters across 9 different datacenters through like 16 pops across two continents for 1000 microservices from the datacenters to AWS AND understanding the ecmp and latency implications at all the different points. Except it is worse than that because not all datacenters are made equal in terms of their geographical positioning relative to the AWS region they're affiliated with, which has some Negative Implications for some of the datacenters more than others. I was late in being informed about this significant latency disparity.
This means some DCs may not be suitable for all applications, but our hardware capacity planning never accounted for this fact that some DCs may not be suitable; which while not proven to be a problem yet, is probably a problem.


Somewhat related, it turns out what a developer believed was grpc traffic was in fact not grpc traffic at all but instead regular rpc carrying gob encoding. Which is something that I cannot l7 route. Investigating and putting in special support for this traffic made the above situation apparent. It was always a problem now in hindsight, just one that I'm now aware of and need to deal with. ECMP routing to anycasted routes from AWS into the pops into the DCs only to land at a IPVS-based l4 router which may then route the traffic further to another datacenter over leased lines direct to other DCs, independent of the pops. So in addition to all of the AWS direct connect routing-isms of this, I need to figure out data-locality in the IPVS loadbalancer. Which only went stable in k8s 1.24 which I am nowhere near at the moment because of an entirely unrelated other rats nest of complicated reasons. What a mess

Methanar fucked around with this message at 11:00 on Dec 8, 2022

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine
lmao if you made pods routable. just lmao

Methanar
Sep 26, 2013

by the sex ghost

my homie dhall posted:

lmao if you made pods routable. just lmao

pod native routing is good. but no

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Methanar posted:

pod native routing is good. but no

for the birds!

what loadbalancer solution are you using that handles ecmp properly?

Methanar
Sep 26, 2013

by the sex ghost

my homie dhall posted:

for the birds!

what loadbalancer solution are you using that handles ecmp properly?

There is no real 'loadbalancer' at all here.
MetalLB is present in multiple DCs advertising up a given /32 to their local DC network. The different DC networks then have multiple routes in the table for reaching the /32. These multiple routes propagate up to the different POPs where by BGP path selection tuning, some pops will prefer some DCs over others. At each of these POPs there are directConnects to AWS where we peer with our AWS VPC to populate the routeTables of the VPC with our DC/POP BGP routes.

The effect is that the routeTables of AWS are responsible for the ECMP aspect of seeing 'hey I have three ways of getting to 10.0.0.1/32, all are ECMP so I'll just pick one based on my address tuple hash'. From there, the POPs have their biased routing to the compute DCs.

Simplified view.
code:
VPC -> POP1 -> DC1 -> 10.0.0.1/32 -> K8s Service object (may route to another DC yet further, until I figure out k8s 1.25 topologyKeys)
               ||
    -> POP2 -> DC2 -> 10.0.0.1/32 -> K8s Service object (may route to another DC yet further, until I figure out k8s 1.25 topologyKeys)
               ||
    -> POP3 -> DC3 -> 10.0.0.1/32 -> K8s Service object (may route to another DC yet further, until I figure out k8s 1.25 topologyKeys)
               ||
    -> POP4 -> DC3 -> 10.0.0.1/32 -> K8s Service object (may route to another DC yet further, until I figure out k8s 1.25 topologyKeys)

Methanar fucked around with this message at 19:56 on Dec 8, 2022

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Methanar posted:

There is no real 'loadbalancer' at all here.
MetalLB is present in multiple DCs advertising up a given /32 to their local DC network. The different DC networks then have multiple routes in the table for reaching the /32. These multiple routes propagate up to the different POPs where by BGP path selection tuning, some pops will prefer some DCs over others. At each of these POPs there are directConnects to AWS where we peer with our AWS VPC to populate the routeTables of the VPC with our DC/POP BGP routes.

The effect is that the routeTables of AWS are responsible for the ECMP aspect of seeing 'hey I have three ways of getting to 10.0.0.1/32, all are ECMP so I'll just pick one based on my address tuple hash'. From there, the POPs have their biased routing to the compute DCs.

Simplified view.
code:
VPC -> POP1 -> DC1 -> 10.0.0.1/32 -> K8s Service object (may route to another DC yet further, until I figure out k8s 1.25 topologyKeys)
               ||
    -> POP2 -> DC2 -> 10.0.0.1/32 -> K8s Service object (may route to another DC yet further, until I figure out k8s 1.25 topologyKeys)
               ||
    -> POP3 -> DC3 -> 10.0.0.1/32 -> K8s Service object (may route to another DC yet further, until I figure out k8s 1.25 topologyKeys)
               ||
    -> POP4 -> DC3 -> 10.0.0.1/32 -> K8s Service object (may route to another DC yet further, until I figure out k8s 1.25 topologyKeys)
everything about this sounds like an absolute loving nightmare

Methanar
Sep 26, 2013

by the sex ghost

Vulture Culture posted:

everything about this sounds like an absolute loving nightmare

Yeah kind of, but it'll get me promoted, so.

This is the best I've got for the intermediary phases of migrating workloads out of AWS into on-prem environments. Shuttling everything through ingress-nginx is not going to work for a lot of reasons. There is going to be times where we need to go from AWS -> DC -> AWS because of how interconnected things can get, add in postgres, mysql, redis, memcache, elasticsearch, cassandra, kafka, s3, and others and it becomes an exponential mess of things that must be incrementally moved over, unravelling the dependency graph every step of the way. This is going to be a 2 year long company-wide project of moving 1000+ microservices and all of the middleware and databases they consume over so I need to really be making sure whatever pattern I put together is going to work in all of the unforeseen cases yet to come. I've never had more authority and weight in influencing the direction of large scale projects than I do right now.

The multi-DC and multi-pop stuff is less than ideal and full of more beartraps than I wanted there to be. I've been insisted to for years now by peers and the DC teams that it's actually good and cool to assume that the DC networking is a blackbox and will Just Work. Which okay i guess is fine for smaller use cases.

But are you sure?
Are you really fuckin sure?
Are you 'bet the company on it' sure?

I suddenly start caring a lot more about the details when total traffic volumes are going to be measured in 100s of gbps and it's going to be my name at the end of the escalation chain for the infra 500+ developers are going to depend on, to an even greater extent than what it already is.

I've already been burned by some of these latency details for work in the past few months for supporting apache spark in the datacenter. The shuffle phase of map reduce was very not cool with even 5ms of latency incurred by crossing DC boundaries. So this is very much a real thing that I'm going to have to be worrying about. It may not always be as easy to solve as to say set a nodeAffinity of just DC1 on application 1.

I know how my organization communicates. Service owners are going to move stuff over, suddenly have garbage performance for some indeterminate reason. Those issues will bubble up to their local director as a blocker for the high visibility efforts. Their director will tell my director, and my director will tell my boss to tell me to deal with it. It is guaranteed that this pattern will continue so I'm trying to get a head of it and understand the situation myself as much as possible before it turns into another 4 month hell-crunch of me dealing with constant P0 high org-visibility fires on the spot.

I can't fully articulate the hypercube of complexity of what's all going on here in a forums post.

So yeah, a little bit of an absolute loving nightmare.

MightyBigMinus
Jan 26, 2020

sure but latency is a function of distance so none of the rube goldberg poo poo is going to matter

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Methanar posted:

There is no real 'loadbalancer' at all here.
MetalLB is present in multiple DCs advertising up a given /32 to their local DC network. The different DC networks then have multiple routes in the table for reaching the /32. These multiple routes propagate up to the different POPs where by BGP path selection tuning, some pops will prefer some DCs over others. At each of these POPs there are directConnects to AWS where we peer with our AWS VPC to populate the routeTables of the VPC with our DC/POP BGP routes.

The effect is that the routeTables of AWS are responsible for the ECMP aspect of seeing 'hey I have three ways of getting to 10.0.0.1/32, all are ECMP so I'll just pick one based on my address tuple hash'. From there, the POPs have their biased routing to the compute DCs.

Simplified view.
code:
VPC -> POP1 -> DC1 -> 10.0.0.1/32 -> K8s Service object (may route to another DC yet further, until I figure out k8s 1.25 topologyKeys)
               ||
    -> POP2 -> DC2 -> 10.0.0.1/32 -> K8s Service object (may route to another DC yet further, until I figure out k8s 1.25 topologyKeys)
               ||
    -> POP3 -> DC3 -> 10.0.0.1/32 -> K8s Service object (may route to another DC yet further, until I figure out k8s 1.25 topologyKeys)
               ||
    -> POP4 -> DC3 -> 10.0.0.1/32 -> K8s Service object (may route to another DC yet further, until I figure out k8s 1.25 topologyKeys)

love too build multiple layers of redundancy into the network and then terminate in a single metallb instance lol

Junkiebev
Jan 18, 2002


Feel the progress.

MightyBigMinus posted:

sure but latency is a function of distance so none of the rube goldberg poo poo is going to matter

in a world other than this, simply stating this might matter

Junkiebev
Jan 18, 2002


Feel the progress.

"how can we cut the latency between London and SGX in half?"
"errr - Plate Techtonics?"

Methanar
Sep 26, 2013

by the sex ghost

MightyBigMinus posted:

sure but latency is a function of distance so none of the rube goldberg poo poo is going to matter

Junkiebev posted:

"how can we cut the latency between London and SGX in half?"
"errr - Plate Techtonics?"

Thankfully there is not a need to actually send traffic cross-region in a way I care about. Intra-region traffic of the many regions is the focus. If somebody does send traffic from LAX to FRA, then you'll just need to deal with the speed of light - sorry.
The point of all of the rube goldberg is to reduce the latency hit in the cases it matters, and to ensure that our DC hardware capacity planning and traffic patterns are compatible with each other. Also I want to (need to) not use ingress-nginx at all for anything actually pushing real data volume and this is my excuse to push that through.

The emphasis being again ensuring things are understood, and built out in a way that can be administratively influenced when and if necessary. I am not comfortable whatsoever trusting the DC networks as being blackboxes given the stakes, and my on-call and escalation position for the relevant infra.

I hope it's not too hard to believe that I care about my responsibilities here and just want to do a good job. I'm a point of contact and escalation point for half a dozen efforts right now in one way or another and it's just all A Lot.

my homie dhall posted:

love too build multiple layers of redundancy into the network and then terminate in a single metallb instance lol

This won't happen either! BGP is pretty good at scaling as shown by the whole internet thing. I can run any number of BGP speakers I want, in as many DCs as I want, with each POP with some arbitrary biased distribution down to the compute DCs as I want. There will not be any singular metalLB I can't live without :). The individual /32 advertisements really are not the main source of complexity here: I'm not exactly re-inventing Cloudflare. Ideally the vast overwhelming majority require no special treatment at all and can just route as they may.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
This is by and large fascinating to read, and I’m largely following it, which is also fun! I have a few questions though, bearing in mind I don’t have any experience with datacenter networking.

By and large you’ve described a flexible and redundant network model, with metallb giving you consistent per connection load balancing so within the context of a connection you always get routed to the same pod.

Just to be clear, I thought that metallb when operating in BGP mode advertises a /32 route per service with a LoadBalancer type. Your diagram of the flow is slightly unclear: when you routing from VPC -> POP -> DC A 10.0.0.1/32 -> k8s service in DC A…Z, does 10.0.0.0/32 represent the downstream k8s service’s endpoint, or are you doing service to service routing with custom endpoint slices?

I’m assuming the former because you would have complained about it by now otherwise but wanted to be clear.

Several Qs:
- Do you have a separate address pool per DC or do you share a single address pool across clusters and datacenters? I didn’t see anything in metallb’s docs about cross cluster address pooling, but I’ve only briefly scanned the docs.
- you mentioned that you can get to the same service via multiple POPs and multiple DCs. I gather you’re relying on your DC’s internetworking here? You’re advertising to just the local router, so this advertisement would need to be propagated to your POPs and DC peer routers, yes?
- Can any node access any service (assuming the appropriate policy) in another DC by going DC -> DC, or DC -> POP —> DC, or do you have to go through AWS and back through the whole POP -> DC -> LB chain? I can’t imagine the latter.
- Do you run with local or cluster traffic policies at scale? I’d normally prefer the impaired performance of cross-node kube proxying rather than needing to be significantly more conscious of which nodes my pods get allocated to but maybe that changed at your scale.
- do you have services (whether represented by k8s services or not) that run in multiple datacenters at once?
- all the above works for L4 routing. What do you do when you need the semantics and application aware logic of an ingress? Advertising the ingress controller as a service and routing to a separate DC-internal service for your app?
—— Could you go into more depth about the performance limitations of nginx-Ingress you’ve experienced?
- if I’m in AWS region A, ECMP is great for load balancing… but I still probably want to go to the POP associated with the DC in region A for lowest latency. The POPs use BGP path selections to route to their local DCs, but how do your VPCs route to their closest POPs? Or is each VPC direct connected to exactly and only one POP?
- For making the BGPAdvertisement of the service to your local DC resilient, are you relying solely on running multiple metallb BGP speaker instances within a single cluster?
— you have many k8s clusters across all your DCs. Do any clusters span multiple DCs?

I have probably a dozen more questions but those are the big ones I can think of at 2am I think.

The Iron Rose fucked around with this message at 08:44 on Dec 12, 2022

my homie dhall
Dec 9, 2010

honey, oh please, it's just a machine

Methanar posted:

This won't happen either! BGP is pretty good at scaling as shown by the whole internet thing. I can run any number of BGP speakers I want, in as many DCs as I want, with each POP with some arbitrary biased distribution down to the compute DCs as I want. There will not be any singular metalLB I can't live without :). The individual /32 advertisements really are not the main source of complexity here: I'm not exactly re-inventing Cloudflare. Ideally the vast overwhelming majority require no special treatment at all and can just route as they may.

I just meant when one of your speakers goes down, all of its connections are going to go with it. It sounds like you are distributing services across multiple speakers/IPs though, so this may be tolerable for you.

cool av
Mar 2, 2013

lmfao microservices

Methanar
Sep 26, 2013

by the sex ghost

The Iron Rose posted:

Just to be clear, I thought that metallb when operating in BGP mode advertises a /32 route per service with a LoadBalancer type. Your diagram of the flow is slightly unclear: when you routing from VPC -> POP -> DC A 10.0.0.1/32 -> k8s service in DC A…Z, does 10.0.0.1/32 represent the downstream k8s service’s endpoint, or are you doing service to service routing with custom endpoint slices?

I’m assuming the former because you would have complained about it by now otherwise but wanted to be clear.


It terminates to the service IP. No magic with endpoint slices necessary. Any special treatment at the Service object level should be addressable by the topologyKeys feature in 1.25, when I get there.

Several Qs:
- Do you have a separate address pool per DC or do you share a single address pool across clusters and datacenters? I didn’t see anything in metallb’s docs about cross cluster address pooling, but I’ve only briefly scanned the docs.

Different subnet per cluster

- you mentioned that you can get to the same service via multiple POPs and multiple DCs. I gather you’re relying on your DC’s internetworking here? You’re advertising to just the local router, so this advertisement would need to be propagated to your POPs and DC peer routers, yes?

The IP 10.0.0.1 is present within each DC and serviceable therein. In the event that DC1 needs to speak to 10.0.0.1, the local routers have a route that will prefer to route to the local IP. In the event something goes bad and the local 10.0.0.1 falls out of table, it can route by inter-DC connections to another DC to again reach some some microservice hopefully still able to process 10.0.0.1.

It's not just that each DC knows how to get to 10.0.0.1. It's that each DC is 10.0.0.1 and has replicas of the microservices and databases and kafka queues, etc and is capable of processing 10.0.0.1-destined traffic on its own.

Somewhat important is that I'm spanning each k8s cluster across multiple DCs, each DC with worker nodes full of microservices and local BGP speakers present and all advertising the same information.


- Can any node access any service (assuming the appropriate policy) in another DC by going DC -> DC
Yes. But it should very rarely need to. Zone awareness and anycasting should strongly prefer that traffic remain intra-DC as much as possible


- Do you run with local or cluster traffic policies at scale? I’d normally prefer the impaired performance of cross-node kube proxying rather than needing to be significantly more conscious of which nodes my pods get allocated to but maybe that changed at your scale.

externalTrafficPolicy: Local is leveraged in some niche cases, but I don't like it because of how much more complicated it makes things. I want to remove all of them except for the one that handles ingress-nginx.

Other than that, topologyKeys when I get to k8s 1.25 is the only other traffic policy element I'm interested in at the moment.


- do you have services (whether represented by k8s services or not) that run in multiple datacenters at once?
All of them.

- all the above works for L4 routing. What do you do when you need the semantics and application aware logic of an ingress? Advertising the ingress controller as a service and routing to a separate DC-internal service for your app?

I say that's not my team's responsibility. There are numerous cases where application-aware logic does exist and its implemented in a dedicated microservice that effectively acts a l7 reverse proxy. Example, Recieving and inspecting RPC traffic or proprietary line protocol and routing it to a particular shard/particular kafka queue/particular set of other microservices based on customer identifier.

That sort of special app-awareness gets special microservices written just for that purpose.
Really though, this is a very expensive solution and one we want to move away from. proxying the traffic volumes we deal with is ridiculously expensive and we want to move towards a client-aware model as much as possible where rather than dataplane loadbalancing, the clients can intelligently know where to send the traffic without the $$$middleman$$$ doing it for them. This can be done with special Consul service discovery mechanisms. This is a long term project, though.

I'm not allowed to tell you how much money we spend on east-west ELBs/ router microservices that can be replaced by special consul service discovery that takes out the dataplane middle man.

There are a small number of simpler cases like wanting to route based on certain http headers, or special session/cookie affinity or http path routing rules to different ports. For those cases, I might just need to continue to leverage ingress-nginx. As far as I know offhand these cases aren't egregious in their traffic volumes so it should be fine to do similar to what you described.



—— Could you go into more depth about the performance limitations of nginx-Ingress you’ve experienced?
Not really because I don't undertand it. All I know is that my nginx performance is about 10% of advertised public benchmarks. But I strongly suspect its caused due to a security kernel module we run that most people do not.
I strongly suspect that certain network syscalls are stalled because of this security module, but I've never proven it and neither has the person I assigned to investigate it and file a bug report over it.


- if I’m in AWS region A, ECMP is great for load balancing… but I still probably want to go to the POP associated with the DC in region A for lowest latency. The POPs use BGP path selections to route to their local DCs, but how do your VPCs route to their closest POPs? Or is each VPC direct connected to exactly and only one POP?


> how do your VPCs route to their closest POPs?
Each VPC has multiple POP connections. And all are equally weighted down to the pop.

It's at the POP level that traffic is firmly within our ASNs and we can influence traffic down to the relevant compute DCs. Each POP has multiple connections to many compute DCs. No routing logic lives at the VPC-POP step other than naive equal cost ECMP to get the traffic into our hands in the first place.


- For making the BGPAdvertisement of the service to your local DC resilient, are you relying solely on running multiple metallb BGP speaker instances within a single cluster?

Yes. But as noted above each k8s spans multiple DCs. So each DC has multiple local BGP speakers within.

— you have many k8s clusters across all your DCs. Do any clusters span multiple DCs?
All of them, as a rule.
I was strongly against running standalone k8s in each DC, rather than spanning, because it would have been an administrative nightmare. All of my experience with trying to federate k8s (even in ways that aren't actually federating k8s) has been negative. It's much easier to just span a cluster and use the regular cluster primitives to configure things as necessary.


cool av posted:

lmfao microservices

Methanar fucked around with this message at 05:09 on Dec 13, 2022

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
I continue to learn valuable things from you Methanar. I greatly appreciate you sharing these details and insights.

FamDav
Mar 29, 2008
can i short your company

12 rats tied together
Sep 7, 2006

ECMP routing is fine in general. ECMP routing to multihomed k8s clusters running applications in an arbitrary location across unequal cost links, well,

cool av posted:

lmfao microservices

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Methanar posted:

This can be done with special Consul service discovery mechanisms.
anything but Consul

Hadlock
Nov 9, 2004

Vulture Culture posted:

anything but Consul

Only time I used consul in production was a backend to Vault. The guy before me had set it up and not installed any monitoring, particularly disk usage. Some process had gone rogue and was slowly filling the disk. Somehow vault managed to write half a value to consul before the disk filled up, literally the last few bits on the disk. Anyways it crashed and apparently Vault decrypts all the values before posting a ready status and, surprise, vault couldn't decrypt an incorrect length encrypted string and would fall over. I forget how I found the particular string, some script that would remove and replace all the values in consul one by one and try and start vault or something

That was the day week I decided cloud managed services like KMS were probably worthwhile. While I was busy debugging how to decrypt all our production secrets our CI/CD system was down and my boss was making up backup plans on how to harvest secrets from memory of running software.

Presumably consul syncs values character-by-character across the cluster. I can't think of any other reason why the code wouldn't just validate there's enough space to write the value then write the whole thing in one go.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Hadlock posted:

Only time I used consul in production was a backend to Vault. The guy before me had set it up and not installed any monitoring, particularly disk usage. Some process had gone rogue and was slowly filling the disk. Somehow vault managed to write half a value to consul before the disk filled up, literally the last few bits on the disk. Anyways it crashed and apparently Vault decrypts all the values before posting a ready status and, surprise, vault couldn't decrypt an incorrect length encrypted string and would fall over. I forget how I found the particular string, some script that would remove and replace all the values in consul one by one and try and start vault or something

That was the day week I decided cloud managed services like KMS were probably worthwhile. While I was busy debugging how to decrypt all our production secrets our CI/CD system was down and my boss was making up backup plans on how to harvest secrets from memory of running software.

Presumably consul syncs values character-by-character across the cluster. I can't think of any other reason why the code wouldn't just validate there's enough space to write the value then write the whole thing in one go.
This is why HashiCorp has enterprise support, though. In a situation like this, you can have an engineer join your debugging session live and async Slack their coworkers about what to do, then slowly ask you for more difficult-to-obtain diagnostic data while not resolving the issue over a span of 24 days

12 rats tied together
Sep 7, 2006

if you're lucky and you're also one of their largest customers, you might only be down for 72 contiguous hours and be experiencing this outage during a limited time marketing campaign.

maybe the timing of your outage combined with the proximity to the campaign convinces your player base that the outage is the fault of your advertiser, which pisses them off even more, because now not only are they not getting their contractually obligated in-game advertising, you've also mobilized your userbase against them

tortilla_chip
Jun 13, 2007

k-partite
Every Consul customer of sufficient scale invents their own caching layer.

Adbot
ADBOT LOVES YOU

Methanar
Sep 26, 2013

by the sex ghost
https://blog.roblox.com/2022/01/roblox-return-to-service-10-28-10-31-2021/

jesus christ

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply