|
The NPC posted:If you are storing team info in metadata, what are your namespace naming conventions? If there isn't team1-redis and team2-redis how do you prevent collisions? Typically where I work a service will own it’s own cache or data stores in K8s, so it doesn’t make sense to name a namespace after the software powering that cache, let alone to have that cache have a separate namespace from the application consuming it. If the service name is bombadier the namespace would be bombadier or bombadier-<env>. Within that namespace you would have your app deployment and your cache store(s) defined. A database team managing a data store would either offer a shared data store in their own namespace, or they would have an operator that is allowed to create a data store in a service’s namespace. Exceptions always exist. We try to do our best to steer teams towards best practices but we still allow people enough rope to learn lessons the hard way as we just silently judge while tapping on our best practices docs that they choose to ignore. At the end of the day it’s their decisions that typically cause downtime, not ours.
|
# ? Mar 20, 2024 06:17 |
|
|
# ? May 15, 2024 04:36 |
|
Hadlock posted:Deployments are plenty enough organizational division in 85% of cases
|
# ? Mar 20, 2024 12:11 |
|
Hadlock posted:If the zookeeper app needs redis, there's a redis deployment in zookeeper-dev, zookeeper-staging, and zookeeper-prod namespaces (prod should be on a different cluster). If the platform team or the backend team owns zookeeper, that's fine, just update rbac for that user group So I read something like this and it seems utterly insane. How many nodes do you have in your cluster? What kind of SLAs do you have? How reliable are your infra services (redis, memcache, etc. )? What kind of resources are you throwing at your shared stuff like coredns? How do you not run into noisy neighbor issues all the time?
|
# ? Mar 20, 2024 12:47 |
|
Blinkz0rz posted:So I read something like this and it seems utterly insane. How many nodes do you have in your cluster? What kind of SLAs do you have? How reliable are your infra services (redis, memcache, etc. )? What kind of resources are you throwing at your shared stuff like coredns? How do you not run into noisy neighbor issues all the time? I thought it was me, but a setup like that looks like a recipe for disaster for the companies I’ve worked for. So I’m really curious what kind of scale you’re running this at. We usually had teams running 1 or more services. Each team ran their entire setup, so no shared resources. By default we didn’t want teams to access eachothers services, unless explicitly exposed. Most services per team did need to access eachother. For us it made sense to give each team their own namespace and keep things seperated from other teams while giving them more or less free reign within their own namespace (guardrails were of course in place)
|
# ? Mar 20, 2024 20:40 |
|
Annoyed at the terraform AWS provider devs today. They released a new minor version that "fixes" an issue where you could add the same route to a VPC route table multiple times. Which, yeah, that probably shouldn't be allowed. But in practice it didn't hurt anything, it's not like you ended up with multiple routes in reality. AWS just silently ignored the subsequent attempts to create dupes. Now your terraform apply hard fails on the same code. A module we wrote had a bug, and was creating some harmless dupe routes. I tried to upgrade the provider today and it broke the module. If I remove one of the duplicate declarations, terraform wants to delete the routes. A second plan/apply will restore them since the other declaration is still present. But this still means eating a 30 second network outage. I tried some fuckery with their moved{} syntax but it didn't help in this case, TF still insists on deleting the routes. The best workaround I came up with is manually doing a "terraform state rm" on the resources I am deleting from the code first so it doesn't want to delete them from AWS too. I can pin the provider version to the old version for a while but that's obviously not a long term solution. All of this sucks. The change they've made is ~technically correct~ but it was not causing any issues whatsoever in practice. Why the hell would you stick this in a 0.01 point release and not sit on it until the next major version with all your other breaking changes
|
# ? Mar 20, 2024 21:59 |
|
Docjowles posted:Annoyed at the terraform AWS provider devs today. They released a new minor version that "fixes" an issue where you could add the same route to a VPC route table multiple times. Which, yeah, that probably shouldn't be allowed. But in practice it didn't hurt anything, it's not like you ended up with multiple routes in reality. AWS just silently ignored the subsequent attempts to create dupes. Now your terraform apply hard fails on the same code. OpenTofu lets you use a removed block for this use case, no idea why Terraform hasn't added this
|
# ? Mar 20, 2024 22:13 |
|
vanity slug posted:OpenTofu lets you use a removed block for this use case, no idea why Terraform hasn't added this drat that is really nice, thanks for the tip. I'm aware of opentofu but had not been following it closely. Didn't realize they had progressed from being a simple fork for license reasons to actually adding sweet new features.
|
# ? Mar 20, 2024 22:55 |
|
vanity slug posted:OpenTofu lets you use a removed block for this use case, no idea why Terraform hasn't added this How is OpenTofu maturing? Last time I checked they were adding features like crazy.
|
# ? Mar 20, 2024 22:55 |
|
Docjowles posted:drat that is really nice, thanks for the tip. I'm aware of opentofu but had not been following it closely. Didn't realize they had progressed from being a simple fork for license reasons to actually adding sweet new features. Just in time https://www.techtarget.com/searchitoperations/news/366574475/HashiCorp-stock-rises-users-hearts-fall-on-sale-report Hashicorp is apparently shopping for a buyer to go private. The article doesn't say much more but one potential buyer, the article speculates, without citing evidence, is Broadcom
|
# ? Mar 21, 2024 16:43 |
|
Vulture Culture posted:I agree with the rest of your post, but could you clarify this? K8s RBAC is a problem that leaks sewage all over use cases that rely on partial match. Trying to convey that inside a namespace, the largest divisional classification you need is a deployment for accessory services like redis etc. I wouldn't put all the services together in a flat hierarchy inside a single namespace, but also wouldn't split service-specific accessories out into their own namespace Blinkz0rz posted:So I read something like this and it seems utterly insane. How many nodes do you have in your cluster? What kind of SLAs do you have? How reliable are your infra services (redis, memcache, etc. )? What kind of resources are you throwing at your shared stuff like coredns? How do you not run into noisy neighbor issues all the time? My current job everything runs on like 15 pods across four nodes so everything works out of the box batteries included which is why I wanted to find a reference architecture because it's small enough not much needs to be modified to work Previous job was e-commerce and we were running about 140-180 pods 60% of the time and would burst to ~500 based on whatever the marketing people were doing that day, spread across 12-25 nodes. We guaranteed 2 9s uptime during daytime hours and 98% off peak but I don't think we ever went below 2 9s except for the memecache issue We ran single node(!) redis and a ha triplet of memcache; we only had a single outage on memecache issue in two years, someone uploaded new code that was heavily reliant on memcache, and during a burst period we exceeded the 10gb/s network of the node long enough for Amazon to shut it off No other issues Also has a couple other services we inherited from a siloed team after that VP rage quit, some kind of custom zendesk plugin for the call center to do customer lookup and order status; the way it was designed it was a pair of services with a bunch of unnecessary loopback calls that needed some assistance from our group to get working, but otherwise nothing exotic Everything else either lived in a very sedate management/tooling cluster of ~8 nodes, the dev cluster which did get noisy from time to time, and then we had a "bombing range Bombing range had no SLA and you could do whatever you wanted at any time; my solution to fixing that cluster there was deleting it in terraform, then adding it back, and if there were complaints tapping the "no SLA" sign
|
# ? Mar 21, 2024 17:02 |
|
Apparently base Terraform also has that "removed" feature since January. I have no idea why the hell it didn't come up in my google searching but it's in tf 1.7. Did they crib it from opentofu or the other way around?
|
# ? Mar 21, 2024 17:24 |
|
Hadlock posted:Trying to convey that inside a namespace, the largest divisional classification you need is a deployment for accessory services like redis etc. I wouldn't put all the services together in a flat hierarchy inside a single namespace, but also wouldn't split service-specific accessories out into their own namespace Oh ok yeah then based on that, definitely not insane. For one small area of our product we have something like 4000-ish application pods alone running on 175 nodes which I'm aware definitely warps my perspective on what's reasonable in terms of thinking about cluster topography and shared dependency resourcing.
|
# ? Mar 21, 2024 21:53 |
|
Ah yeah, for a dozen node cluster that sounds perfectly fine. Scaling to dozens/hundreds of nodes adds a few other dimensions to problems.
|
# ? Mar 21, 2024 22:02 |
|
vanity slug posted:OpenTofu lets you use a removed block for this use case, no idea why Terraform hasn't added this fwiw, this is available in old school Terraform as well (since v1.7): https://developer.hashicorp.com/terraform/language/resources/syntax#removing-resources
|
# ? Mar 22, 2024 01:44 |
|
Has anyone come up with a use for the new S3 directory buckets? I guess they use S3 Express one zone storage class Seems like you could use it sort of like a medium latency (0-9ms) value store, or caching file objects for applications
|
# ? Mar 22, 2024 21:22 |
|
Has anyone here worked with Azure Functions and Cosmos DB? I don't know if I finally losing it but I find the concept or at least the ability to implement a binding freaking impossible. All I am trying to do is simply have my function app with a HTTP Trigger query a single row (or document or whatever Cosmos DB calls it) and increase it's value. For whatever reason, the current tutorials no longer work and I don't get how am I supposed to decipher their documentation. Where does the code go exactly? How do I interrupt the below article? Or is it because I am not a dev and don't know enough C#? Azure Cosmos DB trigger and bindings
|
# ? Mar 25, 2024 19:48 |
|
Gucci Loafers posted:Has anyone here worked with Azure Functions and Cosmos DB? I don't know if I finally losing it but I find the concept or at least the ability to implement a binding freaking impossible. All I am trying to do is simply have my function app with a HTTP Trigger query a single row (or document or whatever Cosmos DB calls it) and increase it's value. For whatever reason, the current tutorials no longer work and I don't get how am I supposed to decipher their documentation. Where does the code go exactly? How do I interrupt the below article? Or is it because I am not a dev and don't know enough C#? Every time i've anything more than 'put object' to cosmosdb in functions I've created a cosmosclient using the sdk rather than using bindings. I'm not sure how much overhead this adds if you're doing something like durable functions but it's easier to me than messing with extra inbound and outbound bindings. https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/quickstart-dotnet?pivots=devcontainer-codespace#authenticate-the-client
|
# ? Mar 25, 2024 20:29 |
|
Me: debugs anything involving publicly hosted anything It: unreliable bizarre behavior Me: I'm not sure why, but 60% of the time "empty cache and hard reload" fixes it 20% of the time walking away, doing the dishes, empty cache and reload fixes it The other 20% of the time: actually a configuration problem For whatever reason the old existing cloudfront waf/ip whitelist works perfectly, the apparently identical one built using terraform, absolutely does not work 5+ minute cycle time is murder for testing, too
|
# ? Mar 25, 2024 23:12 |
|
Gucci Loafers posted:Has anyone here worked with Azure Functions and Cosmos DB? I don't know if I finally losing it but I find the concept or at least the ability to implement a binding freaking impossible. All I am trying to do is simply have my function app with a HTTP Trigger query a single row (or document or whatever Cosmos DB calls it) and increase it's value. For whatever reason, the current tutorials no longer work and I don't get how am I supposed to decipher their documentation. Where does the code go exactly? How do I interrupt the below article? Or is it because I am not a dev and don't know enough C#? Should you not use a bus of some sort for this? Distributed writes make me nervous in any “eventually-consistent” datastore.
|
# ? Mar 26, 2024 00:39 |
|
Thread opinions on terminating TLS at the load balancer, or at the pod? We don't have any need to do packet inspection between the LB and the pod Presumably terminating at the pod is best practice Hadlock fucked around with this message at 07:56 on Mar 30, 2024 |
# ? Mar 30, 2024 07:54 |
|
if you have encryption on your pod network, then either is okay, but otherwise terminating at the lb you will be sending unencrypted traffic on your local network ime everyone does it though
|
# ? Mar 30, 2024 08:26 |
|
I make my users terminate at the pod. Nothing to do with security, the policy is 100% laziness because I don't wanna manage certs for them. xzzy fucked around with this message at 15:17 on Mar 30, 2024 |
# ? Mar 30, 2024 14:38 |
We terminate at the LB because it’s better for us to manage the cert infrastructure and expose simple bindings to our devs than to expect them to roll their own bullshit across how ever many hundreds of microservices we run. Everything inside the environment is plain http and nobody has to gently caress around with SSL connections and all that headache when talking to other internal services and troubleshooting them. Probably more secure to keep it centrally managed, standardized, and observable than to keep tabs on every dev team’s cert implementations.
|
|
# ? Mar 30, 2024 15:09 |
|
If you’re handling PII or you’ve got a reliable, well used, integrated, and supported PKI, then you should terminate at the pod. Otherwise it’s easier to terminate at the LB and let your cloud provider deal with certs.
|
# ? Mar 30, 2024 15:40 |
|
We use self signed certs for internal traffic Our root is automatically installed and teams can self service their certificates with terraform/venafi
|
# ? Mar 30, 2024 15:47 |
|
We use istio and for the tls stuff it works great. Scaling istio to our capacity on the other hand...
|
# ? Mar 30, 2024 15:52 |
|
Blinkz0rz posted:We use istio and for the tls stuff it works great. Scaling istio to our capacity on the other hand... Just curious, what scale you’re running Istio at? We have individual clusters with <6 -8k pods (~tens - low hundreds of clusters total) and even then scaling Istio is fun. Biggest offenders were big global namespaces with high rate of churn, where updates in the mesh need to be propagated to a lot of other peers. A lot of effort is being put into that overall and we still need to do more to feel that we’re ahead of potential issues.
|
# ? Mar 30, 2024 18:32 |
|
Some of the developers we newly acquired are trying to force us to use istio because configuring a golang web server to terminate tls with a cert-manager mounted certificate in their pod is too hard. I want to rip out their cowardly hearts and serve them - securely! - over the internet.
|
# ? Mar 30, 2024 18:56 |
Revenge is a dish best served confidential, integral, and available
|
|
# ? Mar 30, 2024 19:55 |
|
kaaj posted:Just curious, what scale you’re running Istio at? We have individual clusters with <6 -8k pods (~tens - low hundreds of clusters total) and even then scaling Istio is fun. Biggest offenders were big global namespaces with high rate of churn, where updates in the mesh need to be propagated to a lot of other peers. Almost exactly that. Supposedly the move away from sidecar proxies will improve memory use and a lot of the startup race conditions but I'm not wholly convinced.
|
# ? Mar 30, 2024 20:14 |
|
We’ve resisted all service meshes and so far no one has had a compelling enough use case to consider one. We’re open to them if someone actually has a valid need for them, but it’s mostly been attempted cargo culting or resume driven development. We don’t have the team size to support it and quite frankly we’ve still got larger problems to solve so we don’t want the distraction.
|
# ? Mar 30, 2024 21:46 |
|
what are service meshes and what problems do istio sidecars solve? (address me as you would a five-year-old)George Wright posted:resume driven development. lol, gotta remember that one
|
# ? Mar 30, 2024 22:45 |
|
The Iron Rose posted:Some of the developers we newly acquired are trying to force us to use istio because configuring a golang web server to terminate tls with a cert-manager mounted certificate in their pod is too hard. Is it bad? The team I’m embedded with is using it for their K8s routing and it seems fine but everything about K8s is kinda awful so it may not stand out from the background suck.
|
# ? Mar 30, 2024 23:12 |
|
Warbird posted:Is it bad? The team I’m embedded with is using it for their K8s routing and it seems fine but everything about K8s is kinda awful so it may not stand out from the background suck. I have no idea, I’ve never used a service mesh before. But I’m pretty sure it’s more work than: code:
|
# ? Mar 30, 2024 23:47 |
|
Blinkz0rz posted:Almost exactly that. Supposedly the move away from sidecar proxies will improve memory use and a lot of the startup race conditions but I'm not wholly convinced. Big win which brought noticeable improvements for us was usage of Sidecars (the CRD, not proxies) to explicitly define the endpoints each workload needs to talk to. We weirdly have a bunch of reasons to mesh ( tens of teams, hundreds of microservices and monoliths in the mesh, FedRAMP, sensitive data, all that) so like a mesh has its place. But we have a dedicated team for owning Istio and I can’t imagine supporting it without few engineers fully dedicated for that effort. Really hope ambient will make a difference on resource consumption.
|
# ? Mar 31, 2024 01:27 |
|
We tried to run istio and it caused a bunch of outages and upgrade headaches with no tangible benefit. It sounds cool tho
|
# ? Mar 31, 2024 01:55 |
|
Every company I've been at, someone wanted to do istio. Nobody was able to justify the engineering time as we didn't have problems, or, big enough problems to necessitate it. Seems neat, though And yeah resume driven development is a very real thing. We had one guy just go completely off the rails trying to get promoted it made my old boss literally rage quit. He was inventing all kinds of insane poo poo like his own DSL templating system for ECS when we already had Kubernetes in place. He ended up at a bitcoin dump which makes total sense Hadlock fucked around with this message at 10:12 on Apr 1, 2024 |
# ? Apr 1, 2024 10:09 |
|
George Wright posted:If you’re handling PII or you’ve got a reliable, well used, integrated, and supported PKI, then you should terminate at the pod. Otherwise it’s easier to terminate at the LB and let your cloud provider deal with certs.
|
# ? Apr 1, 2024 13:47 |
|
neosloth posted:We tried to run istio and it caused a bunch of outages and upgrade headaches with no tangible benefit. It sounds cool tho I'm again going to say that if your host offers VPC Lattice or something like it, and you aren't either using it or trying to get it adopted, it's out of stubbornness and not because you're looking out for your users
|
# ? Apr 1, 2024 13:49 |
|
|
# ? May 15, 2024 04:36 |
|
At a past company we ran istio. Not sure why, we didn’t get any benefit out of it and it added a layer complexity to troubleshooting. Probably resume driven development by the previous lead. On a different note, I’ve been trying to get back into terraform after a few years of not using it and was looking into variable precedence. I haven’t got the faintest idea what *.auto.tfvars files do differently then regular .tfvars do, besides taking precendence. A quick google didn’t turn up anything. The hashicorp docs also show them in the precedence list but no other eplanation on when/why you’d use them. Is it purely for setting global vars that should be the same in each tf deployment?
|
# ? Apr 1, 2024 14:19 |