Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

22 Eargesplitten posted:

I have an interview on Tuesday that has ELK as good to have, is it feasible to learn enough about it to not sound like an idiot discussing it if I start studying it today? I'm not going to try to sound like an expert, I just want to know enough that I can carry on a conversation.

Failing that, what about Puppet or Octopus, those are also on the "nice to have" list. I only know Terraform as far as deployment tools at this point.

do a tutorial or two, say "you've played around with it once or twice" mention something casually offhand in the interview about not forgetting inter-node encryption, and you'll be fine. with luck they'll think you're being modest.

for puppet, remember that it uses an agent and will enforce config changes to prevent drift, which is different from, eg, ansible.

octopus does CI/CD, if you can talk intelligently about what CI/CD is for and what problems it's trying to solve, who cares about the tooling?

Tooling in general is maybe the least important thing about devops/sre. Focus on the concepts and the problems you're trying to solve. When you're interviewing, speak about those concepts. Don't lose sight of the forest for the trees.


and if you haven't yet, read the google SRE handbook.




e: also it's a remote interview - have a set of prewritten notes with a few lines about each technology in their stack. If you're good at public speaking, you can even write out full sentences. In one of my recent interviews, I basically quoted one of methenar's slack rants about kubernetes and got an offer.

The Iron Rose fucked around with this message at 21:45 on Dec 18, 2020

Adbot
ADBOT LOVES YOU

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

22 Eargesplitten posted:

Alright, thanks for all the advice. I'll go through tutorials on everything so I can say I played around with it a bit. Would it be a bad thing to say that the reason I haven't done it in my current job is that we have one senior admin dedicated to monitoring and another to Puppet and Jenkins (or whatever the thing we use aside from Puppet is, I can look it up)? I mean those things are true, but it's also because I got bait and switched into being a NOC tech with a SysAdmin title and NOC pay. But I'm not saying that part to a potential employer.

Why would you mention that you haven’t done it in my current job? Much less go into that much detail.

“I’ve only had the opportunity to use [x] a little bit, now here’s all I know about it in a sentence that sounds smart.”

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

New Yorp New Yorp posted:

I’m being pulled into an effort to upgrade a bunch of Terraform from 11 to 12. How bad is it likely to be? I have no idea what their stuff looks like right now so I'm hoping it's not a nightmare to begin with.

it's not great. terraform version management in general is also not great, it's difficult to cleanly upgrade when working with many many different developers.

I might even go so far as to say terraform isn't great, but it does do many things well and it's better than cloudformation (lol parameters)

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
stupid question but documentation isn't leading me anywhere

my work runs ansible vault. i have a json file that i need to copy to a server that contains SHA256'd password hashes. I don't want to commit that file to our VCS unencrypted. It will live on the server in the clear before being cleaned up in my startup script.

The smart option here is to convert my static json file to a jinja2 template, encrypt the password hashes and store them as variables, reference them in the .j2 and construct the json on the fly.
e.g.
code:
- name: Construct Json on server
  template:
    src: data.json.j2
    dest: /path/to/file.json
I don't really want to do that, because there's 25 of them and it's a bit of a pain.

What I would like to do is encrypt the whole file and decrypt it on the fly as I copy it over. Something like:

code:
- name: Copy json to server
  copy:
    src: /path/to/encrypted/file.json
    dest: /path/to/decrypted/file.json
I'm not entirely sure this is possible at all, judging from the digging I've done so far, probably because it's a stupid-rear end idea. but hey here we are.

obviously it's not great for even the hashes to live on the server in the clear, access is limited, we all make compromises, and this application is dumb.


vvv: welp that's what not reading the docs carefully before gets me

The Iron Rose fucked around with this message at 21:20 on Feb 11, 2021

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

12 rats tied together posted:

Ansible's copy module will accept a decrypt parameter that controls whether or not ansible will automatically decrypt a vaulted file when it is copied to a remote server.

You might need a newer version of ansible if these params aren't supported in yours?

edit: formatting
edit 2: You might also consider, instead of storing the hashes, storing the plaintext passwords in vault. You can copy them as hashes to the server by creating a template, referencing the post-decryption vaulted variable, and throwing it through one of the various password hashing filters.

edit 3: ^^^ no worries :) the ansible docs are laid out really poorly for concerns that cut across multiple domains (modules plus vault, templates plus filters, etc.). This sort of stuff is also very hard to google!

so follow up to this;

I'm doing this the smart(er) way, and I'm generating my config file with a jinja template. I'm having some trouble working with my dictionary though.

So, I have a dictionary of dictionaries in vars.yml as follows:
code:
users_dictionary:
  service:
    username: serviceUsername
    password_hash: encryptedString
  otherService:
    username: otherServiceUsername
    password_hash: otherEncryptedString
    tags: administrator
I also have my jinja template, which looks something like this:
code:
{% for user in users_dictionary.values() %}
{
  "hashing_algorithm": "application_password_hashing_sha256",
  "name": "{{ user.username }}",
  "password_hash": "{{ user.password_hash }}",
  "tags": "{{ user.tags|default("") }}"
{% if not loop.last %}
},
{% else %}
}
{% endif %}
{% endfor %}

i want to construct my template on the fly, and I've achieved some success with:
code:
---
- hosts: localhost
  vars_files:
    - vars.yml
  tasks:
    - name: test jinja2
      template: src=template.j2 dest=test.conf
However, while the usernames and tags work fine (for users with tags defined and users without), the password hashes do not. Specifically, I get "fatal: [localhost]: FAILED! => {"changed": false, "msg": "AnsibleUndefinedVariable: 'dict object' has no attribute 'password_hash'"}"

This is confusing me quite a bit, since user.username works fine, and user.tags works fine, and all three key:value pairs are in the same dictionary. Do you have any advice here? I'm pretty sure I'm messing something up but honestly at this point I'm at a loss.

edit: moreover, when printing {{ user }} I can see the password_hash key/value pair just fine!
e.g.
{'username': 'serviceUsername', 'password_hash': 'serviceEncryptedString'}

The Iron Rose fucked around with this message at 00:52 on Feb 17, 2021

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
I figured it out!


One "password_hash" was actually "passsword_hash" :v:


Appreciate the assist very much.

The Iron Rose fucked around with this message at 01:03 on Feb 17, 2021

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
How on earth do y'all use terraform version management? I have a few dozen different repos, all of which have "required_version = "0.12.18" or something similarly archaic in there, usually whatever version was latest at the time the repo was first built. There is vast institutional opposition to simply using the latest version whenever you make a new PR, mostly because people are (foolishly) scared of state file surgery.

We do have terraform cloud.


also while I'm at it, hot take...


i kinda hate terraform modules. I mean I get it, there's a few very simple ones that I've used before, but I often find it more work to use and grok an existing module rather than create it all greenfield myself.

The Iron Rose fucked around with this message at 22:49 on Feb 24, 2021

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Methanar posted:

I've turned off a million dollars of annual spend worth of wasted EC2 instances this week.

Ridiculous.

we've spent five figures on a completely unnecessary hot standby VM instance that's not used and nobody will let me turn the drat thing off, even though everyone agrees it's unnecessary :negative:

Methanar posted:

Impressive. What's the overhead ratio of running a million small GKEs. Of the control planes and unused CPU of the workers / actually used by the applications.

i mean just running a cluster 24/7 is $800ish a year if i recall right, even before you add in the resource utilization. that adds up pretty quick!

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Crosby B. Alfred posted:

What's the overall ramp time you think for someone going from zero to component for these? A few months? Six months?

if we're talking about time actively studying, terraform is extremely simple conceptually and you can safely put it on your resume with a few hours of practice. it's a fancy api wrapper combined with a graphing engine, it's not rocket science.

containers/docker/k8s... it's linux on linux all the way down. so it'll just take a lot longer till you know enough about the OS to understand what's happening. a mix of day to day use combined with theory will probably help you out here.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
As a woman I find certifications extremely helpful because I still get people explaining basic poo poo like the various KMS flavours to me, so it’s a nice way to say “I’m not just here to look pretty”. At my most recent job, certifications were explicitly the reason why I was promoted froM IT to a devops-y role.

People say certifications don’t matter like they say titles don’t matter. They might not matter to *you*. They matter for me.

And in general I really like them because it’s a nice self contained way to learn a new set of knowledge. But I hella overstudy for certs in any event. Last one I did was a GCP networking cert, studied for a month, took 20 pages of notes and finished the exam in less than 10 minutes. So your mileage my vary.

I will say as someone who does lots of hiring I don’t tend to look at them in much depth. But I do look at them.



E: I will say resumes also don’t matter that much and ideally you should build up your network to the point the resume doesn’t matter. But that’s an ideal point and even then you still need to have *some* piece of paper nobody will ever look at. So I’m sympathetic to both sides of the argument, though my personal experience has made me value certs a fair bit.

The Iron Rose fucked around with this message at 14:24 on Aug 29, 2021

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

xzzy posted:

I tried real hard to get up to speed on traefik a couple years ago because I needed a magic proxy to make a bunch of silly web page containers accessible. They promised auto discovery of said backends, but I could never get it to work. It's very possible I am a giant idiot though.

So I went back to a static nginx that I have to configure proxy pass rules for manually.
I
Same - I tried right when they were rolling out traefik 2 and the documentation was very poor, and service discovery was not fun.

I’d probably not have nearly as much trouble today having done two years of cloud bullshit, but I too would stick to nginx unless I had a very compelling reason otherwise.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
you are entirely correct, and should be standing your ground in this fight

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Hadlock posted:

Terraform versioning modules vs monorepo,

and go

Not entirely sure what you mean by versioning modules. I usually have the calling module handle the version, Not the child module.

Monorepo is fine so long as you don’t have the same state file for everything. We have a big monorepo at work and it takes like 8 minutes just to run the terraform plan. I’m fine with everything living in the same repo. Less so with everything sharing state.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
Also, why use a separate release branch from main? We do basically the same thing minus that aspect so I’m curious to know what some of the gains and drawbacks are.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
Will taking one of those certs explain to me why I still need to write pre-stop sleep commands for 500-less no downtime webapp deployments?

Is that still the recommended advise? I recall several articles and one kubecon presentation about the matter and remain irked that it’s necessary.

Context: https://www.youtube.com/watch?v=0o5C12kzEDI

also

https://blog.gruntwork.io/delaying-shutdown-to-wait-for-pod-deletion-propagation-445f779a8304


Like I can read and watch these and get why it’s needed. But still! Feels bad man

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Wicaeed posted:

We use LaunchDarkly, it seems to work well.

Question Time: What's the most sane way to deploy Helm charts these days?

I'm building a Rancher cluster for an IT-Ops team to run some hosted apps (Atlassian Jira/Confluence/Stash) and some day also run monitoring tools like Prometheus as well, however nobody on this team wants to use the existing Platform-team owned CI/CD environment that has historically been used with K8s thus far.

I'm thinking of exploring GitHub Actions, GitLab Runners and maybe even BitBucket Agents to see if any of these will meet their requirements:

* Should be able to be hosted internally (ie, on a private, internet-connected subnet in our Datacenter) as a VM
* Ideally, only the Runner itself would need to be hosted on-prem. The management pane can completely live in the cloud w/o issue.

I implore you to use the existing CI/CD environment unless it’s truly awful. Will probably save you - and your security team - a lot of trouble and get you a lot more support.

As far as deployment goes, I’ve used both helm and ansible to manage releases, and while helm is definitely *better*, especially now that tiller is dead, I’m still kinda meh on it. As far as orchestrating releases goes, you can build your own test/apply logic really easily into your runners, or use something like helmfile to do the heavy lifting for you. Honestly though, I’m open to some alternatives here too, Helm has been thoroughly so-so so far. I think it’s because I just don’t love the extra layer of abstraction between me and the deployment manifests, I’d rather set envvars and write HPAs myself. I guess the utility goes up when using third party services, but still.

Haven’t used bitbucket, but gitlab or GitHub will certainly work with self hosted runners and a non-hosted VCS. I don’t mind the gitlab k8s executor, but we don’t do anything very complicated with it. You should probably use whichever option the rest of your developers are already using. Frankly I’m surprised you even have the option.

The Iron Rose fucked around with this message at 06:50 on Jun 28, 2022

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Hadlock posted:

Nobody wants to maintain Jenkins anymore at our place. All it does is run terraform a couple times a day, everything else has been transferred away from it

Is terraform cloud worth looking into? I put in a sales request a couple weeks ago and just recently got a very tepid email response from hashicorp

it's expensive for what it is, but it's a great product. licenses get pricy and they charge SSO tax tho.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
I need some help on this one goons, because I don't like any of my options here.

Background:
I have an internal go based API service called "foo-api" that interacts with a postgres database running in cloudSQL. The service assumes only one database exists and reads $DB_HOST from an envvar. The main database has one read replica which is currently not used by foo-api. Each database instance (including replicas) costs ~$10k/mo across all environment tiers, so I'm loathe to spin up more read replicas. This service is running in a GKE cluster and other services interact with it through an internal load balancer. The service object manages the GKE internal load balancer through an annotation. DNS resolves `foo.example.com` to the load balancer's IP. We have the "gce" and "nginx" (the kubernetes community one, not the F5 nginx one) ingress controllers installed in our clusters.

Problem:
We have a frontend used by clients in AWS that sends requests to foo-api. Due to client growth and usage patterns, foo-api spends a bunch of time waiting to get a connection from the connection pool. Increasing the pool size helps, but we don't want to overload the DB with only requests from foo-api, since other services interact with it as well. We've a limit of 1000 connections to the DB, of which 300 are allocated to foo-api.

Request:

The developers of foo-api came to me asking if I could help them load balance between the main DB and the read replica, only for GET request, with session affinity so as to account for the varying performance characteristics and replication lag (i.e. we don't need session affinity if we balance between multiple read replicas). The developers are reluctant to implement this in the service themselves, since the service assumes only one one database and they don't want to rewrite their DB connection logic. I can push back on all of that if I have to, but I don't know if services should be aware of the database infrastructure configuration in the first place. This feels like something that should be handled outside the app service.

Solution:

This is a bitch to implement and this is what I have so far. I stole some of this from the following two github issues:
https://github.com/kubernetes/ingress-nginx/issues/187
and
https://github.com/nginxinc/kubernetes-ingress/issues/490

code:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    # add an annotation indicating the issuer to use.
    cert-manager.io/cluster-issuer: mycluster-cert-manager-cluster-issuer
    kubernetes.io/ingress.class: nginx
    #### ---NGINX SERVER CONFIG FOR FOO-API.EXAMPLE.COM
    # Server snippet adds custom configuration to all the servers in the nginx configuration
    # location block lives within server blocks, and are used to define how Nginx should handle requests for different resources and URIs    
    # Based on request method we set the target destination to an arbitrary path prefix
    # we then store the variable $original_uri based on the normalized uri nginx is currently processing
    # finally, we rewrite the original uri to read or write.
    # The request then flows to the backend based on the path, and the uri is subsequently rewritten to the original uri
    nginx.ingress.kubernetes.io/server-snippet: |
      location /api/v1 {
        if ( $request_method = GET) {
          set $target_destination '/_read';
        }
        if ( $request_method != GET) {
          set $target_destination '/_write';
        }
        set $original_uri $uri;
        rewrite ^ $target_destination last;
      }
    #### ---FOOAPI-INGRESS CONFIG
    # Configuration snippets modify headers for specific ingresses
    # see: [url]https://kubernetes.github.io/ingress-nginx/examples/customization/configuration-snippets/[/url]
    # internal specifies that a given location can only be used by internal requests
    # see: [url]https://nginx.org/en/docs/http/ngx_http_core_module.html#internal[/url]
    # rewrite says that if the regular expression matches a request URI, URI is changed as specified in the replacement string
    # in this case rewrite matches the original uri with the caret (^), and replaces it with $originaluri, which we stored above.
    nginx.ingress.kubernetes.io/configuration-snippet: |
      internal;
      rewrite ^ $original_uri break;
  name: fooapi-ingress
  namespace: default
# The path will be overwritten by the original uri
spec:
  rules:
  - host: foo-api.example.com
    http:
      paths:
      - pathType: ImplementationSpecific
        path: /_read
        backend:
          service:
            name: foo-api-ro
            port:
              number: 80
      - pathType: ImplementationSpecific
        path: /_write
        backend:
          service:
            name: foo-api
            port:
              number: 80
  tls: # < placing a host in the TLS config will determine what ends up in the cert's subjectAltNames
  - hosts:
    - foo-api.example.com
    secretName: fooapi-ingress-cert # < cert-manager will store the created certificate in this secret.[s][/s]
Obviously I hate most of this and it makes me feel like this cannot possibly be the right approach. You'll notice a complete lack of load balancing between services as well.

To solve that I wrote this ridiculous loving thing.
code:
    # featuring lovely weighted custom load balancing
    nginx.ingress.kubernetes.io/server-snippet: |
      location /api/v1 {
        set $original_uri $uri;
        content_by_lua_block {
          local reqType = ngx.var.request_method
          if reqType == ngx.HTTP_GET
          then
              local randNum = math.random(1, 100)
              if randNum < 66
              then
                res = ngx.location.capture("/_read")
              else
                res = ngx.location.capture("/_write")
          else
              res = ngx.location.capture("/_write")
          end
          ngx.say(res.body)
        };
      }
I hate this even more somehow. there's probably a footgun there with non-seeded rng. Also, even if I added session affinity using session cookies, it wouldn't be aware of the rewrite logic and I'd send requests to separate services (and thus different pods) anyways.

At this point, I think I'm left with the following options:
1. Do the above mess, no LUA, spin up a secondary read replica and use DNS load balancing between replicas.
---- Pro: doesn't require devs to do any code changes. Con: costs 10k/mo, feels really ugly.
2. Do the above but with the lua block
---- Pro: no need to spin up a second read replica, doesn't require any code changes, has some dumb weighting. Con: feels REALLY ugly, I'm now implementing my own load balancing logic, and session affinity won't work.
3. Switch to the other nginx ingress controller, which supports a custom resource that does all the above logic natively: https://docs.nginx.com/nginx-ingress-controller/configuration/virtualserver-and-virtualserverroute-resources/#match
---- Pro: supported first class in the product. Con: I now need to migrate ingress controllers across dozens of kubernetes clusters. We don't do anything especially exciting with our other ingress controllers but two nginx ingress controllers in our environment feels like a bad idea. We'd need to buy NGINX Plus to get sticky sessions.
4. Have foo-api be aware of the primary/replica distinction and route GET requests to the appropriate DB accordingly
---- Pro: I don't need to do any work. Con: Reflects badly on my team due to political sensitivities. And shouldn't load balancing be handled outside of the go API service's business logic anyways?

The Iron Rose fucked around with this message at 21:42 on Jul 22, 2022

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
Thanks folks. pgbouncer seems like a possible solution to do this outside the app, and making the service aware of its database needs is preferable by far. Appreciate the advice!

The Iron Rose fucked around with this message at 23:45 on Jul 22, 2022

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

jaegerx posted:

I’m pretty sure you can use istio for this exact use case.

I could but the fact that customers are banging on the drums here is a limiting factor. There’s lots of approaches here with service meshes though. There was a whole ‘nother solution I briefly entertained of using canary traffic splitting to do the same thing before I realized I was trying to fit a square into a round jole.

Which frankly is what’s happening in general, and there’s no way I’d want to actually implement the lua plugin.

Anyways this thread remains the best grey tech thread, I continue to learn a phenomenal amount from reading what folks post here. Never even heard of CQRS before and it’s fascinating. And 2000s era tech read replicas may be, but it’s quite new and exciting round these here parts.

E: With regards to splitting traffic between read and write services, it’s really a terrible idea and the difficulty of implementation reflects that. While a fun afternoon’s research and testing, this is not the model we’re going to use. With the requirement to split traffic across multiple services with session affinity gone, the problem set becomes much simpler. More importantly, the service becomes easier to comprehend and maintain, we reduce reliability and quality problems, and we preserve flexibility in how we design our application going forwards.

The Iron Rose fucked around with this message at 21:41 on Jul 23, 2022

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
What’s everyone’s favourite API gateway/authentication services?

We’ve a customer facing webapp and a bunch of API services in AWS/GCP/Azure, some of which are internal and some of which I want to be customer facing. Multi cloud so I can’t use the native cloud provider offerings.

Currently evaluating Kong (which highkey sucks in my tests so far), and Cloudflare Workers/Access/API Gateway, but very very open to other alternatives.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

i am a moron posted:

I work with a team that freely embarrasses themselves but we all do it so no one can judge. Someone did bring up squashing the other day but we decided it’s funnier this way.

Also keeping a long running project is a good way to learn by not pinning your provider version and just riding out major changes. I don’t ever pin versions unless the current is bugged for something I’ve deployed or we pin it once we realize a rework on a new version of the provider will take way longer than the urgency of whatever change/addition we’re making

this is the way


Hughmoris posted:

Rookie question about IaC:

As I learn this tech, what is good/best practice for building up a project with terraform and testing as I go? Do I just iterate on the main.tf file and build on as I go?

Ex: Do I build out my TF resources for S3 buckets and then apply/verify they work? Then edit the file and add on my permissions and update my stack to verify those work? Then add on my Lambda etc??

never use public terraform modules. private modules maintained by your org are fine.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
I would avoid doing helm in terraform.

I still like ansible better.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
These are all rough numbers - I’m getting a bit far afield of my non-compsci background so I figured I’d ask for some advice.

I need to distribute a ~55mb bloom filter encoded as a hex string to between several hundred and several thousand client cloud SIEM environments (think Azure Sentinel, Splunk Cloud, MDE, etc) over the public internet (unless azure lets me send data to a different tenant without going over the internet or peering VPCs, which I feel like should be a thing). Call it 100million database rows, using mariaDB’s rocksDB database engine (with which I’ve just learned exists today) to produce the filter. This updates whenever the table updates, but I don’t expect rows to change very often, so each iteration of the filter would I *think* be fairly similar to the previous. I need to distribute this to all the client environments at least once a minute. The client then runs the same hash function against an object in their environment to see if it’s (possibly) in our database or not.

So now the question is how do I efficiently distribute a fairly large blob file that changes semi-frequently. I figure we need to host the blob at edge and put it behind a CDN, but I’m not sure how much caching gets me if the source changes regularly. I don’t want the client to have to download a 50mb file every time either, so we need a client side cache. Append only doesn’t work because sometimes rows will change, and so will the computed hash, and thus so will the bloom filter.

My immediate thought was to divide my filter output into segments and check if each segment is identical in the client cache, which lets me balance compute vs network costs. And then I realized that at this point i lack the formal background to know if someone’s already solved this problem, and as it seems such a fundamental problem set, I’m sure someone already has.


E: using a more efficient probabilistic filter like a cuckoo or XOR filter seems to be the way to go here. More space efficient and we can distribute buckets rather than the whole data object. Bigger downside is outside of redis, there’s not a lot of data stores that implement the more efficient algorithms for you.

E2: I did some quick math on exactly how much volume the naive approach here would take and suddenly updating client environments once a day is okay lmao

The Iron Rose fucked around with this message at 03:58 on Sep 20, 2022

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Methanar posted:

Distribute by bittorrent. Make all of the clients send the blob to other clients.

Call it originator-bandwidth-optimized distributed dynamic edge computing

:laffo:


minato posted:

Distributing a 50MB blob that might change anywhere inside it is tricky. Distributing 50MB of lots of little pieces of data that don't change very often is solved in certain situations. Like, if you could decompose the bloom filter blob into a data structure that fit into many reasonably-sized DB rows, then "all" you'd need to do is set up DB replicas. DBs are pretty good at pushing out changes to read-only replicas, and also recovering from the inevitable problems you'll get when the network falls over.

Edit: Torrents would be very efficient but I think they assume the data is static; every time you modified the blob you'd need a new torrent. If you could break the bloom filter up into smaller pieces, it could work.

I think there’s really no way around splitting the bloom filter into smaller segments if you want to do this sanely. Much more efficient to decompose the filter into 50 different 1mb files and do comparisons on the segments. Would drastically reduce data volume especially if you don’t expect the majority of those 1mb substrings to change frequently. Then on the client side cache the segment, set a lengthy ttl, hash the substring and compare the local segment hash against the published hash of the remote segment to see if we need to invalidate. Do all of that async from queries that engage with the filter, and either concat on query time or pre-compute and have some other means of refreshing the data.

Ultimately iI think this would significantly reduce the amount of data client environments need to download, allow us to update the bloom filter frequently, and also allows us to make much better use of CDN cacheing at edge to minimize our egress costs.

The Iron Rose fucked around with this message at 05:47 on Sep 20, 2022

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Hadlock posted:

When you spin up the k8s cluster, you'll also bootstrap nginx or traefik via helm charts, and then each custom helm chart will have an ingress controller that does work on nginx or traefik. You'll also probably want to install cert manager to handle ssl certs using let's encrypt. Your ingress controllers will inform how the load balancer per cluster is configured automatically

I think EKS only works with ELB not ALB? Maybe that has changed

So yeah your terraform will look like this

Spin up cluster
Install nginx
Install cert manager
Install flux/argocd

And then create your branch and references to the helm chart and let your CI/CD install the helm chart to the correct namespaces

Someday nginx will support let's encrypt out of the box

Edit: installing helm charts via terraform is primarily for bootstrapping the cluster; nginx, cert manager etc. You should not be installing/deleting helm/hadlockco/myapp via terraform, that should be handled by your gitops branches and Ci/cd

I don’t know if I would want to manage the helm charts for nginx/cert-manager via terraform at all. I think I’d rather have a data structure accessible in a K/V store with all your clusters, and then have a repo with all your managed applications that can deploy to all those clusters. Use something like helmfile for declarative specifications of your chart releases and use your CI/CD of choice to deploy the Helm releases.

That way you can deploy config changes to all your clusters at once rather than go cluster by cluster, *and* you don’t need to deal with the misery that’s managing k8s resources via terraform.


If I can recommend nothing else, you want to manage as few kubernetes resources with terraform as possible. Helm is much better suited for that work. In EKS you’ll largely be provisioning things like load balancers using annotations on service or ingress objects, so you wouldn’t manage or define them in terraform at all. They’re aws managed and governed by your k8s templates, which are governed by helm.

The Iron Rose fucked around with this message at 19:55 on Oct 24, 2022

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Sylink posted:

Thanks for the suggestions, I agree that managing k8s with terraform is weird as gently caress and I don't like it. So I have been arriving at the conclusion that terraform should just do the top level resources, and then helm etc for everything else.

I did try the aws alb controller, it worked, but I never got external-dns to work properly with it. Maybe there is a simpler way to update a Route53 record in response to ingress ?

You can survive easily enough by managing DNS separately all together. Have a terraform repo with your R53 records, expose it to the entire engineering org, and set A records to resolve to your ingress controller’s service IP using an internal/external load balancer IP as needed. The ingress controller will then route to the backing service based on the host header, which is defined in each ingress object.

It’s a separate action altogether. External DNS is a perfectly valid way to do it, but I just haven’t used it personally (we use our own self managed DNS because of very silly reasons).

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Hadlock posted:

I have never run into a situation where I unilaterally changed all my nginx instances at once. We always used dev clusters to test changes before promoting to prod. The few times we didn't it was the CTO and his lieutenant cowboying changes into prod and almost always broke something

I'm sure you could configure nginx to read secrets/config from KMS or something? If that was actually a concern?

Not super worried about secrets in dev clusters. And secrets on prod clusters? If you have access to the pride cluster, you probably have the ability to read/modify those secrets via (choose your favorite roundabout method). Limiting access to production via the normal methods has always passed security audit. Not everyone needs beyondcorp security model

Do you not want to update the version of your helm deployments of nginx or cert manager? APIs get deprecated, config changes need to be made, replicas need to be added, annotations adjusted…

You wouldn’t change all of them at once, obviously test in dev first, or have release rings, whatever. But I’d much rather set yourself up for scaling now then have to go and update one terraform state per cluster every time you want to make a change, which also allows you to have a consistent config across your environment.

This is specifically for *your* managed services you want to put on all clusters though. I would include the ingress controller, security tooling, etc here, but ingress objects should obviously be managed on an environment by environment if not cluster by cluster basis. Same with cert manager since you probably want to limit access for providing domain certs to specific clusters.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
Memory requests and memory limits should generally also be identical: https://home.robusta.dev/blog/kubernetes-memory-limit

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
This is by and large fascinating to read, and I’m largely following it, which is also fun! I have a few questions though, bearing in mind I don’t have any experience with datacenter networking.

By and large you’ve described a flexible and redundant network model, with metallb giving you consistent per connection load balancing so within the context of a connection you always get routed to the same pod.

Just to be clear, I thought that metallb when operating in BGP mode advertises a /32 route per service with a LoadBalancer type. Your diagram of the flow is slightly unclear: when you routing from VPC -> POP -> DC A 10.0.0.1/32 -> k8s service in DC A…Z, does 10.0.0.0/32 represent the downstream k8s service’s endpoint, or are you doing service to service routing with custom endpoint slices?

I’m assuming the former because you would have complained about it by now otherwise but wanted to be clear.

Several Qs:
- Do you have a separate address pool per DC or do you share a single address pool across clusters and datacenters? I didn’t see anything in metallb’s docs about cross cluster address pooling, but I’ve only briefly scanned the docs.
- you mentioned that you can get to the same service via multiple POPs and multiple DCs. I gather you’re relying on your DC’s internetworking here? You’re advertising to just the local router, so this advertisement would need to be propagated to your POPs and DC peer routers, yes?
- Can any node access any service (assuming the appropriate policy) in another DC by going DC -> DC, or DC -> POP —> DC, or do you have to go through AWS and back through the whole POP -> DC -> LB chain? I can’t imagine the latter.
- Do you run with local or cluster traffic policies at scale? I’d normally prefer the impaired performance of cross-node kube proxying rather than needing to be significantly more conscious of which nodes my pods get allocated to but maybe that changed at your scale.
- do you have services (whether represented by k8s services or not) that run in multiple datacenters at once?
- all the above works for L4 routing. What do you do when you need the semantics and application aware logic of an ingress? Advertising the ingress controller as a service and routing to a separate DC-internal service for your app?
—— Could you go into more depth about the performance limitations of nginx-Ingress you’ve experienced?
- if I’m in AWS region A, ECMP is great for load balancing… but I still probably want to go to the POP associated with the DC in region A for lowest latency. The POPs use BGP path selections to route to their local DCs, but how do your VPCs route to their closest POPs? Or is each VPC direct connected to exactly and only one POP?
- For making the BGPAdvertisement of the service to your local DC resilient, are you relying solely on running multiple metallb BGP speaker instances within a single cluster?
— you have many k8s clusters across all your DCs. Do any clusters span multiple DCs?

I have probably a dozen more questions but those are the big ones I can think of at 2am I think.

The Iron Rose fucked around with this message at 08:44 on Dec 12, 2022

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
I continue to learn valuable things from you Methanar. I greatly appreciate you sharing these details and insights.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
Helmfile isn’t bad for orchestrating multiple releases in a cluster: https://github.com/helmfile/helmfile

Hooks in nicely with your secrets management tool of choice as well. Use with any CI provider you want - GitHub, gitlab, jenkins; anything that runs a script on a commit or merge will work here. Haven’t used Argo CD but I’ve heard good things.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
It’s awful especially when you have OPA rules to enforce naming patterns and whatnot. I had to kill the few single resource modules I found at current job - thankfully there were only a handful.

It adds a totally unnecessary layer of abstraction, maintenance overhead, and ties you to specific provider version semantics.

Modules should be created very rarely and used sparingly.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

Hadlock posted:

:five:

We had our main cloud services in AWS but every project (GCP services namespace) in GCP needed an XPN connection to the AWS VPN, and needed all sorts of very specific VPN subnet settings etc. You could create a new project and k8s cluster using vanilla tf but there's no way you would figure out how to do all the nuanced config, so instead you would just use hadlock_gcpproject.module and hadlock_k8scluster.module and skip three hours of reading documentation and trial and error. Also when VPN settings changed you got that got free, rather than having to debug weird networking problems because joebob didn't get the memo to update the terraform VPN, or whatever

We do the same thing for project (auto budgeting and audit logging) and VPC creation (auto vpn +dns peering + private services connect) and it is exactly what you want from a module because it’s a tightly coupled set of resources your developers shouldn’t need to care about. Ours has about a dozen or two resources that’s easily invoked and extensible from the module call.

What we do not want is “create this VM, you get these tags and this naming scheme.” The maintenance is not worth the extra layer abstraction, use OPA rules or SCPs or whatever the gently caress - a terraform module isn’t the right way to solve it. AWS has an amazing provider_tags feature that rules though.

In practice devs just say “give me a subnet” and we create it for them, but it still makes our lives easier.

Also death to terraform long live pulami

The Iron Rose fucked around with this message at 22:41 on Dec 21, 2022

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
I didn’t think like this previously, but then I worked for several SaaS companies and now find it eminently reasonable to remind enterprise customers who want enterprise features to pay for an enterprise license.

Security is great and all but making software costs $$$. I’d still like to see SSO more widespread at cheaper license tiers though!

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
In other news I did a huge revamp of our Gitlab CI pipeline templates for deploying poo poo to kube and it’s beautiful. About a thousand fewer lines of code, going from 14 templates to 2, support for multi cloud, arbitrary #s of clusters (and by extension arbitrary #s of environments), passing in arbitrary arguments, arbitrary flow control, multiple helmfiles…

Opinionated design is nice and all but not when you sacrifice too much to achieve it! A little bit of flexibility deferred to the user goes a long long way.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
When I last used cloudformation in 2018/2019 it sucked a lot more but these days I have such contempt for terraform and module related footguns I’m pretty open to alternatives

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
So I’m at an organization that has whiplashed between the two extremes as nauseam over the past few years. We went from having a devops person embedded in each dev team, which produced tons of technology sprawl because nobody coordinated. Lots of poorly configured, poorly supported environments with the attendant waste of endlessly rebuilding a better mousetrap. Too much focus on getting an app running without thinking about things like spot instances, managing logging, HA and scaling, etc.

Then we went to a centralized infrastructure team, which means we think about all those things but now act as a bottleneck for all dev work. This leads to much tossing of hand grenades over cubicle walls. people on all sides are bad at communicating and worse at understanding the broader environment. Also it’s staffed largely of sysadmins who don’t really understand dev work, and devs who are unfairly contemptuous of the sysadmins, with not so great blood as a result.

There’s the much beloved platform model which in theory works but basically means making GBS threads out CI templates and o11y libraries, which is fine I guess but then dev teams have to actually seek out and use and iterate on those templates and libraries to drive improvements, which won’t happen unless they’re forced to or aware of those templates existing in the first place.

Management can’t help too much beyond really broad diktats like “use kubernetes”, which while obviously extremely limited in a lot of ways at least gets everyone on a shared lingua franca.

Is the solution to just hire SWEs for your platform teams so you get people speaking the same language? There’s only so much devolving of monitoring/observability you can really expect to get into individual teams of 5-10 developers. It’s not reasonable to ask that every dev becomes an expert on monitoring, cloud, and kubernetes - and let’s not even talk about networking and security on top of everything else.

Ultimately I feel the solution here has to be one that doesn’t rely so heavily on the skill levels of individual developers or on a rapid and responsive infra/devops/platform/sre/etc team ( aka “devops bullshit”). This is a big part of what the big three cloud providers sell - but even if VPCs obviate the need for network engineers you still need someone who knows what a route table is and who can configure peering, and that probably won’t be someone working on a feature overdue be two weeks.

What I am slowly experimenting with is the idea of a devops bullshit team that provides the basic building blocks and support for that within their area of expertise. Basic CI templates, APM libraries, secrets management, and so on. To the extent devs need more than that as they inevitably will, that’s something they can build out communally or internally. A dedicated team provides a floor, upon which teams can build scaffolding to suit their own needs.

Then again I also just had to shoot down people on my team saying “people shouldn’t use cloud functions it should all be k8s!” so I don’t exactly want to presume the floor is very high either.

Tl;dr: the cloud is hard and people are bad at it. What is a modern theory of devops that can be applied from the perspective of an engineering organization at scale, now that we’ve seen a decade+ of people trying and failing to do devops bullshit at all sorts of organizations big and small?

The Iron Rose fucked around with this message at 19:57 on Feb 14, 2023

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
I think that’s a fine idea and solves some problems, but it doesn’t really scale to solve the others.

Let me reframe what I’m trying to get at here.

The cloud has democratized many of the services that used to require dedicated teams or in house software. As a result, more and more work can be handled by developers without the need to venture externally.

The problem is developers still aren’t good enough at doing this to be as effective as we want them to. Logging, monitoring, scaling, HA, o11y, database administration, effective use of compute, secure design, and so on.

The real question is whether them not being good enough at it actually matters, and I’m not sure that it does.


If we do think it matters, some approaches here are:
- central infrastructure team
— default state, kinda awful in lots of ways. Bottlenecks, poor cross-functional relationships, treated as a cost centre, and scales poorly.
- create building blocks such that dev teams can create good enough designs without the need to involve cross functional business units?
— this is the platform approach, and sucks because you probably aren’t building better building blocks than BigCloud is.
- stick someone who knows to cloud in every team
— this requires someone who knows how to cloud, has a pretty high bus factor risk, and results in chaotic design patterns that are a support challenge
- accept that devs aren’t good enough at this, but so long as the business keeps running not good enough is still fine/Make infra a support org.
— I almost like this but it doesn’t solve for when the business needs better results than a devolved approach provides.
— are you really getting $X million in value from this approach versus what the actual cloud support contracts provide?
— this also doesn’t really solve for o11y and security, both of which require heavy development/admin work running your monitoring/security infrastructure. Elastic doesn’t baby itself and not every company can use a SaaS offering here. You might devolve compute, but if security and logging still result in cross-functional friction have you really solved the problem, or have you just solved infrastructure/compute sui generis?

The Iron Rose fucked around with this message at 20:39 on Feb 14, 2023

Adbot
ADBOT LOVES YOU

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:

12 rats tied together posted:

the other side of this strategy's coin is that you just get rid of people who want to work in the support org but just want to build stuff and don't care about tickets.

the tickets are from the customers and the customers are the entire point of having a team and not just paying extra for premium support. if you don't want to be in the "have no tickets" pipeline, congrats, you've been promoted to internal customer, where you can work on features that generate revenue for the company instead of generating animosity from your coworkers

Missed this when writing my reply, but I actually agree with this to an extent, which is that I almost think a business is better served by having no central devops/infra team at all. I’m not sure an internal support org is worth it either, but I’m also not sure the business is served by infrastructure existing at all.

Maybe just fold it + networking into security. Shared services feels like a bitch to manage in general.

The Iron Rose fucked around with this message at 20:40 on Feb 14, 2023

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply