|
xzzy posted:You talking about the availability of the gateway api? I'll definitely play around with that next but I assume it's gonna be years before ingress is is gone. Yeah the gateway api Ingress controller sounds cooler IMO
|
# ? Nov 9, 2023 01:35 |
|
|
# ? Jun 5, 2024 07:43 |
|
The Iron Rose posted:Before I do though, am I reinventing the wheel here?
|
# ? Nov 9, 2023 02:24 |
|
it brings me no pleasure to tell you that this is a problem easily solved by ansible
|
# ? Nov 9, 2023 04:25 |
|
12 rats tied together posted:it brings me no pleasure to tell you lies
|
# ? Nov 9, 2023 05:08 |
|
cilium has Gateway API now in 1.14
|
# ? Nov 10, 2023 02:20 |
|
12 rats tied together posted:it brings me no pleasure to tell you that this is a problem easily solved by ansible Elaborate on this, if you don’t mind? I’ve not used ansible regularly in a few years beyond running AWX for a few teams, so I’m out of the loop. As far as triggering control plans or node upgrades goes it looks like I can do that with the various community modules for the various cloud providers, which is all well and good. My big concern is flow control and parallelism without wasting a ton of compute. My team leans more ops than devs sadly, so ansible is a plus from a maintenance perspective. We run a few AWX on k8s instances though, so that’s an easy option for running the plays. The Iron Rose fucked around with this message at 21:12 on Nov 13, 2023 |
# ? Nov 13, 2023 09:52 |
|
This is going to be a tall post, sorry in advance. Ansible is good at flow control and parallelism. The exact way you handle it depends on your requirements, but the way I would get started modeling this as an ansible problem is to create a group structure that is "k8s" -> each cluster -> a fake host named after the cluster.TOML code:
YAML code:
You can organize clusters into groups by shifting around these virtual.clustername hosts, for example, by using the "group_by" module or any of the other ansible playbook strategies. You can treat each virtual host as if it were a "thing" -- for example, you can create a group called "k8_1.22" that holds a group_var for target_version=1.22. If you want cluster2.east to be 1.22, put it in that group. Per-cluster data should go in the group named after the cluster, per-version data goes in the group named after the version, and so on. You can run `ansible-inventory [...] --graph --vars` to get a visual representation of how ansible mapped your configuration data onto each virtual host --> cluster. Since you mentioned AWX, you technically have some extra options and features, but an approach like this would run from your laptop which is a good place to start while you figure out the tasks and their order, and then when you're done you can schedule it from AWX and limit it to 1 playbook in progress.
|
# ? Nov 13, 2023 20:32 |
|
New question! I have a need to implement deduplication in cribl, our log processing pipeline. Cribl can interact natively with redis. I need deduplication because fluentbit, our log collector, will just up and loving die silently out of the blue. Replacing it will take weeks. The bandaid fix is to restart the pod every hour… but despite configuring fluentbit’s tail plugin with a SQLite DB to store log offsets, we still get a large amount of duplicates every restart. Elastic rejects them, but the rejection burns compute and can cause backpressure. Here’s the proposed design: 1. filter: true a. Perform a redis get, where the key is a uniquely identifying set of fields (“${field1}:${field2}”). If the key exists in redis, store the value of that key in __dedupe_key (fields with a __prefix are ephemeral and not sent to the logging destination) 2. filter: if __dedupe_key == ‘field1:field2’ a. drop the log 3. filter: if ! __dedupe_key == ‘field1:field2’ a. This event is not a duplicate. Use set to store in redis the key field1:field2. b. both values and keys need to be unique, so may as well set the value to field1:field2 also I intend to use _time (in seconds since Unix epoch, includes milliseconds) and kubernetes.pod_id. Debating a hash of the ‘’message” field (stdout) but those can get lengthy and I think the first two are probably enough unless I have the same pod emit multiple logs at the same millisecond. With an eviction policy of allkeys-lru, a ttl of 2 hours, and 3M events/hour I should only need ~510MB for redis which is pretty trivial. Is this a reasonable approach to take here?
|
# ? Nov 15, 2023 21:20 |
|
I would examine why the gently caress you need to deduplicate your logs Bandaid fix: YES
|
# ? Nov 16, 2023 03:52 |
|
Ok I get to do system architecture for our prod system. It's a baby django app, postgres redis thing you've seen it a billion times before. They're currently limping along on elastic beanstalk which somebody followed a barebones tutorial four years ago but they're ready to graduate. They don't need the scalability of kubernetes but the tooling and standardization of it all is quite nice and conveniently also the system I'm most familiar with. This system is pretty mature but they're going to add more developers and it will grow over time but not expecting more than about 500 simultaneous users. It's pretty db heavy on the read side and I expect we'll move from 6 services out to 12 by next year but I doubt we'll have more than two dozen services over three years Thinking: 1. terraform to spin up kubernetes. hosted on github, executed with github actions 2. two EKS kubernetes clusters to start, prod and develop (don't need a separate qa cluster) 3. flux2 or argoCD for deployments 4. prometheus/grafana/loki for monitoring and alerting, logs Everything else besides the terraform/gha/k8s can be swapped out as we graduate beyond that. I would really like a terraform->github actions system that supports more, like atlantis, but serverless. looks like there are a couple of attempts at making atlantis serverless but nothing has really gained traction. i wouldn't mind going to terraform cloud but can't justify the budget this early in the project and would require review by people external to my team (slow). Thinking I'll go with argoCD as it has a better UI/GUI observability. I've used flux in the past but argoCD is more flashy and flux is a fair amount of work and there's no visual payoff to say "here's the thing I spent three days on, it works flawlessly you'll never see it, but it does like 30% of the work day-to-day" If/when we need it, upgrade from prometheus/grafana to something like newrelic or datadog once I can justify the expense (right now they have nothing) What would you do differently/how would you do it completely differently?
|
# ? Nov 16, 2023 04:27 |
|
never build a QA cluster, namespace that poo poo. haven’t used either flux or argoCD for deployment orchestration - though I’ve gained a lot of value out of Argo workflows. We use helmfile and gitlab pipelines (entirely analagous to GHA), it’s not flashy at all but it works and the rollbacks are nice. Add jaeger and an opentelemetry collector deployed via the otel operator so you can get traces from your app too. It’s like 10 lines of code to get auto instrumentation across your entire service and all its dependencies, including instrumentation libraries for Django, redis, requests, etc. easy win, the graphs look cool, and telemetry is the future. Kube is great for the ecosystem and the shared lingua Franca of it all. It’s probably not your most cost efficient option, which is probably also fine. I can’t live without cert manager and ssl termination at the ingress controller level, and it’s cheaper than creating a million NLB/ALBs via your load balancer controller and terminating SSL there. I agree log deduplication is insane but I’m about to head into change freeze hell so I’ve gotta do something to stop the bleeding here. Bug fixes - even hilariously stupid bug fixes like the one above - are fine. swapping our logging agent to logstash, not so much. The Iron Rose fucked around with this message at 09:56 on Nov 16, 2023 |
# ? Nov 16, 2023 09:45 |
|
Eh - Have cribl toss anything older than x minutes on a message queue/blob store, and process that in an orderly fashion. If it’s older than that, you’ve already lost any advantages in reduction of latency for event processing, so you now are solely in the business of cheaply guaranteeing eventual fidelity and can take your time. Business impacts may occur, but if new events wear latency as a result of not doing that, the outcome is worse.
Junkiebev fucked around with this message at 01:21 on Nov 18, 2023 |
# ? Nov 18, 2023 01:17 |
|
And I hope to Christ your restarts are scheduled +/- random minutes and not “at the top of every hour”, because lol if that’s happening
|
# ? Nov 18, 2023 01:28 |
|
Anyone have any familiarity with using Jupyter Notebooks for operational runbook work? Netflix has been talking about it for years and I'm about to pitch something similar on my team, curious if anyone has experience with it.
|
# ? Nov 25, 2023 20:08 |
|
I found that it doesn't actually solve any problems in a meaningful way for me and instead introduces a ton of new ones like managing the entire notebook ecosystem. For my immediate needs I have a stdout plugin for ansible that hijacks the terminal and replaces it with a textualize app. While I was figuring out how to do this, I noticed that textualize has textualize-web which lets you run apps in a browser in addition to the terminal. I'd probably start there instead of jupyterhub if I was building something from-scratch for operational work specifically.
|
# ? Nov 25, 2023 20:21 |
|
12 rats tied together posted:I found that it doesn't actually solve any problems in a meaningful way for me and instead introduces a ton of new ones like managing the entire notebook ecosystem. Yeah, the trick is that this is for a larger team, so some standardization is probably appropriate. We will have to figure out the managing the notebook ecosystem problem for sure, but our team is large enough that's probably not impossible, and I think the benefits of tightly coupling the documentation to the actions is pretty meaningful. We'd also be using JupyterHub or something hosted, instead of just 'alright folks, go make some notebooks', so that helps with a lot of the weird python overhead of environments/etc. Textualize is neat, and I'm a fan, but it isn't really an analogue to Jupyter Notebooks other than 'quick graphical UI', and that's not really the important part for me about the proposed setup. You could totally do a pseudonotebook setup with cells of rich text/etc, it just seems like an awful lot of overhead. For some context, we have some runbooks that are an awful lot of 'Grab a bunch of data and review it, then do something based on that output'. Long term, we want to turn a lot of this into fully automated solutions using stuff like SQS and Step Functions/etc, but there's a huge gap between '100% paper docs' and 'fully automated solutions', and I see this as sort of a middle ground that allows for a lot of iterative improvement on the process.
|
# ? Nov 25, 2023 21:57 |
|
Yeah, I get it, I just didn't find that editing python code in cells and running the cells really lent itself well to what I was looking for, which was "psychological safety for ansible playbooks" (analogous to step functions in this context). I don't know what kind of jobs you're running but my plans here are basically textual+ansible-playbook for interactive tasks that require human attention, the stdout plugin just shunts events from ansible's task queue into the textual app which displays their status in a nice UI + relevant graphs -- "You're running this job, which is doing this to the cluster, so this graph line should go down", etc. Because it happens at the task queue level it also lets us pause jobs, interactively retry individual tasks or groups of tasks, etc. Theoretically we can even live edit the discovered data but it's probably best not to build that feature. For reactive tasks the plan is ansible-rulebook subscribed to AWS SQS, most likely. It's either SQS or Kafka but internally for us Kafka has weird uptime problems / lack of support, which is not something I want attached to the system that runs cluster-wide jobs with potentially destructive powers. The exciting thing about rulebook is that it uses the exact same tasks format as playbook so we never really have to migrate from one system to another. We have jobs xyz and we think it's safe to run them automatically, we just write a rulebook that collects preconditions and fires them off. The jobs are janitoring workflows for a distributed database, so it is usually the case that we make some change and want to ensure that the cluster is coping with it correctly and not descending into a death spiral/cascading failure. Needing to alt tab between job interface and database health (in various forms) has a real attention span cost. Textual lets us simply display the graph in the cli alongside the job, along with descriptions of what should be happening, hotkeys to pause, rollback, post to slack, etc. Simply type a prometheus query and hit enter instead of remembering that in order to run a prometheus query in grafana you have to open explore, and then to do that you need to right click -> open in new tab, because grafana highjacks cmd+click for some reason, and so on.
|
# ? Nov 25, 2023 23:20 |
|
12 rats tied together posted:Yeah, I get it, I just didn't find that editing python code in cells and running the cells really lent itself well to what I was looking for, which was "psychological safety for ansible playbooks" (analogous to step functions in this context). I expect that if we had a singular interface to our infrastructure like Ansible that probably would be more useful, but we don't (for a bunch of reasons beyond my immediate scope unfortunately). The editing cell stuff is mostly fluff for us as well and I plan on locking down runbook cells to read-only for most sections, but it has a few benefits - namely, getting folks not as used to Python a bit more visibility, and some ability for incremental changes. You can probably (successfully) argue that if I'm not using the live edit feature of JupyterHub, I'm missing the point, but If there's an alternative that doesn't involve me rolling my own entire solution, I'd use that instead. So I guess that's a good question - is there anything like that? The other ancillary benefit is having a section (or second deployment) that is both editable and shareable, so you can collaborate on deeper dive investigations or other issues that require a lot more going off script or otherwise mucking around.
|
# ? Nov 26, 2023 02:22 |
|
I worked for a startup that probably does exactly what you want, but I'm not sure how active development is right now.
|
# ? Nov 26, 2023 08:35 |
|
Sagacity posted:I worked for a startup that probably does exactly what you want, but I'm not sure how active development is right now. Looking into it, this does a lot that we already have solutions for internally, with agents/etc. I'm not looking for something that solves that much, because we have a lot of robust, internal solutions to a lot of it that aren't public offerings/etc, so we're not using G Cloud / AWS stuff all this sort of software ties into. There's apparently a beta for coding providers so we could maybe write a shim, but it's in Rust, and nobody on my team is going to learn Rust any time soon. For a smaller shop that directly used stuff all on AWS/etc, I bet this would be pretty cool though.
|
# ? Nov 26, 2023 10:42 |
|
I'm looking for some way to communicate a simple project roadmap to other internal teams and am hoping someone here might have a recommendation. I don't need a full rear end project management tool which seems to be all I can find in google. I was thinking some sort of site generator tool that takes json or yml. I would like something that will display 3-4 projects side by side with milestones and current status and expected FY and quarter. I don't need to communicate any more information than that.
|
# ? Dec 1, 2023 19:44 |
|
The Fool posted:I'm looking for some way to communicate a simple project roadmap to other internal teams and am hoping someone here might have a recommendation. Would something like Tello work? Or some other kanban?
|
# ? Dec 1, 2023 21:13 |
|
I mean, they already have access to our ADO boards. I'm trying to do something more high level / "customer facing"
|
# ? Dec 1, 2023 22:27 |
|
Depending on how much data and how frequently it's updated, maybe find a nice Excel/Powerpoint/whatever template and update it manually. Or a static page with something like Highcharts or Chart.js: https://www.highcharts.com/docs/chart-and-series-types/x-range-series
|
# ? Dec 2, 2023 05:53 |
|
Business looking for something which does stuff like “disable smbv1 client connectivity on all endpoints”, and I’m looking for something like “Change State and track diffs at scale and the only requisite should be network connectivity and a Linux kernel.” So, Green-Field solution for Fleet Management of (mostly ephemeral) Linux nodes w/ GitOps-driven Config As Code in multiple cloud providers () with a pathway towards something enterprisesque for support. (Open)Salt-Stack seems… old and feature-gated for some reason?? I don’t want to discard it just because it isn’t the FOTM. Uyuni is salt-backed and has RBAC+SSO and a rest API, but it’s weird and German (SUSE) Puppet/Chef are expensive to transition to enterprise support, right? Foreman seems promising but immature at first glance. Ansible Tower($$$)/AWX are… Fine, I suppose? Who ya got? Junkiebev fucked around with this message at 03:32 on Dec 13, 2023 |
# ? Dec 13, 2023 03:15 |
|
ansible or chef all of the other options are dumb
|
# ? Dec 13, 2023 05:23 |
|
I like foreman and puppet for provisioning and config, but keeping them running is a FTE. Puppet free is fine once it's running. Foreman works but it always feels 80% done and never really seems to get beyond that. But there's no better pxeboot project out there.
|
# ? Dec 13, 2023 05:32 |
|
I'm a Puppet user and once I wrapped my head around it, I got to like it well enough. Enterprise at scale was indeed expensive - something north of $100 a node for a few, maybe $75/per at scale, and that was a few years back. At that time, Ansible tower was in the same ballpark. Puppet's owned by Perforce, Ansible by RedHat/IBM, so while Ansible probably has a more secure future, I wouldn't feel great about paying for either.
|
# ? Dec 13, 2023 05:39 |
|
Salt sucks, master/agent protocols suck, ansible is the way
|
# ? Dec 13, 2023 06:35 |
|
Cloudhouse has a product called Guardian that does exactly this. They don't hardly market it at all but it's about 12 years old at this point and has been stable for about 7 more. NYSE uses it extensively. I dunno what their licensing model is like today but in like 2016 it was about $15 per node (server) per month. I deployed it at another company back in like 2015. It has native support for generating salt/chef/puppet config code stuff to fix stuff that's out of compliance I can't imagine deploying ansible as a greenfield solution in TYOOL almost 2024. It goes in the bin along with PHP labeled "neat easy to use technology that is long past it's prime"
|
# ? Dec 13, 2023 11:24 |
|
In 2024 should I be using services or ingress with external dns? Nginx seems to support services in addition to ingress and whatever "nginx virtual server" is. Amazon load balancer controller also supports services. 2019 it was still all about ingresses. Did the world move on and I just didn't notice? Also: what's everyone's preferred way to handle secrets in terraform in AWS? Looks like one lone developer is maintaining the wildly popular sops terraform provider
|
# ? Dec 13, 2023 11:29 |
|
Hadlock posted:I can't imagine deploying ansible as a greenfield solution in TYOOL almost 2024. It goes in the bin along with PHP labeled "neat easy to use technology that is long past it's prime" So what's the new sexy alternative? Puppet/chef?
|
# ? Dec 13, 2023 13:14 |
|
fluppet posted:So what's the new sexy alternative? Puppet/chef? I’d choose ansible over chef/puppet all day every day.
|
# ? Dec 13, 2023 13:19 |
|
“new and sexy” would not have described puppet or chef a decade ago, much less now
|
# ? Dec 13, 2023 13:28 |
|
I guess GitOps (Argo or flux) is the current hot and sexy, but that only works if you’re deploying apps to k8s. You’ll still need something to provision the clusters like Terraform or Ansible.
|
# ? Dec 13, 2023 13:50 |
|
LochNessMonster posted:I’d choose ansible over chef/puppet all day every day. Absolutely, ruby sucks. Also Salt sucks too. Ansible is the least annoying of the four, however for this: Junkiebev posted:Business looking for something which does stuff like “disable smbv1 client connectivity on all endpoints”, and I’m looking for something like “Change State and track diffs at scale and the only requisite should be network connectivity and a Linux kernel.” ...I'd do immutable infrastructure and not worry about a config management tool, unless I really need one in the build pipeline.
|
# ? Dec 13, 2023 13:51 |
|
I've never really "gotten" ansible. Is it a scale thing? Because it feels very slow to me when I try to Do A Thing on a few hundred servers. The inventory setup is kind of annoying too, building groups of systems is annoying and running on a subsection of them is worse. I do like the available modules though, the built in's feel a lot more fleshed out than puppet so it's more convenient to do quick one off tasks on a few machines. So someone explain to me why it's awesome.
|
# ? Dec 13, 2023 14:00 |
|
xzzy posted:I've never really "gotten" ansible. Is it a scale thing? Because it feels very slow to me when I try to Do A Thing on a few hundred servers. The inventory setup is kind of annoying too, building groups of systems is annoying and running on a subsection of them is worse. I have the same sentiments with regards to speed amd inventory building. Then there’s the secrets part of it as well. It’s just less sucky than the alternatives I guess.
|
# ? Dec 13, 2023 14:07 |
|
Hadlock posted:I can't imagine deploying ansible as a greenfield solution in TYOOL almost 2024. It goes in the bin along with PHP labeled "neat easy to use technology that is long past it's prime" Doesn't do much for the reporting needs in this situation, though. Configuration management sucks, fleet management sucks, there's no good answers. The right tool is a matter of figuring out which deficiencies are least important to you personally.
|
# ? Dec 13, 2023 16:31 |
|
|
# ? Jun 5, 2024 07:43 |
|
Ansible is great because it’s just Python and SSH, and it’s hard to gently caress those up too much. I would definitely use it for config management over infrastructure provisioning however. Terraform is fine for most long lived infrastructure components. Gitops is great and the way to go for everything where you’re not managing VMs though, which happily in my work is most of the time. I think I’d go max working at an org where you need to regularly spin up and manage VMs to serve your workloads. I don’t use Argo or Flux though. GitHub Actions/Gitlab Pipelines have so far been more than enough for all my needs, including deploying to 50+ kube clusters at a time. The Iron Rose fucked around with this message at 17:05 on Dec 13, 2023 |
# ? Dec 13, 2023 17:03 |