|
Cancelbot posted:RabbitMQ surely is consistent though? even if it misses out the whole "available" and "partition tolerant" parts of the Holy poo poo that sounds awful. I used to operate multiple clusters running 3.6.5 which was before they distributed the stats db, so under heavy load the node where the stats db landed would continue to eat up memory until it OOMed and the cluster partitioned. At one point we had a runbook for memory usage on rabbit nodes that basically linked to this page: https://www.rabbitmq.com/management.html#stats-db and told the person on-call to keep trying to reset the stats db until it actually took. That was the worst on-call experience of my entire career. Of course even though upgrading to a later version would have fixed it, the engineering team decided to migrate all of their queues to SQS which meant we finally got to murder rabbit. That was the best on-call experience of my entire career.
|
# ¿ May 14, 2019 00:50 |
|
|
# ¿ May 17, 2024 07:17 |
|
uncurable mlady posted:I’m curious if anyone itt has looked at Honeycomb/LightStep/Omnition. my team is currently having an absolutely awful time with our logging solution and are considering moving to honeycomb. i wasn't with the team when they did the demo but from what i heard it was great tech that was just a little too pricey to consider switching to
|
# ¿ Sep 28, 2019 00:19 |
|
For those folks who have Jenkins or something else apply their Terraform changes, how do you handle cases where the apply fails because Terraform can't work out the correct order to create things, or if you get rate limited, or if your IAM role's (you're using roles, right?) temporary creds expire, or if you have a resource limit that you hit during the apply, or if really any number of other things that might cause a plan to succeed but an apply to fail? Our platform team has been a bit shy about the idea of automating applies but I'd love to be able to do it if someone has a good answer for how to recover TF from a bad state that a machine put it in.
|
# ¿ Nov 17, 2019 01:37 |
|
CMYK BLYAT! posted:Kubernetes good. Kubernetes users bad. Organizations bad. If you as an operations engineer can't provide easy to use tools for product engineers to release and run their code then you've failed in your job. Kubernetes is a layer of abstraction that makes deployment and operations by users more complex than it needs to be and provides so many moving parts that touching it in the wrong way can cause cascading issues in places a user couldn't possibly imagine. Golden image worked perfectly and was extremely simple, easy to reason about failure cases, and scaled nicely. I have no idea why people needed to completely reinvent the wheel.
|
# ¿ Dec 14, 2019 20:19 |
|
CMYK BLYAT! posted:I'd love for things to be simpler, I really would. Sadly, we don't have infinite time and resources to try and abstract away every possible decision that's necessary when deploying complex, rapidly-evolving software into every conceivable (and often badly-designed) network architecture. This is sort of the crux of my argument against Kubernetes. 99% of applications deployed in k8s don't require the complexity that k8s brings. It's resume-driven development for ops teams and it loving sucks to be on the product eng side when dealing with it.
|
# ¿ Dec 14, 2019 21:16 |
|
12 rats tied together posted:Cloudformation is a fantastic service that everyone who works in AWS should know, even if you don't use it actively, simply because the Cloudformation resource reference doubles as the best API documentation available for the platform. Cloudformation is awful and no one will ever convince me otherwise
|
# ¿ Feb 5, 2020 23:05 |
|
Osmosisch posted:AWS interfaces There's your first problem. Don't use the AWS console to build or set up anything. It's just not worth the pain and frustration.
|
# ¿ Feb 12, 2020 18:56 |
|
Osmosisch posted:Oh no i meant their overloaded web interface. Yeah, their web interface is called "The Console". Methanar posted:My company built and maintains their own internal version of the aws web console with flask. lmao this is awful
|
# ¿ Feb 13, 2020 01:26 |
|
New Yorp New Yorp posted:As awful as having all cloud infrastructure maintained by a separate team time-shifted by about 10 hours, driven by service now tickets? Nope that's way worse!
|
# ¿ Feb 13, 2020 01:43 |
|
Hadlock posted:I put up an ops manager job listing on the jobs thread. If you do Linux + containers, and have not terrible opinions, hit me up. Full time remote. Hadlock posted:Pay: low to mid 100s Lmao, try $180k+ if you want a decent candidate
|
# ¿ Feb 22, 2020 16:27 |
|
Necronomicon posted:So I've got an AWS/boto3/Rancher conundrum I'm hoping y'all can help out with. I've been trying to put together a python script to do the following: Use IAM roles instead of users and keys
|
# ¿ Feb 25, 2020 23:57 |
|
Nomnom Cookie posted:kiam is crap and kube2iam is worse Use EKS and the OIDC provider https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html
|
# ¿ Feb 26, 2020 00:08 |
|
Nomnom Cookie posted:Migrating to EKS is our perpetual good to do but other stuff is higher priority project, so we use api keys. In that case, while kube2iam isn't great, it's heaps better than static creds.
|
# ¿ Feb 26, 2020 00:25 |
|
Nomnom Cookie posted:It’s not though! It’s racy and abandoned. A thing that works is always better than a thing that doesn’t work. Static creds are probably one of the most dangerous things to have floating in your environment. At least tell me you're using policy conditions to bind them to specific ec2 instances rather than usable by anyone from anywhere.
|
# ¿ Feb 26, 2020 12:46 |
|
Nomnom Cookie posted:You’re not wrong, but if forced to choose between spraying api keys everywhere or deploying kube2iam I’d take the Yikes
|
# ¿ Feb 26, 2020 14:40 |
|
I've heard bad things about kube2iam but we ran it in prod for a year and a half in 5 regions with only one, transient issue I can remember that went away when we did a rolling restart. Not to say it's not a dumpster fire but when faced with that or key spray I'll gladly take the dumpster fire any day of the week.
|
# ¿ Feb 26, 2020 23:37 |
|
What's the current hotness for managing Jenkins job definitions in code? Is it still pipelines with Jenkinsfiles in the project repo or something else?
|
# ¿ Mar 2, 2020 14:57 |
|
CyberPingu posted:Ive got that in already, its just how to pass that into the below so that it reads likle code:
|
# ¿ Mar 5, 2020 14:14 |
|
CyberPingu posted:You cant add a variable into a variable though. Can you use join to compose the whole string?
|
# ¿ Mar 5, 2020 14:48 |
|
The troll answer is to make you nervous so you vendor them yourself. Real answer is probably no good reason other than some internal CNCF agreement
|
# ¿ Apr 16, 2020 22:55 |
|
We use job DSL for the precise reason Gyshall mentioned: it let's our teams manage things. Of course it really means that product teams complain to the platform team every time something goes wrong because why bother reading logs or trying to understand how systems work?
|
# ¿ Jun 5, 2020 17:08 |
|
Our aws misconfiguration situation is wild. We sell a product that does it well, recently acquired a company that does a great job at it, and I wrote a tool that does it decently before the first 2 things existed. Please don't dox me, ok?
|
# ¿ Aug 15, 2020 15:42 |
|
Datagrip or intellij with the DB plugin
|
# ¿ Aug 28, 2020 15:41 |
|
Use okta to federate AWS account access and then use AWS auth for wks Ez pz
|
# ¿ Sep 14, 2020 01:58 |
|
we're using the alb ingress controller and it's quite needs suiting might not apply for you tho
|
# ¿ Oct 16, 2020 00:25 |
|
Not a comprehensive list but I follow Seth Vargo (@sethvargo), Charity Majors (@mipsytipsy), Corey Quinn (@QuinnyPig), Mitchell Hashimoto (@mitchellh), and @SimpsonsOps and find them to be pretty good
|
# ¿ Oct 29, 2020 22:46 |
|
PCjr sidecar posted:Kelsey Hightower, Liz Fong Jones, Erowid Recruiter Oooh yeah forgot about Liz
|
# ¿ Oct 30, 2020 01:45 |
|
Methanar posted:the only acceptable package management system is git clone Ok Rob Pike
|
# ¿ Nov 7, 2020 13:42 |
|
We use data dog for metrics and monitoring and it's fine I guess. Lot fewer things to configure and keep running versus most other comparable metrics platforms. If I had to solve logging I'd go with a managed elastic setup 'cause right now we dogfood our siem product's log management tool and it's not great for application logging.
|
# ¿ Nov 14, 2020 16:56 |
|
12 rats tied together posted:The best scaffolding tool for terraform is ansible. It doesn't come up very often during searches because if you're going to use terraform you've likely already encountered and chosen not to use ansible, and if you were already using ansible you didn't need terraform in the first place. The integration is there for you and available though, and I've enjoyed using it a lot at past roles when dealing with people who are ideologically biased against ansible for some reason (usually a misunderstanding of YAML). Do you work at red hat or something? I can't think of a single post you've made that hasn't pushed ansible as a catch-all solution for automation. It's not a bad tool but if someone asks how to scaffold a terraform project the answer isn't some entirely unrelated too jfc
|
# ¿ Dec 12, 2020 15:12 |
|
12 rats tied together posted:I do not and would not work for red hat but I have been using ansible at work for the past 7 years. As a couple other posters have mentioned, it's not an entirely unrelated tool, it being pitched as config management is purely a post-acquisition piece of marketing that you shouldn't take too seriously. Red hat wants to sell Tower licenses and support and to do that they're pitching it the best way they know how. It's absolutely a config management tool that was written as an alternative to chef and puppet and saying anything else is just revisionist history. quote:I think you'd be surprised, there is no better way to orchestrate cloudformation that I've come across since like 2015 or so. Cloudformation by itself is definitely an awful tool but using ansible to drive is a top tier workflow in the "declarative cloud provider api" space. Cdk is far better than templating cloudformation yaml or json via ansible and jinja
|
# ¿ Dec 12, 2020 19:04 |
|
12 rats tied together posted:You can learn and then string together 5 or 6 different tools for this or you can just pip install ansible and get to work. I hope this helps illustrate why I bring it up in every single IT post I make. You learned how to use a pocket knife and now, to you, every problem can be solved with one regardless of whether there's a more appropriate tool.
|
# ¿ Dec 12, 2020 19:30 |
|
The Fool posted:Ok, my question was pretty vague and I think some of you were searching for an xy problem. I got what you were asking. Our infra team generally delegates ownership to dev teams so most of the team TF repos are kinda choose your own adventure wrt organization. However, at the account management level (i.e. for organization config or idp config), they use cookiecutter to generate a well-defined structure. I bet you could do something similar as a way to let teams bootstrap their setups.
|
# ¿ Dec 12, 2020 21:12 |
|
i haven't done much complex stuff in ansible but couldn't you curl the es health endpoint as a blocking operation until it reports ready and then continue to the next task? we did something similar with our chef cookbooks around ensuring consul availability. i think we had to ultimately write a little bit of ruby to do it but it wasn't particularly painful. if percentage waits until playbook completion for a given node you should be fine as long as your percentage groups are small enough that you're not negatively impacting the cluster.
|
# ¿ Dec 13, 2020 15:54 |
|
LochNessMonster posted:The problem is that task 1 restarts the service and task 2 is the health check. No matter which solution I’m trying, ansible keeps running task 1 on all nodes before doing the health check (which it should do after each node). this feels like something that might require a semaphore somewhere outside of the process. another possible option, if this is in the cloud, would be to set it up as a proper asg with a status check on the es health endpoint and write some quick bash to terminate each old node, wait until the asg capacity is back to full, then terminate the next one, etc.
|
# ¿ Dec 13, 2020 16:35 |
|
LochNessMonster posted:It’s fully on prem unfortunately and working with ansible is considered black magic around here. Honestly it might make sense to just write something quick and dirty that spins until the cluster is healthy, acquires a lock somewhere (DB, consul, etcd, redis, whatever), restarts the service, then releases the lock. Run that on every node at the same time. It'll take a while but at least you won't have to coordinate everything manually.
|
# ¿ Dec 13, 2020 20:33 |
|
Gyshall posted:Imagine wanting to use interpolation in providers/versions. Yeah, dynamic versioning is a pretty awful smell that you're doing something fundamentally at odds with what terraform is designed to do. 12 rats tied together posted:e: It's also possible I'm doing this wrong because this is way harder to test, but it seems like the "conditionally null parameter key and value" problem is still here: The established pattern since forever is to have a boolean variable at the module level and then define a resource's count based on the variable's value. Of course it gets complicated with dependent or linked resources but that's kind of what you'd expect with a declarative graph.
|
# ¿ Feb 4, 2021 14:28 |
|
Man you have a lot of opinions about how to use a tool you admit that you've only used minimally
|
# ¿ Feb 7, 2021 21:46 |
|
Nm then, I thought I'd read a post from you where you said you hadn't used terraform in years.
|
# ¿ Feb 7, 2021 22:15 |
|
|
# ¿ May 17, 2024 07:17 |
|
Methanar posted:spinnaker sux It's overengineered nonsense but it shouldn't surprise you given that it's Asgard's successor.
|
# ¿ Feb 14, 2021 20:06 |