what the fuck is prometheus anyway? a thread about monitoring

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > what the fuck is prometheus anyway? a thread about monitoring

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

we use a combination of elk and the logging software we sell (dogfooding is good) for logging and datadog for monitoring. i think a small part still has some sensu + grafana for monitoring physical assets or something idk

# ¿ Mar 2, 2019 18:48

Adbot: ADBOT LOVES YOU

# ¿ May 14, 2024 11:57

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

my stepdads beer posted:

elk or graylog are cool up until the point you have to learn about maintaining an elasticsearch cluster

yeah ama about maintaining an elk stack that processes a few tb of logs a day

it loving sucks

# ¿ Mar 3, 2019 00:36

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

Arcsech posted:

what are the biggest problems you hit?

tbh it was a combination of the logstash indexers using up too much memory and the process getting killed by the oom killer, extremely chatty applications generating a gently caress-ton of logs and overwhelming the cluster, along with the way that elasticsearch handles clustering.

the indexers dieing was an easier problem to solve: at first we would just scale the autoscaling group down to 1 and let it scale itself back up (scaling was based on cpu usage.) eventually (after i left the team) they did something to the indexer configuration involving a master node which made things quite a bit more stable. i can ask someone about the fix if you're curious.

for chatty applications we would aggressively tear down autoscaling groups if we determined that they were overwhelming the logging cluster. this didn't happen much but the few times it did i'll be honest and say that it was super satisfying to tell a dev i was killing their app until they reduced the number of logs it generated.

in terms of elasticsearch clustering, we run the cluster in ec2 so we when an instance is terminated or loses connectivity or for any other reason the cluster might lose a node, the entire cluster dedicates itself to moving the replicas of the shards that were on the terminated instances elsewhere so that the replication strategy can be maintained. this causes the cluster to go red which means that it won't process new logs until the pending tasks (shards being moved and replicated) complete. i'm sure there's a setting to tune or something but we were never able to figure out a way to tell es to only attempt to execute a small set of tasks while still ingesting data.

the good news is that none of that actually caused long term ingestion issues or data loss; instead the logstash indexer queues would keep backing up until the es cluster went green and then logs would eventually catch up. it wasn't great when logs went 15-30 minutes behind while teams were deploying new services and relied on logs being available to ensure service health but we got through it

tbh we did a pretty good job of building out the automation and monitoring around how we deployed elasticsearch. during my time with that team i remember a few really lovely on-call rotations where most of my time was spent trying to figure out how to get a red cluster to go green quicker but in terms of stability things were pretty good. i don't think we ever ended up actually losing data

# ¿ Mar 7, 2019 01:35

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

iirc it was a very old version somewhere in the 1.7 area

# ¿ Mar 7, 2019 23:58

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

lol @ elastic

https://aws.amazon.com/blogs/opensource/keeping-open-source-open-open-distro-for-elasticsearch/

# ¿ Mar 13, 2019 13:37

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

supabump posted:

breaking my vow of lurking to share my cool monitoring story

i work at one of your favorite tech monoliths and we're doing this big new enterprise service with like 20 different components frankensteined together. our monitoring was supposedly very good though. All 20 of these pos components were logging everything effectively, dashboards were all set up, etc.

for an entire 2 weeks, we had a live site issue where component A was DoSing component B for a customer with a shitload of data. wasn't obvious to the customer that anything was amiss, but our service was functionally useless to them for those two weeks, and we didn't have a loving clue

moral of the story is that none of your lovely monitoring does anything if nobody bothers to set up alerts (and that nobody looks at monitoring dashboards unprompted once the demo is over)

this is why hystrix loving owns

# ¿ Mar 16, 2019 22:50

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

Progressive JPEG posted:

datadog gets very sad if you have lots of tags/cardinality

they had a bug for a long time where their unique identifier was the hostname which is fine except we reuse ip addresses and consequently hostnames as instances come up and down so we'd get all sorts of misattributed events and metrics until we figured out what was going on

# ¿ Jul 8, 2019 11:19

Adbot: ADBOT LOVES YOU

# ¿ May 14, 2024 11:57

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

i think there were other conditions that caused it to became the identifier but it was a big old pain in the butt let me tell you

# ¿ Jul 8, 2019 13:47

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > what the fuck is prometheus anyway? a thread about monitoring