|
we use a combination of elk and the logging software we sell (dogfooding is good) for logging and datadog for monitoring. i think a small part still has some sensu + grafana for monitoring physical assets or something idk
|
# ¿ Mar 2, 2019 18:48 |
|
|
# ¿ May 14, 2024 11:57 |
|
my stepdads beer posted:elk or graylog are cool up until the point you have to learn about maintaining an elasticsearch cluster yeah ama about maintaining an elk stack that processes a few tb of logs a day it loving sucks
|
# ¿ Mar 3, 2019 00:36 |
|
Arcsech posted:what are the biggest problems you hit? tbh it was a combination of the logstash indexers using up too much memory and the process getting killed by the oom killer, extremely chatty applications generating a gently caress-ton of logs and overwhelming the cluster, along with the way that elasticsearch handles clustering. the indexers dieing was an easier problem to solve: at first we would just scale the autoscaling group down to 1 and let it scale itself back up (scaling was based on cpu usage.) eventually (after i left the team) they did something to the indexer configuration involving a master node which made things quite a bit more stable. i can ask someone about the fix if you're curious. for chatty applications we would aggressively tear down autoscaling groups if we determined that they were overwhelming the logging cluster. this didn't happen much but the few times it did i'll be honest and say that it was super satisfying to tell a dev i was killing their app until they reduced the number of logs it generated. in terms of elasticsearch clustering, we run the cluster in ec2 so we when an instance is terminated or loses connectivity or for any other reason the cluster might lose a node, the entire cluster dedicates itself to moving the replicas of the shards that were on the terminated instances elsewhere so that the replication strategy can be maintained. this causes the cluster to go red which means that it won't process new logs until the pending tasks (shards being moved and replicated) complete. i'm sure there's a setting to tune or something but we were never able to figure out a way to tell es to only attempt to execute a small set of tasks while still ingesting data. the good news is that none of that actually caused long term ingestion issues or data loss; instead the logstash indexer queues would keep backing up until the es cluster went green and then logs would eventually catch up. it wasn't great when logs went 15-30 minutes behind while teams were deploying new services and relied on logs being available to ensure service health but we got through it tbh we did a pretty good job of building out the automation and monitoring around how we deployed elasticsearch. during my time with that team i remember a few really lovely on-call rotations where most of my time was spent trying to figure out how to get a red cluster to go green quicker but in terms of stability things were pretty good. i don't think we ever ended up actually losing data
|
# ¿ Mar 7, 2019 01:35 |
|
iirc it was a very old version somewhere in the 1.7 area
|
# ¿ Mar 7, 2019 23:58 |
|
lol @ elastic https://aws.amazon.com/blogs/opensource/keeping-open-source-open-open-distro-for-elasticsearch/
|
# ¿ Mar 13, 2019 13:37 |
|
supabump posted:breaking my vow of lurking to share my cool monitoring story this is why hystrix loving owns
|
# ¿ Mar 16, 2019 22:50 |
|
Progressive JPEG posted:datadog gets very sad if you have lots of tags/cardinality they had a bug for a long time where their unique identifier was the hostname which is fine except we reuse ip addresses and consequently hostnames as instances come up and down so we'd get all sorts of misattributed events and metrics until we figured out what was going on
|
# ¿ Jul 8, 2019 11:19 |
|
|
# ¿ May 14, 2024 11:57 |
|
i think there were other conditions that caused it to became the identifier but it was a big old pain in the butt let me tell you
|
# ¿ Jul 8, 2019 13:47 |