what the fuck is prometheus anyway? a thread about monitoring

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > what the fuck is prometheus anyway? a thread about monitoring

Arcsech: Aug 5, 2008

Blinkz0rz posted:

yeah ama about maintaining an elk stack that processes a few tb of logs a day

it loving sucks

what are the biggest problems you hit?

# ¿ Mar 6, 2019 23:48

Adbot: ADBOT LOVES YOU

# ¿ May 14, 2024 07:01

Arcsech: Aug 5, 2008

Blinkz0rz posted:

tbh it was a combination of the logstash indexers using up too much memory and the process getting killed by the oom killer, extremely chatty applications generating a gently caress-ton of logs and overwhelming the cluster, along with the way that elasticsearch handles clustering.

the indexers dieing was an easier problem to solve: at first we would just scale the autoscaling group down to 1 and let it scale itself back up (scaling was based on cpu usage.) eventually (after i left the team) they did something to the indexer configuration involving a master node which made things quite a bit more stable. i can ask someone about the fix if you're curious.

for chatty applications we would aggressively tear down autoscaling groups if we determined that they were overwhelming the logging cluster. this didn't happen much but the few times it did i'll be honest and say that it was super satisfying to tell a dev i was killing their app until they reduced the number of logs it generated.

in terms of elasticsearch clustering, we run the cluster in ec2 so we when an instance is terminated or loses connectivity or for any other reason the cluster might lose a node, the entire cluster dedicates itself to moving the replicas of the shards that were on the terminated instances elsewhere so that the replication strategy can be maintained. this causes the cluster to go red which means that it won't process new logs until the pending tasks (shards being moved and replicated) complete. i'm sure there's a setting to tune or something but we were never able to figure out a way to tell es to only attempt to execute a small set of tasks while still ingesting data.

the good news is that none of that actually caused long term ingestion issues or data loss; instead the logstash indexer queues would keep backing up until the es cluster went green and then logs would eventually catch up. it wasn't great when logs went 15-30 minutes behind while teams were deploying new services and relied on logs being available to ensure service health but we got through it

tbh we did a pretty good job of building out the automation and monitoring around how we deployed elasticsearch. during my time with that team i remember a few really lovely on-call rotations where most of my time was spent trying to figure out how to get a red cluster to go green quicker but in terms of stability things were pretty good. i don't think we ever ended up actually losing data

thanks, this is very interesting

it surprises me that the loss of a single node would cause a cluster to go red and stop ingesting, unless you had an index with no replicas (don't do that) and actually had legit data loss. or if your discovery/minimum_master_nodes setup was off, i guess.

e: what major version, if you remember?

# ¿ Mar 7, 2019 18:44

Arcsech: Aug 5, 2008

Blinkz0rz posted:

iirc it was a very old version somewhere in the 1.7 area

gotcha

it is really hard to overstate how much better elasticsearch has gotten over the past couple years, and even better soon when the new consensus algo ships (no more minimum_master_nodes, among other things)

edit: full disclosure i guess, i have a vested interest in elasticsearch

Arcsech fucked around with this message at 21:21 on Mar 8, 2019

# ¿ Mar 8, 2019 21:17

Arcsech: Aug 5, 2008

jeffery posted:

what the hell is a minimum_master_node?

you have to tell elasticsearch how many master-eligible nodes it needs to have a quorum to elect a master, typically 50%+1 of master-eligible nodes. that setting is called minimum_master_nodes, and if you set too low your cluster can get split brain, if you set it too high, then your cluster won�t be able to tolerate as many node failures as it should

as of Soon it will figure this out for itself instead of you having to tell it (e: to be clear, you still have to tell it what nodes are in the cluster at startup, but it will keep itself up to date after that when you add/decommission nodes and you don�t have to remember to update the quorum in addition to the initial nodes list)

this is good because that setting is the biggest pain in the rear end and having an incorrect minimum_master_nodes is a great way to make your cluster take a dump big time

Arcsech fucked around with this message at 05:13 on Mar 11, 2019

# ¿ Mar 11, 2019 05:02

Arcsech: Aug 5, 2008

Blinkz0rz posted:

lol @ elastic

https://aws.amazon.com/blogs/opensource/keeping-open-source-open-open-distro-for-elasticsearch/

uncurable mlady posted:

that aws piece is some top notch concern trolling

yep

whats especially funny is that this is mostly repackaging/forks of existing oss projects: security=searchguard, sql=NLPchina/elasticsearch-sql, performance analyzer=perf, but they're playing it up like they did all the work instead of just slapping a new logo on and maybe some light modifications

in fairness i think the alerting thing might be new amazon-written code, or maybe i just haven't found where they took it from yet lol

# ¿ Mar 13, 2019 16:43

Adbot: ADBOT LOVES YOU

# ¿ May 14, 2024 07:01

Arcsech: Aug 5, 2008

crazysim posted:

didn't searchguard also have that open core, enterprise features thing? i think amazon implemented the enterprise features atop the open core too. they got owned too if that's the case.

amazons security "advanced modules" are literally just searchguard's "enterprise modules" with the license changed

right down to the TODOs and commented out code

ex:
FieldReadCallback.java from the amazon repo w/ apache license header
FieldReadCallback.java from the searchguard enterprise repo w/proprietary license header

i dunno if they did some kinda deal with searchguard or what but it's definitely the same code

e: lmao https://github.com/floragunncom/search-guard-enterprise-modules/issues/35

Arcsech fucked around with this message at 22:42 on Mar 13, 2019

# ¿ Mar 13, 2019 22:36

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > what the fuck is prometheus anyway? a thread about monitoring