what the fuck is prometheus anyway? a thread about monitoring - The Something Awful Forums

Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > what the fuck is prometheus anyway? a thread about monitoring

pram: Jun 10, 2001

'durrr kafka works fine on my laptop in docker'

# ? Mar 9, 2019 16:57

Adbot: ADBOT LOVES YOU

# ? Apr 29, 2024 01:45

jeffery: Jan 1, 2013

Arcsech posted:

gotcha

it is really hard to overstate how much better elasticsearch has gotten over the past couple years, and even better soon when the new consensus algo ships (no more minimum_master_nodes, among other things)

edit: full disclosure i guess, i have a vested interest in elasticsearch

what the hell is a minimum_master_node?

# ? Mar 9, 2019 17:00

jeffery: Jan 1, 2013

pram posted:

lol no. it isnt. youve never used it for anything serious stfu. for example

1) kafka doesnt rebalance topics, ever. if a node is down thats it. the replica is just gone. it doesnt 'migrate' because this is 1998
2) kafka doesnt rebalance storage, ever. if you use JBOD it will just randomly put segments wherever it feels like. if a disk is full it just breaks
3) topic compaction impacts the entire cluster performance if its big enough. nothing you can do about it
4) will randomly break and require a full restart if it lags on the zookeeper state
https://issues.apache.org/jira/browse/KAFKA-2729
5) will effortlessly end up with two cluster controllers if one has degraded performance
6) will spend literal hours 'recovering' on a hard restart (kill) if you have compacted segments
7) replicating data to a replaced node will impact the entire cluster performance, hammering the socket server. and this cant be prevented BECAUSE
8) if you throttle performance it impacts the replica manager AND producers
9) leader rebalancing can still temporarily break producers

and more!

some of these were fixed and substantially improved with future releases

# ? Mar 9, 2019 17:03

jeffery: Jan 1, 2013

lancemantis posted:

tbh a lot of software people consider magical scaling wizardry is a nightmare and I�m convinced the people bringing it in flee before the consequences hit or hav never used it beyond toy projects

what's the alternatives to elastic scaling?

# ? Mar 9, 2019 17:04

Arcsech: Aug 5, 2008

jeffery posted:

what the hell is a minimum_master_node?

you have to tell elasticsearch how many master-eligible nodes it needs to have a quorum to elect a master, typically 50%+1 of master-eligible nodes. that setting is called minimum_master_nodes, and if you set too low your cluster can get split brain, if you set it too high, then your cluster won�t be able to tolerate as many node failures as it should

as of Soon it will figure this out for itself instead of you having to tell it (e: to be clear, you still have to tell it what nodes are in the cluster at startup, but it will keep itself up to date after that when you add/decommission nodes and you don�t have to remember to update the quorum in addition to the initial nodes list)

this is good because that setting is the biggest pain in the rear end and having an incorrect minimum_master_nodes is a great way to make your cluster take a dump big time

Arcsech fucked around with this message at 05:13 on Mar 11, 2019

# ? Mar 11, 2019 05:02

Silver Alicorn: Mar 30, 2008; 𝓪 𝓻𝓮𝓭 𝓹𝓪𝓷𝓭𝓪 𝓲𝓼 𝓪 𝓬𝓾𝓻𝓲𝓸𝓾𝓼 𝓼𝓸𝓻𝓽 𝓸𝓯 𝓬𝓻𝓮𝓪𝓽𝓾𝓻𝓮

I still don�t know what any of this bull poo poo is

# ? Mar 12, 2019 17:51

cowboy beepboop: Feb 24, 2001

the prom / grafana guys are making a log thing now

https://grafana.com/loki

no full text search though, also it only works with k8s atm

# ? Mar 13, 2019 08:42

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

lol @ elastic

https://aws.amazon.com/blogs/opensource/keeping-open-source-open-open-distro-for-elasticsearch/

# ? Mar 13, 2019 13:37

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

my stepdads beer posted:

the prom / grafana guys are making a log thing now

https://grafana.com/loki

no full text search though, also it only works with k8s atm

lol y tho

Blinkz0rz posted:

lol @ elastic

https://aws.amazon.com/blogs/opensource/keeping-open-source-open-open-distro-for-elasticsearch/

lol y tho

# ? Mar 13, 2019 14:01

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

that aws piece is some top notch concern trolling

# ? Mar 13, 2019 14:04

akadajet: Sep 14, 2003

Blinkz0rz posted:

lol @ elastic

https://aws.amazon.com/blogs/opensource/keeping-open-source-open-open-distro-for-elasticsearch/

lmao

# ? Mar 13, 2019 15:10

Arcsech: Aug 5, 2008

Blinkz0rz posted:

lol @ elastic

https://aws.amazon.com/blogs/opensource/keeping-open-source-open-open-distro-for-elasticsearch/

uncurable mlady posted:

that aws piece is some top notch concern trolling

yep

whats especially funny is that this is mostly repackaging/forks of existing oss projects: security=searchguard, sql=NLPchina/elasticsearch-sql, performance analyzer=perf, but they're playing it up like they did all the work instead of just slapping a new logo on and maybe some light modifications

in fairness i think the alerting thing might be new amazon-written code, or maybe i just haven't found where they took it from yet lol

# ? Mar 13, 2019 16:43

crazysim: May 23, 2004; I AM SOOOOO GAY

didn't searchguard also have that open core, enterprise features thing? i think amazon implemented the enterprise features atop the open core too. they got owned too if that's the case.

# ? Mar 13, 2019 22:15

Arcsech: Aug 5, 2008

crazysim posted:

didn't searchguard also have that open core, enterprise features thing? i think amazon implemented the enterprise features atop the open core too. they got owned too if that's the case.

amazons security "advanced modules" are literally just searchguard's "enterprise modules" with the license changed

right down to the TODOs and commented out code

ex:
FieldReadCallback.java from the amazon repo w/ apache license header
FieldReadCallback.java from the searchguard enterprise repo w/proprietary license header

i dunno if they did some kinda deal with searchguard or what but it's definitely the same code

e: lmao https://github.com/floragunncom/search-guard-enterprise-modules/issues/35

Arcsech fucked around with this message at 22:42 on Mar 13, 2019

# ? Mar 13, 2019 22:36

cowboy beepboop: Feb 24, 2001

uncurable mlady posted:

lol y tho

lol y tho

i assume they got sick of waking up to CLUSTER: RED

# ? Mar 14, 2019 02:54

pram: Jun 10, 2001

software just wants to be free - jeff bezos, free software advocate

# ? Mar 14, 2019 03:11

Elos: Jan 8, 2009

current monitoring job status: i have 15 different unsee dashboards open and there's anything from 10 to 700 warnings/alerts in bouning around in each of them. i'm somehow supposed to keep eye on all of these with only the two monitors i have

these are monitoring a bunch of datacenters that are a hellish mix of mesos-marathon microservice container stuff and poo poo running straight on the metal, located around the world. i'm connected to them through a collection of ssh-tunnels over connections and vpns that sometimes just decide to stop working.

there's a lot of gently caress A CONTAINER IS DOWN!!! alerts and then go check it and everything is fine. bunch of alerts that once fired will hang around in the dashboard for 6 hours because ??? i'm never confident i'll catch the real problems with all this useless noise. there's graphana too but a lot of the stuff isnt configured right so you have to go massage some prometheus queries by hand to get the graphs you need

documentation if of course nonexistent and/or poo poo. configuring the monitoring is some other teams' job and a lot of the time all i can do is open a ticket and hope. when poo poo hits the fan i have only a vague idea who's responsible for what and who the hell i'm supposed to call so i get to wake up my team leader at four in the morning so he can figure it out

welp thats my story, back to lurking

# ? Mar 16, 2019 21:18

Progressive JPEG: Feb 19, 2003

please watch these broken dashboards/alerts, and absolutely never under any circumstances fix any problems you find

maybe ask manager why they aren't empowering you to maintain the things you depend on to do your job

# ? Mar 16, 2019 21:48

supabump: Feb 8, 2014

breaking my vow of lurking to share my cool monitoring story

i work at one of your favorite tech monoliths and we're doing this big new enterprise service with like 20 different components frankensteined together. our monitoring was supposedly very good though. All 20 of these pos components were logging everything effectively, dashboards were all set up, etc.

for an entire 2 weeks, we had a live site issue where component A was DoSing component B for a customer with a shitload of data. wasn't obvious to the customer that anything was amiss, but our service was functionally useless to them for those two weeks, and we didn't have a loving clue

moral of the story is that none of your lovely monitoring does anything if nobody bothers to set up alerts (and that nobody looks at monitoring dashboards unprompted once the demo is over)

# ? Mar 16, 2019 22:43

Blinkz0rz: May 27, 2001; MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

supabump posted:

breaking my vow of lurking to share my cool monitoring story

i work at one of your favorite tech monoliths and we're doing this big new enterprise service with like 20 different components frankensteined together. our monitoring was supposedly very good though. All 20 of these pos components were logging everything effectively, dashboards were all set up, etc.

for an entire 2 weeks, we had a live site issue where component A was DoSing component B for a customer with a shitload of data. wasn't obvious to the customer that anything was amiss, but our service was functionally useless to them for those two weeks, and we didn't have a loving clue

moral of the story is that none of your lovely monitoring does anything if nobody bothers to set up alerts (and that nobody looks at monitoring dashboards unprompted once the demo is over)

this is why hystrix loving owns

# ? Mar 16, 2019 22:50

abigserve: Sep 13, 2009; this is a better avatar than what I had before

supabump posted:

breaking my vow of lurking to share my cool monitoring story

i work at one of your favorite tech monoliths and we're doing this big new enterprise service with like 20 different components frankensteined together. our monitoring was supposedly very good though. All 20 of these pos components were logging everything effectively, dashboards were all set up, etc.

for an entire 2 weeks, we had a live site issue where component A was DoSing component B for a customer with a shitload of data. wasn't obvious to the customer that anything was amiss, but our service was functionally useless to them for those two weeks, and we didn't have a loving clue

moral of the story is that none of your lovely monitoring does anything if nobody bothers to set up alerts (and that nobody looks at monitoring dashboards unprompted once the demo is over)

counterpoint: too many alerts

Elos posted:

current monitoring job status: i have 15 different unsee dashboards open and there's anything from 10 to 700 warnings/alerts in bouning around in each of them. i'm somehow supposed to keep eye on all of these with only the two monitors i have

these are monitoring a bunch of datacenters that are a hellish mix of mesos-marathon microservice container stuff and poo poo running straight on the metal, located around the world. i'm connected to them through a collection of ssh-tunnels over connections and vpns that sometimes just decide to stop working.

there's a lot of gently caress A CONTAINER IS DOWN!!! alerts and then go check it and everything is fine. bunch of alerts that once fired will hang around in the dashboard for 6 hours because ??? i'm never confident i'll catch the real problems with all this useless noise. there's graphana too but a lot of the stuff isnt configured right so you have to go massage some prometheus queries by hand to get the graphs you need

documentation if of course nonexistent and/or poo poo. configuring the monitoring is some other teams' job and a lot of the time all i can do is open a ticket and hope. when poo poo hits the fan i have only a vague idea who's responsible for what and who the hell i'm supposed to call so i get to wake up my team leader at four in the morning so he can figure it out

welp thats my story, back to lurking

Is all monitoring useless? The answer may surprise you!! (yes).

The only workflow I've found to work in the monitoring world is to only alert on things you won't otherwise be alerted to naturally - an example would be a core router being down, I don't need an email or SMS to tell me that my network is hosed. Instead, focus your alerts on malicious smaller issues you may not notice for a very long time, like errors on an interface or high IO wait on a database server.

Then, you need at least a once a week housecleaning of all the alerts. We do it at our team meeting and it takes about 10 minutes because it's a religious affair - and if an alert is red AND unacknowledged for more than a week, it gets removed from monitoring as it's clearly not important.

# ? Mar 16, 2019 23:20

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

anyway here's some poo poo i've been working on behind the scenes for the past several months

https://twitter.com/opentracing/status/1111389502889574400?s=20

# ? Mar 28, 2019 23:14

Sylink: Apr 17, 2004

Prometheus owns, if anyone has questions we use it all the time.

# ? Apr 28, 2019 22:43

CRIP EATIN BREAD: Jun 24, 2002; Hey stop worrying bout my acting bitch, and worry about your WACK ass music. In the mean time... Eat a hot bowl of Dicks! Ice T; Soiled Meat

opentracing is cool

# ? Apr 29, 2019 00:56

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

CRIP EATIN BREAD posted:

opentracing is cool

thanks, I hope we don�t completely gently caress it up with the merger!!

# ? Apr 29, 2019 01:15

cowboy beepboop: Feb 24, 2001

Sylink posted:

Prometheus owns, if anyone has questions we use it all the time.

what do you find useful to monitor
do you install node exporter on every vm
use it for alerting?

# ? Apr 29, 2019 08:51

Sylink: Apr 17, 2004

Mostly node exporter for basic system metrics. Disable all the collectors you dont need, one thing to do in prometheus is to not collect metrics you dont use or want, saves on space and tedium , by default a ton of poo poo is on the node exporter, and it will catch all kinds of useless disk device metrics depending on your setup (looking at u kubernetes)

We also use autodiscovery and we run kubernetes clusters, so there are a ton of useful kubernetes prometheus tie ins, prometheus operators is great.

Anyway, you get all your collectors and exporters up, then the fun begins. Now we build dashboards in grafana and they cover each level of our application, so we a cluster view, a node view, and an application view to drill down levels and into problems.

The dashboards are also committed to code if they are important.

The real skill is creating dashboards that are useful and understanding what each metric really is. A lot of prom stuff is counters which will trick you as they just go up over time, so you have to remember to take irates and so on.

Depending on your app, its very easy to make custom Prometheus exporters for scraping custom metrics. Do this where possible. Every thing you can turn into a simple number metric is less digging through bullshit logs.

Logs loving suck and should be your last restort as much as possible imo. To the point where you go to the log just to confirm what you already know from the metrics in explicit text form.

And alerts are so easy to make and tweak, since they are just PromQL and the same queries you are using in the dashboards but with some condition attached, so you can play around in the Prom dash looking at data, find the right metrics, and copy paste that into an alert pretty much.

# ? Apr 30, 2019 01:29

animist: Aug 28, 2018

so does OpenTracing do everything prometheus does, or should i be running some combination of OpenTracing + logging + metric collection? like is there some tracer i can plug into OpenTracing to make it pretend to be prometheus, or do i need to do that separately

also, how long does e.g. Jaeger keep traces around? are they created on-demand, or built up from some sort of in-memory thing, or... whatever

basically i'm a little confused about the practical difference between logs vs metrics vs spans/tags vs traces

keep in mind i don't really know what i'm talking about

animist fucked around with this message at 03:41 on Apr 30, 2019

# ? Apr 30, 2019 03:36

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

animist posted:

so does OpenTracing do everything prometheus does, or should i be running some combination of OpenTracing + logging + metric collection? like is there some tracer i can plug into OpenTracing to make it pretend to be prometheus, or do i need to do that separately

also, how long does e.g. Jaeger keep traces around? are they created on-demand, or built up from some sort of in-memory thing, or... whatever

basically i'm a little confused about the practical difference between logs vs metrics vs spans/tags vs traces

keep in mind i don't really know what i'm talking about

they�re different things although we�d like to condense it to one library for instrumentation (this is the point of the OpenTracing/opencensus merger). I can post some more about what it looks like today tomorrow

# ? Apr 30, 2019 05:52

cowboy beepboop: Feb 24, 2001

Sylink posted:

prom

ty that was very helpful

# ? Apr 30, 2019 10:21

Sylink: Apr 17, 2004

No problem, I could talk for days about metrics and data collection.

My only complaint about prometheus is that its documented poorly with regards to actual usage, so you have to dig around. But this site has pretty much every link you'll find on Google https://www.robustperception.io

# ? Apr 30, 2019 10:37

Sylink: Apr 17, 2004

Does opentracing have anything for PHP that doesn't require editing your code project like the way New Relic works?

I work mainly with performance on PHP apps (they suck) and New Relic traces/transaction data is extremely useful, and easy to use since we dont control the codebases. I.e. the PHP module comes up and does magic so we dont have to deal with editing anyone's code.

I've seen alternatives, but none are as easy and painless.

But New Relic pricing sucks if you have a lot of services you want to use it on.

# ? Apr 30, 2019 13:49

suffix: Jul 27, 2013; Wheeee!

love to waste time on workarounds because prometheus devs have a stick up their rear end about some kind of ideological purity that requires hobbling the __name__ label

# ? May 2, 2019 21:05

Schadenboner: Aug 15, 2011; by Shine

We do suricata at work. It�s poo poo, mlmp?

:shrug:

:shrug:

# ? May 3, 2019 03:21

enotnert: Jun 10, 2005; Only women bleed

all I know about monitoring is I still maintain an out of date horrible thing that is now called "SMArTS/ViPR SRM" but we can't get the updates for some reason so I'm running "watch4net"

Also I just spent 3 weeks dealing with using TL1 vs SNMP cause reasons.

# ? May 3, 2019 03:59

distortion park: Apr 25, 2011

Gotta agree with pram that kafka sucks poo poo and breaks constantly in frustrating ways. The tooling is also garbage.

ELK also seems to crash and lose data a lot but that might just be our infrastructure team, idk.

# ? May 11, 2019 17:27

pram: Jun 10, 2001

kafka is so very bad i cant even

# ? May 11, 2019 22:02

pram: Jun 10, 2001

im reminded of this beauty of an error, for example. kafka partitions would literally just get corrupted and the broken log file would prevent the broker from STARTING

https://issues.apache.org/jira/browse/KAFKA-3919

the only way you could fix it was to go delete the actual file sitting on the disk. if it affected multiple brokers (because of unclean leader election) hope you like data loss

and of course during all this you're totally offline so its a huge outage. epic and ftw

# ? May 11, 2019 22:06

suffix: Jul 27, 2013; Wheeee!

kafka works fine if you don't touch it

but lol that pulsar and logdevice were open sourced but kafka is good enough so no one even bothers

# ? May 11, 2019 23:30

Adbot: ADBOT LOVES YOU

# ? Apr 29, 2024 01:45

Cerberus911: Dec 26, 2005; Guarding the damned since '05

Prometheus is cool and good.

What do people do about their historical metrics. I have my current deployment set to retain metrics for 3 months, and throw away anything older.

Any recommendations for storing the older metrics? They would be very rarely queried. Is Thanos the way to go or are there other better solutions out there?

# ? May 15, 2019 03:20

1
2
3
4
5

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > what the fuck is prometheus anyway? a thread about monitoring

Powered by: vBulletin Version 2.2.9 (SABB-v2.24.04)
Copyright ©2000, 2001, Jelsoft Enterprises Limited.
Copyright ©2024, Jeffrey of YOSPOS