what the fuck is prometheus anyway? a thread about monitoring

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > what the fuck is prometheus anyway? a thread about monitoring

Captain Foo: May 11, 2004; we vibin'
we slidin'
we breathin'
we dyin'

what's this got to do with cloudflare ray

# ¿ Feb 22, 2019 19:28

Adbot: ADBOT LOVES YOU

# ¿ May 14, 2024 14:13

Captain Foo: May 11, 2004; we vibin'
we slidin'
we breathin'
we dyin'

How can you make sure your monitoring is working tho

# ¿ Feb 23, 2019 17:02

Captain Foo: May 11, 2004; we vibin'
we slidin'
we breathin'
we dyin'

uncurable mlady posted:

the SLA SLO SLI le lo

what the gently caress is a SLI, SLA, or a SLO? they're some hot-poo poo terms everyone and their brother loves throwing around because they also read the loving SRE book (or, more likely, read a blog by someone who read the SRE book) but doesn't really understand that well and probably hasn't implemented them at all for their service or product. im gonna explain why, unlike most bullshit that seeps out from the toxic hellstew of a walking antitrust lawsuit named google, they're actually useful concepts.

SLA me daddy
stands for 'service level agreement'. this is probably something you've heard of, or possibly had to implement yourself. all of the five nines jokes go here. SLAs are mostly something that you get in trouble for breaking. that trouble might be financial (cash penalties or discounts for breaking a SLA) or organizational (your team gets reorg'ed into deadend project dungeons because you keep loving poo poo up for everyone else). SLAs aren't really for you, person reading this - they're for people that consume your service.

a good SLA is going to be realistic, but it can be crafted in a way that gives you a lot of latitude on how to actually fix your poo poo or make it better. you can (and should) be pretty specific with SLAs - maybe there's critical consumers that have them, but others don't, this gives you a knob to turn for performance tuning or request throttling. also it's important to be able to quantify 'legitimate' traffic or requests in your SLA. maybe a consumer fucks up and starts sending malformed queries to your service, which you handle and return an error for. should those count against overall 'uptime'? probably not, and they certainly shouldn't count against the overall performance of your service.

SLO down
a SLO is a 'service level objective'. if the SLA is what you're promising, the SLO is what you want. this might seem to be a subtle difference, but it's not in a lot of ways. think about it this way. you release a new service, cool! it's got some neat features, but it's not your main priority - and even if it is, you don't want to get paged in the middle of the night when something breaks. maybe after a while you'd like it to have high availability, but you didn't really put in the extra work at the beginning to handle redundancy, multi-az, poo poo like that. after all, isn't the point of all this agile poo poo to work iteratively? so you send out some emails and slack about your new API that fidgets foozles or quuz's quuxes and some people start using it, you start getting feedback, poo poo's going great. in your mind, you still have that goal of it being available during business hours, but it's actually much more reliable - it's up all the time.

but it starts to get popular. more people start relying on your service to quuz their quuxes. some dev in another team finds your service and decides to use it instead of writing their own, and now you're serving a lot more requests than you thought you'd be. poo poo goes wrong, things get caught crashlooping, and suddenly you're getting texts in the middle of the night because a public feature is down because your poo poo broke. bad scene.

how would an SLO have fixed this? well, they're published, for one. would other rando dev have started relying on your service if they knew you had a 48 hour turnaround on even looking at issues opened against it? two, they give you a 'error budget' (or 'downtime budget') in order to gently caress poo poo up in order to make it better. SLOs are basically an internal contract that gives you leverage to actually run services with actual users (so that theoretically, you can improve on them by getting feedback) without making you rip your hair out and/or want to murder everyone you work with (for a certain specific subset of reasons you'd want to murder your coworkers).

SLI but not the graphics card kind
so i've got my sla and my slo, what's a SLI? it's a 'service level indicator' and it's basically where monitoring comes into this mess. SLIs are the measurements you use to ensure you're hitting your SLO/SLA. this is probably a separate post on its own, but creating good SLIs can be difficult, and creating bad SLIs can be very easy. a good example is up in this post, when I was talking about SLAs - should I count 'bad requests' as part of my SLA? do I care that i'm successfully completing requests, or do I care that I can accept them at all? it's a lot more nuanced than simply "PINGDOM SAYS YOU'RE UP SO WE GOOD".

this is a very good post op

# ¿ Feb 24, 2019 07:06

Captain Foo: May 11, 2004; we vibin'
we slidin'
we breathin'
we dyin'

abigserve posted:

my friend you never want to look into doing network or security monitoring

big lol that you decided to @mononcqc with this

# ¿ Feb 24, 2019 18:44

Captain Foo: May 11, 2004; we vibin'
we slidin'
we breathin'
we dyin'

lancemantis posted:

anyways I think thats it for now

i appreciate you are posts

# ¿ Feb 27, 2019 20:03

Captain Foo: May 11, 2004; we vibin'
we slidin'
we breathin'
we dyin'

my stepdads beer posted:

my prometheus keeps corrupting its data because it's on an nfs share. that's fine because i only use it for some pretty graphs sometimes. prom really suffers from not having good examples for what I think are common scenarios.

anyway we use cacti for all our network stuff because of inertia. prom+snmp-exporter+grafana was tedious as hell. nagios for our alerting and it's OK but kind of a pain re: config files.

for "tracing" we use senty for catching our dumb php apps various issues and fatals

nagios sucks cocks in hell

# ¿ Mar 2, 2019 07:10