Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

what's this got to do with cloudflare ray

Adbot
ADBOT LOVES YOU

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

How can you make sure your monitoring is working tho

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

uncurable mlady posted:

the SLA SLO SLI le lo

what the gently caress is a SLI, SLA, or a SLO? they're some hot-poo poo terms everyone and their brother loves throwing around because they also read the loving SRE book (or, more likely, read a blog by someone who read the SRE book) but doesn't really understand that well and probably hasn't implemented them at all for their service or product. im gonna explain why, unlike most bullshit that seeps out from the toxic hellstew of a walking antitrust lawsuit named google, they're actually useful concepts.

SLA me daddy
stands for 'service level agreement'. this is probably something you've heard of, or possibly had to implement yourself. all of the five nines jokes go here. SLAs are mostly something that you get in trouble for breaking. that trouble might be financial (cash penalties or discounts for breaking a SLA) or organizational (your team gets reorg'ed into deadend project dungeons because you keep loving poo poo up for everyone else). SLAs aren't really for you, person reading this - they're for people that consume your service.

a good SLA is going to be realistic, but it can be crafted in a way that gives you a lot of latitude on how to actually fix your poo poo or make it better. you can (and should) be pretty specific with SLAs - maybe there's critical consumers that have them, but others don't, this gives you a knob to turn for performance tuning or request throttling. also it's important to be able to quantify 'legitimate' traffic or requests in your SLA. maybe a consumer fucks up and starts sending malformed queries to your service, which you handle and return an error for. should those count against overall 'uptime'? probably not, and they certainly shouldn't count against the overall performance of your service.

SLO down
a SLO is a 'service level objective'. if the SLA is what you're promising, the SLO is what you want. this might seem to be a subtle difference, but it's not in a lot of ways. think about it this way. you release a new service, cool! it's got some neat features, but it's not your main priority - and even if it is, you don't want to get paged in the middle of the night when something breaks. maybe after a while you'd like it to have high availability, but you didn't really put in the extra work at the beginning to handle redundancy, multi-az, poo poo like that. after all, isn't the point of all this agile poo poo to work iteratively? so you send out some emails and slack about your new API that fidgets foozles or quuz's quuxes and some people start using it, you start getting feedback, poo poo's going great. in your mind, you still have that goal of it being available during business hours, but it's actually much more reliable - it's up all the time.

but it starts to get popular. more people start relying on your service to quuz their quuxes. some dev in another team finds your service and decides to use it instead of writing their own, and now you're serving a lot more requests than you thought you'd be. poo poo goes wrong, things get caught crashlooping, and suddenly you're getting texts in the middle of the night because a public feature is down because your poo poo broke. bad scene.

how would an SLO have fixed this? well, they're published, for one. would other rando dev have started relying on your service if they knew you had a 48 hour turnaround on even looking at issues opened against it? two, they give you a 'error budget' (or 'downtime budget') in order to gently caress poo poo up in order to make it better. SLOs are basically an internal contract that gives you leverage to actually run services with actual users (so that theoretically, you can improve on them by getting feedback) without making you rip your hair out and/or want to murder everyone you work with (for a certain specific subset of reasons you'd want to murder your coworkers).

SLI but not the graphics card kind
so i've got my sla and my slo, what's a SLI? it's a 'service level indicator' and it's basically where monitoring comes into this mess. SLIs are the measurements you use to ensure you're hitting your SLO/SLA. this is probably a separate post on its own, but creating good SLIs can be difficult, and creating bad SLIs can be very easy. a good example is up in this post, when I was talking about SLAs - should I count 'bad requests' as part of my SLA? do I care that i'm successfully completing requests, or do I care that I can accept them at all? it's a lot more nuanced than simply "PINGDOM SAYS YOU'RE UP SO WE GOOD".

this is a very good post op

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

abigserve posted:

my friend you never want to look into doing network or security monitoring

big lol that you decided to @mononcqc with this

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

lancemantis posted:

anyways I think thats it for now

i appreciate you are posts

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

my stepdads beer posted:

my prometheus keeps corrupting its data because it's on an nfs share. that's fine because i only use it for some pretty graphs sometimes. prom really suffers from not having good examples for what I think are common scenarios.

anyway we use cacti for all our network stuff because of inertia. prom+snmp-exporter+grafana was tedious as hell. nagios for our alerting and it's OK but kind of a pain re: config files.

for "tracing" we use senty for catching our dumb php apps various issues and fatals

nagios sucks cocks in hell

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

pram posted:

also the latest version of opsview is shiiiiiitttttt

we use observium, op

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

Blinkz0rz posted:

yeah ama about maintaining an elk stack that processes a few tb of logs a day

it loving sucks

having done this with a much smaller stack, all i can say is :gonk:

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

Blinkz0rz posted:

they had a bug for a long time where their unique identifier was the hostname

lol what the gently caress

Adbot
ADBOT LOVES YOU

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

kitten emergency posted:

who up observing they apps

who needs they prometheussy metriced

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply