Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison

pictured: docker run -p 9090:9090 prom/prometheus

sup fucks, monitoring poo poo sucks. that's why you can pay new relic or datadog like fourty billion dollars a year to slow down all your poo poo and create pretty dashboards for all those tvs the ops people hung around the office. this thread is about monitoring and logging and tracing and poo poo.

a brief history of monitoring
originally no one really gave a poo poo about a lot of this because you were using tomcat or iis or something that you'd create metrics or logs in your application that would write to whatever the system-provided sink for metrics and logs was. if you were a real cheap bastard, you'd just dump some poo poo into a text file and grep it. this was fine because we hadn't created a massive ecosystem of 'distributed computing' companies that exist to tell your vp of eng how you're all incompetent hellfuckers and they should pay a few hundred thousand bucks a year to use their special service mesh jerkoff mcgee middleware to do it right with docker and kubernetes. so we decided to break our applications apart into a million bifurcated services and pieces and now we don't know what the gently caress anything is doing when poo poo breaks, so here comes your friendly open source community to build some more goddamned tools to dig your way out of this nightmare you're trapped in.

logs
logs are probably the thing folks are the most familiar with, and maybe the only thing you'll get when your dumb fart app has a problem. even the most remedial plang toucher will have the brilliant insight to print something to the console when an exceptional case occurs. logs basically underpin the entire monitoring ecosystem at some way because fundamentally they're a way for the people writing the software to tell their future selves about what's happening in the code. it's pretty common to see people write a lot of logging statements in their code, then use some sort of runtime flag to determine which ones get printed to the console or a file, so you aren't writing poo poo like credit card numbers to a plain text file in prod (lol). there's a lot of problems with logs though, especially in the exciting world of distributed computing. trying to search a bunch of log files or console output across a hundred docker containers sucks. trying to correlate logs that pass through multiple services sucks, especially since little things like "the time on all my computers is not perfectly synchronized and so the timestamps on the logs are off a bit" exist. moreover, a lot of individual log statements don't really mean much by themselves, but do in aggregate. so how do we aggregate the information contained in log files?

metrics
software sucks and crashes and has errors and poo poo. sometimes that's because you forgot to unfuckulate a pointer or the internet happened and some frames dropped or whatever, but poo poo breaks. you probably logged an error when that happened. if you know what you're looking for, that log message can be important and useful, but a lot of times these errors only really make sense in aggregate. there's also all sorts of implicit things happening when you run your software, like the amount of memory being consumed on a host or the CPU usage, whatever. in general, these sort of measurements are called 'metrics' and there's 2 main types, counters and gauges. a counter is what it sounds like - number starts at 0, goes up from there. this is something like "total amount of requests since this application started". gauges can go either way (like ur mom lol), they can increase or decrease. this is something like "number of concurrent requests right now". there's also poo poo like histograms which let you put counts in buckets and summaries which are basically histograms that calculate a quantile over a sliding time window. metrics are cool, you can generate them by parsing your log files or emitting counts/gauges directly from your application code to some sort of collector, you can use agent processes that collect application metrics from the runtime and also from the host, all sorts of wildass poo poo.

traces
metrics and logs both have one big problem - they're too general (metrics) and specific (logs) to really get an idea about specific stuff going on in your application. you can get an idea about the overall state of requests with your metrics, and you can try to identify highly specific failures by looking at logs, but a lot of times you want to understand what's happening at a level somewhere between the two - for instance, if i want to look at a single request as it goes from the browser, through whatever ingress/load balancing bullshit i have, into a variety of backend services all the way to the DB and back. traces - specifically, distributed traces - are the answer to this. traces are comprised of 'spans', which is really just a record of how long it took to do a thing, a span context that has some identifiers, and a bunch of information you can shove in like tags and logs and crap. each service emits this span to some sort of collector which can then reassemble the disparate spans into a single trace based on the identifiers in the span context.

conclusion
cool now you know what some of this poo poo is, so go ahead and ask questions or shitpost or whatever about monitoring, tracing, etc. share your stories about horrible home-brewed monitoring setups, or how your company got rooked by new relic. i know there's a few of us here who work with this on the regular and would probably be happy to answer any questions about this poo poo. personally, i work on distributed tracing and contribute to a tracing framework so i can field questions on that.

thanks

Adbot
ADBOT LOVES YOU

Sham bam bamina!
Nov 6, 2012

ƨtupid cat
pity :gas:

Butcher
Nov 29, 2004
TODO: Place Title Here
just wanted to say i appreciate the entry of unfuckulate into common parlance, thanks

Tankakern
Jul 25, 2007

preemptive 5

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison

Butcher posted:

just wanted to say i appreciate the entry of unfuckulate into common parlance, thanks

yw

DONT THREAD ON ME
Oct 1, 2002

by Nyc_Tattoo
Floss Finder
hey this is a good post.

so i'm pretty good at gathering metrics but i'm bad at analyzing it. i know how to use some of the analytical functions in graphite but many are voodoo and i frequently have a hard time interpreting supposedly useful graphs.

anyone have a good primer? something like this but maybe a little quicker and vetted for me.

DONT THREAD ON ME fucked around with this message at 18:33 on Feb 22, 2019

Silver Alicorn
Mar 30, 2008

𝓪 𝓻𝓮𝓭 𝓹𝓪𝓷𝓭𝓪 𝓲𝓼 𝓪 𝓬𝓾𝓻𝓲𝓸𝓾𝓼 𝓼𝓸𝓻𝓽 𝓸𝓯 𝓬𝓻𝓮𝓪𝓽𝓾𝓻𝓮
https://gfycat.com/sillydesertedborzoi-aliens-prometheus-engineer-god

Silver Alicorn
Mar 30, 2008

𝓪 𝓻𝓮𝓭 𝓹𝓪𝓷𝓭𝓪 𝓲𝓼 𝓪 𝓬𝓾𝓻𝓲𝓸𝓾𝓼 𝓼𝓸𝓻𝓽 𝓸𝓯 𝓬𝓻𝓮𝓪𝓽𝓾𝓻𝓮
the answer to “why did the robot do all that stuff? is he a secret rear end in a top hat?” is yes, because his father never truly loved him and made that very apparent despite his obviously surpassing humanity

Silver Alicorn
Mar 30, 2008

𝓪 𝓻𝓮𝓭 𝓹𝓪𝓷𝓭𝓪 𝓲𝓼 𝓪 𝓬𝓾𝓻𝓲𝓸𝓾𝓼 𝓼𝓸𝓻𝓽 𝓸𝓯 𝓬𝓻𝓮𝓪𝓽𝓾𝓻𝓮
I liked Prometheus

Moo Cowabunga
Jun 15, 2009

[Office Worker.




monitor deez nutz

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

what's this got to do with cloudflare ray

graph
Nov 22, 2006

aaag peanuts
i like prtg better than solarwinds, OP

Linux Pirate
Apr 21, 2012


Silver Alicorn posted:

the answer to “why did the robot do all that stuff? is he a secret rear end in a top hat?” is yes, because his father never truly loved him and made that very apparent despite his obviously surpassing humanity

also alien shares the same universe as blade runner. the robots are just not happy.

MononcQc
May 29, 2007

I have strong opinions on logging and monitoring and metrics.

Here's one: if your log line is a string you hosed up. Get some structured logging going.

Zlodo
Nov 25, 2006
if your log is a string you have digestion problems

Tankakern
Jul 25, 2007

where does journald come into all this

Progressive JPEG
Feb 19, 2003

avoid counters imo

Feisty-Cadaver
Jun 1, 2000
The worms crawl in,
The worms crawl out.
if you don’t have metrics you don’t have alerts which means you don’t get paged if poo poo goes sideways so do that imo

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison

DONT THREAD ON ME posted:

hey this is a good post.

so i'm pretty good at gathering metrics but i'm bad at analyzing it. i know how to use some of the analytical functions in graphite but many are voodoo and i frequently have a hard time interpreting supposedly useful graphs.

anyone have a good primer? something like this but maybe a little quicker and vetted for me.

the oreilly monitoring with graphite book is p dece from what i understand, but yeah, poo poo's tricky to understand without having some stats background.

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison
i saw a talk by one of the authors of nanolog (https://github.com/PlatformLab/NanoLog) which is insanely badass if you want some ridiculously fast logging.

https://www.usenix.org/sites/default/files/conference/protected-files/atc18_slides_yang.pdf

Agile Vector
May 21, 2007

scrum bored



uncurable mlady posted:


pictured: docker run -p 9090:9090 prom/prometheus

q: is mercury peepin that action the user?

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison

Agile Vector posted:

q: is mercury peepin that action the user?

mercury is your boss, thirsty for SLI/SLOs

post hole digger
Mar 21, 2011

I use Graylog. pretty good.

Salt Fish
Sep 11, 2003

Cybernetic Crumb
Can you guys help me? I can see in new relic that my cluster failed 3 ping checks in december, I need an RCA for each of them. I'm marking this request the highest urgency thank you.

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

How can you make sure your monitoring is working tho

Jonny 290
May 5, 2005



[ASK] me about OS/2 Warp

Captain Foo posted:

How can you make sure your monitoring is working tho

pingdom lookin' at the mrtg page and mrtg monitoring pingdom of course

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison

Jonny 290 posted:

pingdom lookin' at the mrtg page and mrtg monitoring pingdom of course

:wom:

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison
the new hotness these days is SLO/SLI tho, i should write a post about that

r u ready to WALK
Sep 29, 2001

https://www.cacti.net is extremely underrated for generic snmp poo poo

the server i set up 10 years ago at work still works pretty much maintenance free

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

i really enjoyed when honeycomb did a demo with their tool on the real world example of a service outage on their platform to discover that one of their servers ran out of disk space, something that a loving nagios check would have picked up 20 years ago

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

prometheus is pretty cool but shoehorning everything into a pull model annoys me

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

r u ready to WALK posted:

https://www.cacti.net is extremely underrated for generic snmp poo poo

the server i set up 10 years ago at work still works pretty much maintenance free

it’s ok if what you need to do is turn snmp (or snmplikes) into browser-viewable rrds on one host

also make sure you stay on top of the security updates or keep it inaccessible from the web

in a well actually
Jan 26, 2011

dude, you gotta end it on the rhyme

tracing seems to be something you write into your web app

anybody doing non-http tracing or collecting trace data from apps you don’t write yourself and doesn’t have native tracing support?

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison

PCjr sidecar posted:

tracing seems to be something you write into your web app

anybody doing non-http tracing or collecting trace data from apps you don’t write yourself and doesn’t have native tracing support?

i wouldn't say it's just in your web app - our entire application is instrumented from webapp all the way down to the db. that said, it's kind of a massive pain in the rear end right now to get trace data from resources you don't directly manage because not everyone uses opentracing and even if they did, wire formats are very tracer dependent. that said, w3c is working on a tracecontext/tracedata specification that's intended to address this problem by standardizing headers and wire formats for context so you could have a situation where you're using some sort of managed ingress proxy or w/e and it'd be able to create spans as part of a trace that started on a client, etc. could also see the same thing at a managed db where the database service on the provider side is able to pick up traces incoming from the application and emit spans that you'd collect.

are you using tracing now? something home-brewed, or opentracing/opencensus?

DONT THREAD ON ME
Oct 1, 2002

by Nyc_Tattoo
Floss Finder
distributed tracing blew my mind kinda

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison
distributed tracing owns bones.

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison
the SLA SLO SLI le lo

what the gently caress is a SLI, SLA, or a SLO? they're some hot-poo poo terms everyone and their brother loves throwing around because they also read the loving SRE book (or, more likely, read a blog by someone who read the SRE book) but doesn't really understand that well and probably hasn't implemented them at all for their service or product. im gonna explain why, unlike most bullshit that seeps out from the toxic hellstew of a walking antitrust lawsuit named google, they're actually useful concepts.

SLA me daddy
stands for 'service level agreement'. this is probably something you've heard of, or possibly had to implement yourself. all of the five nines jokes go here. SLAs are mostly something that you get in trouble for breaking. that trouble might be financial (cash penalties or discounts for breaking a SLA) or organizational (your team gets reorg'ed into deadend project dungeons because you keep loving poo poo up for everyone else). SLAs aren't really for you, person reading this - they're for people that consume your service.

a good SLA is going to be realistic, but it can be crafted in a way that gives you a lot of latitude on how to actually fix your poo poo or make it better. you can (and should) be pretty specific with SLAs - maybe there's critical consumers that have them, but others don't, this gives you a knob to turn for performance tuning or request throttling. also it's important to be able to quantify 'legitimate' traffic or requests in your SLA. maybe a consumer fucks up and starts sending malformed queries to your service, which you handle and return an error for. should those count against overall 'uptime'? probably not, and they certainly shouldn't count against the overall performance of your service.

SLO down
a SLO is a 'service level objective'. if the SLA is what you're promising, the SLO is what you want. this might seem to be a subtle difference, but it's not in a lot of ways. think about it this way. you release a new service, cool! it's got some neat features, but it's not your main priority - and even if it is, you don't want to get paged in the middle of the night when something breaks. maybe after a while you'd like it to have high availability, but you didn't really put in the extra work at the beginning to handle redundancy, multi-az, poo poo like that. after all, isn't the point of all this agile poo poo to work iteratively? so you send out some emails and slack about your new API that fidgets foozles or quuz's quuxes and some people start using it, you start getting feedback, poo poo's going great. in your mind, you still have that goal of it being available during business hours, but it's actually much more reliable - it's up all the time.

but it starts to get popular. more people start relying on your service to quuz their quuxes. some dev in another team finds your service and decides to use it instead of writing their own, and now you're serving a lot more requests than you thought you'd be. poo poo goes wrong, things get caught crashlooping, and suddenly you're getting texts in the middle of the night because a public feature is down because your poo poo broke. bad scene.

how would an SLO have fixed this? well, they're published, for one. would other rando dev have started relying on your service if they knew you had a 48 hour turnaround on even looking at issues opened against it? two, they give you a 'error budget' (or 'downtime budget') in order to gently caress poo poo up in order to make it better. SLOs are basically an internal contract that gives you leverage to actually run services with actual users (so that theoretically, you can improve on them by getting feedback) without making you rip your hair out and/or want to murder everyone you work with (for a certain specific subset of reasons you'd want to murder your coworkers).

SLI but not the graphics card kind
so i've got my sla and my slo, what's a SLI? it's a 'service level indicator' and it's basically where monitoring comes into this mess. SLIs are the measurements you use to ensure you're hitting your SLO/SLA. this is probably a separate post on its own, but creating good SLIs can be difficult, and creating bad SLIs can be very easy. a good example is up in this post, when I was talking about SLAs - should I count 'bad requests' as part of my SLA? do I care that i'm successfully completing requests, or do I care that I can accept them at all? it's a lot more nuanced than simply "PINGDOM SAYS YOU'RE UP SO WE GOOD".

psiox
Oct 15, 2001

Babylon 5 Street Team
articles like this are usually too fluffy but i got a lot out of it re: distributed tracing

https://nickcraver.com/blog/2018/11/29/stack-overflow-how-we-do-monitoring/

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

uncurable mlady posted:

the SLA SLO SLI le lo

what the gently caress is a SLI, SLA, or a SLO? they're some hot-poo poo terms everyone and their brother loves throwing around because they also read the loving SRE book (or, more likely, read a blog by someone who read the SRE book) but doesn't really understand that well and probably hasn't implemented them at all for their service or product. im gonna explain why, unlike most bullshit that seeps out from the toxic hellstew of a walking antitrust lawsuit named google, they're actually useful concepts.

SLA me daddy
stands for 'service level agreement'. this is probably something you've heard of, or possibly had to implement yourself. all of the five nines jokes go here. SLAs are mostly something that you get in trouble for breaking. that trouble might be financial (cash penalties or discounts for breaking a SLA) or organizational (your team gets reorg'ed into deadend project dungeons because you keep loving poo poo up for everyone else). SLAs aren't really for you, person reading this - they're for people that consume your service.

a good SLA is going to be realistic, but it can be crafted in a way that gives you a lot of latitude on how to actually fix your poo poo or make it better. you can (and should) be pretty specific with SLAs - maybe there's critical consumers that have them, but others don't, this gives you a knob to turn for performance tuning or request throttling. also it's important to be able to quantify 'legitimate' traffic or requests in your SLA. maybe a consumer fucks up and starts sending malformed queries to your service, which you handle and return an error for. should those count against overall 'uptime'? probably not, and they certainly shouldn't count against the overall performance of your service.

SLO down
a SLO is a 'service level objective'. if the SLA is what you're promising, the SLO is what you want. this might seem to be a subtle difference, but it's not in a lot of ways. think about it this way. you release a new service, cool! it's got some neat features, but it's not your main priority - and even if it is, you don't want to get paged in the middle of the night when something breaks. maybe after a while you'd like it to have high availability, but you didn't really put in the extra work at the beginning to handle redundancy, multi-az, poo poo like that. after all, isn't the point of all this agile poo poo to work iteratively? so you send out some emails and slack about your new API that fidgets foozles or quuz's quuxes and some people start using it, you start getting feedback, poo poo's going great. in your mind, you still have that goal of it being available during business hours, but it's actually much more reliable - it's up all the time.

but it starts to get popular. more people start relying on your service to quuz their quuxes. some dev in another team finds your service and decides to use it instead of writing their own, and now you're serving a lot more requests than you thought you'd be. poo poo goes wrong, things get caught crashlooping, and suddenly you're getting texts in the middle of the night because a public feature is down because your poo poo broke. bad scene.

how would an SLO have fixed this? well, they're published, for one. would other rando dev have started relying on your service if they knew you had a 48 hour turnaround on even looking at issues opened against it? two, they give you a 'error budget' (or 'downtime budget') in order to gently caress poo poo up in order to make it better. SLOs are basically an internal contract that gives you leverage to actually run services with actual users (so that theoretically, you can improve on them by getting feedback) without making you rip your hair out and/or want to murder everyone you work with (for a certain specific subset of reasons you'd want to murder your coworkers).

SLI but not the graphics card kind
so i've got my sla and my slo, what's a SLI? it's a 'service level indicator' and it's basically where monitoring comes into this mess. SLIs are the measurements you use to ensure you're hitting your SLO/SLA. this is probably a separate post on its own, but creating good SLIs can be difficult, and creating bad SLIs can be very easy. a good example is up in this post, when I was talking about SLAs - should I count 'bad requests' as part of my SLA? do I care that i'm successfully completing requests, or do I care that I can accept them at all? it's a lot more nuanced than simply "PINGDOM SAYS YOU'RE UP SO WE GOOD".

this is a very good post op

Adbot
ADBOT LOVES YOU

abigserve
Sep 13, 2009

this is a better avatar than what I had before

MononcQc posted:

I have strong opinions on logging and monitoring and metrics.

Here's one: if your log line is a string you hosed up. Get some structured logging going.

my friend you never want to look into doing network or security monitoring

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply