what the fuck is prometheus anyway? a thread about monitoring

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > what the fuck is prometheus anyway? a thread about monitoring

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

pictured: docker run -p 9090:9090 prom/prometheus

sup fucks, monitoring poo poo sucks. that's why you can pay new relic or datadog like fourty billion dollars a year to slow down all your poo poo and create pretty dashboards for all those tvs the ops people hung around the office. this thread is about monitoring and logging and tracing and poo poo.

a brief history of monitoring
originally no one really gave a poo poo about a lot of this because you were using tomcat or iis or something that you'd create metrics or logs in your application that would write to whatever the system-provided sink for metrics and logs was. if you were a real cheap bastard, you'd just dump some poo poo into a text file and grep it. this was fine because we hadn't created a massive ecosystem of 'distributed computing' companies that exist to tell your vp of eng how you're all incompetent hellfuckers and they should pay a few hundred thousand bucks a year to use their special service mesh jerkoff mcgee middleware to do it right with docker and kubernetes. so we decided to break our applications apart into a million bifurcated services and pieces and now we don't know what the gently caress anything is doing when poo poo breaks, so here comes your friendly open source community to build some more goddamned tools to dig your way out of this nightmare you're trapped in.

logs
logs are probably the thing folks are the most familiar with, and maybe the only thing you'll get when your dumb fart app has a problem. even the most remedial plang toucher will have the brilliant insight to print something to the console when an exceptional case occurs. logs basically underpin the entire monitoring ecosystem at some way because fundamentally they're a way for the people writing the software to tell their future selves about what's happening in the code. it's pretty common to see people write a lot of logging statements in their code, then use some sort of runtime flag to determine which ones get printed to the console or a file, so you aren't writing poo poo like credit card numbers to a plain text file in prod (lol). there's a lot of problems with logs though, especially in the exciting world of distributed computing. trying to search a bunch of log files or console output across a hundred docker containers sucks. trying to correlate logs that pass through multiple services sucks, especially since little things like "the time on all my computers is not perfectly synchronized and so the timestamps on the logs are off a bit" exist. moreover, a lot of individual log statements don't really mean much by themselves, but do in aggregate. so how do we aggregate the information contained in log files?

metrics
software sucks and crashes and has errors and poo poo. sometimes that's because you forgot to unfuckulate a pointer or the internet happened and some frames dropped or whatever, but poo poo breaks. you probably logged an error when that happened. if you know what you're looking for, that log message can be important and useful, but a lot of times these errors only really make sense in aggregate. there's also all sorts of implicit things happening when you run your software, like the amount of memory being consumed on a host or the CPU usage, whatever. in general, these sort of measurements are called 'metrics' and there's 2 main types, counters and gauges. a counter is what it sounds like - number starts at 0, goes up from there. this is something like "total amount of requests since this application started". gauges can go either way (like ur mom lol), they can increase or decrease. this is something like "number of concurrent requests right now". there's also poo poo like histograms which let you put counts in buckets and summaries which are basically histograms that calculate a quantile over a sliding time window. metrics are cool, you can generate them by parsing your log files or emitting counts/gauges directly from your application code to some sort of collector, you can use agent processes that collect application metrics from the runtime and also from the host, all sorts of wildass poo poo.

traces
metrics and logs both have one big problem - they're too general (metrics) and specific (logs) to really get an idea about specific stuff going on in your application. you can get an idea about the overall state of requests with your metrics, and you can try to identify highly specific failures by looking at logs, but a lot of times you want to understand what's happening at a level somewhere between the two - for instance, if i want to look at a single request as it goes from the browser, through whatever ingress/load balancing bullshit i have, into a variety of backend services all the way to the DB and back. traces - specifically, distributed traces - are the answer to this. traces are comprised of 'spans', which is really just a record of how long it took to do a thing, a span context that has some identifiers, and a bunch of information you can shove in like tags and logs and crap. each service emits this span to some sort of collector which can then reassemble the disparate spans into a single trace based on the identifiers in the span context.

conclusion
cool now you know what some of this poo poo is, so go ahead and ask questions or shitpost or whatever about monitoring, tracing, etc. share your stories about horrible home-brewed monitoring setups, or how your company got rooked by new relic. i know there's a few of us here who work with this on the regular and would probably be happy to answer any questions about this poo poo. personally, i work on distributed tracing and contribute to a tracing framework so i can field questions on that.

thanks

# ¿ Feb 22, 2019 18:00

Adbot: ADBOT LOVES YOU

# ¿ May 14, 2024 00:48

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

Butcher posted:

just wanted to say i appreciate the entry of unfuckulate into common parlance, thanks

# ¿ Feb 22, 2019 18:24

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

DONT THREAD ON ME posted:

hey this is a good post.

so i'm pretty good at gathering metrics but i'm bad at analyzing it. i know how to use some of the analytical functions in graphite but many are voodoo and i frequently have a hard time interpreting supposedly useful graphs.

anyone have a good primer? something like this but maybe a little quicker and vetted for me.

the oreilly monitoring with graphite book is p dece from what i understand, but yeah, poo poo's tricky to understand without having some stats background.

# ¿ Feb 23, 2019 16:06

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

i saw a talk by one of the authors of nanolog (https://github.com/PlatformLab/NanoLog) which is insanely badass if you want some ridiculously fast logging.

https://www.usenix.org/sites/default/files/conference/protected-files/atc18_slides_yang.pdf

# ¿ Feb 23, 2019 16:09

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

Agile Vector posted:

q: is mercury peepin that action the user?

mercury is your boss, thirsty for SLI/SLOs

# ¿ Feb 23, 2019 16:36

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

Jonny 290 posted:

pingdom lookin' at the mrtg page and mrtg monitoring pingdom of course

# ¿ Feb 23, 2019 17:12

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

the new hotness these days is SLO/SLI tho, i should write a post about that

# ¿ Feb 23, 2019 17:14

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

PCjr sidecar posted:

tracing seems to be something you write into your web app

anybody doing non-http tracing or collecting trace data from apps you don�t write yourself and doesn�t have native tracing support?

i wouldn't say it's just in your web app - our entire application is instrumented from webapp all the way down to the db. that said, it's kind of a massive pain in the rear end right now to get trace data from resources you don't directly manage because not everyone uses opentracing and even if they did, wire formats are very tracer dependent. that said, w3c is working on a tracecontext/tracedata specification that's intended to address this problem by standardizing headers and wire formats for context so you could have a situation where you're using some sort of managed ingress proxy or w/e and it'd be able to create spans as part of a trace that started on a client, etc. could also see the same thing at a managed db where the database service on the provider side is able to pick up traces incoming from the application and emit spans that you'd collect.

are you using tracing now? something home-brewed, or opentracing/opencensus?

# ¿ Feb 24, 2019 05:02

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

distributed tracing owns bones.

# ¿ Feb 24, 2019 05:45

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

the SLA SLO SLI le lo

what the gently caress is a SLI, SLA, or a SLO? they're some hot-poo poo terms everyone and their brother loves throwing around because they also read the loving SRE book (or, more likely, read a blog by someone who read the SRE book) but doesn't really understand that well and probably hasn't implemented them at all for their service or product. im gonna explain why, unlike most bullshit that seeps out from the toxic hellstew of a walking antitrust lawsuit named google, they're actually useful concepts.

SLA me daddy
stands for 'service level agreement'. this is probably something you've heard of, or possibly had to implement yourself. all of the five nines jokes go here. SLAs are mostly something that you get in trouble for breaking. that trouble might be financial (cash penalties or discounts for breaking a SLA) or organizational (your team gets reorg'ed into deadend project dungeons because you keep loving poo poo up for everyone else). SLAs aren't really for you, person reading this - they're for people that consume your service.

a good SLA is going to be realistic, but it can be crafted in a way that gives you a lot of latitude on how to actually fix your poo poo or make it better. you can (and should) be pretty specific with SLAs - maybe there's critical consumers that have them, but others don't, this gives you a knob to turn for performance tuning or request throttling. also it's important to be able to quantify 'legitimate' traffic or requests in your SLA. maybe a consumer fucks up and starts sending malformed queries to your service, which you handle and return an error for. should those count against overall 'uptime'? probably not, and they certainly shouldn't count against the overall performance of your service.

SLO down
a SLO is a 'service level objective'. if the SLA is what you're promising, the SLO is what you want. this might seem to be a subtle difference, but it's not in a lot of ways. think about it this way. you release a new service, cool! it's got some neat features, but it's not your main priority - and even if it is, you don't want to get paged in the middle of the night when something breaks. maybe after a while you'd like it to have high availability, but you didn't really put in the extra work at the beginning to handle redundancy, multi-az, poo poo like that. after all, isn't the point of all this agile poo poo to work iteratively? so you send out some emails and slack about your new API that fidgets foozles or quuz's quuxes and some people start using it, you start getting feedback, poo poo's going great. in your mind, you still have that goal of it being available during business hours, but it's actually much more reliable - it's up all the time.

but it starts to get popular. more people start relying on your service to quuz their quuxes. some dev in another team finds your service and decides to use it instead of writing their own, and now you're serving a lot more requests than you thought you'd be. poo poo goes wrong, things get caught crashlooping, and suddenly you're getting texts in the middle of the night because a public feature is down because your poo poo broke. bad scene.

how would an SLO have fixed this? well, they're published, for one. would other rando dev have started relying on your service if they knew you had a 48 hour turnaround on even looking at issues opened against it? two, they give you a 'error budget' (or 'downtime budget') in order to gently caress poo poo up in order to make it better. SLOs are basically an internal contract that gives you leverage to actually run services with actual users (so that theoretically, you can improve on them by getting feedback) without making you rip your hair out and/or want to murder everyone you work with (for a certain specific subset of reasons you'd want to murder your coworkers).

SLI but not the graphics card kind
so i've got my sla and my slo, what's a SLI? it's a 'service level indicator' and it's basically where monitoring comes into this mess. SLIs are the measurements you use to ensure you're hitting your SLO/SLA. this is probably a separate post on its own, but creating good SLIs can be difficult, and creating bad SLIs can be very easy. a good example is up in this post, when I was talking about SLAs - should I count 'bad requests' as part of my SLA? do I care that i'm successfully completing requests, or do I care that I can accept them at all? it's a lot more nuanced than simply "PINGDOM SAYS YOU'RE UP SO WE GOOD".

# ¿ Feb 24, 2019 06:21

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

influxdb more like refluxdb because it gives you heartburn trying to run it

# ¿ Feb 25, 2019 19:51

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

CRIP EATIN BREAD posted:

i fired it up and its been running since, seems braindead simple?

i actually dont know, i've never hosed with it. buddy of mine at work was griping about it being hard to use, maybe he's just bad

# ¿ Feb 25, 2019 22:28

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

lancemantis posted:

i could probably do some kind of experience based long-post on it because it sucked rear end

you should!

# ¿ Feb 26, 2019 14:42

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

lancemantis posted:

another fun anecdote: a kind of career recognition theft

so one particularly high-value operations-related large organizational effort decided to develop their own kind of monitoring front end; a super fancy set of dashboards and navigation tools to help provided visibility to all kinds of things about their sprawling mess of applications; the operations center was going to be expected to use this as their kind of source of record for application status in this particular area

they developed this whole thing with a team of their own that was probably 10 times or more larger than mine easily

and in the end it had no system to provided the data to back it -- it was basically just a fancy UI and I think a graph-database to help organize the displayed information

they then of course expected us to magically now integrate and provided the exact kind of data they wanted to back it, and used their much heavier organizational hitting power to bully us into it; it was a nightmare and I don't think it ever ended up getting finished and was probably a total waste of money

lol this owns in a really lovely way

# ¿ Feb 27, 2019 03:26

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

my bitter bi rival posted:

ive spent all morning loving with this crap. whats a better free alternative

better and free dont really go together in this world

# ¿ Mar 6, 2019 13:33

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

zookeeper is so loving cursed

# ¿ Mar 9, 2019 05:58

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

my stepdads beer posted:

the prom / grafana guys are making a log thing now

https://grafana.com/loki

no full text search though, also it only works with k8s atm

lol y tho

Blinkz0rz posted:

lol @ elastic

https://aws.amazon.com/blogs/opensource/keeping-open-source-open-open-distro-for-elasticsearch/

lol y tho

# ¿ Mar 13, 2019 14:01

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

that aws piece is some top notch concern trolling

# ¿ Mar 13, 2019 14:04

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

anyway here's some poo poo i've been working on behind the scenes for the past several months

https://twitter.com/opentracing/status/1111389502889574400?s=20

# ¿ Mar 28, 2019 23:14

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

CRIP EATIN BREAD posted:

opentracing is cool

thanks, I hope we don�t completely gently caress it up with the merger!!

# ¿ Apr 29, 2019 01:15

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

animist posted:

so does OpenTracing do everything prometheus does, or should i be running some combination of OpenTracing + logging + metric collection? like is there some tracer i can plug into OpenTracing to make it pretend to be prometheus, or do i need to do that separately

also, how long does e.g. Jaeger keep traces around? are they created on-demand, or built up from some sort of in-memory thing, or... whatever

basically i'm a little confused about the practical difference between logs vs metrics vs spans/tags vs traces

keep in mind i don't really know what i'm talking about

they�re different things although we�d like to condense it to one library for instrumentation (this is the point of the OpenTracing/opencensus merger). I can post some more about what it looks like today tomorrow

# ¿ Apr 30, 2019 05:52

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

CRIP EATIN BREAD posted:

we use the opentracing api for everything and use jaeger as our backend which feeds it into elasticsearch

its cool when someone used a constant sampler (samples every trace) and we were producing 10gb of traces each day in our test environment.

pro-tip: use a probabilistic sampler or something that says "sample 5% of traces" instead.

alternately consider a fine saas tracing product that performs tail sampling

# ¿ Jul 10, 2019 13:49

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

what do dogs know about cost accounting anyway

# ¿ Jul 29, 2020 05:06

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

my understanding is that they basically have just a massive loving Kafka cluster and boy I�d love to hear some stories from their SREs

# ¿ Jul 29, 2020 05:07

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

Pardot posted:

i want to send structured logs/events at some service that I wont have to janitor. the events will mostly be an entity with one or more uuids, and then some info like state changes, maybe some timestmaps or durations, idk. Nothing automated needs to consume this, just sometimes a person will do a search on the uuids to figure out what happened. I only need like 7 maybe 14 days of retention and the volume of incoming events wont be so high. Probably like 1gb/day, for sure less than 100. Only like 5 people will need access.

what service should i use?

honestly? honeycombs free tier might work out well for you

edit - assuming you want to share a password, I think they have a user limit on the free account

# ¿ Aug 3, 2020 12:53

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

who up observing they apps

# ¿ Feb 10, 2024 15:58

Adbot: ADBOT LOVES YOU

# ¿ May 14, 2024 00:48

kitten emergency: Jan 13, 2008; get meow this wack-ass crystal prison

Hed posted:

is there a good explainer to opentelemetry or is it one of those "if you have to ask it doesn't solve one of your usecases".

playing around with building a new distributed application and trying to figure out if I should read more since we've seemed to change how we yeet info for monitoring, logging, and coordination

well you�re in luck

I wrote a book about it and it�ll be out next month! https://learningopentelemetry.com

but im also happy to just explain whatever. the basic concept of otel is that all of your telemetry signals should be correlated through hard context (trace ids, etc) as well as soft context (standardized metadata)

# ¿ Feb 12, 2024 23:40

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > what the fuck is prometheus anyway? a thread about monitoring