|
pictured: docker run -p 9090:9090 prom/prometheus sup fucks, monitoring poo poo sucks. that's why you can pay new relic or datadog like fourty billion dollars a year to slow down all your poo poo and create pretty dashboards for all those tvs the ops people hung around the office. this thread is about monitoring and logging and tracing and poo poo. a brief history of monitoring originally no one really gave a poo poo about a lot of this because you were using tomcat or iis or something that you'd create metrics or logs in your application that would write to whatever the system-provided sink for metrics and logs was. if you were a real cheap bastard, you'd just dump some poo poo into a text file and grep it. this was fine because we hadn't created a massive ecosystem of 'distributed computing' companies that exist to tell your vp of eng how you're all incompetent hellfuckers and they should pay a few hundred thousand bucks a year to use their special service mesh jerkoff mcgee middleware to do it right with docker and kubernetes. so we decided to break our applications apart into a million bifurcated services and pieces and now we don't know what the gently caress anything is doing when poo poo breaks, so here comes your friendly open source community to build some more goddamned tools to dig your way out of this nightmare you're trapped in. logs logs are probably the thing folks are the most familiar with, and maybe the only thing you'll get when your dumb fart app has a problem. even the most remedial plang toucher will have the brilliant insight to print something to the console when an exceptional case occurs. logs basically underpin the entire monitoring ecosystem at some way because fundamentally they're a way for the people writing the software to tell their future selves about what's happening in the code. it's pretty common to see people write a lot of logging statements in their code, then use some sort of runtime flag to determine which ones get printed to the console or a file, so you aren't writing poo poo like credit card numbers to a plain text file in prod (lol). there's a lot of problems with logs though, especially in the exciting world of distributed computing. trying to search a bunch of log files or console output across a hundred docker containers sucks. trying to correlate logs that pass through multiple services sucks, especially since little things like "the time on all my computers is not perfectly synchronized and so the timestamps on the logs are off a bit" exist. moreover, a lot of individual log statements don't really mean much by themselves, but do in aggregate. so how do we aggregate the information contained in log files? metrics software sucks and crashes and has errors and poo poo. sometimes that's because you forgot to unfuckulate a pointer or the internet happened and some frames dropped or whatever, but poo poo breaks. you probably logged an error when that happened. if you know what you're looking for, that log message can be important and useful, but a lot of times these errors only really make sense in aggregate. there's also all sorts of implicit things happening when you run your software, like the amount of memory being consumed on a host or the CPU usage, whatever. in general, these sort of measurements are called 'metrics' and there's 2 main types, counters and gauges. a counter is what it sounds like - number starts at 0, goes up from there. this is something like "total amount of requests since this application started". gauges can go either way (like ur mom lol), they can increase or decrease. this is something like "number of concurrent requests right now". there's also poo poo like histograms which let you put counts in buckets and summaries which are basically histograms that calculate a quantile over a sliding time window. metrics are cool, you can generate them by parsing your log files or emitting counts/gauges directly from your application code to some sort of collector, you can use agent processes that collect application metrics from the runtime and also from the host, all sorts of wildass poo poo. traces metrics and logs both have one big problem - they're too general (metrics) and specific (logs) to really get an idea about specific stuff going on in your application. you can get an idea about the overall state of requests with your metrics, and you can try to identify highly specific failures by looking at logs, but a lot of times you want to understand what's happening at a level somewhere between the two - for instance, if i want to look at a single request as it goes from the browser, through whatever ingress/load balancing bullshit i have, into a variety of backend services all the way to the DB and back. traces - specifically, distributed traces - are the answer to this. traces are comprised of 'spans', which is really just a record of how long it took to do a thing, a span context that has some identifiers, and a bunch of information you can shove in like tags and logs and crap. each service emits this span to some sort of collector which can then reassemble the disparate spans into a single trace based on the identifiers in the span context. conclusion cool now you know what some of this poo poo is, so go ahead and ask questions or shitpost or whatever about monitoring, tracing, etc. share your stories about horrible home-brewed monitoring setups, or how your company got rooked by new relic. i know there's a few of us here who work with this on the regular and would probably be happy to answer any questions about this poo poo. personally, i work on distributed tracing and contribute to a tracing framework so i can field questions on that. thanks
|
# ¿ Feb 22, 2019 18:00 |
|
|
# ¿ May 14, 2024 00:48 |
|
Butcher posted:just wanted to say i appreciate the entry of unfuckulate into common parlance, thanks yw
|
# ¿ Feb 22, 2019 18:24 |
|
DONT THREAD ON ME posted:hey this is a good post. the oreilly monitoring with graphite book is p dece from what i understand, but yeah, poo poo's tricky to understand without having some stats background.
|
# ¿ Feb 23, 2019 16:06 |
|
i saw a talk by one of the authors of nanolog (https://github.com/PlatformLab/NanoLog) which is insanely badass if you want some ridiculously fast logging. https://www.usenix.org/sites/default/files/conference/protected-files/atc18_slides_yang.pdf
|
# ¿ Feb 23, 2019 16:09 |
|
Agile Vector posted:q: is mercury peepin that action the user? mercury is your boss, thirsty for SLI/SLOs
|
# ¿ Feb 23, 2019 16:36 |
|
Jonny 290 posted:pingdom lookin' at the mrtg page and mrtg monitoring pingdom of course
|
# ¿ Feb 23, 2019 17:12 |
|
the new hotness these days is SLO/SLI tho, i should write a post about that
|
# ¿ Feb 23, 2019 17:14 |
|
PCjr sidecar posted:tracing seems to be something you write into your web app i wouldn't say it's just in your web app - our entire application is instrumented from webapp all the way down to the db. that said, it's kind of a massive pain in the rear end right now to get trace data from resources you don't directly manage because not everyone uses opentracing and even if they did, wire formats are very tracer dependent. that said, w3c is working on a tracecontext/tracedata specification that's intended to address this problem by standardizing headers and wire formats for context so you could have a situation where you're using some sort of managed ingress proxy or w/e and it'd be able to create spans as part of a trace that started on a client, etc. could also see the same thing at a managed db where the database service on the provider side is able to pick up traces incoming from the application and emit spans that you'd collect. are you using tracing now? something home-brewed, or opentracing/opencensus?
|
# ¿ Feb 24, 2019 05:02 |
|
distributed tracing owns bones.
|
# ¿ Feb 24, 2019 05:45 |
|
the SLA SLO SLI le lo what the gently caress is a SLI, SLA, or a SLO? they're some hot-poo poo terms everyone and their brother loves throwing around because they also read the loving SRE book (or, more likely, read a blog by someone who read the SRE book) but doesn't really understand that well and probably hasn't implemented them at all for their service or product. im gonna explain why, unlike most bullshit that seeps out from the toxic hellstew of a walking antitrust lawsuit named google, they're actually useful concepts. SLA me daddy stands for 'service level agreement'. this is probably something you've heard of, or possibly had to implement yourself. all of the five nines jokes go here. SLAs are mostly something that you get in trouble for breaking. that trouble might be financial (cash penalties or discounts for breaking a SLA) or organizational (your team gets reorg'ed into deadend project dungeons because you keep loving poo poo up for everyone else). SLAs aren't really for you, person reading this - they're for people that consume your service. a good SLA is going to be realistic, but it can be crafted in a way that gives you a lot of latitude on how to actually fix your poo poo or make it better. you can (and should) be pretty specific with SLAs - maybe there's critical consumers that have them, but others don't, this gives you a knob to turn for performance tuning or request throttling. also it's important to be able to quantify 'legitimate' traffic or requests in your SLA. maybe a consumer fucks up and starts sending malformed queries to your service, which you handle and return an error for. should those count against overall 'uptime'? probably not, and they certainly shouldn't count against the overall performance of your service. SLO down a SLO is a 'service level objective'. if the SLA is what you're promising, the SLO is what you want. this might seem to be a subtle difference, but it's not in a lot of ways. think about it this way. you release a new service, cool! it's got some neat features, but it's not your main priority - and even if it is, you don't want to get paged in the middle of the night when something breaks. maybe after a while you'd like it to have high availability, but you didn't really put in the extra work at the beginning to handle redundancy, multi-az, poo poo like that. after all, isn't the point of all this agile poo poo to work iteratively? so you send out some emails and slack about your new API that fidgets foozles or quuz's quuxes and some people start using it, you start getting feedback, poo poo's going great. in your mind, you still have that goal of it being available during business hours, but it's actually much more reliable - it's up all the time. but it starts to get popular. more people start relying on your service to quuz their quuxes. some dev in another team finds your service and decides to use it instead of writing their own, and now you're serving a lot more requests than you thought you'd be. poo poo goes wrong, things get caught crashlooping, and suddenly you're getting texts in the middle of the night because a public feature is down because your poo poo broke. bad scene. how would an SLO have fixed this? well, they're published, for one. would other rando dev have started relying on your service if they knew you had a 48 hour turnaround on even looking at issues opened against it? two, they give you a 'error budget' (or 'downtime budget') in order to gently caress poo poo up in order to make it better. SLOs are basically an internal contract that gives you leverage to actually run services with actual users (so that theoretically, you can improve on them by getting feedback) without making you rip your hair out and/or want to murder everyone you work with (for a certain specific subset of reasons you'd want to murder your coworkers). SLI but not the graphics card kind so i've got my sla and my slo, what's a SLI? it's a 'service level indicator' and it's basically where monitoring comes into this mess. SLIs are the measurements you use to ensure you're hitting your SLO/SLA. this is probably a separate post on its own, but creating good SLIs can be difficult, and creating bad SLIs can be very easy. a good example is up in this post, when I was talking about SLAs - should I count 'bad requests' as part of my SLA? do I care that i'm successfully completing requests, or do I care that I can accept them at all? it's a lot more nuanced than simply "PINGDOM SAYS YOU'RE UP SO WE GOOD".
|
# ¿ Feb 24, 2019 06:21 |
|
influxdb more like refluxdb because it gives you heartburn trying to run it
|
# ¿ Feb 25, 2019 19:51 |
|
CRIP EATIN BREAD posted:i fired it up and its been running since, seems braindead simple? i actually dont know, i've never hosed with it. buddy of mine at work was griping about it being hard to use, maybe he's just bad
|
# ¿ Feb 25, 2019 22:28 |
|
lancemantis posted:i could probably do some kind of experience based long-post on it because it sucked rear end you should!
|
# ¿ Feb 26, 2019 14:42 |
|
lancemantis posted:another fun anecdote: a kind of career recognition theft lol this owns in a really lovely way
|
# ¿ Feb 27, 2019 03:26 |
|
my bitter bi rival posted:ive spent all morning loving with this crap. whats a better free alternative better and free dont really go together in this world
|
# ¿ Mar 6, 2019 13:33 |
|
zookeeper is so loving cursed
|
# ¿ Mar 9, 2019 05:58 |
|
my stepdads beer posted:the prom / grafana guys are making a log thing now lol y tho Blinkz0rz posted:lol @ elastic lol y tho
|
# ¿ Mar 13, 2019 14:01 |
|
that aws piece is some top notch concern trolling
|
# ¿ Mar 13, 2019 14:04 |
|
anyway here's some poo poo i've been working on behind the scenes for the past several months https://twitter.com/opentracing/status/1111389502889574400?s=20
|
# ¿ Mar 28, 2019 23:14 |
|
CRIP EATIN BREAD posted:opentracing is cool thanks, I hope we don’t completely gently caress it up with the merger!!
|
# ¿ Apr 29, 2019 01:15 |
|
animist posted:so does OpenTracing do everything prometheus does, or should i be running some combination of OpenTracing + logging + metric collection? like is there some tracer i can plug into OpenTracing to make it pretend to be prometheus, or do i need to do that separately they’re different things although we’d like to condense it to one library for instrumentation (this is the point of the OpenTracing/opencensus merger). I can post some more about what it looks like today tomorrow
|
# ¿ Apr 30, 2019 05:52 |
|
CRIP EATIN BREAD posted:we use the opentracing api for everything and use jaeger as our backend which feeds it into elasticsearch alternately consider a fine saas tracing product that performs tail sampling
|
# ¿ Jul 10, 2019 13:49 |
|
what do dogs know about cost accounting anyway
|
# ¿ Jul 29, 2020 05:06 |
|
my understanding is that they basically have just a massive loving Kafka cluster and boy I’d love to hear some stories from their SREs
|
# ¿ Jul 29, 2020 05:07 |
|
Pardot posted:i want to send structured logs/events at some service that I wont have to janitor. the events will mostly be an entity with one or more uuids, and then some info like state changes, maybe some timestmaps or durations, idk. Nothing automated needs to consume this, just sometimes a person will do a search on the uuids to figure out what happened. I only need like 7 maybe 14 days of retention and the volume of incoming events wont be so high. Probably like 1gb/day, for sure less than 100. Only like 5 people will need access. honestly? honeycombs free tier might work out well for you edit - assuming you want to share a password, I think they have a user limit on the free account
|
# ¿ Aug 3, 2020 12:53 |
|
who up observing they apps
|
# ¿ Feb 10, 2024 15:58 |
|
|
# ¿ May 14, 2024 00:48 |
|
Hed posted:is there a good explainer to opentelemetry or is it one of those "if you have to ask it doesn't solve one of your usecases". well you’re in luck I wrote a book about it and it’ll be out next month! https://learningopentelemetry.com but im also happy to just explain whatever. the basic concept of otel is that all of your telemetry signals should be correlated through hard context (trace ids, etc) as well as soft context (standardized metadata)
|
# ¿ Feb 12, 2024 23:40 |