|
pictured: docker run -p 9090:9090 prom/prometheus sup fucks, monitoring poo poo sucks. that's why you can pay new relic or datadog like fourty billion dollars a year to slow down all your poo poo and create pretty dashboards for all those tvs the ops people hung around the office. this thread is about monitoring and logging and tracing and poo poo. a brief history of monitoring originally no one really gave a poo poo about a lot of this because you were using tomcat or iis or something that you'd create metrics or logs in your application that would write to whatever the system-provided sink for metrics and logs was. if you were a real cheap bastard, you'd just dump some poo poo into a text file and grep it. this was fine because we hadn't created a massive ecosystem of 'distributed computing' companies that exist to tell your vp of eng how you're all incompetent hellfuckers and they should pay a few hundred thousand bucks a year to use their special service mesh jerkoff mcgee middleware to do it right with docker and kubernetes. so we decided to break our applications apart into a million bifurcated services and pieces and now we don't know what the gently caress anything is doing when poo poo breaks, so here comes your friendly open source community to build some more goddamned tools to dig your way out of this nightmare you're trapped in. logs logs are probably the thing folks are the most familiar with, and maybe the only thing you'll get when your dumb fart app has a problem. even the most remedial plang toucher will have the brilliant insight to print something to the console when an exceptional case occurs. logs basically underpin the entire monitoring ecosystem at some way because fundamentally they're a way for the people writing the software to tell their future selves about what's happening in the code. it's pretty common to see people write a lot of logging statements in their code, then use some sort of runtime flag to determine which ones get printed to the console or a file, so you aren't writing poo poo like credit card numbers to a plain text file in prod (lol). there's a lot of problems with logs though, especially in the exciting world of distributed computing. trying to search a bunch of log files or console output across a hundred docker containers sucks. trying to correlate logs that pass through multiple services sucks, especially since little things like "the time on all my computers is not perfectly synchronized and so the timestamps on the logs are off a bit" exist. moreover, a lot of individual log statements don't really mean much by themselves, but do in aggregate. so how do we aggregate the information contained in log files? metrics software sucks and crashes and has errors and poo poo. sometimes that's because you forgot to unfuckulate a pointer or the internet happened and some frames dropped or whatever, but poo poo breaks. you probably logged an error when that happened. if you know what you're looking for, that log message can be important and useful, but a lot of times these errors only really make sense in aggregate. there's also all sorts of implicit things happening when you run your software, like the amount of memory being consumed on a host or the CPU usage, whatever. in general, these sort of measurements are called 'metrics' and there's 2 main types, counters and gauges. a counter is what it sounds like - number starts at 0, goes up from there. this is something like "total amount of requests since this application started". gauges can go either way (like ur mom lol), they can increase or decrease. this is something like "number of concurrent requests right now". there's also poo poo like histograms which let you put counts in buckets and summaries which are basically histograms that calculate a quantile over a sliding time window. metrics are cool, you can generate them by parsing your log files or emitting counts/gauges directly from your application code to some sort of collector, you can use agent processes that collect application metrics from the runtime and also from the host, all sorts of wildass poo poo. traces metrics and logs both have one big problem - they're too general (metrics) and specific (logs) to really get an idea about specific stuff going on in your application. you can get an idea about the overall state of requests with your metrics, and you can try to identify highly specific failures by looking at logs, but a lot of times you want to understand what's happening at a level somewhere between the two - for instance, if i want to look at a single request as it goes from the browser, through whatever ingress/load balancing bullshit i have, into a variety of backend services all the way to the DB and back. traces - specifically, distributed traces - are the answer to this. traces are comprised of 'spans', which is really just a record of how long it took to do a thing, a span context that has some identifiers, and a bunch of information you can shove in like tags and logs and crap. each service emits this span to some sort of collector which can then reassemble the disparate spans into a single trace based on the identifiers in the span context. conclusion cool now you know what some of this poo poo is, so go ahead and ask questions or shitpost or whatever about monitoring, tracing, etc. share your stories about horrible home-brewed monitoring setups, or how your company got rooked by new relic. i know there's a few of us here who work with this on the regular and would probably be happy to answer any questions about this poo poo. personally, i work on distributed tracing and contribute to a tracing framework so i can field questions on that. thanks
|
# ? Feb 22, 2019 18:00 |
|
|
# ? May 28, 2024 16:33 |
|
pity
|
# ? Feb 22, 2019 18:04 |
|
just wanted to say i appreciate the entry of unfuckulate into common parlance, thanks
|
# ? Feb 22, 2019 18:07 |
|
preemptive 5
|
# ? Feb 22, 2019 18:11 |
|
Butcher posted:just wanted to say i appreciate the entry of unfuckulate into common parlance, thanks yw
|
# ? Feb 22, 2019 18:24 |
|
hey this is a good post. so i'm pretty good at gathering metrics but i'm bad at analyzing it. i know how to use some of the analytical functions in graphite but many are voodoo and i frequently have a hard time interpreting supposedly useful graphs. anyone have a good primer? something like this but maybe a little quicker and vetted for me. DONT THREAD ON ME fucked around with this message at 18:33 on Feb 22, 2019 |
# ? Feb 22, 2019 18:30 |
|
https://gfycat.com/sillydesertedborzoi-aliens-prometheus-engineer-god
|
# ? Feb 22, 2019 19:22 |
|
the answer to “why did the robot do all that stuff? is he a secret rear end in a top hat?” is yes, because his father never truly loved him and made that very apparent despite his obviously surpassing humanity
|
# ? Feb 22, 2019 19:23 |
|
I liked Prometheus
|
# ? Feb 22, 2019 19:24 |
|
monitor deez nutz
|
# ? Feb 22, 2019 19:25 |
|
what's this got to do with cloudflare ray
|
# ? Feb 22, 2019 19:28 |
|
i like prtg better than solarwinds, OP
|
# ? Feb 22, 2019 19:36 |
|
Silver Alicorn posted:the answer to “why did the robot do all that stuff? is he a secret rear end in a top hat?” is yes, because his father never truly loved him and made that very apparent despite his obviously surpassing humanity also alien shares the same universe as blade runner. the robots are just not happy.
|
# ? Feb 22, 2019 20:07 |
|
I have strong opinions on logging and monitoring and metrics. Here's one: if your log line is a string you hosed up. Get some structured logging going.
|
# ? Feb 22, 2019 20:09 |
|
if your log is a string you have digestion problems
|
# ? Feb 22, 2019 23:56 |
|
where does journald come into all this
|
# ? Feb 23, 2019 11:15 |
|
avoid counters imo
|
# ? Feb 23, 2019 13:42 |
|
if you don’t have metrics you don’t have alerts which means you don’t get paged if poo poo goes sideways so do that imo
|
# ? Feb 23, 2019 14:33 |
|
DONT THREAD ON ME posted:hey this is a good post. the oreilly monitoring with graphite book is p dece from what i understand, but yeah, poo poo's tricky to understand without having some stats background.
|
# ? Feb 23, 2019 16:06 |
|
i saw a talk by one of the authors of nanolog (https://github.com/PlatformLab/NanoLog) which is insanely badass if you want some ridiculously fast logging. https://www.usenix.org/sites/default/files/conference/protected-files/atc18_slides_yang.pdf
|
# ? Feb 23, 2019 16:09 |
|
uncurable mlady posted:
q: is mercury peepin that action the user?
|
# ? Feb 23, 2019 16:21 |
|
Agile Vector posted:q: is mercury peepin that action the user? mercury is your boss, thirsty for SLI/SLOs
|
# ? Feb 23, 2019 16:36 |
|
I use Graylog. pretty good.
|
# ? Feb 23, 2019 16:37 |
|
Can you guys help me? I can see in new relic that my cluster failed 3 ping checks in december, I need an RCA for each of them. I'm marking this request the highest urgency thank you.
|
# ? Feb 23, 2019 16:42 |
|
How can you make sure your monitoring is working tho
|
# ? Feb 23, 2019 17:02 |
|
Captain Foo posted:How can you make sure your monitoring is working tho pingdom lookin' at the mrtg page and mrtg monitoring pingdom of course
|
# ? Feb 23, 2019 17:03 |
|
Jonny 290 posted:pingdom lookin' at the mrtg page and mrtg monitoring pingdom of course
|
# ? Feb 23, 2019 17:12 |
|
the new hotness these days is SLO/SLI tho, i should write a post about that
|
# ? Feb 23, 2019 17:14 |
|
https://www.cacti.net is extremely underrated for generic snmp poo poo the server i set up 10 years ago at work still works pretty much maintenance free
|
# ? Feb 23, 2019 18:00 |
|
i really enjoyed when honeycomb did a demo with their tool on the real world example of a service outage on their platform to discover that one of their servers ran out of disk space, something that a loving nagios check would have picked up 20 years ago
|
# ? Feb 23, 2019 20:27 |
|
prometheus is pretty cool but shoehorning everything into a pull model annoys me
|
# ? Feb 23, 2019 20:29 |
|
r u ready to WALK posted:https://www.cacti.net is extremely underrated for generic snmp poo poo it’s ok if what you need to do is turn snmp (or snmplikes) into browser-viewable rrds on one host also make sure you stay on top of the security updates or keep it inaccessible from the web
|
# ? Feb 23, 2019 20:42 |
|
tracing seems to be something you write into your web app anybody doing non-http tracing or collecting trace data from apps you don’t write yourself and doesn’t have native tracing support?
|
# ? Feb 23, 2019 20:56 |
|
PCjr sidecar posted:tracing seems to be something you write into your web app i wouldn't say it's just in your web app - our entire application is instrumented from webapp all the way down to the db. that said, it's kind of a massive pain in the rear end right now to get trace data from resources you don't directly manage because not everyone uses opentracing and even if they did, wire formats are very tracer dependent. that said, w3c is working on a tracecontext/tracedata specification that's intended to address this problem by standardizing headers and wire formats for context so you could have a situation where you're using some sort of managed ingress proxy or w/e and it'd be able to create spans as part of a trace that started on a client, etc. could also see the same thing at a managed db where the database service on the provider side is able to pick up traces incoming from the application and emit spans that you'd collect. are you using tracing now? something home-brewed, or opentracing/opencensus?
|
# ? Feb 24, 2019 05:02 |
|
distributed tracing blew my mind kinda
|
# ? Feb 24, 2019 05:36 |
|
distributed tracing owns bones.
|
# ? Feb 24, 2019 05:45 |
|
the SLA SLO SLI le lo what the gently caress is a SLI, SLA, or a SLO? they're some hot-poo poo terms everyone and their brother loves throwing around because they also read the loving SRE book (or, more likely, read a blog by someone who read the SRE book) but doesn't really understand that well and probably hasn't implemented them at all for their service or product. im gonna explain why, unlike most bullshit that seeps out from the toxic hellstew of a walking antitrust lawsuit named google, they're actually useful concepts. SLA me daddy stands for 'service level agreement'. this is probably something you've heard of, or possibly had to implement yourself. all of the five nines jokes go here. SLAs are mostly something that you get in trouble for breaking. that trouble might be financial (cash penalties or discounts for breaking a SLA) or organizational (your team gets reorg'ed into deadend project dungeons because you keep loving poo poo up for everyone else). SLAs aren't really for you, person reading this - they're for people that consume your service. a good SLA is going to be realistic, but it can be crafted in a way that gives you a lot of latitude on how to actually fix your poo poo or make it better. you can (and should) be pretty specific with SLAs - maybe there's critical consumers that have them, but others don't, this gives you a knob to turn for performance tuning or request throttling. also it's important to be able to quantify 'legitimate' traffic or requests in your SLA. maybe a consumer fucks up and starts sending malformed queries to your service, which you handle and return an error for. should those count against overall 'uptime'? probably not, and they certainly shouldn't count against the overall performance of your service. SLO down a SLO is a 'service level objective'. if the SLA is what you're promising, the SLO is what you want. this might seem to be a subtle difference, but it's not in a lot of ways. think about it this way. you release a new service, cool! it's got some neat features, but it's not your main priority - and even if it is, you don't want to get paged in the middle of the night when something breaks. maybe after a while you'd like it to have high availability, but you didn't really put in the extra work at the beginning to handle redundancy, multi-az, poo poo like that. after all, isn't the point of all this agile poo poo to work iteratively? so you send out some emails and slack about your new API that fidgets foozles or quuz's quuxes and some people start using it, you start getting feedback, poo poo's going great. in your mind, you still have that goal of it being available during business hours, but it's actually much more reliable - it's up all the time. but it starts to get popular. more people start relying on your service to quuz their quuxes. some dev in another team finds your service and decides to use it instead of writing their own, and now you're serving a lot more requests than you thought you'd be. poo poo goes wrong, things get caught crashlooping, and suddenly you're getting texts in the middle of the night because a public feature is down because your poo poo broke. bad scene. how would an SLO have fixed this? well, they're published, for one. would other rando dev have started relying on your service if they knew you had a 48 hour turnaround on even looking at issues opened against it? two, they give you a 'error budget' (or 'downtime budget') in order to gently caress poo poo up in order to make it better. SLOs are basically an internal contract that gives you leverage to actually run services with actual users (so that theoretically, you can improve on them by getting feedback) without making you rip your hair out and/or want to murder everyone you work with (for a certain specific subset of reasons you'd want to murder your coworkers). SLI but not the graphics card kind so i've got my sla and my slo, what's a SLI? it's a 'service level indicator' and it's basically where monitoring comes into this mess. SLIs are the measurements you use to ensure you're hitting your SLO/SLA. this is probably a separate post on its own, but creating good SLIs can be difficult, and creating bad SLIs can be very easy. a good example is up in this post, when I was talking about SLAs - should I count 'bad requests' as part of my SLA? do I care that i'm successfully completing requests, or do I care that I can accept them at all? it's a lot more nuanced than simply "PINGDOM SAYS YOU'RE UP SO WE GOOD".
|
# ? Feb 24, 2019 06:21 |
|
articles like this are usually too fluffy but i got a lot out of it re: distributed tracing https://nickcraver.com/blog/2018/11/29/stack-overflow-how-we-do-monitoring/
|
# ? Feb 24, 2019 06:24 |
|
uncurable mlady posted:the SLA SLO SLI le lo this is a very good post op
|
# ? Feb 24, 2019 07:06 |
|
|
# ? May 28, 2024 16:33 |
|
MononcQc posted:I have strong opinions on logging and monitoring and metrics. my friend you never want to look into doing network or security monitoring
|
# ? Feb 24, 2019 10:20 |