what the fuck is prometheus anyway? a thread about monitoring

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > what the fuck is prometheus anyway? a thread about monitoring

abigserve: Sep 13, 2009; this is a better avatar than what I had before

MononcQc posted:

I have strong opinions on logging and monitoring and metrics.

Here's one: if your log line is a string you hosed up. Get some structured logging going.

my friend you never want to look into doing network or security monitoring

# ¿ Feb 24, 2019 10:20

Adbot: ADBOT LOVES YOU

# ¿ May 14, 2024 03:48

abigserve: Sep 13, 2009; this is a better avatar than what I had before

Captain Foo posted:

big lol that you decided to @mononcqc with this

if he has any idea how to do it i'm all ears, i'll send it straight to our infosec team who are currently building a hadoop stack to try and deal with it

# ¿ Feb 24, 2019 23:45

abigserve: Sep 13, 2009; this is a better avatar than what I had before

MononcQc posted:

Captain Foo mentioned this because I used to work for the routing team at Heroku and maintained part of their logging stack for a while, while I now work in a security company and helping them set up some IoT stuff for data acquisition, so it does make for a funny overlap. I have however not worked in infosec directly.

I don't know what exactly your team's doing, but going for hadoop for infosec and networking makes me think that they're trying to straight up do analytics on network traces or at least network metadata (connection endpoints, protocols/certs, payload sizes, etc.) -- so it'd be interesting to figure out what they're actually trying to accomplish. If it's a dragnet thing it's different from actual logging since you would probably have the ability to control some stuff there?

Most network software logs at least tend to have a semblance of structure, so they're not as bad of a cause as <Timestamp> could not load user <id> because an exception happened which essentially requires a full-text search to do anything with.

You pretty much nailed it, essentially they are trying to build an analytics stack that brings in a bunch of different data sources to be able to assist in root cause analysis and hunting sessions.

The issue is, none of the data set or very little of it, is structured and often the structures that are in place are inconsistent. To illustrate an example, say you want to be able to have a consistent view of source IP addresses accessing any web page in the enterprise. You need to be parse the following different log formats:

- apache
- nginx
- iis
- netscaler
- firewall connections
- netflow records
- EDR

it's pretty hard. And that's after you ingest all of those data sets into a sane bucket that actually allows you to parse it in the first place.

# ¿ Feb 25, 2019 03:34

abigserve: Sep 13, 2009; this is a better avatar than what I had before

MononcQc posted:

Right. So the two patterns there are essentially what they're doing and just hadooping the hell out of it, or otherwise treating individual logs as a stream that you have to process into a more standard format. Currently they're likely forwarding logs from specific servers to remote instances by reading them from disk first and then shoving them over a socket (or some syslog-like agent); the cheapest way to go would be to do that stream processing at the source as the agent reads up the logs from the disk and before it forwards them.

This requires deploying the agent to all instances (or as a sidecar), but from that point on, all the data is under a more standardized format, and you can shove it in hadoop or splunk or whatever.

E: do forward unknown/unformatteable logs but annotated as such within a structure so that they can iteratively clean poo poo up

You're spot on, their initial shot was to do stream processing on all the logs using Splunk but licensing on that side and the administrative overhead of maintaining correct splunk forwarder/stream configurations drove them to a dump-in-a-lake model.

Real talk, it's a loving mess and I don't envy whoever has to untangle it all into something that resembles a sane data structure.

# ¿ Feb 25, 2019 11:13

abigserve: Sep 13, 2009; this is a better avatar than what I had before

I was tasked recently with the job of migrating several of our ancient monitoring servers to a newer distro along with a bunch of one off scripterinos that hook into it.

So I spent the week migrating all of the stuff into version control, automating the quite complex builds and documenting as much as possible. I spent a couple of days trying to discover all the dependencies by going through the servers but I had a fairly good picture leading in of the service and the dependencies it had. After all, we were responsible for the service and I knew what the team did and didn't use.

loving wrongo, turns out the entire thing forms the foundation of a web of other scripts, webpages, and alarms for an entirely different team, none of which is documented or in source control and I literally had no idea existed.

The worst part is it's no (technical) person's fault either, the person they have looking after with is an exceptionally good programmer and an all around nice guy, but he's only seconded for a day per week and the second he walks in the door they are piling him with work that needs to be out the door TODAY ASAP

it's a cold splash of water, I've been dealing almost exclusively in new builds or drop-in replacements it feels like I've been living in a zen garden and now I'm back into the weeds.

tldr: gently caress monitoring

# ¿ Mar 1, 2019 12:00

Adbot: ADBOT LOVES YOU

# ¿ May 14, 2024 03:48

abigserve: Sep 13, 2009; this is a better avatar than what I had before

supabump posted:

breaking my vow of lurking to share my cool monitoring story

i work at one of your favorite tech monoliths and we're doing this big new enterprise service with like 20 different components frankensteined together. our monitoring was supposedly very good though. All 20 of these pos components were logging everything effectively, dashboards were all set up, etc.

for an entire 2 weeks, we had a live site issue where component A was DoSing component B for a customer with a shitload of data. wasn't obvious to the customer that anything was amiss, but our service was functionally useless to them for those two weeks, and we didn't have a loving clue

moral of the story is that none of your lovely monitoring does anything if nobody bothers to set up alerts (and that nobody looks at monitoring dashboards unprompted once the demo is over)

counterpoint: too many alerts

Elos posted:

current monitoring job status: i have 15 different unsee dashboards open and there's anything from 10 to 700 warnings/alerts in bouning around in each of them. i'm somehow supposed to keep eye on all of these with only the two monitors i have

these are monitoring a bunch of datacenters that are a hellish mix of mesos-marathon microservice container stuff and poo poo running straight on the metal, located around the world. i'm connected to them through a collection of ssh-tunnels over connections and vpns that sometimes just decide to stop working.

there's a lot of gently caress A CONTAINER IS DOWN!!! alerts and then go check it and everything is fine. bunch of alerts that once fired will hang around in the dashboard for 6 hours because ??? i'm never confident i'll catch the real problems with all this useless noise. there's graphana too but a lot of the stuff isnt configured right so you have to go massage some prometheus queries by hand to get the graphs you need

documentation if of course nonexistent and/or poo poo. configuring the monitoring is some other teams' job and a lot of the time all i can do is open a ticket and hope. when poo poo hits the fan i have only a vague idea who's responsible for what and who the hell i'm supposed to call so i get to wake up my team leader at four in the morning so he can figure it out

welp thats my story, back to lurking

Is all monitoring useless? The answer may surprise you!! (yes).

The only workflow I've found to work in the monitoring world is to only alert on things you won't otherwise be alerted to naturally - an example would be a core router being down, I don't need an email or SMS to tell me that my network is hosed. Instead, focus your alerts on malicious smaller issues you may not notice for a very long time, like errors on an interface or high IO wait on a database server.

Then, you need at least a once a week housecleaning of all the alerts. We do it at our team meeting and it takes about 10 minutes because it's a religious affair - and if an alert is red AND unacknowledged for more than a week, it gets removed from monitoring as it's clearly not important.

# ¿ Mar 16, 2019 23:20

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > what the fuck is prometheus anyway? a thread about monitoring