Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
MononcQc
May 29, 2007

I have strong opinions on logging and monitoring and metrics.

Here's one: if your log line is a string you hosed up. Get some structured logging going.

Adbot
ADBOT LOVES YOU

MononcQc
May 29, 2007

uncurable mlady posted:

i wouldn't say it's just in your web app - our entire application is instrumented from webapp all the way down to the db. that said, it's kind of a massive pain in the rear end right now to get trace data from resources you don't directly manage because not everyone uses opentracing and even if they did, wire formats are very tracer dependent.

I have big opinions about what to trace and logs, and where to put the probes -- about 15 pages worth of opinions -- that I put in one place at https://ferd.ca/operable-software.html

Mostly if I have to TL:DR; my views it is that:
  • you have to be aware that usage of the system forces and creates diverging mental models for all users and operators
  • just tracking what was problematic in the past is a losing proposition that will result in unmaintainable messes that don't offer any insights
  • debugging is often not done by just flat out understanding the internals of the whole system, but through trying to understand your interactions with a given set of underlying components and abstractions. Digging a layer below requires coming up with a whole new mental model.
  • making "everything visible" is a stupid idea because it asks of people interacting with the system to know everything that may or may not be relevant no matter their expertise level
  • you therefore want to locate probes for debugging/tracing/logging/metrics at a layer below the one you're planning to interact with. For example, probes in your app should be so users or support figure out if they're configuring/using the app right. Probe in the framework (say middleware in a server) are for developers to figure out if the app they wrote is behaving right, and so on.
  • Building on a stack of abstractions that lack observability features forces you to cope by reinventing them at your own layer and is generally a nightmare

And so the idea is to think in terms of "operator experience" the same way we would do "user experience", and figure out patterns in which to lay information that talks to the different types and levels of expertise of users and operators. If you don't have that, you have a lot of data, but it's not necessarily going to be useful at all.

MononcQc
May 29, 2007

abigserve posted:

if he has any idea how to do it i'm all ears, i'll send it straight to our infosec team who are currently building a hadoop stack to try and deal with it

Captain Foo mentioned this because I used to work for the routing team at Heroku and maintained part of their logging stack for a while, while I now work in a security company and helping them set up some IoT stuff for data acquisition, so it does make for a funny overlap. I have however not worked in infosec directly.

I don't know what exactly your team's doing, but going for hadoop for infosec and networking makes me think that they're trying to straight up do analytics on network traces or at least network metadata (connection endpoints, protocols/certs, payload sizes, etc.) -- so it'd be interesting to figure out what they're actually trying to accomplish. If it's a dragnet thing it's different from actual logging since you would probably have the ability to control some stuff there?

Most network software logs at least tend to have a semblance of structure, so they're not as bad of a cause as <Timestamp> could not load user <id> because an exception happened which essentially requires a full-text search to do anything with.

MononcQc
May 29, 2007

Right. So the two patterns there are essentially what they're doing and just hadooping the hell out of it, or otherwise treating individual logs as a stream that you have to process into a more standard format. Currently they're likely forwarding logs from specific servers to remote instances by reading them from disk first and then shoving them over a socket (or some syslog-like agent); the cheapest way to go would be to do that stream processing at the source as the agent reads up the logs from the disk and before it forwards them.

This requires deploying the agent to all instances (or as a sidecar), but from that point on, all the data is under a more standardized format, and you can shove it in hadoop or splunk or whatever.

E: do forward unknown/unformatteable logs but annotated as such within a structure so that they can iteratively clean poo poo up

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply