top | item 38296485

(no title)

What is the "first principles" argument that observability decomposes into logs, metrics, and tracing? I see this dogma accepted everywhere, but I'm inquisitive about it

discuss

yannyu|2 years ago

First you had logs. Everyone uses logs because it's easy. Logs are great, but suddenly you're spending a crapton of time or money maintaining terabytes or petabytes of storage and ingest of logs. And even worse, in some cases for these logs, you don't actually care about 99% of the log line and simply want a single number, such as CPU utilization or the value of the shopping cart or latency.

So, someone says, "let's make something smaller and more portable than logs. We need to track numerical data over time more easily, so that we can see pretty charts of when these values are outside of where they should be." This ends up being metrics and a time-series database (TSDB), built to handle not arbitrary lines of text but instead meant to parse out metadata and append numerical data to existing time-series based on that metadata.

Between metrics and logs, you end up with a good idea of what's going on with your infrastructure, but logs are still too verbose to understand what's happening with your applications past a certain point. If you have an application crashing repeatedly, or if you've got applications running slowly, metrics and logs can't really help you there. So companies built out Application Performance Monitoring, meant to tap directly into the processes running on the box and spit out all sorts of interesting runtime metrics and events about not just the applications, but the specific methods and calls those applications are utilizing within their stack/code.

Initially, this works great if you're running these APM tools on a single box within monolithic stacks, but as the world moved toward Cloud Service Providers and containerized/ephemeral infrastructure, APM stopped being as effective. When a transaction starts to go through multiple machines and microservices, APM deployed on those boxes individually can't give you the context of how these disparate calls relate to a holistic transaction.

So someone says, "hey, what if we include transaction IDs in these service calls, so that we can post-hoc stitch together these individual transaction lines into a whole transaction, end-to-end?" Which is how you end up with the concept of spans and traces, taking what worked well with Application Performance Monitoring and generalizing that out into the modern microservices architectures that are more common today.