One thing about logging and tracing is the inevitable cost (in real money).
I love observability probably more than most. And my initial reaction to this article is the obvious: why not both?
In fact, I tend to think more in terms of "events" when writing both logs and tracing code. How that event is notified, stored, transmitted, etc. is in some ways divorced from the activity. I don't care if it is going to stdout, or over udp to an aggregator, or turning into trace statements, or ending up in Kafka, etc.
But inevitably I bump up against cost. For even medium sized systems, the amount of data I would like to track gets quite expensive. For example, many tracing services charge for the tags you add to traces. So doing `trace.String("key", value)` becomes something I think about from a cost perspective. I worked at a place that had a $250k/year New Relic bill and we were avoiding any kind of custom attributes. Just getting APM metrics for servers and databases was enough to get to that cost.
Logs are cheap, easy, reliable and don't lock me in to an expensive service to start. I mean, maybe you end up integrating splunk or perhaps self-hosting kibana, but you can get 90% of the benefits just by dumping the logs into Cloudwatch or even S3 for a much cheaper price.
FWIW part of the reason you're seeing that is, at least traditionally, APM companies rebranding as Observability companies stuffed trace data into metrics data stores, which becomes prohibitively expensive to query with custom tags/attributes/fields. Newer tools/companies have a different approach that makes cost far more predictable and generally lower.
Luckily, some of the larger incumbents are also moving away from this model, especially as OpenTelemetry is making tracing more widespread as a baseline of sorts for data. And you can definitely bet they're hearing about it from their customers right now, and they want to keep their customers.
Cost is still a concern but it's getting addressed as well. Right now every vendor has different approaches (e.g., the one I work for has a robust sampling proxy you can use), but that too is going the way of standardization. OTel is defining how to propagate sampling metadata in signals so that downstream tools can use the metadata about population representativeness to show accurate counts for things and so on.
I have made use of tracing, metrics, and logging all together and find each of them have its own place, as well as synergies of being able to work with all three together.
Cost is a real issue, and not just in terms of how much the vendor costs you. When tracing becomes a noticeable fraction of CPU or memory usage relative to the application, it's time to rethink doing 100% sampling. In practice, if you are sampling thousands of requests per second, you're very unlikely to actually look through each one of those thousands (thousands of req/s may not be a lot for some sites, but it is already exceeding human-scale without tooling). In order to keep accurate, useful statistics with sampling, you end up using metrics to store trace metrics prior to sampling.
> In fact, I tend to think more in terms of "events" when writing both logs and tracing code.
They are events[1]. For my text editor, KeenWrite, events can be logged either to the console when run from the command-line or displayed in a dialog when running in GUI mode. By changing "logger.log()" statements to "event.publish()" statements, a number of practical benefits are realized, including:
* Decoupled logging implementation from the system (swap one line of code to change loggers).
* Publish events on a message bus (e.g., D-Bus) to allow extending system functionality without modifying the existing code base.
* Standard logging format, which can be machine parsed, to help trace in-field production problems.
* Ability to assign unique identifiers to each event, allowing for publication of problem/solution documentation based on those IDs (possibly even seeding LLMs these days).
Observability costs feel high when everything’s working fine. When something snaps and everything is down and you need to know why in a hurry… those observability premiums you’ve been paying all along can pay off fast.
As other posters have mentioned, the incument companies rebranding to Observability definitely are expensive, because they are charging in the same way as they do for logs and/or metrics: per entry and per unique dimension (metrics especially).
Honeycomb at least charges per event, which in this case means per span - however they don't charge per span attribute, and each span can be pretty large (100kb / 2000 attributes).
I run all my personal services in their free tier, which has plenty of capacity, and that's before I do any sampling.
How does one break into the industry though? I worked in a project tangentially related and the problem is that sales were done by corporate sales man rather than on technicality. The companies buying the product didn't care because the people involved were making "deals". The company selling the product didn't care about making the product better because it was selling and having high AWS bills sounds like they were doing something (even though they were burning money).
> I mean, maybe you end up integrating splunk or perhaps self-hosting kibana
I think this is the issue. Both Splunk and OpenSearch (even self-hosted OpenSearch) get really pricy as well especially with large volumes of log data. Cloudwatch can also get ludicrously expensive. They charge something like $0.50 per GB (!) and another $0.03 per GB to store. I've seen situations at a previous employer where someone accidentally deployed a lambda function with debug logging and ran up a few thousand $$ in Cloudwatch bills overnight.
You should look at Coralogix (disclaimer: I work there). We've built a platform that allows you to store your observability data in S3 and query it through our infrastructure. It can be dramatically more cost-effective than other providers in this space.
> Log Levels are meaningless. Is a log line debug, info, warning, error, fatal, or some other shade in between?
I partly agree and disagree. In terms of severity, there are only three levels:
– info: not a problem
– warning: potential problem
– error: actual problem (operational failure)
Other levels like “debug” are not about severity, but about level of detail.
In addition, something that is an error in a subcomponent may only be a warning or even just an info on the level of the superordinate component. Thus the severity has to be interpreted relative to the source component.
The latter can be an issue if the severity is only interpreted globally. Either it will be wrong for the global level, or subcomponents have to know the global context they are running in to use the severity appropriate for that context. The latter causes undesirable dependencies on a global context. Meaning, the developer of a lower-level subcomponent would have to know the exact context in which that component is used, in order to chose the appropriate log level. And what if the component is used in different contexts entailing different severities?
So one might conclude that the severity indication is useless after all, but IMO one should rather conclude that severity needs to be interpreted relative to the component. This also means that a lower-level error may have to be logged again in the higher-level context if it’s still an error there, so that it doesn’t get ignored if e.g. monitoring only looks at errors on the higher-level context.
Differences between “fatal” and “error” are really nesting differences between components/contexts. An error is always fatal on the level where it originates.
The OP is wrong, log levels are very valuable if you leverage them.
Here's a classic problem as an illustration:
The storage cost of your logs is really prohibitive. You would like to cut out some of your logs from storage but cannot lower retention below some threshold (say 2 weeks maybe). For this example, assume that tracing is also enabled and every log has a traceId
A good answer is to run a compaction job that inspects each trace. If it contains an error preserve it. Remove X% of all other traces.
Log levels make the ergonomics for this excellent and it can save millions of dollars a year at sufficient scale.
> In addition, something that is an error in a subcomponent may only be a warning or even just an info on the level of the superordinate component.
Or, keep it simple.
- error means someone is alerted urgently to look at the problem
- warning means someone should be looking into it eventually, with a view to reclassifying as info/debug or resolving it.
IMO many people don't care much about their logs, until the shit hits the fan. Only then, in production, do they realise just how much harder their overly verbose (or inadequate) logging is making things.
The simple filter of "all errors send an alert" can go a long way to encouraging a bit of ownership and correctness on logging.
Tracing is poor at both very long lived traces,
at stream processing, and most tracing implementations are too heavy to run in computationally bound tasks beyond at a very coarse level. Logging is nice in that it has no context, no overhead, is generally very cheap to compose and emit, and with including transaction id and done in a structured way gives you most of what tracing does without all the other baggage.
That said for the spaces where tracing works well, it works unreasonably well.
I think Open Telemetry has solved the stream processing problem issue with span links[1]. Treating each unit of work as an individual trace but being able to combine them and see a causal relationship. Slack published a blog about it pretty recently [2]
When I worked at ScoutAPM, that list is basically the exact areas where we had issues supporting. We didn't do
full-on tracing in the OpenTracing kind of way, but the agent was pretty similar, with spans (mostly automatically inserted), and annotations on those spans with timing, parentage, and extra info (like the sql query this represented in Active record).
The really hard things, which we had reasonable answers for, but never quite perfect:
* Rails websockets (actioncable)
* very long running background jobs (we stopped collecting at some limit, to prevent unbounded memory)
* trying to profile code, we used a modified version of Stackprof to do sampling instead of exact profiling. That worked surprisingly well at finding hotspots, with low overhead.
All sorts of other tricks came along too. I should go look at that codebase again to remind me. That'd be good for my resume.... :)
Hmmm, for long-lived processes and stream processing we use tracing just fine. What we do is make a cutoff of 60 seconds, which each chunk is its own trace. But our backend queries trace data directly, so we can still analyze the aggregate, long-term behavior and then dig into a particular 60 second chunk if it's problematic.
I like a log to read like a book if it’s the result of a task taking a finite time, such as for example an installation, a compilation, a loading of a browser page or similar. Users are going to look into it for clues about what happened and they a) aren’t always related to those who wrote the tools b) don’t have access to the source code or any special log analytics/querying tools.
That’s when you want a log and that’s what the big traditional log frameworks were designed to handle.
A web backend/service is basically the opposite. End users don’t have access to the log, those who analyze it can cross reference with system internals like source code or db state and the log is basically infinite. In that situation a structured log and querying obviously wins.
It’s honestly not even clear that these systems are that closely related.
It’s a good distinction to make, logging for client based systems, is essentially UI design.
For a web app, serving lots of concurrent users, they are essentially unreadable without tools, so you may as well optimise the logs for tool based consumption.
> If you’re writing log statements, you’re doing it wrong.
I too use this bait statement.
Then I follow it up with (the short version):
1) Rewrite your log statements so that they're machine readable
2) Prove they're machine-readable by having the down-stream services read them instead of the REST call you would have otherwise sent.
3) Switch out log4j for Kafka, which will handle the persistence & multiplexing for you.
Voila, you got yourself a reactive, event-driven system with accurate "logs".
If you're like me and you read the article thinking "I like the result but I hate polluting my business code with all that tracing code", well now you can create an independent reader of your kafka events which just focuses on turning events into traces.
> 3) Switch out log4j for Kafka, which will handle the persistence & multiplexing for you.
I don't think this is a reasonable statement. There are already a few logging agents that support structured logging without dragging in heavyweight dependencies such as Kafka. Bringing up Kafka sounds like a case of a solution looking for a problem.
Who on Earth does that? Logs are almost always written to stderr... In part to prevent other problems author is talking about (eg. mixing with the output generated by the application).
I don't understand why this has to be either or... If you store the trace output somewhere you get a log... (let's call it "un-annotated" log, since trace won't have the human-readable message part). Trace is great when examining the application interactively, but if you use the same exact tool and save the results for later you get logs, with all the same problems the author ascribes to logs.
Loads of people, it drives me around the twist too (especially when there's inevitably custom parsing to separate the log messages from the output) but it happens, probably well correlated with people that use more GUI tools, not that there's anything wrong with that, just I think the more you use a CLI the more you're probably aware of this being an issue, or other lesser best practices that might make life easier like newline and tab separation.
As a historical critic of Rust-mania (and if I’m honest, kind of an asshole about it too many times, fail), I’ve recently bumped into stuff like tokio-tracing, eyre, tokio-console, and some others.
And while my historical gripes are largely still the status quo: stack traces in multi-threaded, evented/async code that actually show real line numbers? Span-based tracing that makes concurrent introspection possible by default?
I’m in. I apologize for everything bad I ever said and don’t care whatever other annoying thing.
That’s the whole show. Unless it deletes my hard drive I don’t really care about anything else by comparison.
Traces are just distributed "logs" (in the data structure sense; data ordered only by its appearance in something) where you also pass around the tiniest bit of correlation context between apps. Traces are structured, timestamped, and can be indexed into much more debug-friendly structures like a call tree. But you could just as easily ignore all the data and print them out in streaming sorted order without any correlation.
Honestly it sounds like you're pitching opentelemetry/otlp but where you only trace and leave all the other bits for later inside your opentelemetry collector, which can turn traces into metrics or traces into logs.
That's more or less the model Honeycomb uses. Every signal type is just a structured event. Reality is a bit messier, though. In particular, metrics are the oddball in this world and required a lot of work to make economical.
This is a great article because everyone should understand the similarity between logging and tracing. One thing worth pondering though is the differences in cost. If I am not planning to centrally collect and index informational logs, free-form text logging is extremely cheap. Even a complex log line with formatted strings and numbers can be emitted in < 1µs on modern machines. If you are handling something like 100s or 1000s of requests per second per core, which is pretty respectable, putting a handful of informational log statements in the critical path won't hurt anyone.
Off-the-shelf tracing libraries on the other hand are pretty expensive. You have one additional mandatory read of the system clock, to establish the span duration, plus you are still paying for a clock read on every span event, if you use span events. Every span has a PRNG call, too. Distributed tracing is worthless if you don't send the spans somewhere, so you have to budget for encoding your span into json, msgpack, protobuf, or whatever. It's a completely different ball game in terms of efficiency.
I will agree that conceptually logging can be much cheaper than tracing ever can, but in practice any semi-serious attempt at structured logging ends up looking very, very close to tracing. In fact I'd go so far as to say that the two are effectively interchangeable at a point. What you do with that information, whether you index it or build a graph, is up to you -- and that is where the cost creeps in.
Adding timestamps and UUIDs and an encoding is par for the course in logging these days, I don't think that is the right angle to criticize efficiency.
Tracing can be very cheap if you "simply" (and I'm glossing over a lot here) search for all messages in a liberal window matching each "span start" message and index the result sets. Offering a way to view results as a tree is just a bonus.
Of course, in practice this ends up meaning something completely different, and far costlier. Why that is I cannot fathom.
I was recently musing about the 2 different types of logs:
1. application logs, emitted multiple times per request and serve as breadcrumbs
2. request logs emitted once per request and include latencies, counters and metadata about the request and response
The application logs were useless to me except during development. However the request logs I could run aggregations on which made them far more useful for answering questions. What the author explains very well is that the problem with application logs is they aren't very human-readable which is where visualizing a request with tracing shines. If you don't have tracing, creating request logs will get you most of the way there, it's certainly better than application logs. https://speedrun.nobackspacecrew.com/blog/2023/09/08/logging...
Minor nitpick, but I wish this post started with defining what we mean by logging vs tracing, since some people use these interchangeably. The reader instead has to infer this from the criticisms of logging.
I've never encountered this confusion anywhere, so I wouldn't ever think to dispel it. Which isn't to say that I disagree with the more general point that defining your terms is good thing.
In any case, the post itself (which is not long) illustrates and marks out many of the differences.
I agree. I'm working with code that uses 'verbose "message"' for level 1 verbosity logs and 'trace "message"' for level 2 verbosity. Makes sense in its world, but it's not the same meaning as how cloud-devops-observability culture uses those words.
There are logging libraries that include syntactically scoped timers, such as mulog (https://github.com/BrunoBonacci/mulog). While a great library, we preferred timbre (https://github.com/taoensso/timbre) and rolled our own logging timer macro that interoperates with it. More convenient to have such niceties in a Lisp of course. Since we also have OpenTelemetry available, it would also be easy to wrap traces around code form boundaries as well. Thanks OP for the idea!
One big failing of OpenTelemetry's traces in particular is that attaching structured data to them is difficult. Most structured logs can be JSON which for all of it's faults most things can be serialized to JSON. OpenTelemetry's attributes on traces are much more limited, they don't even support a null/None value! I wish they just accepted JSON-like data, it'd make it much easier to always use traces.
Tracing is much more actionnable but barely usable without a platform. Which makes local programming dependent on third party.
Also it requires passing context or have a way to get back the context in every function that requires it, which can be daunting.
On my side I have opted to mixed structured/text, a generic message that can be easily understood while glancing over logs, and a data object attached for more details.
Someone got me excited about tracing and I started tweaking our stats API to optionally add tracing. Retrofitted it into a mature app, then immediately discovered that all of the data was being dropped because AWS only likes very tiny traces. Depth or fanout or both break it rather quickly.
And OpenTelemetry has a very questionable implementation. For a nested trace, events fire when the trace closes, meaning that a parent ID is reported before it is seen in the stream. That can’t be good for processing. Would be better to have a leading edge event (also helps with errors throwing and the parent never being reported).
Nit to the author: 'rapala' seems like a mistranslation. It is a brand name of a company that makes fishing lures, as far as I can tell. It is not the Finnish word for "to bait", and is therefore only used to refer to a that particular brand.
I'm not sure what the purpose of the text in parenthesis is here, but 'houkutella' would be the most apt translation in this case.
What's most incredible to me is how close tracing feels in spirit to me to event-sourcing.
Here's this log of every frame of compute going on, plus data or metadata about the frame.... but afaik we have yet to start using the same stream of computation for business processes as we do for it's excellent observability.
As a matter of fact, at a previous job we used traces as a data source for event sourcing. One use case: we tracked usage of certain features in API calls in traces, and some batch job ran at whatever frequency aggregated which users were using which features. While it was far from real time because of the sheer amount of data, it was so simple to implement that we had dozens of use cases implemented like that.
Does this naive approach work for anyone to allow a log to be read like a trace:
1. At the start of a request, generate a globally unique traceId
2. Pass this traceId through the whole call stack.
3. Whenever logging, log the traceId as a parameter
Now you have a log with many of the plusses of a trace. The only additional cost to the log is the storage of the traceId on every message.
If you want to read a trace, search through your logs for "traceId: xyz123". If you use plain text storage you can grep. If you use some indexed storage, search for the key-value pair.
This way, you can retrieve something that looks like a trace from a log.
This does not solve all the issues named in the article. However, it is a decent tradeoff that I've used successfully in the past. Call it "poor man's tracing".
Yes, but going to this effort, why not move to tracing instead?
A migration path I could see might be:
- replace current logging lib with otel logging (sending to same output)
- setup tracing
- replace logging with tracing over time (I prefer moving the most painful areas of code first)
How would a hobbyist programmer get started with tracing for a simple web app? Where do the traces end up and how do I query it? Can tracing be used in a development environment?
Context: the last thing I wrote used Deno and Deno Deploy.
It drives me insane that the standardized tracing libraries have you only report closed spans. What if it crashes? What if it stalls? Why should I keep open spans in memory when I can just write an end span event?
I have a side project that I run in Kubernetes with a postgres database and a few Go/Nodejs apps. Recommend me a lightweight otel backend that isn't going to blow out my cloud costs.
I think the author's point is that tracing is a better implementation of both logs and metrics, and I think it's a valid point.
* metrics are pre-aggregated into timeseries data, which makes cardinality expensive. You could also aggregate a value from a trace statement.
* Logs are hand crafted and unique, and are usually improved by adding structured attributes. Structured attributes are better as traces because you can have execution context and well defined attributes that provide better detail.
Traces can be aggregated or sampled to provide all of the information available from logs, but in a more flexible way.
* Certain traces can be retained at 100%. This is equivalent to logs.
* Certain trace attributes can be converted to timeseries data. This is equivalent to metrics.
* Certain traces can be sampled and/or queried with streaming infrastructure. This is a way to observe data with high cardinality without hitting the high cost.
I'm fundamentally uncomfortable with sending all my data to a third party.
The cool thing about logs is that they're just a text file and don't need to be sent over the internet to someone else. But yes, I've encountered some problems just using text logs and I'd like to solve them.
Is there an OpenTelemetry solution that is capable of being self-hosted (and preferably OS) that anyone recommends?
I really enjoyed the content- it's a great article.
Note to author: all but the last code block have a very odd mixture of rather large font sizes (at least on mobile) which vary line to line that make them pretty difficult to read.
Also the link to "Observability Driven Development." was a blank slide deck AFAICT
Are you saying every single variable and function call should be logged every time the code runs? In a dream world sure. While we are at it, let's make it possible to freeze all the state from the production system and let me add breakpoints to rewind in time. In the real world, someone has to make a decision what is noise and what is content.
Unless you are talking about profilers, that measure execution time and memory only, but traces are a lot more than only that.
Annotating the code with logs and traces is a UX activity, not for the end users, but for the ops-team. They don't have knowledge of the internals of the code. Logs should be written in the context of levers that ops have control over.
Take the example from the OP: nr of cache hits. It's something ops can control by configuring the cache size, it is something ops can observe and correlate with request-time and network bandwidth. It would require an immensely sophisticated debugger to make all these correlations automatically.
There are several projects that leverage eBPF for automatic instrumentation[1].
How accurate and useful these are vs. doing this manually will depend on the use case, but I reckon the automatic approach gets you most of the way there, and you can add the missing traces yourself, so if nothing else it saves a lot of work.
Yes, OTel has autoinstrumentation libraries for some language that can pick up a fair amount by default. Though it's unlikely that that would ever be sufficient, it's a nice start.
Sure, and a lot of tools will do this in one way or another. Either instrument code directly or provide annotations/macros to trace a specific method (something like tokio-tracing in the Rust ecosystem).
However, tracing literally every method call would probably be prohibitively expensive so typically you have either:
1. Instrumentation with "understands" common frameworks/libraries and knows what to instrument (eg request handlers in web frameworks)
2. Full opt-in. They make it easy to add a trace for a method invocation with a simple annotation but nothing gets instrumented by default
Yes, and itel has instrumentation libraries which do this.
However, no automatic instrumentation can do everything for you; it can't know what are all the interesting properties or things to add as attributes. But adding tracing automatically to SQL clients, web frameworks etc is very valuable
person writing this came to know some thing that he din't know earlier and decided to convert his light bulb moment into a blog post. not bad bad but failed to understand that logs are the generalisation of very thing they are talking about.
zoogeny|2 years ago
I love observability probably more than most. And my initial reaction to this article is the obvious: why not both?
In fact, I tend to think more in terms of "events" when writing both logs and tracing code. How that event is notified, stored, transmitted, etc. is in some ways divorced from the activity. I don't care if it is going to stdout, or over udp to an aggregator, or turning into trace statements, or ending up in Kafka, etc.
But inevitably I bump up against cost. For even medium sized systems, the amount of data I would like to track gets quite expensive. For example, many tracing services charge for the tags you add to traces. So doing `trace.String("key", value)` becomes something I think about from a cost perspective. I worked at a place that had a $250k/year New Relic bill and we were avoiding any kind of custom attributes. Just getting APM metrics for servers and databases was enough to get to that cost.
Logs are cheap, easy, reliable and don't lock me in to an expensive service to start. I mean, maybe you end up integrating splunk or perhaps self-hosting kibana, but you can get 90% of the benefits just by dumping the logs into Cloudwatch or even S3 for a much cheaper price.
phillipcarter|2 years ago
Luckily, some of the larger incumbents are also moving away from this model, especially as OpenTelemetry is making tracing more widespread as a baseline of sorts for data. And you can definitely bet they're hearing about it from their customers right now, and they want to keep their customers.
Cost is still a concern but it's getting addressed as well. Right now every vendor has different approaches (e.g., the one I work for has a robust sampling proxy you can use), but that too is going the way of standardization. OTel is defining how to propagate sampling metadata in signals so that downstream tools can use the metadata about population representativeness to show accurate counts for things and so on.
alexisread|2 years ago
hosh|2 years ago
Cost is a real issue, and not just in terms of how much the vendor costs you. When tracing becomes a noticeable fraction of CPU or memory usage relative to the application, it's time to rethink doing 100% sampling. In practice, if you are sampling thousands of requests per second, you're very unlikely to actually look through each one of those thousands (thousands of req/s may not be a lot for some sites, but it is already exceeding human-scale without tooling). In order to keep accurate, useful statistics with sampling, you end up using metrics to store trace metrics prior to sampling.
thangalin|2 years ago
They are events[1]. For my text editor, KeenWrite, events can be logged either to the console when run from the command-line or displayed in a dialog when running in GUI mode. By changing "logger.log()" statements to "event.publish()" statements, a number of practical benefits are realized, including:
* Decoupled logging implementation from the system (swap one line of code to change loggers).
* Publish events on a message bus (e.g., D-Bus) to allow extending system functionality without modifying the existing code base.
* Standard logging format, which can be machine parsed, to help trace in-field production problems.
* Ability to assign unique identifiers to each event, allowing for publication of problem/solution documentation based on those IDs (possibly even seeding LLMs these days).
[1]: https://dave.autonoma.ca/blog/2022/01/08/logging-code-smell/
jameshart|2 years ago
pondidum|2 years ago
Honeycomb at least charges per event, which in this case means per span - however they don't charge per span attribute, and each span can be pretty large (100kb / 2000 attributes).
I run all my personal services in their free tier, which has plenty of capacity, and that's before I do any sampling.
csomar|2 years ago
BiteCode_dev|2 years ago
Log for long term, traces for short debut and analisys is a fine compromise.
thinkharderdev|2 years ago
I think this is the issue. Both Splunk and OpenSearch (even self-hosted OpenSearch) get really pricy as well especially with large volumes of log data. Cloudwatch can also get ludicrously expensive. They charge something like $0.50 per GB (!) and another $0.03 per GB to store. I've seen situations at a previous employer where someone accidentally deployed a lambda function with debug logging and ran up a few thousand $$ in Cloudwatch bills overnight.
You should look at Coralogix (disclaimer: I work there). We've built a platform that allows you to store your observability data in S3 and query it through our infrastructure. It can be dramatically more cost-effective than other providers in this space.
hahn-kev|2 years ago
layer8|2 years ago
I partly agree and disagree. In terms of severity, there are only three levels:
– info: not a problem
– warning: potential problem
– error: actual problem (operational failure)
Other levels like “debug” are not about severity, but about level of detail.
In addition, something that is an error in a subcomponent may only be a warning or even just an info on the level of the superordinate component. Thus the severity has to be interpreted relative to the source component.
The latter can be an issue if the severity is only interpreted globally. Either it will be wrong for the global level, or subcomponents have to know the global context they are running in to use the severity appropriate for that context. The latter causes undesirable dependencies on a global context. Meaning, the developer of a lower-level subcomponent would have to know the exact context in which that component is used, in order to chose the appropriate log level. And what if the component is used in different contexts entailing different severities?
So one might conclude that the severity indication is useless after all, but IMO one should rather conclude that severity needs to be interpreted relative to the component. This also means that a lower-level error may have to be logged again in the higher-level context if it’s still an error there, so that it doesn’t get ignored if e.g. monitoring only looks at errors on the higher-level context.
Differences between “fatal” and “error” are really nesting differences between components/contexts. An error is always fatal on the level where it originates.
Hermitian909|2 years ago
Here's a classic problem as an illustration: The storage cost of your logs is really prohibitive. You would like to cut out some of your logs from storage but cannot lower retention below some threshold (say 2 weeks maybe). For this example, assume that tracing is also enabled and every log has a traceId
A good answer is to run a compaction job that inspects each trace. If it contains an error preserve it. Remove X% of all other traces.
Log levels make the ergonomics for this excellent and it can save millions of dollars a year at sufficient scale.
beachy|2 years ago
Or, keep it simple.
- error means someone is alerted urgently to look at the problem
- warning means someone should be looking into it eventually, with a view to reclassifying as info/debug or resolving it.
IMO many people don't care much about their logs, until the shit hits the fan. Only then, in production, do they realise just how much harder their overly verbose (or inadequate) logging is making things.
The simple filter of "all errors send an alert" can go a long way to encouraging a bit of ownership and correctness on logging.
SkyPuncher|2 years ago
Info is things like “processing X”
Debug is things like “variable is Y” or “made it to this point”
BillinghamJ|2 years ago
And then "error" as - "things are not okay, a developer is going to need to intervene"
And errors then split roughly between "must be fixed sometime", and "must be fixed now/ASAP"
fnordpiglet|2 years ago
That said for the spaces where tracing works well, it works unreasonably well.
riv991|2 years ago
[1] https://opentelemetry.io/docs/concepts/signals/traces/#span-...
[2] https://slack.engineering/tracing-notifications/
cschneid|2 years ago
The really hard things, which we had reasonable answers for, but never quite perfect: * Rails websockets (actioncable) * very long running background jobs (we stopped collecting at some limit, to prevent unbounded memory) * trying to profile code, we used a modified version of Stackprof to do sampling instead of exact profiling. That worked surprisingly well at finding hotspots, with low overhead.
All sorts of other tricks came along too. I should go look at that codebase again to remind me. That'd be good for my resume.... :)
https://github.com/scoutapp/scout_apm_ruby
phillipcarter|2 years ago
alkonaut|2 years ago
That’s when you want a log and that’s what the big traditional log frameworks were designed to handle.
A web backend/service is basically the opposite. End users don’t have access to the log, those who analyze it can cross reference with system internals like source code or db state and the log is basically infinite. In that situation a structured log and querying obviously wins.
It’s honestly not even clear that these systems are that closely related.
WatchDog|2 years ago
For a web app, serving lots of concurrent users, they are essentially unreadable without tools, so you may as well optimise the logs for tool based consumption.
mrkeen|2 years ago
I too use this bait statement.
Then I follow it up with (the short version):
1) Rewrite your log statements so that they're machine readable
2) Prove they're machine-readable by having the down-stream services read them instead of the REST call you would have otherwise sent.
3) Switch out log4j for Kafka, which will handle the persistence & multiplexing for you.
Voila, you got yourself a reactive, event-driven system with accurate "logs".
If you're like me and you read the article thinking "I like the result but I hate polluting my business code with all that tracing code", well now you can create an independent reader of your kafka events which just focuses on turning events into traces.
rewmie|2 years ago
I don't think this is a reasonable statement. There are already a few logging agents that support structured logging without dragging in heavyweight dependencies such as Kafka. Bringing up Kafka sounds like a case of a solution looking for a problem.
bowsamic|2 years ago
crabbone|2 years ago
Who on Earth does that? Logs are almost always written to stderr... In part to prevent other problems author is talking about (eg. mixing with the output generated by the application).
I don't understand why this has to be either or... If you store the trace output somewhere you get a log... (let's call it "un-annotated" log, since trace won't have the human-readable message part). Trace is great when examining the application interactively, but if you use the same exact tool and save the results for later you get logs, with all the same problems the author ascribes to logs.
OJFord|2 years ago
FridgeSeal|2 years ago
Like, I develop cli apps, so like, what else would go to stdout that you suppose will interfere?
dalyons|2 years ago
benreesman|2 years ago
And while my historical gripes are largely still the status quo: stack traces in multi-threaded, evented/async code that actually show real line numbers? Span-based tracing that makes concurrent introspection possible by default?
I’m in. I apologize for everything bad I ever said and don’t care whatever other annoying thing.
That’s the whole show. Unless it deletes my hard drive I don’t really care about anything else by comparison.
hardwaresofton|2 years ago
- we collectively realized that logs, events, traces, metrics, and errors are actually all just logs
- we agreed on a single format that encapsulated all that information in a structured manner
- we built firehose/stream processing tooling to provide modern o11y creature comforts
I can't tell if that universe is better than this one, or worse.
andrewstuart2|2 years ago
Honestly it sounds like you're pitching opentelemetry/otlp but where you only trace and leave all the other bits for later inside your opentelemetry collector, which can turn traces into metrics or traces into logs.
phillipcarter|2 years ago
dalyons|2 years ago
jeffbee|2 years ago
Off-the-shelf tracing libraries on the other hand are pretty expensive. You have one additional mandatory read of the system clock, to establish the span duration, plus you are still paying for a clock read on every span event, if you use span events. Every span has a PRNG call, too. Distributed tracing is worthless if you don't send the spans somewhere, so you have to budget for encoding your span into json, msgpack, protobuf, or whatever. It's a completely different ball game in terms of efficiency.
xyzzy_plugh|2 years ago
Adding timestamps and UUIDs and an encoding is par for the course in logging these days, I don't think that is the right angle to criticize efficiency.
Tracing can be very cheap if you "simply" (and I'm glossing over a lot here) search for all messages in a liberal window matching each "span start" message and index the result sets. Offering a way to view results as a tree is just a bonus.
Of course, in practice this ends up meaning something completely different, and far costlier. Why that is I cannot fathom.
nithril|2 years ago
Actually structured logging exists since years like in Java https://github.com/logfellow/logstash-logback-encoder
hyperpape|2 years ago
perpil|2 years ago
1. application logs, emitted multiple times per request and serve as breadcrumbs
2. request logs emitted once per request and include latencies, counters and metadata about the request and response
The application logs were useless to me except during development. However the request logs I could run aggregations on which made them far more useful for answering questions. What the author explains very well is that the problem with application logs is they aren't very human-readable which is where visualizing a request with tracing shines. If you don't have tracing, creating request logs will get you most of the way there, it's certainly better than application logs. https://speedrun.nobackspacecrew.com/blog/2023/09/08/logging...
ec109685|2 years ago
ducharmdev|2 years ago
ryanklee|2 years ago
In any case, the post itself (which is not long) illustrates and marks out many of the differences.
jlokier|2 years ago
waffletower|2 years ago
goalieca|2 years ago
waffletower|2 years ago
gazpacho|2 years ago
h1fra|2 years ago
On my side I have opted to mixed structured/text, a generic message that can be easily understood while glancing over logs, and a data object attached for more details.
hinkley|2 years ago
And OpenTelemetry has a very questionable implementation. For a nested trace, events fire when the trace closes, meaning that a parent ID is reported before it is seen in the stream. That can’t be good for processing. Would be better to have a leading edge event (also helps with errors throwing and the parent never being reported).
Kind of a bummer. Needs work.
candiddevmike|2 years ago
vkoskiv|2 years ago
pondidum|2 years ago
jauntywundrkind|2 years ago
Here's this log of every frame of compute going on, plus data or metadata about the frame.... but afaik we have yet to start using the same stream of computation for business processes as we do for it's excellent observability.
alexisread|2 years ago
juliogreff|2 years ago
koliber|2 years ago
1. At the start of a request, generate a globally unique traceId
2. Pass this traceId through the whole call stack.
3. Whenever logging, log the traceId as a parameter
Now you have a log with many of the plusses of a trace. The only additional cost to the log is the storage of the traceId on every message.
If you want to read a trace, search through your logs for "traceId: xyz123". If you use plain text storage you can grep. If you use some indexed storage, search for the key-value pair.
This way, you can retrieve something that looks like a trace from a log.
This does not solve all the issues named in the article. However, it is a decent tradeoff that I've used successfully in the past. Call it "poor man's tracing".
pondidum|2 years ago
A migration path I could see might be:
- replace current logging lib with otel logging (sending to same output) - setup tracing - replace logging with tracing over time (I prefer moving the most painful areas of code first)
skybrian|2 years ago
Context: the last thing I wrote used Deno and Deno Deploy.
curioussavage|2 years ago
opentelemetry has a service you can run that will collect the telemetry data and you can export it to something like prometheus which can store it and let you query it. Example here https://github.com/open-telemetry/opentelemetry-collector-co...
Typically in dev environments trace spans are just emitted to stdout just like logs. I sometimes turn that off too though because it gets noisy.
spullara|2 years ago
andersrs|2 years ago
hosh|2 years ago
While there are better tools for alerting, metrics, or aggregations, it helps a lot in debugging and troubleshooting.
aero142|2 years ago
Traces can be aggregated or sampled to provide all of the information available from logs, but in a more flexible way. * Certain traces can be retained at 100%. This is equivalent to logs. * Certain trace attributes can be converted to timeseries data. This is equivalent to metrics. * Certain traces can be sampled and/or queried with streaming infrastructure. This is a way to observe data with high cardinality without hitting the high cost.
marcus_holmes|2 years ago
The cool thing about logs is that they're just a text file and don't need to be sent over the internet to someone else. But yes, I've encountered some problems just using text logs and I'd like to solve them.
Is there an OpenTelemetry solution that is capable of being self-hosted (and preferably OS) that anyone recommends?
Too|2 years ago
malinens|2 years ago
jasonjmcghee|2 years ago
Note to author: all but the last code block have a very odd mixture of rather large font sizes (at least on mobile) which vary line to line that make them pretty difficult to read.
Also the link to "Observability Driven Development." was a blank slide deck AFAICT
pondidum|2 years ago
It's all statically rendered html, and I don't see anything weird in the html either.
Do you have a screenshot and some device info so I can look a bit more? Thanks
amelius|2 years ago
This should not require code at the application level, but it should be implemented at the tooling level.
Too|2 years ago
Unless you are talking about profilers, that measure execution time and memory only, but traces are a lot more than only that.
Annotating the code with logs and traces is a UX activity, not for the end users, but for the ops-team. They don't have knowledge of the internals of the code. Logs should be written in the context of levers that ops have control over.
Take the example from the OP: nr of cache hits. It's something ops can control by configuring the cache size, it is something ops can observe and correlate with request-time and network bandwidth. It would require an immensely sophisticated debugger to make all these correlations automatically.
conradludgate|2 years ago
lambda_garden|2 years ago
Perhaps really performance critical stuff could have a "notrace" annotation.
imiric|2 years ago
How accurate and useful these are vs. doing this manually will depend on the use case, but I reckon the automatic approach gets you most of the way there, and you can add the missing traces yourself, so if nothing else it saves a lot of work.
[1]: https://ebpf.io/applications/
austinsharp|2 years ago
For Java: https://opentelemetry.io/docs/instrumentation/java/automatic...
thinkharderdev|2 years ago
However, tracing literally every method call would probably be prohibitively expensive so typically you have either:
1. Instrumentation with "understands" common frameworks/libraries and knows what to instrument (eg request handlers in web frameworks)
2. Full opt-in. They make it easy to add a trace for a method invocation with a simple annotation but nothing gets instrumented by default
pondidum|2 years ago
However, no automatic instrumentation can do everything for you; it can't know what are all the interesting properties or things to add as attributes. But adding tracing automatically to SQL clients, web frameworks etc is very valuable
killbot5000|2 years ago
hello1234567|2 years ago
thegrizzlyking|2 years ago