Article doesn't mention cardinality. And it's a key concept to understand how these systems differ.
When I joined a company that used statsD for metrics and ELK for logging, I was told the first rule of stats: you will not put a order id, customer id or the like in the stat name. It is OK to have a value with a small set of values (e.g. beta: true|false), but the order Id varies too much and will swamp the statsd server.
Formally: order id is high cardinality. A boolean is low cardinality. cardinality is the "number of elements of the set" - foor a boolean it's 2, for an "order state" enumeration it's a few, and for an order id it's unbounded, but realistically, it's "how many orders might we have in any 2 week window / other stat retention period"
A stats system like StatsD that keeps continuous averages, max, min, p90 etc cannot do this if there are lots of high-cardinality factors. There is only so much combined cardinality that they can handle.
No, Prometheus will not save you here (1)
Whereas a logging system such as an ELK stack stores each individual log entry in detail, and the more relevant unique ids you attach to it, the more informative it is. Cardinality is not problem. But pre-calculated stats across these fields are limited, and ad-hoc queries might be expensive to run.
You could, I suppose, start with a rich log entry and transform it into a metric by discarding some parts. But not the reverse.
> CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
Working in the ad-tech industry, we're tracking creative id, placement id, domain and more for each impression. We're lucky if each timeseries has 10 impressions at the hourly granularity.
Although I agree in principle one other aspect that becomes critical in high volume systems is efficiency.
Logging is generally a text stream and in production should only be warning+ levels (imo) so the impact on performance is negligible and really only when something is going wrong.
I want metrics all of the time, potentially 10 or 20 metrics per request/action, a high performance, low network method for sending those metrics with low latency is critical.
We need both systems and both should be treated as tier 1 systems within an organisation. I don't think pushing metrics into a log stream is a scalable architecture.
Being a text stream, you'd also have to reparse that text stream, and hope that the text stream never changes. Which is at odds with how I usually use logs (as places to dump information for humans).
the author is saying the same thing. although he didn't mention efficiency in his article, which is an excellent remark btw. but he is saying you can usually go from logs to metrics but not the other way around. and that we often need both.
Am not sure how much this discussion is happening now but do remember having these conversations more often a few years ago when tools like Prometheus were coming up.
This is my take and shouldn't be taken as gospel but I've observed that metrics are perfect for isolating and identifying an issue and logs attempt to explain the behaviour seen. Having a layered dashboard set up top to bottom showing the throughput, error rates and latencies expressing the health of each layer from the exposed service or api all the way down to the subsystems and hardware supporting them is a good place to start. I'd write more but don't want to make this post overly long. There are several useful articles online on the different methodologies and approaches to achieve this.
There isn't much conversation though about the knowledge an organisation has around incident analysis workflows, how past incident resolutions are captured and integrated with dashboards in monitoring systems and how they are shared and reused as SREs and engineers work with log analytics. I've seen the same lessons having to be relearned when new engineers join a team. Checklists are a good example( there are more ways)for directing incident analysis in large complex systems but how many times do you see this being supported by the monitoring system? The focus is always on displaying pretty charts in a grid when more value could be gleaned from the same dashboard content presented as a rich living document with charts, integrated checklists in something resembling a guide to resolving an issue.
> I'd write more but don't want to make this post overly long. There are several useful articles online on the different methodologies and approaches to achieve this.
I've thought about this a lot too and agree to a good extent, however:
> 2. You can derive arbitrary metrics from log streams
Not quite. There isn't really a way (that I'd consider sensible) to get periodic, system-level measurements such as system load, memory usage, I/O stats from logs.
I'm also far less anxious about the idea of throwing away metrics beyond a certain age than I would with logs, and to a degree that allows me to worry a bit less about the volume of data I collect through metrics.
Doubling up on metrics/logs can also be key in detecting that one of those systems isn't working properly for a particular set of hosts.
>There isn't really a way (that I'd consider sensible) to get periodic, system-level measurements such as system load, memory usage, I/O stats from logs.
My biggest problem with using ES for metrics is that with the exception of Timelion, there is no way to do cross metric calculations (db queries/s over requests/s => db queries/request)
While logs and metrics have common parts, they are quite different beasts. It is quite expensive producing and storing logs for frequently changing metrics such as requests rate or response latency on highly loaded system with millions of rps.
I'd recommend:
- Storing logs in high-performance systems such as cLoki [1], which give enough flexibility to generate arbitrary metrics from high log volumes in real time.
- Storing metrics in high-performance time series database optimized for high volumes of data points (trillions) and high number of time series (aka high cardinality). An example of such a TSDB is VictoriaMetrics [2].
Implementing dashboards with time-based correlation between metrics and logs based on shared attributes (labels) such as datacenter, cluster, instance, job, subsystem, etc [3].
The dichotomy is real because nobody asks devs to output properly structured logs, or to work with ops teams to associate particular logs with particular states. Because of that, metrics are the defacto universal measure of the health of a service. Getting lots of 500s? May be time to flip a circuit breaker. Getting lots of logs of an unknown error? Time to... Send an email to the devs and ask what this means, and wait around for an investigation (meanwhile, metrics have identified the likely culprit)
For many of the services I run, I haven't looked at logs in months, because the metrics tell the story. If service is degraded, I can usually correlate it to another downed service, a network failure, or a recent change. No logs needed.
Correlating metrics with logs is great, and proper distributed tracing is a revelation. But if you had to collect and measure just one thing, it's metrics.
There's a wide variety of time series data that is interesting to track: metrics, application telemetry, kpis, auditing events, log messages, etc. This is rich structured data. The richer it is; the more useful it gets. Many companies use different solutions for each of those and that creates a lot of complexity and devops overhead. I've seen a lot of projects where corners get cut and there is just the bare minimum of data. With microservices and serverless, the problem is even more acute because logging into individual servers and grepping through logs like a caveman is just not practical anymore. Running blind is downright irresponsible and sadly all too common.
When something happens, it's actually interesting cross examine all of these different things. When you do, having meta data to drill down and break down is essential and it's also essential to have software that can handle that that doesn't fall over just because you had a spike in usage. Having everything in one place and annotated with the right meta data is key. Another thing that is key is retention policies to keep the data volume in check. Done properly, you will be generating many GB in data per day. Left unchecked, this will kill whatever infrastructure you have in no time. And while historical data is sometimes interesting, it's usually the very recent past that is most interesting. So, you should expect to throw away the vast majority of data you collect after a short while.
The dichotomy is real and a reflection of Dev vs Ops dichotomy. DevOps made Dev and Ops collaborate but didn't blend the roles & skills. Ops appreciate logs but require consistent metrics to identify and root-cause the problem. Dev appreciate metrics but require logs to debug and fix the problem. Opinions on what is more important are informed by role and experience; the author makes it clear that as a team, we need both.
> For many of the services I run, I haven't looked at logs in months, because the metrics tell the story. If service is degraded, I can usually correlate it to another downed service, a network failure, or a recent change. No logs needed.
Good point, echoing Brendan Gregg, the author of "USE" method commented:
> The USE Method is based on three metric types and a strategy for approaching a complex system. I find it solves about 80% of server issues with 5% of the effort,” (http://www.brendangregg.com/usemethod.html)
Solving 80% of issues with 5% of effort is commendable; the rest 20% goes to developers where the other 95% of effort is spent debugging and fixing the problem, primarily by reasoning about the logs.
So:
- "which of metrics or logs is more important?" is a relative and moot
- "can metrics be extracted from logs?" - yes; "is it practical?" - it depends: likely NO for DIY. The fact that ELK is not making it particularly well doesn't mean that other products can't / don't do it.
-
Manipulating/aggregating metrics is a giant pain, you have to understand what exactly the implementors though `max` or whatever means and try to match it up to what you need.
One thing the article doesn’t address is that logs have to be accurate and metrics should be consistent, but not vice versa. If your performance metrics are always 10% out, they’re still perfectly usable (as long as they’re always 10% high or low) to alert on or to compare to metrics from a known-good historical period.
> In practice, teams continue to ignore this problem and instead rely on aggregate time-series that are essentially impossible to interpret correctly, such as “maximum of 99th percentile latency by server.”
I get that we can't know P99 across all servers. Let's say we're talking about latency and the max of P99's will be an upper bound on the P99 across all servers. Why isn't this true or useful?
How about this: you have a load balancer that sends much less traffic to servers that are running slowly (but still a small amount of traffic, so that it recognises when the machine recovers). One backend server has a hardware problem; requests that usually take milliseconds now take seconds; the 99th percentile on that machine is much higher than anywhere else. But it's actually serving a tiny tiny fraction of traffic. Is the value useful in indicating the performance of the service as a whole?
An example: your server is handling requests to two endpoints. Endpoint A responds in ~100ms, B in ~200ms. You start caching some responses to A higher up. Now your P99 across all servers is higher even though you improved the performance. (Potentially setting off alarms on previous thresholds)
While it doesn't solve the alerts, for viewing the data like that heatmaps are amazing. But you need the raw numbers (or predefined bands for aggregation, but that limits flexibility)
I will go a step further by stating that metrics, logs and traces are very similar and should be treated as such in a unique platform. Leveraging these 3 sources of data in a micro-services world is more than needed for troubleshooting, documentation and monitoring.
Right now I'm using Prometheus (metrics) + Jaeger (traces) + Fluentd&Clickhouse (logs) + Grafana to render all of that. It's not that easy to correlate data but I'm getting there (with tricky queries in Grafana panels and custom Grafana sources). A PoC about displaying traces in a nice way: https://github.com/alexvaut/OpenTracingDiagram.
This is not about the logarithmic vs linear scales, as I first thought, but about analytics and aggregation of events.
This is probably the gist of it:
> many interesting kinds of metric are very hard to aggregate and re-aggregate correctly [...] The best you can hope for is that your metric collection supports the recording of histograms [...]
On the funny side, I had to whitelist ajax.googleapis un umatrix to get rid of the white page. Fitting for the website's name.
Metrics tell you if the system is behaving well overall and can usually identify periods or trouble or trends.
Logs give you more detail when you want it. Often because of a report either from a user or an indication from your metrics. Metrics usually won't tell you why, unless you've specifically put in a metric that covers the exact cause (sometimes to protect against regressions).
Meh. We are implementing tracing between several core services in our pipeline of work. It is valuable. It will in no way replace our logging. We use logging of event streams and can attach all kinds of information into a log that can be hard to get on a given span in a trace.
[+] [-] SideburnsOfDoom|6 years ago|reply
When I joined a company that used statsD for metrics and ELK for logging, I was told the first rule of stats: you will not put a order id, customer id or the like in the stat name. It is OK to have a value with a small set of values (e.g. beta: true|false), but the order Id varies too much and will swamp the statsd server.
Formally: order id is high cardinality. A boolean is low cardinality. cardinality is the "number of elements of the set" - foor a boolean it's 2, for an "order state" enumeration it's a few, and for an order id it's unbounded, but realistically, it's "how many orders might we have in any 2 week window / other stat retention period"
A stats system like StatsD that keeps continuous averages, max, min, p90 etc cannot do this if there are lots of high-cardinality factors. There is only so much combined cardinality that they can handle.
No, Prometheus will not save you here (1)
Whereas a logging system such as an ELK stack stores each individual log entry in detail, and the more relevant unique ids you attach to it, the more informative it is. Cardinality is not problem. But pre-calculated stats across these fields are limited, and ad-hoc queries might be expensive to run.
You could, I suppose, start with a rich log entry and transform it into a metric by discarding some parts. But not the reverse.
1) https://prometheus.io/docs/practices/naming/
> CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
[+] [-] mbar84|6 years ago|reply
Working in the ad-tech industry, we're tracking creative id, placement id, domain and more for each impression. We're lucky if each timeseries has 10 impressions at the hourly granularity.
[+] [-] nodefortytwo|6 years ago|reply
Logging is generally a text stream and in production should only be warning+ levels (imo) so the impact on performance is negligible and really only when something is going wrong.
I want metrics all of the time, potentially 10 or 20 metrics per request/action, a high performance, low network method for sending those metrics with low latency is critical.
We need both systems and both should be treated as tier 1 systems within an organisation. I don't think pushing metrics into a log stream is a scalable architecture.
[+] [-] delusional|6 years ago|reply
[+] [-] blondin|6 years ago|reply
[+] [-] thinkersilver|6 years ago|reply
This is my take and shouldn't be taken as gospel but I've observed that metrics are perfect for isolating and identifying an issue and logs attempt to explain the behaviour seen. Having a layered dashboard set up top to bottom showing the throughput, error rates and latencies expressing the health of each layer from the exposed service or api all the way down to the subsystems and hardware supporting them is a good place to start. I'd write more but don't want to make this post overly long. There are several useful articles online on the different methodologies and approaches to achieve this.
There isn't much conversation though about the knowledge an organisation has around incident analysis workflows, how past incident resolutions are captured and integrated with dashboards in monitoring systems and how they are shared and reused as SREs and engineers work with log analytics. I've seen the same lessons having to be relearned when new engineers join a team. Checklists are a good example( there are more ways)for directing incident analysis in large complex systems but how many times do you see this being supported by the monitoring system? The focus is always on displaying pretty charts in a grid when more value could be gleaned from the same dashboard content presented as a rich living document with charts, integrated checklists in something resembling a guide to resolving an issue.
There certainly is enough data to achieve this.
[+] [-] ggregoire|6 years ago|reply
Do you have some of these articles in mind?
[+] [-] ris|6 years ago|reply
> 2. You can derive arbitrary metrics from log streams
Not quite. There isn't really a way (that I'd consider sensible) to get periodic, system-level measurements such as system load, memory usage, I/O stats from logs.
I'm also far less anxious about the idea of throwing away metrics beyond a certain age than I would with logs, and to a degree that allows me to worry a bit less about the volume of data I collect through metrics.
Doubling up on metrics/logs can also be key in detecting that one of those systems isn't working properly for a particular set of hosts.
[+] [-] yehosef|6 years ago|reply
Why not Metric Beats (https://www.elastic.co/products/beats/metricbeat). The log structure is generally numbers, but it is still fundamentally different from pure metrics, IMO.
My biggest problem with using ES for metrics is that with the exception of Timelion, there is no way to do cross metric calculations (db queries/s over requests/s => db queries/request)
[+] [-] valyala|6 years ago|reply
I'd recommend:
- Storing logs in high-performance systems such as cLoki [1], which give enough flexibility to generate arbitrary metrics from high log volumes in real time.
- Storing metrics in high-performance time series database optimized for high volumes of data points (trillions) and high number of time series (aka high cardinality). An example of such a TSDB is VictoriaMetrics [2].
Implementing dashboards with time-based correlation between metrics and logs based on shared attributes (labels) such as datacenter, cluster, instance, job, subsystem, etc [3].
[1] https://github.com/lmangani/cLoki
[2] https://github.com/VictoriaMetrics/VictoriaMetrics/
[3] https://grafana.com/blog/2019/05/06/how-loki-correlates-metr...
[+] [-] 0xbadcafebee|6 years ago|reply
For many of the services I run, I haven't looked at logs in months, because the metrics tell the story. If service is degraded, I can usually correlate it to another downed service, a network failure, or a recent change. No logs needed.
Correlating metrics with logs is great, and proper distributed tracing is a revelation. But if you had to collect and measure just one thing, it's metrics.
[+] [-] jillesvangurp|6 years ago|reply
When something happens, it's actually interesting cross examine all of these different things. When you do, having meta data to drill down and break down is essential and it's also essential to have software that can handle that that doesn't fall over just because you had a spike in usage. Having everything in one place and annotated with the right meta data is key. Another thing that is key is retention policies to keep the data volume in check. Done properly, you will be generating many GB in data per day. Left unchecked, this will kill whatever infrastructure you have in no time. And while historical data is sometimes interesting, it's usually the very recent past that is most interesting. So, you should expect to throw away the vast majority of data you collect after a short while.
[+] [-] dzimine|6 years ago|reply
> For many of the services I run, I haven't looked at logs in months, because the metrics tell the story. If service is degraded, I can usually correlate it to another downed service, a network failure, or a recent change. No logs needed.
Good point, echoing Brendan Gregg, the author of "USE" method commented:
> The USE Method is based on three metric types and a strategy for approaching a complex system. I find it solves about 80% of server issues with 5% of the effort,” (http://www.brendangregg.com/usemethod.html)
Solving 80% of issues with 5% of effort is commendable; the rest 20% goes to developers where the other 95% of effort is spent debugging and fixing the problem, primarily by reasoning about the logs.
So: - "which of metrics or logs is more important?" is a relative and moot - "can metrics be extracted from logs?" - yes; "is it practical?" - it depends: likely NO for DIY. The fact that ELK is not making it particularly well doesn't mean that other products can't / don't do it. -
[+] [-] falsedan|6 years ago|reply
One thing the article doesn’t address is that logs have to be accurate and metrics should be consistent, but not vice versa. If your performance metrics are always 10% out, they’re still perfectly usable (as long as they’re always 10% high or low) to alert on or to compare to metrics from a known-good historical period.
[+] [-] karmakaze|6 years ago|reply
> In practice, teams continue to ignore this problem and instead rely on aggregate time-series that are essentially impossible to interpret correctly, such as “maximum of 99th percentile latency by server.”
I get that we can't know P99 across all servers. Let's say we're talking about latency and the max of P99's will be an upper bound on the P99 across all servers. Why isn't this true or useful?
[+] [-] Fiadliel|6 years ago|reply
[+] [-] viraptor|6 years ago|reply
While it doesn't solve the alerts, for viewing the data like that heatmaps are amazing. But you need the raw numbers (or predefined bands for aggregation, but that limits flexibility)
[+] [-] alexvaut|6 years ago|reply
Right now I'm using Prometheus (metrics) + Jaeger (traces) + Fluentd&Clickhouse (logs) + Grafana to render all of that. It's not that easy to correlate data but I'm getting there (with tricky queries in Grafana panels and custom Grafana sources). A PoC about displaying traces in a nice way: https://github.com/alexvaut/OpenTracingDiagram.
[+] [-] MayeulC|6 years ago|reply
This is probably the gist of it:
> many interesting kinds of metric are very hard to aggregate and re-aggregate correctly [...] The best you can hope for is that your metric collection supports the recording of histograms [...]
On the funny side, I had to whitelist ajax.googleapis un umatrix to get rid of the white page. Fitting for the website's name.
[+] [-] karmakaze|6 years ago|reply
Logs give you more detail when you want it. Often because of a report either from a user or an indication from your metrics. Metrics usually won't tell you why, unless you've specifically put in a metric that covers the exact cause (sometimes to protect against regressions).
[+] [-] Thaxll|6 years ago|reply
[+] [-] sethammons|6 years ago|reply