Grafana Tempo, a scalable distributed tracing system

[+] RedShift1|5 years ago|reply

I really wish Grafana would spend more time on fixing the 2.7k bug reports that are open for their core product...

[+] ATsch|5 years ago|reply

I disagree with the idea that they can't work on both, but I have to agree that this is my #1 wish for Grafana. The frontend is filled with an incredible amount of ux papercuts, performance issues and silent breakage that make me resent having to edit dashboards at all. It's a perpetual state of being scared my next click will trigger "a script is not responding" or randomly break something I won't know how to fix.

It feels like Grafana Labs is stepping into the same "if it doesn't help sales tick checkboxes it's not getting development time" trap so many B2B companies hit as they grow and it's a shame. I wish there was something else I could switch to.

[+] Sodman|5 years ago|reply

There are certainly some real bugs in there, but saying there are 2.7k "bug reports" is a little disingenuous. These are "GitHub Issues", many of which are feature requests, and many more are new users trying out this free open source software for the first time. Many of these "bugs" are from users asking why their setup isn't working, and then failing to produce any kind of technical debug info when asked by the maintainers.

[+] netingle|5 years ago|reply

PRs welcome!

[+] cbsmith|5 years ago|reply

Oh dear... this seems like a move in the exact opposite direction of what one would hope for. It's basically shoving in structured data into what is at best semi-structured log data, storing all of it, and then using that for things that a tracing system would be better suited to.

I can't help but feel they've learned the wrong lessons from their challenges with tracing.

[+] annanay|5 years ago|reply

The semi structured nature of logs works to the advantage of Tempo, because as developers we have the flexibility to log _anything_, high cardinality values like cust-id, request latency, gobble-de-gook .. the equivalent of span tags. Instead of indexing these as tags, we get advance search features through a powerful query language landing in Loki (LogQLv2).

[+] tikkabhuna|5 years ago|reply

Slightly off-topic, but how are users of Grafana and other monitoring tools justifying the investment? And I'm not necessarily talking about monetary amount, its also people cost.

What features are you looking for and how do you rank them?

[+] lawrjone|5 years ago|reply

We run Prometheus with 60d retention (2TB), and an Elasticsearch cluster for logs (also 60d, about 30TB cluster) with Kibana on top.

Prometheus is by far the cheapest, both for infrastructure spend and for human cost. The only time we spend really working on Prometheus is configuring our service discovery, which is the shipping component you’d have with any monitoring tool. I estimate we spend about 1 developer day a month un Prometheus upkeep.

Logging is much more fiddly, and required a large up front investment. Now it’s running though, we get a lot of value from our setup. It costs much more though, think about $10k/month in infrastructure spend and about 3 developer days a month to maintain.

Grafana is effortless to run, really dead simple. You can consider this to be ‘free’, both for infra and maintenance.

Hope that gives you a sense of this? I think we come out equivalent to managed services for cost when you account for infra and human time, but have far more flexibility in how we use the tools and develop the skills to properly leverage each product along the way.

[+] syoc|5 years ago|reply

I use a telegraf, influxdb, grafana stack. I find the cost in maintainance and initial setup quite low. Telegraf is super easy, just uncomment the things you want it to collect from the config file. Influxdb is just adding the correct users and a database, never touched it since. Grafana can be a time sink if you want to bikeshed your dashboards, but there are a lot of pre-made ones that handle common usecases.

It's really nice to have some metrics when for instance a service goes down. It's super easy to spot a OOM situation or other vertical scaling issues.

[+] unknown|5 years ago|reply

[deleted]

[+] gjulianm|5 years ago|reply

I run Prometheus + Grafana on a handful of machines we manage, with different software and setups. The cost of setting up a Prometheus instance and monitoring something is not too high, but things around it depend on your actual setup

- Service discovery can get tricky. In our case we manage the physical machines by ourselves so at the beginning we just had to write down static configs, although by now we have automatic discovery (took a few days of development).

- Depending on what you want to monitor, you might need to write your own exporters to Prometheus. However, mtail [1] has been really useful to create metrics from logs without too much work. In any case, you'll have to put time in deploying and configuring those exporters.

- Dashboards and alerts. There are dashboards for a lot of exporters, and there are collections of alerts too [2], but you will need to put time and effort in modifying/creating dashboards and writing down alerts. However, it's a productive effort because it helps in having a better understanding of which metrics are important and how do they relate to the workings of the software you use. Also, PromQL is a pretty nice query language for the purposes of Prometheus.

- Notification integrations. In my case we had to put some time to properly configure a Microsoft Teams integration and a deadman switch channel, but in most cases it will be pretty straightforward.

All in all, you'll need to invest some time in the integrations, but those are things that you need to do in any case. Prometheus itself is pretty easy to set up and maintain, and doesn't stand in your way. No tweaks, no undocumented settings, no bugs. I'm pretty happy in that regard, once you get it running you don't have to worry about it. Storage usage is pretty low even with a high amount of exporters and metrics per node, maybe around 10GB for 60 days of data of a single node? (I'm not sure because Prometheus does some compression and it's not exactly linear with the time or the number of nodes).

And that relatively low investment pays off quickly. The machines we manage use various tools and programs to deal with quite a lot of data at high bandwidths, so performance problems and bugs can be difficult to debug. The Prometheus + Grafana setup has made several times easier the debugging of issues and performance problems, the alerting system helps us prevent outages and we have even discovered issues that were unknown to us. For me, the moment you manage machines with even just a little bit of complexity in the software or setup, it's already worth it to look into monitoring.

1: https://github.com/google/mtail 2: https://awesome-prometheus-alerts.grep.to/

[+] sna1l|5 years ago|reply

Ingestion/storage of the volume is one thing, but what about the additional overhead for every request?

[+] strofcon|5 years ago|reply

Good question! Depends on the instrumentation libraries for sure, but in our Go services we've been using Jaeger protocol for our traces, and "sampling" at 100%, with negligible impact to request / response times.

We spit out some 30k+ spans per second, FWIW. :-)

Edit: Disclaimer, we're not using Tempo.

[+] Axsuul|5 years ago|reply

Is this a good fit for storing every network request we make (so that we can trace all requests made to an API on behalf of a specific customer for debugging) or is Loki better for that?

[+] number101010|5 years ago|reply

it depends on your needs. if a single log line is sufficient to log all the information you need to do your debugging then Loki is a great choice.

if you need to see the full request as it passed through your system then distributed tracing/Tempo is a great fit!

[+] sciurus|5 years ago|reply

Honeycomb also stores traces in S3 and supports searching them via AWS lambda. I wonder if a model like that is useful for tempo, which doesn't seem to support search at all now.

https://www.honeycomb.io/blog/secondary-storage-to-just-stor...

[+] SEJeff|5 years ago|reply

I'd love to see how this compares to Jaeger.

[+] number101010|5 years ago|reply

The headline would be:

Jaeger supports native search, but requires Elastic or Cassandra.

Tempo relies on discovery from logs/exemplars, but puts everything in object storage (s3/gcs).

Tempo is cheaper and easier to operate but lacks native search.

[+] strofcon|5 years ago|reply

I'm hopeful that with it effectively being a KV store pointed at a "simple" object store (S3, GCS, etc), it will be dramatically simpler to manage than Jaeger, and much more performant.

Backing Jaeger with Elasticsearch or Cassandra was a nightmare. :-|

56 comments