Stripe’s Veneur: A distributed, fault-tolerant pipeline for observability data

[+] chimeracoder|7 years ago|reply

This was a pleasant surprise to see on Hacker News this morning! I work on the Observability team at Stripe and have been the PM for Veneur (and the rest of our metrics & tracing pipeline work) pretty much since we released it ~2 years ago.

If you're interested in learning more about how Veneur works and why we built it, I gave a talk at Monitorama last year that explains the philosophy behind Veneur[0]. In short, a massive company like Google is able to build their on integrated observability stacks in-house, but almost any other smaller company is going to be relying on an array of open-source tools or third-party vendors for different parts of their observability tooling[1]. When using different tools, there are always going to be gaps between them, which leads to incomplete instrumentation and awkward (inter-)operability. By taking control of the pipeline that processes the data, we're able to provide fully integrated views into different aspects of our observability data.

The Monitorama talk is a year old at this point, so it doesn't cover some of the newer things Veneur has helped us to accomplish, but the core philosophy hasn't changed. I've given updated versions of the talk more recently at CraftConf (in May) and DevOpsDaysMSP (last week), but neither of those videos are online yet.

[0] https://vimeo.com/221049715

[1] e.g. ELK/Papertrail/Splunk for logs, Graphite/Datadog/SignalFx for metrics, and maybe a third tool for tracing if you're lucky.

[+] tchaffee|7 years ago|reply

Am I the only one who is always slightly disappointed that neither the README file on Github nor the landing page at the website tells me why I would want to use the software in question? What problem it solves? Why might "a distributed, fault-tolerant observability pipeline" be interesting to programmers or anyone else? It seems like you've already got to be familiar with the problem space to understand what this is and what need it fulfills.

I'm not picking on this package. I see it all the time.

Can someone here explain to me what the use case is for this software?

[+] jmillikin|7 years ago|reply

The use case: you have more than a hundred machines emitting lots of monitoring data, much of which is uninteresting except in aggregate form. Instead of paying to store millions of data points and then computing the aggregates later, Veneur can calculate things like percentiles itself and only forwards those.

It also has a separate, related purpose as a statsd protocol transport. You run it on the statsd port and it receives standard (or DataDog-extended) UDP traffic, then it forwards metrics via TCP to another backend. This has reliability benefits when operating over a network that might drop UDP packets, such as the public internet.

[+] abathur|7 years ago|reply

I regularly have the same experience.

When I run into one, I optimistically assume I'd know if I needed it (but, unfortunately, it means I'm very unlikely to remember a specific package if I do eventually run into the problem.

If they don't already exist, maybe there's room in the universe for some README best-practices (like those floating around for writing commit messages, user stories, issue reports, etc.) that might nudge more maintainers to include at least one lucid example of a problem it could solve.

[+] SuperKlaus|7 years ago|reply

Thanks asking this. I came back here after looking at the README in the repo and was still clueless as to why I'd want to use Veneur.

[+] mrkurt|7 years ago|reply

It's not always possible to describe a problem in a way you will understand if you aren't already familiar with it. And that's ok! Reading something like "distributed, fault-tolerant pipeline for observability data" and ¯\_(ツ)_/¯ could be good response both for you and the people who built the thing. It's definitely ok that you have to dig a little more to wrap your head around it.

In short, it's reasonable to expect that people who see a project already understand the problem space and write for the ones who can say "yep I need this".

This particular project is probably only useful to people who know what observability data means.

[+] roskilli|7 years ago|reply

It’s definitely interesting to see the different systems being built for monitoring across the different tech co’s.

M3 aggregator, Uber’s metrics aggregation tier is similar, except it has inbuilt replication and leader election on top of etcd to avoid any SPOF during deployments, failed instances, etc. Also it uses Cormode-Muthukrishnan for estimating percentiles by default, it has support for T-Digest too. Although these days submitting histogram bucket aggregates all the way from the client to aggregator then to storage is more popular as you can estimate percentiles across more dimensions and time windows at query time quite cheaply. You need to choose your buckets carefully though.

It too is open source, but needs some help to make it plug into other stacks more easily: https://github.com/m3db/m3aggregator

[+] dswalter|7 years ago|reply

It always makes me happy to see approximate algorithms/data structures like hyperloglog being used.

[+] ambicapter|7 years ago|reply

"probabilistic" is probably the word you're looking for, and yes I agree, I'm fascinated by the idea of trading off a little bit of accuracy for massive performance gains.

[+] ebikelaw|7 years ago|reply

When I'm evaluating a system like this what I want to read about is how is it hardened against client stupidity. For example, someone deploys an application in my datacenter and it emits metrics that have gibberish in their names (consider a common Java bug where a class lacks a toString, so the metric gets barfed out as foo.bar.0xCAFEBABE.baz). How does the system cope with this enormous, hyper-dimensional input?

[+] noncoml|7 years ago|reply

Why is Go so popular in the industry at the moment? What's the decision process for choosing Go?

[+] gphat|7 years ago|reply

Hello! I'm the original author of Veneur @ Stripe and continue to work on it with a host of marvelous teammates.

I can't speak for the industry, but here's why I chose it for this project:

* I hadn't yet written anything in Go and wanted to try it on a side project / experiment

* I knew that my eventual deploy target (if the project turned out useful) would be lots of machines and I wanted to minimize the deployment requirements. Static binaries are good for that.

* I wanted to distribute the work across many cores and felt that Go's channels would make be a useful mechanism.

* I benchmarked my initial PoC against some other implementations (Python and JVM/Scala) and found no major reason to not use Go

The contributions of Stephen Jung (https://github.com/tummychow) and Aditya Mukerjee (https://github.com/chimeracoder) elevated it from a glimmer in my eye to system you can trust across your infrastructure.

So, in summary it was a confluence of interest and convenience with a strong hint of "if I use, this, it needs to be easy to deploy" and here we are 2 years later. :)

[+] ebikelaw|7 years ago|reply

I doubt anyone can answer this for you, but why shouldn't it be? It is a very sensible language and toolchain. When writing source, it is easy to write tests and testable code and to run the tests as part of your build. At build time, it's fast and it produces easy-to-deploy statically-linked applications. At runtime, it's pretty fast and compact ... compared to python, Go is most of the way to C++-level performance.

[+] pinko|7 years ago|reply

Know of anyone using this in production outside Stripe?

[+] chimeracoder|7 years ago|reply

> Know of anyone using this in production outside Stripe?

In addition to Intercom (mentioned in a sibling comment), some other companies off the top of my head who are also using Veneur: Sentry, Bluecore, Axiom, Quantcast, and at least two more that I'm not sure if I have permission to name publicly.

There are more too - since Veneur is an OSS project, we only find out when people submit submit PRs to the project or happen to contact us for another reason. It makes for a pleasant surprise when we find out!

[+] otterley|7 years ago|reply

We use it at Segment as a replacement for Datadog's Python-based (and dog slow) dogstatsd server. Veneur is not only faster on a per-CPU basis, but scales to multiple CPUs as well.

I consider it absolutely essential for collecting metrics on large instances that run hundreds of busy tasks emitting hundreds or thousands of metrics per second.

[+] galaktor|7 years ago|reply

Intercom use it heavily in production.

Source: am engineer at Intercom.

[+] unknown|7 years ago|reply

[deleted]

[+] madspindel|7 years ago|reply

It's from 2016: https://stripe.com/blog/introducing-veneur-high-performance-...

[+] chimeracoder|7 years ago|reply

> It's from 2016

2016 was the first public release, but the project has grown a lot in that time. You can take a look at the changelog to see what's new, ever since we switched to a six-week release cycle last year: https://github.com/stripe/veneur/blob/master/CHANGELOG.md

Source: I work on the Observability team at Stripe and I am the PM for Veneur.

[+] mhluongo|7 years ago|reply

Last commit on master was 5 days ago, and appears otherwise to be maintained. A lot can change in a repo in 2 years

[+] amelius|7 years ago|reply

What do they mean by "observability data"?

Is this a fancy way of saying "privacy-sensitive user data"?

[+] Jarred|7 years ago|reply

More like diagnostic stats about servers - memory usage, CPU usage and many more things like that.

43 comments