Loupe: Etsy's New Monitoring Stack

[+] noelwelsh|13 years ago|reply

Interesting stuff! I've actually been working on the same idea recently, starting with reading about anomaly detection. In particular, this survey: http://www-users.cs.umn.edu/~kumar/papers/anomaly-survey.php

I would like to know more about the performance of Skyline in practice:

- what are the accuracy and recall like?

- what is CPU consumption like?

Regarding the latter, I had a quick look at the implemented algorithms and them seemed very inefficient. Basically recomputing over the entire series at every change. I think with a bit of work most of the algorithms could be reimplemented in an incremental way. I also wouldn't use Python for something that is going to be CPU bound. (I await the "We rewrote in Go and it's 10x faster!" blog post ;-)

[+] aba_sababa|13 years ago|reply

Author here. Accuracy is okay - we err on the side of noise, but it does routinely pick up anomalies. It doesn't currently account for seasonal trends, though.

We aim for 100% CPU consumption. Analyzing is very CPU intensive process, and there are two parts in particular that are expensive: decoding the Redis string from MessagePack to Python, and running the algorithms.

As for the algorithm inefficiencies, pull requests encouraged :)

Rewriting it in Go is a plan for a rainy weekend :) The problem with Go is that it doesn't have as great statistics support as Python does.

[+] tel|13 years ago|reply

DTW is quadratic. While in grad school I worked for a bit with a team interested in doing massive speech recognition using DTW so they did some work speeding up the algorithm using a technical called Locality Sensitive Hashing [1], [2], [3]. It might be worth a look in order to speed your algorithms.

[1] http://www.academia.edu/2600658/Indexing_Raw_Acoustic_Featur... [2] http://old-site.clsp.jhu.edu/~ajansen/papers/IS2012a.pdf [3] http://www.cs.jhu.edu/~vandurme/papers/JansenVanDurmeASRU11....

[+] jqueryin|12 years ago|reply

I was very intrigued until I poked around the github repos and noticed the server specs Etsy was using to analyze the 250k metrics.

Oculus recommended setup (found at https://github.com/etsy/oculus):

  * ElasticSearch
    * At least 8GB RAM
    * Quad Core  Xeon 5620 CPU or comparable
    * 1GB disk space
    * two ElasticSearch servers in separate clusters
  * a cluster of Worker boxes running Resque 
    * worker master runs redis
    * additional resque worker boxes (and potentially slaves)
    * At least 12GB RAM
    * Quad Core Xeon 5620 CPU or comparable
    * 1GB disk space

It'd be nice if there was a more established baseline set of server specs to get up and running. While many of us aspire to be at Etsy level monitoring, we're just not there.

[+] jonlives|12 years ago|reply

I'll definitely have a look at doing that - the initial specs were designed around the metric volumes we use the tools for, but I realise that might not be practical for smaller workloads :)

[+] laichzeit0|12 years ago|reply

I'm not really convinced that this is very useful. I've been in the application monitoring space for a few years now and I'm not sure that watching graphs is something Ops people should be doing.

There should be rules which notify them if something is anomalous, by email, SMS, or logging a problem on an incident management tool. e.g. "Java request foo.bar() on Managed Server 1 is throwing exceptions for 50% of invocations (20 requests, 10 exceptions) in the last 10 minutes. This affects the following services: Customer Login page on foo.bar." possibly even attaching some of the exception messages to the email, if sampled through instrumentation or correlating it back to the log files, automatically.

This type of monitoring is actually useful because Ops understand what is broken, what it effects and gives them enough detail to either fix it or pass the problem to someone else; and they're not wasting their time looking at graphs waiting for a problem to appear.

[+] thibaut_barrere|12 years ago|reply

The thing is that at a given scale (and it comes early, actually), pushes do not scale.

I still use pushes for clear-cut things that require paging, but having graphes of a lot of things and just noticing changes or anomaly on the overall patterns will help spot a lot of issues, including things you haven't yet planned paging for :-)

[+] jlgreco|13 years ago|reply

Very nifty. The automatic selection is a great innovation.

I built a somewhat similar system a while ago on-top of statsd/graphite. Mine was not designed for production deployment though, just as a test platform (I was basically using graphite to store and query metric data. Not optimal, but that problem was out of scope and it was easy to abuse like that.) This tool allowed a user to manually select a set of metrics and create a fault classifiers with those metrics.

These classifiers were able to detect not only the presence of faults but also classify what type of faults they were (provided sufficient training data. Of course you could train new classifiers with data you collected in production so training new classifiers becomes an ongoing activity.). We were only testing geometric classification, but using any sort of classifier to identify complex fault types seems to be an idea with promise.

[+] cilo|13 years ago|reply

Always fun to read these Etsy ops posts. I'm very curious to know what their practical architecture looks like that allows them to capture 250k unique metrics and also run skyline against them all. It seems like each new algorithm would add a ton of processing requirements when you're at that scale.

Also, it seems like this would be really useful with the addition of metrics grouping and group specific algorithms as right now it looks like their 250k metrics all pop up in the same anomalous bucket with all metrics getting the same algorithms applied to them.

[+] oscilloscope|13 years ago|reply

Abe gave a talk at OpenVisConf about Skyline and how to make use of 250k metrics http://www.youtube.com/watch?v=Rij604NBXqk

[+] plasma|13 years ago|reply

Anyone recommend an easy way to get started with StatsD?

I've tried to configure/install/setup StatsD etc in the past but hit so many problems with dependencies, undocumented software needing to be installed, etc.

Any tutorial or something to get stats being tracked and graphed beautifully would be awesome.

[+] thibaut_barrere|12 years ago|reply

Batsd [1] is a stripped down version of (StatsD + Graphite) that works well in my opinion. You won't have the full graphite functions etc, but it's easier to get started.

[1] https://github.com/noahhl/batsd

[+] russgray|12 years ago|reply

This blog post helped me when I took a look last year: http://geek.michaelgrace.org/2011/09/installing-statsd-on-ub...

[+] jmcqk6|13 years ago|reply

They might want to reconsider the name:

http://www.gibraltarsoftware.com/

Their monitoring solution is also called Loupe.

[+] contingencies|12 years ago|reply

Skyline and oculus both look interesting and this is definitely a solid direction to be heading in.

However, I wonder if some form of topology knowledge, operations dependency tree or similar could further inform this type of root cause analytics.

Without a declarative style "here is how thing should be" model of adequate accuracy, it seems like the analytics will be stuck at the "these things are strange and happened at once, what does human think?" level of sophistication.

[+] philsnow|13 years ago|reply

I'm confused on one point: does the anomaly correlation find other metrics that look "similar" or other metrics that also have anomalies in the same time span ? The latter seems like it would be very useful.

You mention elsewhere that statsd lets you do complicated aggregations over time. If you have a moving average of errors over 10 minutes or something, that's potentially not going to show up when you do anomaly correlation, since a spike is smeared across 20 minutes. Do you account for that? It would require knowing which metrics are aggregated across time and by how much, etc, I guess.

[+] jonlives|13 years ago|reply

Oculus author here - Oculus detects other metrics that look similar, ie that have a simlar anomaly or shape in the same time span. It doesn't pick up metrics that have other, "dissimilar" anomalies - that part is left to Skyline

Oculus treats all metrics that it gets from Skyline equally at the moment, ie it doesn't know if what it's looking at is an aggregation, or a single set of data points. It just takes the data as it's presented. It would be totally possible, however, to add 10 and 20 minute averages (for example) for the same metric into Skyline so that Oculus would treat them separately.

[+] josh2600|13 years ago|reply

This is really interesting. I can't even imagine measuring 250k different metrics without a tool like this. It's just so much data to assess.

Granted it would be extremely useful for post-mortems, but looking at it real time is a bit like the library of Babel [1].

[1]http://en.wikipedia.org/wiki/The_Library_of_Babel

[+] methehack|13 years ago|reply

> That’s far too many graphs for a team of 150 engineers to watch all day long!

Does etsy have 150 engineers? Is that even possible?

[+] methehack|12 years ago|reply

I'll take the downvote as a "yes" :)

It's true the "Is that even possible?" was out of line and I should have tempered it -- but, I am truly surprised.

[+] misiti3780|13 years ago|reply

I love etsy's open source contributions but am i the only one here that think's they are due for a blog redesign?

[+] aba_sababa|13 years ago|reply

It's in the works :)

29 comments