I would like to know more about the performance of Skyline in practice:
- what are the accuracy and recall like?
- what is CPU consumption like?
Regarding the latter, I had a quick look at the implemented algorithms and them seemed very inefficient. Basically recomputing over the entire series at every change. I think with a bit of work most of the algorithms could be reimplemented in an incremental way. I also wouldn't use Python for something that is going to be CPU bound. (I await the "We rewrote in Go and it's 10x faster!" blog post ;-)
Author here. Accuracy is okay - we err on the side of noise, but it does routinely pick up anomalies. It doesn't currently account for seasonal trends, though.
We aim for 100% CPU consumption. Analyzing is very CPU intensive process, and there are two parts in particular that are expensive: decoding the Redis string from MessagePack to Python, and running the algorithms.
As for the algorithm inefficiencies, pull requests encouraged :)
Rewriting it in Go is a plan for a rainy weekend :) The problem with Go is that it doesn't have as great statistics support as Python does.
DTW is quadratic. While in grad school I worked for a bit with a team interested in doing massive speech recognition using DTW so they did some work speeding up the algorithm using a technical called Locality Sensitive Hashing [1], [2], [3]. It might be worth a look in order to speed your algorithms.
* ElasticSearch
* At least 8GB RAM
* Quad Core Xeon 5620 CPU or comparable
* 1GB disk space
* two ElasticSearch servers in separate clusters
* a cluster of Worker boxes running Resque
* worker master runs redis
* additional resque worker boxes (and potentially slaves)
* At least 12GB RAM
* Quad Core Xeon 5620 CPU or comparable
* 1GB disk space
It'd be nice if there was a more established baseline set of server specs to get up and running. While many of us aspire to be at Etsy level monitoring, we're just not there.
I'll definitely have a look at doing that - the initial specs were designed around the metric volumes we use the tools for, but I realise that might not be practical for smaller workloads :)
I'm not really convinced that this is very useful. I've been in the application monitoring space for a few years now and I'm not sure that watching graphs is something Ops people should be doing.
There should be rules which notify them if something is anomalous, by email, SMS, or logging a problem on an incident management tool. e.g. "Java request foo.bar() on Managed Server 1 is throwing exceptions for 50% of invocations (20 requests, 10 exceptions) in the last 10 minutes. This affects the following services: Customer Login page on foo.bar." possibly even attaching some of the exception messages to the email, if sampled through instrumentation or correlating it back to the log files, automatically.
This type of monitoring is actually useful because Ops understand what is broken, what it effects and gives them enough detail to either fix it or pass the problem to someone else; and they're not wasting their time looking at graphs waiting for a problem to appear.
The thing is that at a given scale (and it comes early, actually), pushes do not scale.
I still use pushes for clear-cut things that require paging, but having graphes of a lot of things and just noticing changes or anomaly on the overall patterns will help spot a lot of issues, including things you haven't yet planned paging for :-)
Very nifty. The automatic selection is a great innovation.
I built a somewhat similar system a while ago on-top of statsd/graphite. Mine was not designed for production deployment though, just as a test platform (I was basically using graphite to store and query metric data. Not optimal, but that problem was out of scope and it was easy to abuse like that.) This tool allowed a user to manually select a set of metrics and create a fault classifiers with those metrics.
These classifiers were able to detect not only the presence of faults but also classify what type of faults they were (provided sufficient training data. Of course you could train new classifiers with data you collected in production so training new classifiers becomes an ongoing activity.). We were only testing geometric classification, but using any sort of classifier to identify complex fault types seems to be an idea with promise.
Always fun to read these Etsy ops posts. I'm very curious to know what their practical architecture looks like that allows them to capture 250k unique metrics and also run skyline against them all. It seems like each new algorithm would add a ton of processing requirements when you're at that scale.
Also, it seems like this would be really useful with the addition of metrics grouping and group specific algorithms as right now it looks like their 250k metrics all pop up in the same anomalous bucket with all metrics getting the same algorithms applied to them.
Anyone recommend an easy way to get started with StatsD?
I've tried to configure/install/setup StatsD etc in the past but hit so many problems with dependencies, undocumented software needing to be installed, etc.
Any tutorial or something to get stats being tracked and graphed beautifully would be awesome.
Batsd [1] is a stripped down version of (StatsD + Graphite) that works well in my opinion. You won't have the full graphite functions etc, but it's easier to get started.
Skyline and oculus both look interesting and this is definitely a solid direction to be heading in.
However, I wonder if some form of topology knowledge, operations dependency tree or similar could further inform this type of root cause analytics.
Without a declarative style "here is how thing should be" model of adequate accuracy, it seems like the analytics will be stuck at the "these things are strange and happened at once, what does human think?" level of sophistication.
I'm confused on one point: does the anomaly correlation find other metrics that look "similar" or other metrics that also have anomalies in the same time span ? The latter seems like it would be very useful.
You mention elsewhere that statsd lets you do complicated aggregations over time. If you have a moving average of errors over 10 minutes or something, that's potentially not going to show up when you do anomaly correlation, since a spike is smeared across 20 minutes. Do you account for that? It would require knowing which metrics are aggregated across time and by how much, etc, I guess.
Oculus author here - Oculus detects other metrics that look similar, ie that have a simlar anomaly or shape in the same time span. It doesn't pick up metrics that have other, "dissimilar" anomalies - that part is left to Skyline
Oculus treats all metrics that it gets from Skyline equally at the moment, ie it doesn't know if what it's looking at is an aggregation, or a single set of data points. It just takes the data as it's presented. It would be totally possible, however, to add 10 and 20 minute averages (for example) for the same metric into Skyline so that Oculus would treat them separately.
[+] [-] noelwelsh|13 years ago|reply
I would like to know more about the performance of Skyline in practice:
- what are the accuracy and recall like?
- what is CPU consumption like?
Regarding the latter, I had a quick look at the implemented algorithms and them seemed very inefficient. Basically recomputing over the entire series at every change. I think with a bit of work most of the algorithms could be reimplemented in an incremental way. I also wouldn't use Python for something that is going to be CPU bound. (I await the "We rewrote in Go and it's 10x faster!" blog post ;-)
[+] [-] aba_sababa|13 years ago|reply
We aim for 100% CPU consumption. Analyzing is very CPU intensive process, and there are two parts in particular that are expensive: decoding the Redis string from MessagePack to Python, and running the algorithms.
As for the algorithm inefficiencies, pull requests encouraged :)
Rewriting it in Go is a plan for a rainy weekend :) The problem with Go is that it doesn't have as great statistics support as Python does.
[+] [-] tel|13 years ago|reply
[1] http://www.academia.edu/2600658/Indexing_Raw_Acoustic_Featur... [2] http://old-site.clsp.jhu.edu/~ajansen/papers/IS2012a.pdf [3] http://www.cs.jhu.edu/~vandurme/papers/JansenVanDurmeASRU11....
[+] [-] jqueryin|12 years ago|reply
Oculus recommended setup (found at https://github.com/etsy/oculus):
It'd be nice if there was a more established baseline set of server specs to get up and running. While many of us aspire to be at Etsy level monitoring, we're just not there.[+] [-] jonlives|12 years ago|reply
[+] [-] laichzeit0|12 years ago|reply
There should be rules which notify them if something is anomalous, by email, SMS, or logging a problem on an incident management tool. e.g. "Java request foo.bar() on Managed Server 1 is throwing exceptions for 50% of invocations (20 requests, 10 exceptions) in the last 10 minutes. This affects the following services: Customer Login page on foo.bar." possibly even attaching some of the exception messages to the email, if sampled through instrumentation or correlating it back to the log files, automatically.
This type of monitoring is actually useful because Ops understand what is broken, what it effects and gives them enough detail to either fix it or pass the problem to someone else; and they're not wasting their time looking at graphs waiting for a problem to appear.
[+] [-] thibaut_barrere|12 years ago|reply
I still use pushes for clear-cut things that require paging, but having graphes of a lot of things and just noticing changes or anomaly on the overall patterns will help spot a lot of issues, including things you haven't yet planned paging for :-)
[+] [-] jlgreco|13 years ago|reply
I built a somewhat similar system a while ago on-top of statsd/graphite. Mine was not designed for production deployment though, just as a test platform (I was basically using graphite to store and query metric data. Not optimal, but that problem was out of scope and it was easy to abuse like that.) This tool allowed a user to manually select a set of metrics and create a fault classifiers with those metrics.
These classifiers were able to detect not only the presence of faults but also classify what type of faults they were (provided sufficient training data. Of course you could train new classifiers with data you collected in production so training new classifiers becomes an ongoing activity.). We were only testing geometric classification, but using any sort of classifier to identify complex fault types seems to be an idea with promise.
[+] [-] cilo|13 years ago|reply
Also, it seems like this would be really useful with the addition of metrics grouping and group specific algorithms as right now it looks like their 250k metrics all pop up in the same anomalous bucket with all metrics getting the same algorithms applied to them.
[+] [-] oscilloscope|13 years ago|reply
[+] [-] plasma|13 years ago|reply
I've tried to configure/install/setup StatsD etc in the past but hit so many problems with dependencies, undocumented software needing to be installed, etc.
Any tutorial or something to get stats being tracked and graphed beautifully would be awesome.
[+] [-] thibaut_barrere|12 years ago|reply
[1] https://github.com/noahhl/batsd
[+] [-] russgray|12 years ago|reply
[+] [-] jmcqk6|13 years ago|reply
http://www.gibraltarsoftware.com/
Their monitoring solution is also called Loupe.
[+] [-] contingencies|12 years ago|reply
However, I wonder if some form of topology knowledge, operations dependency tree or similar could further inform this type of root cause analytics.
Without a declarative style "here is how thing should be" model of adequate accuracy, it seems like the analytics will be stuck at the "these things are strange and happened at once, what does human think?" level of sophistication.
[+] [-] philsnow|13 years ago|reply
You mention elsewhere that statsd lets you do complicated aggregations over time. If you have a moving average of errors over 10 minutes or something, that's potentially not going to show up when you do anomaly correlation, since a spike is smeared across 20 minutes. Do you account for that? It would require knowing which metrics are aggregated across time and by how much, etc, I guess.
[+] [-] jonlives|13 years ago|reply
Oculus treats all metrics that it gets from Skyline equally at the moment, ie it doesn't know if what it's looking at is an aggregation, or a single set of data points. It just takes the data as it's presented. It would be totally possible, however, to add 10 and 20 minute averages (for example) for the same metric into Skyline so that Oculus would treat them separately.
[+] [-] josh2600|13 years ago|reply
Granted it would be extremely useful for post-mortems, but looking at it real time is a bit like the library of Babel [1].
[1]http://en.wikipedia.org/wiki/The_Library_of_Babel
[+] [-] methehack|13 years ago|reply
Does etsy have 150 engineers? Is that even possible?
[+] [-] methehack|12 years ago|reply
It's true the "Is that even possible?" was out of line and I should have tempered it -- but, I am truly surprised.
[+] [-] misiti3780|13 years ago|reply
[+] [-] aba_sababa|13 years ago|reply