I spent a month or so evaluating anomaly detection systems and I can tell you a few things the twitter post fails to mention
1. You can get a long way with an ensemble of simple techniques. And it's always better than any single technique.
I wouldn't recommend trying to install skyline but re-implementing the ensemble of anomaly classifiers they have might take you a day or two and it will get you 90% of the way there.
2. The False Positive rate is probably your most important metric.
Detecting anomalies is good, but your ops team already has plenty of alerts to deal with. If you throw false positives at them from a new system they will hate you and ignore the new system. Most papers report ROC curve data as the metric for classification, that can be ok too.
3. Don't build something complex when a threshold will do.
If a point anomaly is obvious to a human, you absolutely should not build a complex system to detect it, just use thresholds. It's only when you want to detect anomalies before they cross a threshold that you should start on this kind of task. That leads me to
4. Almost all anomalies have a temporal component.
If your detector isn't ultimately looking at multiple sources of data and finding patterns that initially look like normal behavior (or odd behavior over time like a change in frequency) then its not adding as much value as it could. Slow trends, increased predictability, absent spikes which are still within threshold, those are the kind of anomalies your simple systems will miss and which will add a lot of value.
Ultimately anything that makes ops life easier and alerts them sooner to real problems is good. But in anomaly detection it is easy to fool yourself into thinking you need something complex to start out with, and when you've built a complex thing that it is "working" because it finds 95% of the obvious outliers.
Why would you not recommend installing skyline? I would like to have something that watches metrics for me, and we're already using graphite for stats.
One of my responsibilities is to do post-outage reporting to estimate how much money we lost. Right now I use Holt-Winters to give a forecast starting just before the outage began. I then estimate the loss to be the difference between the forecast and the data points that fall outside of the forecast's confidence interval.
Is this a statistically valid method? I chose Holt-Winters because I'm inexperienced and it's appealingly intuitive. Should I be looking at anomaly detection methods instead? Would they be able to tell me what "normal" would have been for the duration of the anomaly?
I don't know if it's statistically valid, but I've used methods similarly based on this to do calculations like you're talking for a role similar to what you described.
There are more subtleties that it might be important for you to take into account - primarily:
1) need to sample fairly extensively before/after outage to calibrate more accurately against Holt-Winters (the Holt-Winters seasonal projection should accurately project the trend, but actual numbers are probably running at some slight or significant rate above/below projections)
2) When running those samples, it's important to sample data where you believe the data points are definitely not impacted by the outage. This is often quite challenging since outages sometimes might span low / peak traffic periods or ramp-up/down periods.
3) Finally, it can be hard to pinpoint the actual start / end of the event (the identify the time samples you want to consider in your measurement for the outage cost). Particularly the end, since there's often some pressure for queued operations (by software or by your users who are itching to complete what they were trying to do) that may make your samples fluctuate. That backfill pressure can be substantial and is important to not ignore in your measurement of the actual cost of the issue. Say you're a retail site - if you have a 15 minute period of 50% order drop but the first 5 minutes where service is restored, the total order rate was 50% above projections. Do you count that as 15 minutes of 50% order drop, or 10 minutes of 50% order drop? Both are legitimate but it's important to know what metric you're measuring yourself against so you're as correct / honest as you can be.
An interesting complement to Etsy's Skyline project[1], which was intended to read data from the incoming Graphite data streams. The two do seem like they could be complementary, however.
Does anyone have a clue what "ESD" stands for in this context? The article is too buzzword-heavy to be very meaningful, even to a practitioner (this is surprisingly common in data analysis, where there seems to be a culture of naming things in the most opaque way possible.)
The data they are looking at is essentially a
univariate stochastic point process, that is, an
arrival process. The most important special case is
the Poisson process. There the times between
arrivals are independent, identically distributed
random variables with exponential distribution with
parameter the arrival rate. The number of arrivals
in an interval is random with Poisson distribution
(compare with the terms of the Taylor series for
exp(x)).
See early in
E. Cinlar, 'Introduction to Stochastic Processes'.
There for the Poisson process there is a
'qualitative, axiomatic' definition -- an arrival
process with stationary, independent increments. A
cute derivation from this just qualitative
description results in the details of the Poisson
process.
One point about this qualitative approach is that
commonly in practice the assumptions are obvious
just intuitively.
Another solid approach to a Poisson process is the
renewal theorem; there is a careful treatment in W.
Feller's now classic volume II.
The theorem says that under mild assumptions a sum
of independent renewal processes converges to a
Poisson process. Arrivals at Twitter look like a
nearly perfect example.
So, basically without any anomalies the Twitter data
is a Poisson process.
Observation (2).
In part, the anomaly detector is based on the
extreme Studentized deviate (ESD) statistical
hypothesis test as in
but this test has a Gaussian assumption, not a
Poisson assumption. So, there should be some
mention of justifying using a Gaussian assumption.
Point (1).
The work in the OP, that is, an anomaly detector, is
basically, nearly inescapably and necessarily a
statistical hypothesis and, thus, faces the usual
issues of false alarm rate (significance level of
the test, conditional probability of Type I error
given an anomaly), detection rate, and the classic
Neyman-Pearson best possible result.
In particular, it is important and usually standard
to have a means to adjust, control, and know the
false alarm rate, but in the OP I saw no mention of
false alarm rate, power of the test, etc.
On a server farm bridge or in a network operations
center (NOC) with near real time anomaly detection,
false alarm rate too high is a serious concern.
With realistic detectors, false alarm rate too low
means detection rate too low and is also a concern.
More.
There are some tests that are both multi-dimensional
and distribution-free, with false alarm rate known
exactly in advance and adjustable in small steps
over a wide range. Such tests might be good for
monitoring for 'zero day' problems, that is, ones
never seen before, in serious server farms and
networks.
> but this test has a Gaussian assumption, not a Poisson assumption.
In practice the Gaussian assumption might not be too badly abused (at least for Twitter) since the Poisson distribution approaches the Gaussian when lambda is large.
[+] [-] iandanforth|11 years ago|reply
Their introduction of the Kale stack (which includes Skyline) is a great read - https://codeascraft.com/2013/06/11/introducing-kale/
I spent a month or so evaluating anomaly detection systems and I can tell you a few things the twitter post fails to mention
1. You can get a long way with an ensemble of simple techniques. And it's always better than any single technique.
I wouldn't recommend trying to install skyline but re-implementing the ensemble of anomaly classifiers they have might take you a day or two and it will get you 90% of the way there.
2. The False Positive rate is probably your most important metric.
Detecting anomalies is good, but your ops team already has plenty of alerts to deal with. If you throw false positives at them from a new system they will hate you and ignore the new system. Most papers report ROC curve data as the metric for classification, that can be ok too.
3. Don't build something complex when a threshold will do.
If a point anomaly is obvious to a human, you absolutely should not build a complex system to detect it, just use thresholds. It's only when you want to detect anomalies before they cross a threshold that you should start on this kind of task. That leads me to
4. Almost all anomalies have a temporal component.
If your detector isn't ultimately looking at multiple sources of data and finding patterns that initially look like normal behavior (or odd behavior over time like a change in frequency) then its not adding as much value as it could. Slow trends, increased predictability, absent spikes which are still within threshold, those are the kind of anomalies your simple systems will miss and which will add a lot of value.
Ultimately anything that makes ops life easier and alerts them sooner to real problems is good. But in anomaly detection it is easy to fool yourself into thinking you need something complex to start out with, and when you've built a complex thing that it is "working" because it finds 95% of the obvious outliers.
[+] [-] Eridrus|11 years ago|reply
[+] [-] bglazer|11 years ago|reply
Is this a statistically valid method? I chose Holt-Winters because I'm inexperienced and it's appealingly intuitive. Should I be looking at anomaly detection methods instead? Would they be able to tell me what "normal" would have been for the duration of the anomaly?
[+] [-] slipperyp|11 years ago|reply
There are more subtleties that it might be important for you to take into account - primarily:
1) need to sample fairly extensively before/after outage to calibrate more accurately against Holt-Winters (the Holt-Winters seasonal projection should accurately project the trend, but actual numbers are probably running at some slight or significant rate above/below projections)
2) When running those samples, it's important to sample data where you believe the data points are definitely not impacted by the outage. This is often quite challenging since outages sometimes might span low / peak traffic periods or ramp-up/down periods.
3) Finally, it can be hard to pinpoint the actual start / end of the event (the identify the time samples you want to consider in your measurement for the outage cost). Particularly the end, since there's often some pressure for queued operations (by software or by your users who are itching to complete what they were trying to do) that may make your samples fluctuate. That backfill pressure can be substantial and is important to not ignore in your measurement of the actual cost of the issue. Say you're a retail site - if you have a 15 minute period of 50% order drop but the first 5 minutes where service is restored, the total order rate was 50% above projections. Do you count that as 15 minutes of 50% order drop, or 10 minutes of 50% order drop? Both are legitimate but it's important to know what metric you're measuring yourself against so you're as correct / honest as you can be.
[+] [-] glibgil|11 years ago|reply
[+] [-] unknown|11 years ago|reply
[deleted]
[+] [-] 23david|11 years ago|reply
[+] [-] falcolas|11 years ago|reply
[1] https://github.com/etsy/skyline/wiki
[+] [-] tjradcliffe|11 years ago|reply
[+] [-] jamessb|11 years ago|reply
[+] [-] graycat|11 years ago|reply
The data they are looking at is essentially a univariate stochastic point process, that is, an arrival process. The most important special case is the Poisson process. There the times between arrivals are independent, identically distributed random variables with exponential distribution with parameter the arrival rate. The number of arrivals in an interval is random with Poisson distribution (compare with the terms of the Taylor series for exp(x)).
See early in
E. Cinlar, 'Introduction to Stochastic Processes'.
There for the Poisson process there is a 'qualitative, axiomatic' definition -- an arrival process with stationary, independent increments. A cute derivation from this just qualitative description results in the details of the Poisson process.
One point about this qualitative approach is that commonly in practice the assumptions are obvious just intuitively.
Another solid approach to a Poisson process is the renewal theorem; there is a careful treatment in W. Feller's now classic volume II.
The theorem says that under mild assumptions a sum of independent renewal processes converges to a Poisson process. Arrivals at Twitter look like a nearly perfect example.
So, basically without any anomalies the Twitter data is a Poisson process.
Observation (2).
In part, the anomaly detector is based on the extreme Studentized deviate (ESD) statistical hypothesis test as in
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3...
but this test has a Gaussian assumption, not a Poisson assumption. So, there should be some mention of justifying using a Gaussian assumption.
Point (1).
The work in the OP, that is, an anomaly detector, is basically, nearly inescapably and necessarily a statistical hypothesis and, thus, faces the usual issues of false alarm rate (significance level of the test, conditional probability of Type I error given an anomaly), detection rate, and the classic Neyman-Pearson best possible result.
In particular, it is important and usually standard to have a means to adjust, control, and know the false alarm rate, but in the OP I saw no mention of false alarm rate, power of the test, etc.
On a server farm bridge or in a network operations center (NOC) with near real time anomaly detection, false alarm rate too high is a serious concern. With realistic detectors, false alarm rate too low means detection rate too low and is also a concern.
More.
There are some tests that are both multi-dimensional and distribution-free, with false alarm rate known exactly in advance and adjustable in small steps over a wide range. Such tests might be good for monitoring for 'zero day' problems, that is, ones never seen before, in serious server farms and networks.
[+] [-] joeyo|11 years ago|reply
[+] [-] unknown|11 years ago|reply
[deleted]