Complex systems can fail in many ways and I find it useful to divide failures into two classes. The first consists of failures that can be anticipated (e.g. disk failures), will happen again, and can be directly checked for. The second class is failures that are unexpected. This post is about that second class.

The tool for unexpected failures is statistics, hence the name for this post. A statistical anomaly detector hunts for things that seem off and then sends an alert. The power of statistical anomalies is that they can detect novel problems. The downside is that they must be followed up to track down the root cause. All this might seem abstract, but I will give a concrete example below.

To avoid confusion, I find it helps to carefully define some terminology:

**Monitored Signals**: Results of continuous scans that look for trouble.**Disruption**: Behavior indicating that the site is not working as designed. A human will need to investigate to find the root cause.**Anomaly**: Unusual behavior in the monitored signals, suggesting a disruption.**Alert**: Automatic signal that an anomaly has occurred. It is common to send an alert after detecting an anomaly.

The obstacle to building useful statistical detectors is *false positives*: an alert that doesn’t correspond to a disruption. If the anomaly detector has too many false positives, users will turn off alerts or direct them to a spam folder or just generally ignore them. An ignored detector is not very useful.

In this post I will describe a statistical anomaly detector in use at eBay; my paper at the recent USENIX HotCloud 2015 workshop has more details. The monitored signals in our search system are based on the search results from a set of reference queries that are repeatedly issued. A statistical anomaly detector needs a way to encode the results of the monitoring as a set of numbers. We do this by computing *metrics* on the results. For each reference query, we compute about 50 metrics summarizing the items returned by the query. Two examples of metrics are the number of items returned and the median price of the returned items. There are 3000 reference queries and 50 metrics, therefore 150,000 numbers total. Currently, reference queries are issued every 4 hours, or 6 times a day, so there are 900,000 numbers per day. In these days of terabyte databases, this is very small. But the problem of sifting through those numbers to find anomalies, and do it with a low false-positive rate, is challenging.

I’ll outline our approach using diagrams, and refer you to my HotCloud paper for details. First, here is a picture of the monitored signals—that is, the metrics collected:

Each (query, metric) pair is a number that can be tracked over time, resulting in a time series. That’s 150,000 times series, and it’s reasonable to expect that during every 4-hour collection window, at least one of them will appear anomalous just by chance. So alerting on each time series is no good, because it will result in many false positives.

Our approach is to aggregate, and it starts with something very simple: examining each time series and computing the deviation between the most recent value and what you might have expected by extrapolating previous values. I call this the *surprise*, where the bigger the deviation the more the surprise. Here’s a figure illustrating that there is a surprise for each (query, metric, time) triple.

The idea for our anomaly detector is this: at each collection time *T* we expect a few (query,metric, *T*) triples to have a large surprise. We signal an alert if an unusually high number of triples have a large surprise. To make this idea quantitative, fix a metric, look at the surprise for all 3000 queries, and compute the 90th percentile of surprise at time *T*: *S*_{90}(*T*).

This gives a new time series in *T*, one for each metric. Hypothetical time series for the first two metrics are illustrated below.

We’ve gone from 150,000 time series down to 50. Aggregation like this is a very useful technique in anomaly detection.

Our final anomaly detector uses a simple test on this aggregated time series. We define an anomaly to occur when the current value of any of the 50 series is more than 3σ from the median of that series. Here’s an example using eBay data with the metric median sale price. There is clearly an anomaly at time *T*=88.

For more details, see my previously cited HotCloud paper, The Importance of Features for Statistical Anomaly Detection.

Alex BoeseIt sounds to me like instead of taking a regular 3 standard deviations, you’ve found a round about way of taking something that is approximately 5 standard deviations out. I’m sure it’s not precisely the same, but that’s what it feels like.

Perhaps it would be better (but much more actual work – work requiring a team), to determine causalities on lower standard deviations and then weed those out. This would require (among other things) larger data sets, so perhaps something like Twitter trends, newsfeeds, weather patterns, date time behaviors, internet traffic, etc. Then again, quite possibly you’re already doing this too, and I’m just preaching to the choir, as it were.

Have fun with the weeding.

-A

ZachVery interesting!

-Have you noticed problems with cyclic patterns? I would imagine that cyclic data with a short period would often look “surprising” due to a constant change towards/away from the median deviation line?

– I suppose if the metric is consistently “surprising”, its 90th quantile over time will simply be larger and so that metric is just less sensitive to finding subtle disruptions?

– Is there a reason these metrics are only checked every 4 hours? Simply for performance reasons?

@Alex: I could be wrong, but I don’t think it’s quite the same. This is looking for samples whose “surprise” (aka absolute deviation from the median) are three standard deviations away from the median of the 90th percentile rank of all absolute deviations. Pheww…that’s a mouthful 🙂

– Collect all surprise for a metric

– Sort, find the 90th percentile rank of those values

– Find the median of the 90th rank

– Calculate the absolute deviation of all ranks from that median

– Flag any value that is >3 deviations from the median (of the 90th rank)

I *think* this is different, because it’s asking “how much does the 90th percentile vary from point to point” rather than “how much do the samples vary from point to point”. The latter is more sensitive to outliers, since a few big outliers could skew the overall variance but not affect the 90th percentile rank too much.

Intuitively I think this means is that a sample’s 90th percentile will only change if a large number of constituents experience correlated changes in surprise. Individual spikes will generally be suppressed since it won’t affect 90th percentile rank much.

But I very well could be wrong, still noodling over this in my head 🙂

John WillisDeming and Shewhart would be proud.

Loved it…

Thanks

@botchagalupe

David GoldbergPost authorEvery 4 hours: you are right, we do that for performance reasons. There’s also a much simpler check that runs every minute but can only detect major disruptions.

Cyclic patterns: none of our current metrics have a strong periodicity. But I think a small change will handle such a metric. Before computing surprise as the deviation from a linear fit, first subtract any periodic component.

Consistently surprising: if many queries for a metric consistently have large surprise, then as you note, the metric’s 90% quantile will be consistently large. That doesn’t necessarily make it less sensitive. The sensitivity depends on the consistency of the 90th quantile. That number could be very big, but if it’s consistently big, a small change would trigger an alert.

ZachAh, yeah, that makes sense. Thanks for the explanation! Very cool work 🙂

John WillisI have been learning statistics as a hobby and I tried to build a R based prototype based on this idea ( just for fun). When I got down to the last stage I noticed you said this “any of the 50 series is more than 3σ from the median of that series. “. Glancing at your paper I saw you were using MAD not SD. When you say “3σ” from the median are you saying 3 standard deviations from the median or 3 median distributions from the median? As a layman I always thing of 3σ as three standard deviations…

TIA

@botchagalupe

David GoldbergPost authorYes, by 3σ I mean SD (standard deviation). The confusion is because I use both MAD and SD. When I want to find an outlier in data that might already contain outliers, I use MAD, otherwise I use σ. Muninn falls into the first case. But in Atlas, the signal being monitored is the 90th quantile feature, which has a roughly normal distribution, so I used σ.

Cannelle RichterFirst of all thank you for your article. It’s extremely difficult finding people capable of presenting and explaining mathematical theories and algorithms the easiest way as possible. I acknowledge these strength, it could have more.

From my understanding, this statistical anomaly detection aimed to stand out global high-level anomalies from multiple query records for each metric. So, its seems to correlate queries data (correlated anomalies precisely) in order to produce high level of abstraction of occurred anomalies and reduce false positive. It’s like a high level vision of occurred anomalies : you suppose that you want an alert when the most of the query or some queries are disrupted.

So, if I’m right, I have then two questions :

– it supposes that queries are correlated in one way or another, because anomalies occurred relatively at the same time ? Or because one anomaly is bigger as another ?

– in this scheme, we can’t say which query has caused the anomaly. So what is the practical application ? Maybe you have to investigate the 3000 queries for the root cause ?

Thanks.

David GoldbergPost authorA site problem (disruption) might affect a single query or multiple queries. The system I described is designed to catch a disruption that affects multiple queries. For example eBay ranks queries about automobiles differently from other queries, and a new release might introduce a bug (disruption) in the automobile ranking code.

When the anomaly detector triggers an alert, it reports two things. First, the metrics that had a large change in the distribution of surprise values. And second, for each such metric, a sample of the queries with the largest surprise. So if the disruption affected automobile queries, then many of the sample queries would be about automobiles. This would be a clue that could help track down the root cause.

Pingback: T-Digest Algebird | BigSnarf blog

Fernando WasylyszynHello. I read this article and I found it not only very interesting but also very didactic. Although, I have a doubt: in the paper you say that “we quantify the surprise by fitting all the points to a line (shown in red), and computing the (unsigned) deviation of each point from the line”. You fit the points using the moving average technique? I ask you this in particular because I found an article about an implementation of Atlas using ElasticSearch (https://www.elastic.co/blog/implementing-a-statistical-anomaly-detector-part-1) that uses it.

Kind regards

Fernando

David GoldbergPost authorThanks for your question, which points out a detail that I glossed over. In the original implementation of Atlas, I used robust linear regression to fit the data, specifically rlm in the MASS package for R. The surprise is the deviation from the prediction made using the linear fit.

There’s another thing you might wonder about: why do I use a *linear* fit? The answer is that for our application the data was almost always close to linear. If that is not the case for your application, you can use a more complex predictor, perhaps using a time-series forecast method.

Fernando WasylyszynThank you for the reply! I see your point.

Do you think that measuring the surprise of a single point as:

suprise(Xn+1) = abs(Xn+1 – average Xn) ; where average Xn is the average of points X1 … Xn

and then, compare surprise(Xn+1) with the stddev Xn (standard deviation of points X1 … Xn) in order to determine if Xn+1 is an anomaly point. Going to the point, the comparison that I’m using it is:

isAnomaly(Xn+1) = surprise(Xn+1) > 2 * stddev Xn

The use of the factor two is merely experimental from several datasets that I’ve use.

Kind regards

Fernando

Fernando WasylyszynThank you for the reply! I see your point.

Do you think that measuring the surprise of a single point as:

suprise(Xn+1) = abs(Xn+1 – average Xn) ; where average Xn is the average of points X1 … Xn

and then, compare surprise(Xn+1) with the stddev Xn (standard deviation of points X1 … Xn) in order to determine if Xn+1 is an anomaly point. Going to the point, the comparison that I’m using it is:

isAnomaly(Xn+1) = surprise(Xn+1) > 2 * stddev Xn

The use of the factor two is merely experimental from several datasets that I’ve use.

Kind regards

Fernando

Fernando WasylyszynThank you for the reply! I see your point.

Do you think that measuring the surprise of a single point as:

suprise(Xn+1) = abs(Xn+1 – average Xn) ; where average Xn is the average of points X1 … Xn

and then, compare surprise(Xn+1) with the stddev Xn (standard deviation of points X1 … Xn) in order to determine if Xn+1 is an anomaly point. Going to the point, the comparison that I’m using it is:

isAnomaly(Xn+1) = surprise(Xn+1) > 2 * stddev Xn

The use of the factor two is merely experimental from several datasets that I’ve use.

Kind regards

Fernando

Fernando WasylyszynDavid: I try to put here, as the content of a reply, a question about the way of calculate the surprise value, but for some unknown reason for me, I was unable to do it. Have you got any mail where I can reach you in order to ask you this question?

Kind regards.

Fernando

Fernando WasylyszynDo you think that measuring the surprise of a single point as:

suprise(Xn+1) = abs(Xn+1 – average Xn) ; where average Xn is the average of points X1 … Xn

isAnomaly(Xn+1) = surprise(Xn+1) > 2 * stddev Xn

The use of the factor two is merely experimental from several datasets that I’ve use.

Kind regards

Fernando