Control Charts

Last week, while working on customer engagement, I learned a new method of quantifying behavior of time-series data. The method is called “Control Chart” and credit to Josh Wills, our director of data science, for pointing it out. I thought I’ll share it with my readers as its easy to understand, easy to implement, flexible and very useful in many situations.

The problem is ages old – you collect measurements over time and want to know when your measurements indicate abnormal behavior. “Abnormal” is not well defined, and thats on purpose – we want our method to be flexible enough to match what you define as an issue.

For example, lets say Facebook are interested in tracking usage trend for each user, catching those with decreasing use

There are few steps to the Control Chart method:

  1. Collect all relevant data points. In our case, number of minutes of Facebook use per day for each user.
  2. Calculate a baseline – this can be average use for each user or average use for similar demographics.  Even adaptive average of the type calculated by Oracle Enterprise Manager, to take into account decreased Facebook use over the weekend.
  3. Calculate “zones” one, two and three standard deviations around the baseline

Those zones can be used to define rules for normal and abnormal behaviors of the system. These rules are what makes the system valuable.
Examples of rules that define abnormal behavior can be:

  1. Any point 3 standard deviations above the baseline. This will indicate extreme sudden increase.
  2. 7 consecutive measurements more than one standard deviation over the baseline. This indicates a sustained increase.
  3. 9 consecutive measurement each higher than previous one. This indicates steady upward trend.
  4. 6 consecutive measurements each more than two standard deviations away from baseline, each one of different side of the baseline than the previous measurement. This indicates instability of the system.

There are even sets of standard rules used in various industries, best practices of sorts. Western Electric rules and Nelson rules are particularly well know.

Note how flexible the method is – you can use any combination of rules that will highlight abnormalities you are interested in highlighting.
Also note that while the traditional use indeed involves charts, the values and rules are very easy to calculate programmatically and visualization can be useful but not mandatory.
If you measure CPU utilization on few servers, visualizing the chart and actually seeing the server behavior can be useful. If you are Facebook and monitor user behavior, visualizing a time series of every one of their millions of users is hopeless. Calculating baselines, standard deviations and rules for each use is trivial.

Also note how this problem is “embarrassingly parallel“. To calculate behavior for each user, you only need to look at data for that particular user. Parallel, share-nothing platform like Hadoop can be used to scale the calculation indefinitely simply by throwing increasing number of servers on the problem. The only limit is the time it takes to calculate the rules for a single user.

Naturally, I didn’t dive into some of the complexities in using Control Charts.  Such as how to select a good baseline, how to calculate standard deviation (or whether to use another statistic to define zones) and how many measurements should be examined before a behavior signals a trend. If you think this tool is useful for you, I encourage you to investigate more deeply.