I'm just a simple DBA on a complex production system

Writing about all things production. Especially Oracle databases.

Autocorrelation and Causation December 7, 2009

Filed under: musing,statistics — prodlife @ 7:48 pm

Everyone talks about how correlation doesn’t imply causation, but no one says what autocorrelation implies.

Maybe its because we don’t talk about autocorrelation at all :)

Lets start talking about autocorrelation by saying what it is:
First of all, autocorrelation is a concept related to time series. Time series is an ordered series of measurements taken at intervals over time. You know how we measure CPU utilization every 10 minutes and then display nice “CPU over time” graphs? thats a time series.
Autocorrelation is the correlation between points in the time series. So we can compare every point in the time series to the measurement taken 10 minutes later, 20 minutes later, etc. And we can find out that every point in our graph is strongly correlated with the measurement taken 10 minutes later and the point taken 60 minutes later.

But does that imply causation?
We normally don’t assume that the current value of the CPU caused the value that the CPU has in 60 minutes. It makes much more sense to assume that there is a third factor that causes the CPU to peak every 60 minutes. This effect is also called seasonality. The weather today is strongly correlated with the weather on Dec 9th 2008. The third factor in this case is the circles our planet makes around the sun.

However the autocorrelation with the point immediately following the current value, often does imply causation of sorts. If you want to make a good guess about the value of IBM stock tomorrow, your best bet is to guess that it will be the same as the value today. Stock values usually have very strong short-term autocorrelation, and we can say that tomorrow’s value is todays value plus some error. IBM stock prices are normally stable, so the error is normally small. So you can say that today’s stock value is caused by today’s value. In a similar way the CPU in 2 minutes can be predicted to be identical to the CPU right now.

I’m hesitant to call this “causation”, because although the stock price today does cause the stock price of tomorrow (plus an error!), the “real” cause is that stock prices and cpus behave in a specific way. On the other hand, we know that they behave in a specific way because we measured the autocorrelation, modeled it and made predictions that work. So in two important uses of causation, understanding the behavior of the thing we measured and making predictions, we can say that we have a cause-and-effect relation. Albeit a bit less intuitive that usual.

If you dig the idea of explaining and predicting CPU and other important performance measurements by using only the measure itself without looking for other explaining factors, then you should definitely attend my presentation about time series analysis at RMOUG. I’ll show exactly how we find autocorrelations and how to predict future values and we’ll discuss whether or not this is a useful method.

 

Goodness of Data December 2, 2009

Filed under: Analysis — prodlife @ 4:16 pm

I’m working on my time-series analysis presentation for RMOUG, and one of the topics I may include (or may not, because it is only marginally relevant) is that of data quality.

You cannot do good analysis and get meaningful results if your data is distorted. If you start your analysis with garbage you’ll end up with garbage.

So, before starting your analysis, you have to look at the data and make sure it doesn’t have any obvious problems. One of the favorite ways of doing this is by literally looking at the data. Plotting a graph of the data is the fastest and easiest way to spot issues.

What kinds of issues?

  1. Outliers: It is easy to spot outliers in a graph (especially if you use a box plot), but not all outliers are bad data. It is important to differentiate between extreme but legitimate data and bad measurements. In order to do that, you really need to understand the data you are looking at and the system it describes. There is a world of difference between “Yes, the system did hit 75 load average that morning” and “Oh yeah, thats the morning when we hit a freaky problem with the /proc system and top reported 75 load average even though the system was not loaded at all”. Some outliers don’t offer a ready explanation – the data says load average was 75, but you have no clue if it was a real issue or not. In those cases my tendency is to err on the side of including the data – if I can’t explain why the data is bad and should be excluded, then I keep it.
  2. Missing data: Missing data isn’t always bad. Systematically missing data is fine – ASH samples Oracle sessions every 10 seconds, so you can say that data in between is missing. But thats just random sampling and all analysis tools deal with this, so you are safe. The bad kind of missing data is the biased missing data – if you monitor your sessions from an external tool that queries v$session, it is likely that when the load is very high, the tool will be unable to connect and query your database. So the data isn’t randomly missing – it is always missing the points of the highest load. Your data set will lack the most important data, and worse it will show your system load as much lower than it really is. Obviously any analysis based on this data will be hopelessly flawed.
  3. Breaks: By breaks I mean specific points in time where the nature of the data completely changed. Example: You are looking at monthly response-times data, and at Nov 15th someone upgraded the SAN. Before the upgrade average response time was 8s with standard deviation of 4s, but after the average response time was 3s with standard deviation of 0.01s. It should be obvious that you can’t analyze November as one single time series, because the behavior of the system changed dramatically. Any forecast made based on the first half of the month will be completely irrelevant.

How do we fix the bad data? Here are few relatively simple suggestions.

  1. Replace outliers and missing data with average value. Note that if your data has strong trend or seasonality (higher load on Monday morning for example), you will need to use local average values because overall average will be meaningless. The process of replacing missing data (or outliers) with meaningful average data is also known as interpolation. You do this whenever you display your data as a graph with one continuous line instead of a series of dots. You only sampled the system every 10 seconds, but your show the graph as if you also have all the data in between. This is probably the most common and most intuitive way of fixing data.
  2. If the missing data has a bias (data is only missing when the load is very high) then replacing missing data with averages is not a good idea. You know that the missing data was not average. In this case you can replace the missing values with maximum values of the data you did measure.
  3. If the data has breaks in it, analyze each part of the data separately and fit a different model to each part. In forecasts you will probably want to use only the latest model (if you have reasons to believe that this behavior will continue into the future).
 

 
Follow

Get every new post delivered to your Inbox.

Join 46 other followers