Goodness of Data

I’m working on my time-series analysis presentation for RMOUG, and one of the topics I may include (or may not, because it is only marginally relevant) is that of data quality.

You cannot do good analysis and get meaningful results if your data is distorted. If you start your analysis with garbage you’ll end up with garbage.

So, before starting your analysis, you have to look at the data and make sure it doesn’t have any obvious problems. One of the favorite ways of doing this is by literally looking at the data. Plotting a graph of the data is the fastest and easiest way to spot issues.

What kinds of issues?

  1. Outliers: It is easy to spot outliers in a graph (especially if you use a box plot), but not all outliers are bad data. It is important to differentiate between extreme but legitimate data and bad measurements. In order to do that, you really need to understand the data you are looking at and the system it describes. There is a world of difference between “Yes, the system did hit 75 load average that morning” and “Oh yeah, thats the morning when we hit a freaky problem with the /proc system and top reported 75 load average even though the system was not loaded at all”. Some outliers don’t offer a ready explanation – the data says load average was 75, but you have no clue if it was a real issue or not. In those cases my tendency is to err on the side of including the data – if I can’t explain why the data is bad and should be excluded, then I keep it.
  2. Missing data: Missing data isn’t always bad. Systematically missing data is fine – ASH samples Oracle sessions every 10 seconds, so you can say that data in between is missing. But thats just random sampling and all analysis tools deal with this, so you are safe. The bad kind of missing data is the biased missing data – if you monitor your sessions from an external tool that queries v$session, it is likely that when the load is very high, the tool will be unable to connect and query your database. So the data isn’t randomly missing – it is always missing the points of the highest load. Your data set will lack the most important data, and worse it will show your system load as much lower than it really is. Obviously any analysis based on this data will be hopelessly flawed.
  3. Breaks: By breaks I mean specific points in time where the nature of the data completely changed. Example: You are looking at monthly response-times data, and at Nov 15th someone upgraded the SAN. Before the upgrade average response time was 8s with standard deviation of 4s, but after the average response time was 3s with standard deviation of 0.01s. It should be obvious that you can’t analyze November as one single time series, because the behavior of the system changed dramatically. Any forecast made based on the first half of the month will be completely irrelevant.

How do we fix the bad data? Here are few relatively simple suggestions.

  1. Replace outliers and missing data with average value. Note that if your data has strong trend or seasonality (higher load on Monday morning for example), you will need to use local average values because overall average will be meaningless. The process of replacing missing data (or outliers) with meaningful average data is also known as interpolation. You do this whenever you display your data as a graph with one continuous line instead of a series of dots. You only sampled the system every 10 seconds, but your show the graph as if you also have all the data in between. This is probably the most common and most intuitive way of fixing data.
  2. If the missing data has a bias (data is only missing when the load is very high) then replacing missing data with averages is not a good idea. You know that the missing data was not average. In this case you can replace the missing values with maximum values of the data you did measure.
  3. If the data has breaks in it, analyze each part of the data separately and fit a different model to each part. In forecasts you will probably want to use only the latest model (if you have reasons to believe that this behavior will continue into the future).


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s