Daylight Saving Time Causes Performance Issues!

Since I’m starting employment with Pythian on Monday, this is the last technical post that is exclusive to this blog. Future technical tips, discoveries and stories will appear on the Pythian blog, and either copied here or linked to from here. BTW. I already made my first post in the Pythian blog today, as the editor of the Log Buffer – and I’m not even an employee yet 🙂

So, lets end this with a bang. Real life story on how Daylight Saving Time caused one query to take twice as long as before! And its not even an April Fools joke 🙂

Three days ago the customer called me and said: Listen, since daylight saving time started, this process that should take two hours is now taking five. My manager gets his morning report late and he is unhappy.

I told him that I’ll be glad to help with his performance issue, but this can’t possibly related  to day light saving time. Besides, since the process runs every night but some nights the performance is fine, the problem is probably in the nature of the data that changes from day to day.

In any case, we started working on the problem with our usual methods: Instrumented the PL/SQL of the process so we can know exactly what part is slowing down the process, ran trace, analyzed the results of traces, looks at AWR reports, look at query plans, etc.

Eventually, we found a query, buried within a view, whose explain plan looked completely unreasonable – First of all, it included a bunch of “nested loops” and even “cartesian merge join” between tables that I knew were quite large. Second, those large tables had estimated number of rows of 1.

I checked DBA_TABLES and saw that num_rows was indeed 0, and last analyzed was yesterday at 10pm. This didn’t make any sense – I personally collected statistics for this table yesterday, it had around 1M rows and it was at 3pm.

Why does it show 0 rows at 10pm?

A quick check with the customer confirmed that the  process truncates this table and copies data into it before our slow query runs. I assumed that “truncate table” resets the statistics and asked the customer to modify the process to analyze the table after all the data is copied in, but before the slow query runs.

But two questions still bothered me: Does truncate really resets statistics? And why did this problem start occurring after day light saving time?

The answer to the first question is “no”, truncate does not reset statistics:

SQL>  analyze table  sitescope.test1 compute statistics;
Table analyzed.

SQL> select table_name,num_rows from dba_tables where owner='SITESCOPE' and table_name='TEST1';
TABLE_NAME                       NUM_ROWS
------------------------------ ----------
TEST1                                   4

SQL> truncate table sitescope.test1;
Table truncated.
SQL> select table_name,num_rows from dba_tables where owner='SITESCOPE' and table_name='TEST1';
TABLE_NAME                       NUM_ROWS
------------------------------ ----------
TEST1                                   4

The answer to the second question was found using DBA_TAB_STATS_HISTORY.

Turns out that there is an automated process that collects statistics on this schema every night. Until DST, it ran every night at 9pm, before the process started running – so the table had data and the collected statistics were accurate. After DST, the process started running at 10pm (obviously the system running the process doesn’t know about DST). At 10pm the table was already truncated, but was not loaded yet, so the process collected statistics for a table that was temporarily empty.

Tada! Daylight saving time caused a severe performance problem!

P.S. No indexes were rebuilt while attempting to resolve the issue 🙂

Advertisement

Goodness of Data

I’m working on my time-series analysis presentation for RMOUG, and one of the topics I may include (or may not, because it is only marginally relevant) is that of data quality.

You cannot do good analysis and get meaningful results if your data is distorted. If you start your analysis with garbage you’ll end up with garbage.

So, before starting your analysis, you have to look at the data and make sure it doesn’t have any obvious problems. One of the favorite ways of doing this is by literally looking at the data. Plotting a graph of the data is the fastest and easiest way to spot issues.

What kinds of issues?

  1. Outliers: It is easy to spot outliers in a graph (especially if you use a box plot), but not all outliers are bad data. It is important to differentiate between extreme but legitimate data and bad measurements. In order to do that, you really need to understand the data you are looking at and the system it describes. There is a world of difference between “Yes, the system did hit 75 load average that morning” and “Oh yeah, thats the morning when we hit a freaky problem with the /proc system and top reported 75 load average even though the system was not loaded at all”. Some outliers don’t offer a ready explanation – the data says load average was 75, but you have no clue if it was a real issue or not. In those cases my tendency is to err on the side of including the data – if I can’t explain why the data is bad and should be excluded, then I keep it.
  2. Missing data: Missing data isn’t always bad. Systematically missing data is fine – ASH samples Oracle sessions every 10 seconds, so you can say that data in between is missing. But thats just random sampling and all analysis tools deal with this, so you are safe. The bad kind of missing data is the biased missing data – if you monitor your sessions from an external tool that queries v$session, it is likely that when the load is very high, the tool will be unable to connect and query your database. So the data isn’t randomly missing – it is always missing the points of the highest load. Your data set will lack the most important data, and worse it will show your system load as much lower than it really is. Obviously any analysis based on this data will be hopelessly flawed.
  3. Breaks: By breaks I mean specific points in time where the nature of the data completely changed. Example: You are looking at monthly response-times data, and at Nov 15th someone upgraded the SAN. Before the upgrade average response time was 8s with standard deviation of 4s, but after the average response time was 3s with standard deviation of 0.01s. It should be obvious that you can’t analyze November as one single time series, because the behavior of the system changed dramatically. Any forecast made based on the first half of the month will be completely irrelevant.

How do we fix the bad data? Here are few relatively simple suggestions.

  1. Replace outliers and missing data with average value. Note that if your data has strong trend or seasonality (higher load on Monday morning for example), you will need to use local average values because overall average will be meaningless. The process of replacing missing data (or outliers) with meaningful average data is also known as interpolation. You do this whenever you display your data as a graph with one continuous line instead of a series of dots. You only sampled the system every 10 seconds, but your show the graph as if you also have all the data in between. This is probably the most common and most intuitive way of fixing data.
  2. If the missing data has a bias (data is only missing when the load is very high) then replacing missing data with averages is not a good idea. You know that the missing data was not average. In this case you can replace the missing values with maximum values of the data you did measure.
  3. If the data has breaks in it, analyze each part of the data separately and fit a different model to each part. In forecasts you will probably want to use only the latest model (if you have reasons to believe that this behavior will continue into the future).