Since I’m starting employment with Pythian on Monday, this is the last technical post that is exclusive to this blog. Future technical tips, discoveries and stories will appear on the Pythian blog, and either copied here or linked to from here. BTW. I already made my first post in the Pythian blog today, as the editor of the Log Buffer – and I’m not even an employee yet
So, lets end this with a bang. Real life story on how Daylight Saving Time caused one query to take twice as long as before! And its not even an April Fools joke
Three days ago the customer called me and said: Listen, since daylight saving time started, this process that should take two hours is now taking five. My manager gets his morning report late and he is unhappy.
I told him that I’ll be glad to help with his performance issue, but this can’t possibly related to day light saving time. Besides, since the process runs every night but some nights the performance is fine, the problem is probably in the nature of the data that changes from day to day.
In any case, we started working on the problem with our usual methods: Instrumented the PL/SQL of the process so we can know exactly what part is slowing down the process, ran trace, analyzed the results of traces, looks at AWR reports, look at query plans, etc.
Eventually, we found a query, buried within a view, whose explain plan looked completely unreasonable – First of all, it included a bunch of “nested loops” and even “cartesian merge join” between tables that I knew were quite large. Second, those large tables had estimated number of rows of 1.
I checked DBA_TABLES and saw that num_rows was indeed 0, and last analyzed was yesterday at 10pm. This didn’t make any sense – I personally collected statistics for this table yesterday, it had around 1M rows and it was at 3pm.
Why does it show 0 rows at 10pm?
A quick check with the customer confirmed that the process truncates this table and copies data into it before our slow query runs. I assumed that “truncate table” resets the statistics and asked the customer to modify the process to analyze the table after all the data is copied in, but before the slow query runs.
But two questions still bothered me: Does truncate really resets statistics? And why did this problem start occurring after day light saving time?
The answer to the first question is “no”, truncate does not reset statistics:
SQL> analyze table sitescope.test1 compute statistics; Table analyzed. SQL> select table_name,num_rows from dba_tables where owner='SITESCOPE' and table_name='TEST1'; TABLE_NAME NUM_ROWS ------------------------------ ---------- TEST1 4 SQL> truncate table sitescope.test1; Table truncated. SQL> select table_name,num_rows from dba_tables where owner='SITESCOPE' and table_name='TEST1'; TABLE_NAME NUM_ROWS ------------------------------ ---------- TEST1 4
The answer to the second question was found using DBA_TAB_STATS_HISTORY.
Turns out that there is an automated process that collects statistics on this schema every night. Until DST, it ran every night at 9pm, before the process started running – so the table had data and the collected statistics were accurate. After DST, the process started running at 10pm (obviously the system running the process doesn’t know about DST). At 10pm the table was already truncated, but was not loaded yet, so the process collected statistics for a table that was temporarily empty.
Tada! Daylight saving time caused a severe performance problem!
P.S. No indexes were rebuilt while attempting to resolve the issue
Like many other DBAs, I’ll be attending RMOUG training days conference on Feb 17-18 in Denver. I’ll give two presentations in the conference. On the same day, just thinking about it makes me exhausted.
The first presentation is “Everything DBAs need to know about TCP/IP Networks”. Here’s the paper and the slides. I’ll also present this at NoCOUG‘s winter conference in Pleasanton, CA. Maybe you prefer to catch me there.
I still have time to improve the presentations and papers – so comments are very welcome
Everyone talks about how correlation doesn’t imply causation, but no one says what autocorrelation implies.
Maybe its because we don’t talk about autocorrelation at all
Lets start talking about autocorrelation by saying what it is:
First of all, autocorrelation is a concept related to time series. Time series is an ordered series of measurements taken at intervals over time. You know how we measure CPU utilization every 10 minutes and then display nice “CPU over time” graphs? thats a time series.
Autocorrelation is the correlation between points in the time series. So we can compare every point in the time series to the measurement taken 10 minutes later, 20 minutes later, etc. And we can find out that every point in our graph is strongly correlated with the measurement taken 10 minutes later and the point taken 60 minutes later.
But does that imply causation?
We normally don’t assume that the current value of the CPU caused the value that the CPU has in 60 minutes. It makes much more sense to assume that there is a third factor that causes the CPU to peak every 60 minutes. This effect is also called seasonality. The weather today is strongly correlated with the weather on Dec 9th 2008. The third factor in this case is the circles our planet makes around the sun.
However the autocorrelation with the point immediately following the current value, often does imply causation of sorts. If you want to make a good guess about the value of IBM stock tomorrow, your best bet is to guess that it will be the same as the value today. Stock values usually have very strong short-term autocorrelation, and we can say that tomorrow’s value is todays value plus some error. IBM stock prices are normally stable, so the error is normally small. So you can say that today’s stock value is caused by today’s value. In a similar way the CPU in 2 minutes can be predicted to be identical to the CPU right now.
I’m hesitant to call this “causation”, because although the stock price today does cause the stock price of tomorrow (plus an error!), the “real” cause is that stock prices and cpus behave in a specific way. On the other hand, we know that they behave in a specific way because we measured the autocorrelation, modeled it and made predictions that work. So in two important uses of causation, understanding the behavior of the thing we measured and making predictions, we can say that we have a cause-and-effect relation. Albeit a bit less intuitive that usual.
If you dig the idea of explaining and predicting CPU and other important performance measurements by using only the measure itself without looking for other explaining factors, then you should definitely attend my presentation about time series analysis at RMOUG. I’ll show exactly how we find autocorrelations and how to predict future values and we’ll discuss whether or not this is a useful method.