More Troubleshooting Monsters – Correlation and CausationPosted: July 24, 2009
We have this application that hangs for few minutes almost every hour. Why? No one knows. But naturally, the database is a suspect.
The developers responsible for this application requested an AWR report of the last hour. No problem. AWR report shows a significant amount of time spent waiting for locks on a specific table. It happened to be a timesheet table. So the developers say “Aha! We found the cause of the problem – contention on the timesheet table!”.
Oh no, I say. You did not find the cause of THE PROBLEM, you found the cause of A PROBLEM. What you are now looking at is our usual Monday morning slowdown. Timesheets are due at 12pm, so everyone is inserting their timesheets at 10am, causing contention and slowdown. I’ll be forever gratefull if you fix it, but it is unrelated to THE PROBLEM of the application hangs. Here are few AWR reports from Tuesday morning, when the application still hangs, but we no longer have any locks.
It took me about 4 conference calls and 3 emails to get the following point across: Even if the locks occured at every hang, you can’t be completely sure they are the cause for the hangs. When the locks occure only once a week and the hangs occure every hour, you can be quite sure that the locks don’t cause the hang. If they did cause the hang, they would appear at every hang.
Tanel Poder is going to give his Advanced Troubleshooting at NoCoug in November (More about this soon!). Some people asked me – how do I know if I’m advanced enough for the course. Well, being able to understand correlation and therefore avoid jumping to conclusions about what caused an event would be a great start.