More Troubleshooting Monsters – Correlation and Causation

We have this application that hangs for few minutes almost every hour. Why? No one knows. But naturally, the database is a suspect.

The developers responsible for this application requested an AWR report of the last hour. No problem. AWR report shows a significant amount of time spent waiting for locks on a specific table. It happened to be a timesheet table. So the developers say “Aha! We found the cause of the problem – contention on the timesheet table!”.

Oh no, I say. You did not find the cause of THE PROBLEM, you found the cause of A PROBLEM. What you are now looking at is our usual Monday morning slowdown. Timesheets are due at 12pm, so everyone is inserting their timesheets at 10am, causing contention and slowdown. I’ll be forever gratefull if you fix it, but it is unrelated to THE PROBLEM of the application hangs. Here are few AWR reports from Tuesday morning, when the application still hangs, but we no longer have any locks.

It took me about 4 conference calls and 3 emails to get the following point across: Even if the locks occured at every hang, you can’t be completely sure they are the cause for the hangs. When the locks occure only once a week and the hangs occure every hour, you can be quite sure that the locks don’t cause the hang. If they did cause the hang, they would appear at every hang.

Tanel Poder is going to give his Advanced Troubleshooting at NoCoug in November (More about this soon!). Some people asked me – how do I know if I’m advanced enough for the course. Well, being able to understand correlation and therefore avoid jumping to conclusions about what caused an event would be a great start.

About these ads

7 Comments on “More Troubleshooting Monsters – Correlation and Causation”

  1. chet says:

    Sounds like a design issue to me, especially if it’s happening off peak.

    Did you ever get a final resolution?

    I remember once trying to figure out why we were getting “cannot serialize transaction” errors (from a dev’s perspective anyway). I knew the DBA could say the why…but I believed, though never proved, it had to do with poor design. Trigger on the table in question, the package code inserted into other tables (which themselves had triggers)…sort of a cascading effect. You should have seen the flow chart…wow.

    Or maybe that was the deadlocks we were receiving at 400 transactions a day?

  2. Noons says:

    Good post, Chen. It’s amazing how easily folks “find the root cause” without checking if they are actually looking at the actual problem!

    Had a recent instance here where our redo logs got exhausted. They are already @1GB.

    Immediate reaction from the developers: “increase redo size”.

    Immediate reaction from our DBAs, after examining what in reality was going on: “you forgot to put a COMMIT in that PL/SQL package loop!”.

    Same root problem: looking for “quick fixes”. ;)

  3. prodlife says:

    For anyone who is still wondering what was the root cause of the issue – we still have no clue, but the application team upgraded their JDBC driver and the problem is resolved.

  4. Fran says:

    Hi everybody,
    I strongly recommend Tanel’s seminar to anybody with medium/strong Oracle knowledge to attend.
    For me, at least, was by far the best Oracle seminar I’ve ever been to.

    Oh, and by the way, regarding the issue of the post, on of the slides on the course was this basic guideline: “Understand first, then fix” It says it all :)

    Chen, you have a great blog, keep it up.

    Fran


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 3,110 other followers