Don’t use AWR for wait event analysisPosted: September 5, 2007
Tonight, at midnight, inserts slowed to a crawl at one of our main production databases. This database is used to process about 300 inserts every second, but for about 20 minutes starting at midnight, each insert took almost a minute. The problem resolved by itself, but now the application owner wants answers – what caused the slowdown at midnight?
A co-worker ran AWR and found that the top wait event is Enq: TM-Contention. A lock! This looks promising, but inserts usually don’t lock. Thats one of the most important principles of Oracle – readers don’t block writers and writers don’t block readers. The most locking I’d expect from insert will be a latch contention on a hot block. But co-worker is a very experienced DBA and he knew that TM-Contention can happen from non-indexed foreign key, and he started looking for one.
At this point I arrived at the office, and he updated me on the situation.
I don’t like AWR at all, so I took out my favorite tool – Confio. This is a nice graphical tool that helps with wait event analysis, but you can probably get pretty much the same data by digging into v$active_session_history yourself.
Confio also showed that TM-Contention is the largest wait event, but it also showed the queries waiting (lock table … in exclusive mode) and the machines. The table locked and the machines had nothing to do with the insert issue. A completely different part of the application. So, we can skip this event. I don’t think you can get this kind of information from AWR!
The second and third highest events are “db file sequential read” and “gc current request”. These events are so typical of selects (usually with a bad index), that if I saw them in AWR I’d probably assume they are unrelated, but Confio showed, clear as a day, that all the queries with this wait are inserts. I’m not talking about a fancy “insert into … select”, I’m talking about a plain “insert into … values”.
I still don’t know why a plain insert had to wait on “db file sequential read”, and why it had to do it only at midnight and stopped that 20 minutes later. But at least now I know where to look!
Note how AWR not only lead us in the wrong direction, it also doesn’t give enough information to follow up on the issue – like queries, tables and machines involved.