A year ago I was handling a rather confusing performance issue. It was the usual problem – a query that took forever to run. While checking the issue, I discovered that one of the main tables that this query was querying had amazing size variance at different times it held anything between 3 rows to 3 million rows. This is obviously a design mistake that can indeed cause performance problems. An experienced DBA might have configured a job to collect statistics very frequently, or maybe something about bind variable peeking, or just go to the developers and demand an immediate fix.
I’m not an experienced DBA. I noticed that the indexes were highly fragmented with lots of deleted leaf nodes, and so I decided to rebuild the indexes. Online rebuild, so users will not be affected and parallel – so it will go faster.
It was a disaster. First of all, rebuilding online does affect performance. A lot. Users did notice. The table was effectively locked until we killed the rebuild. Its been a year and Oracle support is still silent about whether this is an expected behavior of online rebuild. Also, since the index was rebuilt with parallel option, many of the queries started running in parallel, which also had horrible effects and we had to fix this issue too.
After we cleaned up the entire mess, I got a stern lecture from Senior DBA: “You can’t solve performance problems by randomly rebuilding indexes. You must throughly understand the problem first”.
Fast forward to this weekend. Customer complains about very slow queries. Quick check on v$sql shows that each execution retrieves about 100 rows, but does about 10 million buffer gets. Ouch. The explain plan looked reasonable and I couldn’t see any cause for the excessive gets. I decide to wait with this to Monday, run a trace and see what is really happening there. I never got to run that trace. Senior DBA got home later that day, saw the emails about that query and immediately decides to rebuild indexes!
Needless to say, rebuilding indexes works great when you really understand the problem. It took 2 hours, caused no additional performance issues, and after it was done, the query required only 1000 buffer gets.
I still have some suspicions that the rebuild didn’t change much and what really caused the difference was the fact that Senior DBA also collected statistics on the entire table and all its indexes immediately after the rebuild, but we can’t really know, and as far as I saw – the plan did change.
Regardless of the exact facts, this reminded me of an AI Koan:
A novice was trying to fix a broken Lisp machine by turning the power off and on. Knight, seeing what the student was doing spoke sternly: “You can not fix a machine by just power-cycling it with no understanding of what is going wrong.” Knight turned the machine off and on. The machine worked.
(Knight is the inventor of the Lisp machine).
You can read more AI Koans here. Highly recommended.