Automated Root Cause AnalysisPosted: February 5, 2010
I’ve ran into multiple products that claim to offer automated root cause analysis, so don’t think that I’m ranting against a specific product or vendor. I have a problem with the concept.
The problem these products are trying to solve: IT staff spend much of their time trying to troubleshoot issues. Essentially finding the cause of effects they don’t like. What is causing high response times on this report? What is causing the lower disk throughputs?
If we can somehow automate the task of finding a cause for a problem, we’ll have a much more efficient IT department.
The idea that troubleshooting can be automated is rather seductive. I’d love to have a “What is causing this issue” button. My problem is with the way those vendors go about solving this issue.
Most of them use variations of a very similar technique:
All these vendors already have monitoring software, so they usually know when there is a problem. They also know of many other things that happen at the same time. So if their software detects that response time go up, it can look at disk throughput, DB cpu, swap, load average, number of connections, etc etc.
When they see that CPU goes up together with response times – Tada! Root cause found!
First problem with this approach: You can’t look at correlation and declare that you found a cause. Enough said.
Second problem: If you collect so much data (and often these systems have millions of measurements) you will find many correlation by pure chance, in addition to some correlations that do indicate a common issue.
What these vendors do is ignore all the false findings and present the real problems found at a conference as proof that their method works. Also, you can’t reduce the rate of false-findings without losing the rate of finding real issues as well.
Note that I’m not talking about tools like Tanel Poder visualization tool. Tools which makes it easier for the DBA to look at large amounts of data and using our brain’s built in pattern matcher to find correlations. I support any tool that assists me in applying my knowledge to large sets of data at once.
I have a problem with tools that use statistical correlation as a replacement to applying knowledge. It can’t be done.
Here’s the kind of tool I’d like to see:
Suppose your monitoring tool will give you the ability to visually browse, filter and explore all that data you collect in ways that help you troubleshoot. The tool will remember the things you looked at and the steps you took. After you solve the problem, you can upload the problem description and your debug process to a website. You can even mark away the dead-ends of the investigation.
Now you can go to that website and see that for problem X, 90% of the DBAs started by looking at v$sesstat and 10% ran trace. Maybe you can even have a friend network, so you can see that in this case Fuad looked at disk utilization first while Iggy checked how much redo is written each hour.
If you are not into sharing, you can still browse your own past problems and solutions for ideas that might have slipped your mind.
I think that a troubleshooting tool combined with “collective wisdom” site can assist experienced DBAs and improve the learning curve for junior DBAs without pretending to automate away knowledge and experience.