Automated Root Cause Analysis

Posted: February 5, 2010 | Author: prodlife | Filed under: musing |12 Comments

I’ve ran into multiple products that claim to offer automated root cause analysis, so don’t think that I’m ranting against a specific product or vendor. I have a problem with the concept.

The problem these products are trying to solve: IT staff spend much of their time trying to troubleshoot issues. Essentially finding the cause of effects they don’t like. What is causing high response times on this report? What is causing the lower disk throughputs?

If we can somehow automate the task of finding a cause for a problem, we’ll have a much more efficient IT department.

The idea that troubleshooting can be automated is rather seductive. I’d love to have a “What is causing this issue” button. My problem is with the way those vendors go about solving this issue.

Most of them use variations of a very similar technique:
All these vendors already have monitoring software, so they usually know when there is a problem. They also know of many other things that happen at the same time. So if their software detects that response time go up, it can look at disk throughput, DB cpu, swap, load average, number of connections, etc etc.
When they see that CPU goes up together with response times – Tada! Root cause found!

First problem with this approach: You can’t look at correlation and declare that you found a cause. Enough said.

Second problem: If you collect so much data (and often these systems have millions of measurements) you will find many correlation by pure chance, in addition to some correlations that do indicate a common issue.
What these vendors do is ignore all the false findings and present the real problems found at a conference as proof that their method works. Also, you can’t reduce the rate of false-findings without losing the rate of finding real issues as well.

Note that I’m not talking about tools like Tanel Poder visualization tool. Tools which makes it easier for the DBA to look at large amounts of data and using our brain’s built in pattern matcher to find correlations. I support any tool that assists me in applying my knowledge to large sets of data at once.

I have a problem with tools that use statistical correlation as a replacement to applying knowledge. It can’t be done.

Here’s the kind of tool I’d like to see:
Suppose your monitoring tool will give you the ability to visually browse, filter and explore all that data you collect in ways that help you troubleshoot. The tool will remember the things you looked at and the steps you took. After you solve the problem, you can upload the problem description and your debug process to a website. You can even mark away the dead-ends of the investigation.

Now you can go to that website and see that for problem X, 90% of the DBAs started by looking at v$sesstat and 10% ran trace. Maybe you can even have a friend network, so you can see that in this case Fuad looked at disk utilization first while Iggy checked how much redo is written each hour.

If you are not into sharing, you can still browse your own past problems and solutions for ideas that might have slipped your mind.

I think that a troubleshooting tool combined with “collective wisdom” site can assist experienced DBAs and improve the learning curve for junior DBAs without pretending to automate away knowledge and experience.

12 Comments on “Automated Root Cause Analysis”

Dr Queues says:

February 5, 2010 at 11:38 pm

Not to mention ‘anti-correlation.’ I have a great real-world example of that, which is too long to describe here.

Another method, that you didn’t mention is, Bayesian inference, and brethren. I know this technique has been beaten to death in the academic literature, but I’m not sure if any of the vendors you’re referring to use it. I have the impression that it is used in some commercial network mgmt tools.

Also, “collective wisdom” == knowledge base (the AI term). A difference with Bayesian techniques is that it can adapt quickly to new data or that’s the claim, at least.

Reply
aachleon says:

February 5, 2010 at 11:58 pm

As usual…love the ideas.

Reply
Tanel Poder says:

February 6, 2010 at 4:33 pm

Hi Chen,

Next time we meet I’ll show you a product beta which does (among other things) that 😉

Reply
joel garry says:

February 8, 2010 at 11:28 pm

Problem with such a base is inevitably rating-feedback control. The masses will tell you to rebuild indices every week, and every other myth you don’t want to think of now that you know better. There’s a reason there’re only a few gurus and thousands of conference attendees. MOS even vets their publications, and look at their reputation and what they are doing wrong (including not keeping change histories on the docs, and the knowledgebase interface).

Doing Administration Right is expensive. You never get enough data to automate such a tool, nor enough data points to solve other than the simplest problem.

Reply
prodlife says:

February 10, 2010 at 3:49 am

@Joel

I’m not that sure that crowd-sourcing for troubleshooting or tuning advice will be that horrible. Wikipedia seems to survive without too many mythes going on. The advice in OTN is often quite solid and in Oracle-L it is excellent. Actually I think that the index rebuilders do better when there is no open discussion, when they can walk to a client, fiddle around, charge big bucks and no one can criticize them.

I’m an optimist 🙂

Reply
prodlife says:

February 10, 2010 at 3:50 am

@Tanel

Will you be at RMOUG? I’ll remind you of this promise.

Reply
prodlife says:

February 10, 2010 at 3:51 am

@DRQZ

Bayesian inference seems to have good track record as the basis of expert systems. I’d like to see a product that uses this for root cause analysis.

Of course, the problem with quickly adapting to new data is that you make new mistakes faster 🙂

Reply
Log Buffer #178: a Carnival of the Vanities for DBAs | The Pythian Blog says:

February 12, 2010 at 6:38 pm

[…] Shapira offers us her thoughts on diagnostics in Automated Root Cause Analysis, and I have to agree with her that sometimes it is best to be offered good visualization tools […]

Reply
Donald K. Burleson says:

February 12, 2010 at 10:13 pm

Good article!

Let’s not forget the debate about correlation and causation.

– In medical research, there are many medicines that treat symptoms, while doctors do not understand how the medicines actually work.

– In statostics, it’s the strength of the correlation that determines its predictive validity, regardless of the root cause of the correlation.

DKB

Reply
prodlife says:

February 12, 2010 at 10:54 pm

@DKB

You can certainly use a strong correlation to make predictions. However, predictions are very difficult – especially about the future 🙂

If I see high CPU every time response times are high, I can use this to predict that the next time CPU will be high my response times will suffer. It may be correct if nothing changes in the system.

This still doesn’t imply casuality – I don’t know that the high CPU is the cause of the high response times (maybe its lack of memory that causes both issues?), I don’t know that reducing CPU utilization will reduce response times and I don’t know that reducing response times will reduce CPU utilization.

In short, I can make predictions in a very limited set of circumstances. Correlation is a very weak predictive tool. Craig Shallahamer explains this in his “Forecasting Performance” book. Databases are essentially queue systems where very small changes can cause large and difficult to predict effects.

BTW. If you are interested in statistical prediction as it applies to databases – I’m giving a presentation about the use of time series analysis in RMOUG.

Reply
David Morrison says:

March 1, 2012 at 10:22 pm

Interesting discussion as I was thinking to myself why can’t RCA be automated in a way that makes sense and the hold true industry standards and methodologies of determining Root Cause. That said, the toll must be able to allow for input a storage of date or it must be able to search the cloud for possibilities given criteria by the user. It should also be able to invoke the chosen method such as The 5 Whys, or ishikawa ..afterwards the systems output could be sued to either validate or invalidate its findings based on what it found …the reality is there is nothing that has gone wrong with a server or a system that has not happened before someplace else….if a tool like this existed image the reduction in turnaround times for RCA and the increase is customer SAT…..interesting indeed

Reply
David Morrison says:

March 1, 2012 at 10:25 pm

Please excuse my typos ….let me correct:

Interesting discussion as I was thinking to myself why can’t RCA be automated in a way that makes sense and the hold true industry standards and methodologies of determining Root Cause. That said, the tool must be able to allow for input and/ or storage of data or it must be able to search the cloud for possibilities given criteria by the user. It should also be able to invoke a chosen method such as The 5 Whys, or ishikawa ..afterwards the system’s output could be used to either validate or invalidate its findings The reality is there is nothing that has gone wrong with a server or a system that has not happened before someplace else….if a tool like this existed image the reduction in turnaround times for RCA and the increase is customer SAT…..interesting indeed

Reply