Measuring Performance – The ProblemPosted: April 30, 2008
And as always, he is absolutely right. My managers fell for one of the oldest operations management errors – Confusing what can be easily measure with what they actually need to know.
They wanted to know if our users are unhappy with the application response times, what they could easily see was server CPU, and to simplify their lives, they decided that “high cpu = unhappy users”. Which is true often enough to make this simplification tempting, but not often enough to make this a serious operational strategy.
Measuring application performance is incredibly complex. Applications are complex, performance is difficult to define and even more difficult to define is “good performance”. I’ve been working in the field of application performance for 8 years now. I spent the first two years of my career developing application monitoring tools, and another 6 years using them – first as operations manager and now as a DBA, and I can easily claim that this is still an open problem. Tons of tools, each and every one of them is doing it wrong.
Two problems I see with the tools I’ve looked at (besides managers taking cpu averages too seriously):
1. Narrow view: Network monitors show you the network. System monitors show you cpu, io and memory. DB Monitors show you how the database wastes its time. End User Monitors should you slow web pages. So you end up with at least 10 different tools (How many do you have in your IT organization? Do you even know?), and whenever one of them indicates something unusual, you have to juggle and correlate all of them to get a reasonably accurate idea of what is going on. Usually they show conflicting information that you never manage to fully correlate and explain. In any case, tons of intuition and guess work is involved.
Those different tools will be owned by different teams. So you are not correlating network monitors, storage monitors, db monitors and application monitors. You are correlating your understanding of the situation as you see it from db monitors with what you understand from the network manager who is trying to understand his network monitors simultaneously. Throw in the storage manager, application owners, angry customers, flustered managers and few consultants, and you no longer have rational correlation of facts, you have politics and finger-pointing.
Uncorrelated data means that understanding scope of issues is very difficult. DB server has 98% cpu, does it cause an application problem? if so, where? which users are affected? Or maybe application monitoring tool shows slow login – is it caused by web server? DB? LDAP? Maybe its the load balancer? or a switch? are there other applications affected by same issue? Is it a reoccurring issue?
2. Too much data, too little meaning: Often a problem of application layer monitors. Actually, this is not exactly a tool problem – application owners usually insist on having incredible (=absurd) amounts of data about their application performance and having no idea what to do with it later.
Here the tools diverge into two kinds, virtual users and real users. Each one of these tools has its own way of generating too much data.
Virtual User based tools require an application expert that will write a script to emulate a user visiting his website, performing various common operations. The you run this script from various locations and ISPs, usually every 5-15 minutes, the tool will collect information regarding response times for each file accessed by the script, often broken down to various network components (server time, network time, ssl, dns, etc).
Take all this data, multiply by 30 locations, running 12 times an hour, 24×7, and of course you want to keep this for something like 18 month, right? These tools also have the additional issue that you may be monitoring the wrong thing, or having a badly written script completely skew your results.
Real User tools will sit at your data center and capture pretty much the same information, but about the real users that are actually visiting your site. If your site is worth monitoring, this will generate much more information that Virtual Users tools (because you should have more real users than what monitoring will generate), this information will also be more “real” by definition. It will be more difficult to analyze, because virtual users lend themselves to nice time-series type analysis. But lets face it, no one does time-series on monitors. We have averages, and on good days – percentiles.
Now you got 50G of monitoring data. Do you really need all this? Do you know what to do with it? Is it meaningful to you? Thursday night, 5am GMT you see performance downgrade for users (real or virtual) accessing your search page from Japan. What do you do?
Do you have meaningful trends? Can your tool show you the slow-down that occurs 10 minutes after the hour, every hour? Can it show you that it is getting worth? Do you know if it is something you should worry about and investigate? Application owners rarely have the ability to mine their own data. Do you have a tool that does it for you?
I’ve seen operators get alerts about performance issue every day at 3pm, it lasted for 10 minutes so it was dismissed at “momentary anomaly”, and not noticing that it happened same time every day for over a month, starting right after a new patchset was installed. It is easy to drown in too much data, especially data that you don’t immediately know how to process. After all, no one knows how to debug performance issues that last for 10 minutes. You need to know that it is repeating the same way every day in order to debug it.
If you have a performance monitoring tool that you are truly happy with, please comment and let us know! In general, I’m very interested in how you do your monitoring.
My friend Moshe works for BHive, a company that happened to develop a new monitor tool that promise to take some of the pain (and irrelevant data) out of the equation. He arranged for me a demo of their solution, and while it is still very very new and incomplete, it does take monitoring in an interesting direction. I’ll write a seperate post about that.
Exciting posts appeared on my RSS Reader yesterday:
Dom Brooks at OraStory upgraded production server to 11g and lived to tell the tale. Its the first production upgrade I’ve heard of and apparently it went well. Good news!
Greg Rahn at Structured Data wrote about data warehouse scalability. This post is so good that I have no superlatives. Just go read it. Even if you don’t have data warehouse. Its a must.