Monitoring Performance With BHivePosted: May 7, 2008
After we established that looking at the CPU is not a productive way to monitoring performance, and after I wrote a long long rant about how every application level response time monitor is useless, I want to talk a bit about an application response time tool I’ve seen that takes an interesting approach to the problem – BHive‘s Conductor.
Just to make sure we are on the same page: Application level response time monitors are the tools that attempt to check the application response times as it is experienced by its users. As such it monitors the entire system, not a particular component of it.
The first cool thing about the Conductor is that it works by monitoring your network traffic. Why is this so important?
- Its agentless. You install one box on your network, and that should be it.
- It monitors your real users, probably by examining packets flying back and forth. This means that you don’t have to write scripts to emulate users, and worry about getting the scripts wrong. I’m a big fan of virtual user scripts, because you get very consistent data set that is easier to analyze. But even I have to admit that it is very difficult to get those scripts right.
As an additional bonus – you will not get alerted about performance issues that happen at 3am when the big batch jobs are running but users are asleep.
- It attempts to model your business processes (login, search, bid, buy, wish list, etc) from the network information it catches. So the monitor can alert about response time for bidding vs. searching, which is the language that Operations and Business are speaking.
This worked great in the demo, and it could save tons of time if it always works that well. I’m still slightly skeptical, because I’ve seen so many products fail when they attempt to do this. Especially when some critical information about what a page will do is embedded in environment variables sent to server in HTTP POST.
- They also automatically map all your servers, so after the monitor ran for a while you see a nice map of which application server is serving which business processes and which DB is used by which application server, and it even maps the queries that each DB is running. I suspect that the map is one more thing that is nicer in a demo than in a real IT, where there are just too many things to look at.
- The latter point means that they should have good information if the performance issue is DB related or not. Which may mean that the DB will not get automatically blamed for everything.
- The ability to relate DB slowdowns with real business processes and real users can lead to incredible things. Imagine instructing the helpdesk (or even configuring alerts) to beep you only if more than 10 users are affected, or that they can call you at 3am for slow “buy” but slow “search” can wait until morning.
All these are quite cool, and certainly make life easier, but they are not exactly revolutionary. What makes the Conductor so interesting is that BHive does not intend to leave it as a passive monitor.
The idea is that there are some things you can do once performance issue is detected: You can modify load balancer settings or if you are using VMWare you can move resources from a server that is doing fine to one that is in trouble, maybe even rebuild indexes ;). So why not have the monitoring system do this for you when it discovers that there is an issue?
Of course, I don’t mind if the Conductor modifies our Load Balancer settings twice an hour, but I’d rather die before I let it rebuilt indexes. I think that our network admin feels exactly the reverse. There are definitely an issue with automating tasks there were previously manual (and quite rare). I suggested that instead of actually doing the modification, the Conductor should attach a script with instructions to each alert – so the admin can review the actions and decide if he wants to run this or not. Oracle’s ADDM does it successfully.
One direction that I’d really love to see this product taking is more integration with external data sources. BHive is talking about integration with VMWare, but I want to see much more. I want to see the CPU on the machines where slowness is detected. If the DB is oracle, please show me wait events, correlated with the slow queries that the monitor shows. Maybe take AWR snapshot while you are at it and send me the result of the report.
In my opinion one of the most important things to remember while developing a monitoring product is that you can’t really give good value if you cover just one aspect of the system. You need to integrate, you need to correlate, and you need to give me the fullest picture possible. You can’t give good value if you play vendor power games and pretend that your product is the only one I’ll ever use. Remember, I already have 10 terrific monitoring products. It is time to give us the big picture. I hope BHive will be able to do this. They certainly seem to be going the right way.
(Disclaimer: My best friend works for BHive. I did not get paid to write this review – I wrote this because I found the product exciting. I hope my readers will find this review somewhat useful, but I’m sure we are all grownups capable of making our own monitoring decisions. Specifically, for reasons beyond my control (i.e. corporate policy), I’m not using this product myself. )