Monitoring Performance With BHive

After we established that looking at the CPU is not a productive way to monitoring performance, and after I wrote a long long rant about how every application level response time monitor is useless, I want to talk a bit about an application response time tool I’ve seen that takes an interesting approach to the problem –  BHive‘s Conductor.

Just to make sure we are on the same page: Application level response time monitors are the tools that attempt to check the application response times as it is experienced by its users. As such it monitors the entire system, not a particular component of it.

The first cool thing about the Conductor is that it works by monitoring your network traffic. Why is this so important?

  • Its agentless. You install one box on your network, and that should be it.
  • It monitors your real users, probably by examining packets flying back and forth. This means that you don’t have to write scripts to emulate users, and worry about getting the scripts wrong. I’m a big fan of virtual user scripts, because you get very consistent data set that is easier to analyze. But even I have to admit that it is very difficult to get those scripts right.
    As an additional bonus – you will not get alerted about performance issues that happen at 3am when the big batch jobs are running but users are asleep.
  • It attempts to model your business processes (login, search, bid, buy, wish list, etc) from the network information it catches. So the monitor can alert about response time for bidding vs. searching, which is the language that Operations and Business are speaking.
    This worked great in the demo, and it could save tons of time if it always works that well. I’m still slightly skeptical, because I’ve seen so many products fail when they attempt to do this. Especially when some critical information about what a page will do is embedded in environment variables sent to server in HTTP POST.
  • They also automatically map all your servers, so after the monitor ran for a while you see a nice map of which application server is serving which business processes and which DB is used by which application server, and it even maps the queries that each DB is running. I suspect that the map is one more thing that is nicer in a demo than in a real IT, where there are just too many things to look at.
  • The latter point means that they should have good information if the performance issue is DB related or not. Which may mean that the DB will not get automatically blamed for everything.
  • The ability to relate DB slowdowns with real business processes and real users can lead to incredible things. Imagine instructing the helpdesk (or even configuring alerts) to beep you only if more than 10 users are affected, or that they can call you at 3am for slow “buy” but slow “search” can wait until morning.

All these are quite cool, and certainly make life easier, but they are not exactly revolutionary. What makes the Conductor so interesting is that BHive does not intend to leave it as a passive monitor.

The idea is that there are some things you can do once performance issue is detected: You can modify load balancer settings or if you are using VMWare you can move resources from a  server that is doing fine to one that is in trouble, maybe even rebuild indexes ;).  So why  not have  the monitoring system do this for you when it discovers that there is an issue?

Of course, I don’t mind if the Conductor modifies our Load Balancer settings twice an hour, but I’d rather die before I let it rebuilt indexes. I think that our network admin feels exactly the reverse. There are definitely an issue with automating tasks there were previously manual (and quite rare). I suggested that instead of actually doing the modification, the Conductor should attach a script with instructions to each alert – so the admin can review the actions and decide if he wants to run this or not. Oracle’s ADDM does it successfully.

One direction that I’d really love to see this product taking is more integration with external data sources. BHive is talking about integration with VMWare, but I want to see much more. I want to see the CPU on the machines where slowness is detected. If the DB is oracle, please show me wait events, correlated with the slow queries that the monitor shows. Maybe take AWR snapshot while you are at it and send me the result of the report.

In my opinion one of the most important things to remember while developing a monitoring product is that you can’t really give good value if you cover just one aspect of the system. You need to integrate, you need to correlate, and you need to give me the fullest picture possible. You can’t give good value if you play vendor power games and pretend that your product is the only one I’ll ever use. Remember, I already have 10 terrific monitoring products. It is time to give us the big picture. I hope BHive will be able to do this. They certainly seem to be going the right way.

(Disclaimer: My best friend works for BHive. I did not get paid to write this review – I wrote this because I found the product exciting. I hope my readers will find this review somewhat useful, but I’m sure we are all grownups capable of making our own monitoring decisions. Specifically, for reasons beyond my control (i.e. corporate policy), I’m not using this product myself. )

2 Comments on “Monitoring Performance With BHive”

  1. prodlife says:

    Another cool use of real user monitoring: Alert me only if you see performance issues for users in a specific IP range that belong to my biggest customer / executive LAN.

    Kind of a Quality of Service tool.

  2. Ashi Sheth says:

    We were considering B-Hive at our university, however, we couldn’t justify the licensing costs associated with the solution. From the information we were provided, it is licensed based on number of host servers being watched (as opposed to applications or some other more lenient model). For a non-profit university, having B-Hive licensed based on the 4 application servers, 3 web servers and 2 database servers supporting the single app we care about to this level of monitoring, the cost was just unreasonable. If I was a business where time has the same level of impact on revenue, I could justify the cost. Here – I can’t (yet).

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s