I'm just a simple DBA on a complex production system

Writing about all things production. Especially Oracle databases.

Monitoring Performance With BHive May 7, 2008

Filed under: monitoring — prodlife @ 4:10 am

After we established that looking at the CPU is not a productive way to monitoring performance, and after I wrote a long long rant about how every application level response time monitor is useless, I want to talk a bit about an application response time tool I’ve seen that takes an interesting approach to the problem -  BHive‘s Conductor.

Just to make sure we are on the same page: Application level response time monitors are the tools that attempt to check the application response times as it is experienced by its users. As such it monitors the entire system, not a particular component of it.

The first cool thing about the Conductor is that it works by monitoring your network traffic. Why is this so important?

  • Its agentless. You install one box on your network, and that should be it.
  • It monitors your real users, probably by examining packets flying back and forth. This means that you don’t have to write scripts to emulate users, and worry about getting the scripts wrong. I’m a big fan of virtual user scripts, because you get very consistent data set that is easier to analyze. But even I have to admit that it is very difficult to get those scripts right.
    As an additional bonus – you will not get alerted about performance issues that happen at 3am when the big batch jobs are running but users are asleep.
  • It attempts to model your business processes (login, search, bid, buy, wish list, etc) from the network information it catches. So the monitor can alert about response time for bidding vs. searching, which is the language that Operations and Business are speaking.
    This worked great in the demo, and it could save tons of time if it always works that well. I’m still slightly skeptical, because I’ve seen so many products fail when they attempt to do this. Especially when some critical information about what a page will do is embedded in environment variables sent to server in HTTP POST.
  • They also automatically map all your servers, so after the monitor ran for a while you see a nice map of which application server is serving which business processes and which DB is used by which application server, and it even maps the queries that each DB is running. I suspect that the map is one more thing that is nicer in a demo than in a real IT, where there are just too many things to look at.
  • The latter point means that they should have good information if the performance issue is DB related or not. Which may mean that the DB will not get automatically blamed for everything.
  • The ability to relate DB slowdowns with real business processes and real users can lead to incredible things. Imagine instructing the helpdesk (or even configuring alerts) to beep you only if more than 10 users are affected, or that they can call you at 3am for slow “buy” but slow “search” can wait until morning.

All these are quite cool, and certainly make life easier, but they are not exactly revolutionary. What makes the Conductor so interesting is that BHive does not intend to leave it as a passive monitor.

The idea is that there are some things you can do once performance issue is detected: You can modify load balancer settings or if you are using VMWare you can move resources from a  server that is doing fine to one that is in trouble, maybe even rebuild indexes ;) .  So why  not have  the monitoring system do this for you when it discovers that there is an issue?

Of course, I don’t mind if the Conductor modifies our Load Balancer settings twice an hour, but I’d rather die before I let it rebuilt indexes. I think that our network admin feels exactly the reverse. There are definitely an issue with automating tasks there were previously manual (and quite rare). I suggested that instead of actually doing the modification, the Conductor should attach a script with instructions to each alert – so the admin can review the actions and decide if he wants to run this or not. Oracle’s ADDM does it successfully.

One direction that I’d really love to see this product taking is more integration with external data sources. BHive is talking about integration with VMWare, but I want to see much more. I want to see the CPU on the machines where slowness is detected. If the DB is oracle, please show me wait events, correlated with the slow queries that the monitor shows. Maybe take AWR snapshot while you are at it and send me the result of the report.

In my opinion one of the most important things to remember while developing a monitoring product is that you can’t really give good value if you cover just one aspect of the system. You need to integrate, you need to correlate, and you need to give me the fullest picture possible. You can’t give good value if you play vendor power games and pretend that your product is the only one I’ll ever use. Remember, I already have 10 terrific monitoring products. It is time to give us the big picture. I hope BHive will be able to do this. They certainly seem to be going the right way.

(Disclaimer: My best friend works for BHive. I did not get paid to write this review – I wrote this because I found the product exciting. I hope my readers will find this review somewhat useful, but I’m sure we are all grownups capable of making our own monitoring decisions. Specifically, for reasons beyond my control (i.e. corporate policy), I’m not using this product myself. )

 

Measuring Performance – The Problem April 30, 2008

Filed under: monitoring,musing — prodlife @ 6:27 am

After I published “Yet Another Nice Myth”, my friend MosheZ sent me an angry email:
” STOP looking at top. You need actual application data if you want to validate application responsiveness.”

And as always, he is absolutely right. My managers fell for one of the oldest operations management errors – Confusing what can be easily measure with what they actually need to know.

They wanted to know if our users are unhappy with the application response times, what they could easily see was server CPU, and to simplify their lives, they decided that “high cpu = unhappy users”. Which is true often enough to make this simplification tempting, but not often enough to make this a serious operational strategy.

Measuring application performance is incredibly complex. Applications are complex, performance is difficult to define and even more difficult to define is “good performance”. I’ve been working in the field of application performance for 8 years now.  I spent the first two years of my career developing application monitoring tools, and another 6 years using them – first as operations manager and now as a DBA, and I can easily claim that this is still an open problem. Tons of tools, each and every one of them is doing it wrong.

Two problems I see with the tools I’ve looked at (besides managers taking cpu averages too seriously):

1.  Narrow view: Network monitors show you the network. System monitors show you cpu, io and memory. DB Monitors show you how the database wastes its time. End User Monitors should you slow web pages.  So you end up with at least 10 different tools (How many do you have in your IT organization? Do you even know?), and whenever one of them indicates something unusual, you have to juggle and correlate all of them to get a reasonably accurate idea of what is going on. Usually they show conflicting information that you never manage to fully correlate and explain. In any case, tons of intuition and guess work is involved.

Those different tools will be owned by different teams. So you are not correlating network monitors, storage monitors, db monitors and application monitors. You are correlating your understanding of the situation as you see it from db monitors with what you understand from the network manager who is trying to understand his network monitors simultaneously. Throw in the storage manager, application owners, angry customers, flustered managers and few consultants, and you no longer have rational correlation of facts, you have politics and finger-pointing.

Uncorrelated data means that understanding scope of issues is very difficult. DB server has 98% cpu, does it cause an application problem? if so, where? which users are affected? Or maybe application monitoring tool shows slow login – is it caused by web server? DB? LDAP? Maybe its the load balancer? or a switch? are there other applications affected by same issue? Is it a reoccurring issue?

2. Too much data, too little meaning: Often a problem of application layer monitors. Actually, this is not exactly a tool problem – application owners usually insist on having incredible (=absurd) amounts of data about their application performance and having no idea what to do with it later.

Here the tools diverge into two kinds, virtual users and real users. Each one of these tools has its own way of generating too much data.

Virtual User based tools require an application expert that will write a script to emulate a user visiting his website, performing various common operations. The you run this script from various locations and ISPs, usually every 5-15 minutes, the tool will collect information regarding response times for each file accessed by the script, often broken down to various network components (server time, network time, ssl, dns, etc).
Take all this data, multiply by 30 locations, running 12 times an hour, 24×7, and of course you want to keep this for something like 18 month, right? These tools also have the additional issue that you may be monitoring the wrong thing, or having a badly written script completely skew your results.

Real User tools will sit at your data center and capture pretty much the same information, but about the real users that are actually visiting your site. If your site is worth monitoring, this will generate much more information that Virtual Users tools (because you should have more real users than what monitoring will generate), this information will also be more “real” by definition. It will be more difficult to analyze, because virtual users lend themselves to nice time-series type analysis. But lets face it, no one does time-series on monitors. We have averages, and on good days – percentiles.

Now you got 50G of monitoring data. Do you really need all this? Do you know what to do with it? Is it meaningful to you? Thursday night, 5am GMT you see performance downgrade for users (real or virtual) accessing your search page from Japan. What do you do?

Do you have meaningful trends? Can your tool show you the slow-down that occurs 10 minutes after the hour, every hour? Can it show you that it is getting worth? Do you know if it is something you should worry about and investigate?  Application owners rarely have the ability to mine their own data. Do you have a tool that does it for you?

I’ve seen operators get alerts about performance issue every day at 3pm, it lasted for 10 minutes so it was dismissed at “momentary anomaly”, and not noticing that it happened same time every day for over a month, starting right after a new patchset was installed. It is easy to drown in too much data, especially data that you don’t immediately know how to process. After all, no one knows how to debug performance issues that last for 10 minutes. You need to know that it is repeating the same way every day in order to debug it.

If you have a performance monitoring tool that you are truly happy with, please comment and let us know! In general, I’m very interested in how you do your monitoring.

My friend Moshe works for BHive, a company that happened to develop a new monitor tool that promise to take some of the pain (and irrelevant data) out of the equation. He arranged for me a demo of their solution, and while it is still very very new and incomplete, it does take monitoring in an interesting direction. I’ll write a seperate post about that.

****

Exciting posts appeared on my RSS Reader yesterday:

Dom Brooks at OraStory upgraded production server to 11g and lived to tell the tale. Its the first production upgrade I’ve heard of and apparently it went well. Good news!

Greg Rahn at Structured Data wrote about data warehouse scalability.  This post is so good that I have no superlatives. Just go read it. Even if you don’t have data warehouse. Its a must.

 

Do you have a checklist? September 13, 2007

Filed under: maintenance,monitoring,musing,rants — prodlife @ 1:53 am

Too often, when I search the net for some information I need, I find something completely unrelated that catches my interest. Normal people may bookmark the interesting article for later review and continue searching for the original piece of information. But I am not normal people, I have N.A.D.D – so I’ll read the interesting article, think of it for a bit, then write a blog entry about it, discuss it with friends, and few hours later remember that I still need to find that original bit.

This time, the interesting article is “Oracle DBA Checklist“. Its a nice list of things DBAs do daily, weekly, monthly, etc. At first glance it looked like the kind of thing that every DBA should have, and I was starting to wonder if I’m a bad DBA because I don’t have one.
At second glance, I figured out that I hate this list and everything it stands for. No DBA should have such list. The correct name of the list is “Monitors to set up when configuring Oracle system”.

the first problem is the entire idea of having the DBA go though a checklist every day, or even every week. This is an unbelievable waste of time. Everything that has to be done at such frequency should be automated during the first week on the job. While the list author does mention that “some sites may wish to automate this”, this is not good enough. It should be phrased as “Automate the entire checklist or find another job”.

The second problem is that of frequency. The suggest to check that the instances are up once a day. Don’t know about their implementation, but we can’t afford a day of downtime on our databases. Our monitors check for instance availability every 10 minutes. Verifying resources should also be done more frequently – Disk space problems can occur very fast and crash your instance, high CPU can cause severe performance problems. You want to catch these sooner rather than later.

however, they were few gems in that document – I loved the suggestion to read manuals for an hour a day. So I decided to write my own mini-checklist despite my hate for anything that involves routine and repetition:

Daily:

  1. Go through mailbox and dashboard to verify that no alerts were sent during the night from our monitors and scheduled jobs.
  2. Read DBA manuals
  3. go over blog aggregator for interesting news and posts

Monthly:

  1. Review reports of CPU, Disk, network, memory and response times for trends, future scalability and ideas for improvement.

The rest of the day can be dedicated to the usual disasters, urgent requests and firefighting. Oh, and maybe some system improvements, in our spare time.

 

Tips for good monitoring August 10, 2007

Filed under: monitoring — prodlife @ 2:35 am

Monitors are essential for maintaining healthy production environment. You don’t want to be in a situation when things went wrong and you don’t know about it, and ideally you want to know that things are going in the wrong direction before catastrophe happens.

Monitors are tricky. It is very easy to have too much monitors, the wrong monitors, annoying monitors, monitors that everyone ignores, etc. All these mistakes result in lots of work and very little value. So, I want to give few tips on how to get monitors right.

1) Start with the basics. Databases have 5 essential processes that must be up, they need enough disk space, CPU and memory. Usually you’ll want your listener to be around too. Make sure you monitor the essentials on all your production DBs before you start monitoring fancy things like response times.

2) Monitors should be actionable (I’m talking about alerting monitors, not the ones that are purely for data collection). There is nothing more annoying than monitors that nobody knows how to handle. We used to have a monitor that alerted us about some kind of CRS error. We knew these errors were bad, but Oracle still didn’t get back to us on what we should do when they happen or even what is the impact, so when we got alerts about these errors the only thing we could do is ignore them. The problem was that people got used to ignoring them and when we did learn what to do, it was very difficult to train everyone to start handling them. When you write the monitor, also write a procedure on what should be done when the monitor fires – add diskspace, kill the session that takes the most CPU, start the listener, etc. Make sure that it is very easy to find the instructions – a link in the alert is a good idea. Remember – If there is no action to take, there shouldn’t be an alert.

3) It should be reasonably easy to add a monitor that does what you want, to make sure a new server is also monitored, to configure alerts for monitors and to modify the monitors. There are a bunch of excellent monitoring tools, so don’t settle for one that is a pain.

4) Monitors shouldn’t drive people out of their mind. We used to have a diskspace monitor that sent an alert every minute. Since even the fastest DBA can take few minutes to add diskspace to a volume, we felt that we are constantly bombarded by diskspace issues. Management also thought that diskspace is a huge problem. It wasn’t a big deal – but it was very noisy.

5) Don’t monitor things that are irrelevant. We are plagued with dozens of monitors that are no longer relevant – sometimes they were added to handle an application problem that was resolved in a later version but no one knows about it, or maybe after an incident management demanded some action and we decided to add a monitor to calm them down, or maybe someone added a monitor to show off his new skills. We have a monitor that alerts us when transaction has been open for over an hour. Why? Who cares? No one. But in SQL Server long transactions are a big deal, and we used to be an SQL Server shop and the monitor remains. Same goes for the monitor with the number of open sessions.

6) Try to avoid very broad monitors. We have a monitor for errors in alert log. Generally it is a good idea, but the alert log can have many different errors with different impacts and priorities. Putting everything in one monitor means less work defining monitors, but it also means that you may get alerted for issues that really don’t interest you, like a fellow DBA who made a mistake in a stored procedure.

 

 
Follow

Get every new post delivered to your Inbox.

Join 48 other followers