Tips for good monitoring

Monitors are essential for maintaining healthy production environment. You don’t want to be in a situation when things went wrong and you don’t know about it, and ideally you want to know that things are going in the wrong direction before catastrophe happens.

Monitors are tricky. It is very easy to have too much monitors, the wrong monitors, annoying monitors, monitors that everyone ignores, etc. All these mistakes result in lots of work and very little value. So, I want to give few tips on how to get monitors right.

1) Start with the basics. Databases have 5 essential processes that must be up, they need enough disk space, CPU and memory. Usually you’ll want your listener to be around too. Make sure you monitor the essentials on all your production DBs before you start monitoring fancy things like response times.

2) Monitors should be actionable (I’m talking about alerting monitors, not the ones that are purely for data collection). There is nothing more annoying than monitors that nobody knows how to handle. We used to have a monitor that alerted us about some kind of CRS error. We knew these errors were bad, but Oracle still didn’t get back to us on what we should do when they happen or even what is the impact, so when we got alerts about these errors the only thing we could do is ignore them. The problem was that people got used to ignoring them and when we did learn what to do, it was very difficult to train everyone to start handling them. When you write the monitor, also write a procedure on what should be done when the monitor fires – add diskspace, kill the session that takes the most CPU, start the listener, etc. Make sure that it is very easy to find the instructions – a link in the alert is a good idea. Remember – If there is no action to take, there shouldn’t be an alert.

3) It should be reasonably easy to add a monitor that does what you want, to make sure a new server is also monitored, to configure alerts for monitors and to modify the monitors. There are a bunch of excellent monitoring tools, so don’t settle for one that is a pain.

4) Monitors shouldn’t drive people out of their mind. We used to have a diskspace monitor that sent an alert every minute. Since even the fastest DBA can take few minutes to add diskspace to a volume, we felt that we are constantly bombarded by diskspace issues. Management also thought that diskspace is a huge problem. It wasn’t a big deal – but it was very noisy.

5) Don’t monitor things that are irrelevant. We are plagued with dozens of monitors that are no longer relevant – sometimes they were added to handle an application problem that was resolved in a later version but no one knows about it, or maybe after an incident management demanded some action and we decided to add a monitor to calm them down, or maybe someone added a monitor to show off his new skills. We have a monitor that alerts us when transaction has been open for over an hour. Why? Who cares? No one. But in SQL Server long transactions are a big deal, and we used to be an SQL Server shop and the monitor remains. Same goes for the monitor with the number of open sessions.

6) Try to avoid very broad monitors. We have a monitor for errors in alert log. Generally it is a good idea, but the alert log can have many different errors with different impacts and priorities. Putting everything in one monitor means less work defining monitors, but it also means that you may get alerted for issues that really don’t interest you, like a fellow DBA who made a mistake in a stored procedure.

Advertisements

4 Comments on “Tips for good monitoring”

  1. Again for Oracle after 10g this topic completely changed, one tool to rule them all 🙂

    http://www.oracle.com/technology/products/oem/index.html

    AWR and ADDM are my best friends, fantastic root cause analysis after years of experience, and as a result you pay extra for this kit 😦

    Best regards.

  2. Totally agree with #1 — First watch the CPU and I/O rate on the db server. They indicate that the server is busy or not. If something is slow, and the CPU and I/O are idle, then you can fire up your esoteric diagnostic tool set…

  3. Raj Jamadagni says:

    H.Tonguç Yılmaz,

    But AWR and ADDM costs lots of money and not every company is willing to pay for it. Hence sometimes simple scripts are the best.

    BTW I too like AWR and ADDM.

    Raj

  4. prodlife says:

    We use AWR for analyzing possible causes of a performance problem after we already determined there is an issue using other tools.

    The reason for this is mostly historical – we had a very good set of working monitors long before Oracle released AWR, and we need a very good reason to change a system that worked so well for many years.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s