Tips for good monitoringPosted: August 10, 2007
Monitors are essential for maintaining healthy production environment. You don’t want to be in a situation when things went wrong and you don’t know about it, and ideally you want to know that things are going in the wrong direction before catastrophe happens.
Monitors are tricky. It is very easy to have too much monitors, the wrong monitors, annoying monitors, monitors that everyone ignores, etc. All these mistakes result in lots of work and very little value. So, I want to give few tips on how to get monitors right.
1) Start with the basics. Databases have 5 essential processes that must be up, they need enough disk space, CPU and memory. Usually you’ll want your listener to be around too. Make sure you monitor the essentials on all your production DBs before you start monitoring fancy things like response times.
2) Monitors should be actionable (I’m talking about alerting monitors, not the ones that are purely for data collection). There is nothing more annoying than monitors that nobody knows how to handle. We used to have a monitor that alerted us about some kind of CRS error. We knew these errors were bad, but Oracle still didn’t get back to us on what we should do when they happen or even what is the impact, so when we got alerts about these errors the only thing we could do is ignore them. The problem was that people got used to ignoring them and when we did learn what to do, it was very difficult to train everyone to start handling them. When you write the monitor, also write a procedure on what should be done when the monitor fires – add diskspace, kill the session that takes the most CPU, start the listener, etc. Make sure that it is very easy to find the instructions – a link in the alert is a good idea. Remember – If there is no action to take, there shouldn’t be an alert.
3) It should be reasonably easy to add a monitor that does what you want, to make sure a new server is also monitored, to configure alerts for monitors and to modify the monitors. There are a bunch of excellent monitoring tools, so don’t settle for one that is a pain.
4) Monitors shouldn’t drive people out of their mind. We used to have a diskspace monitor that sent an alert every minute. Since even the fastest DBA can take few minutes to add diskspace to a volume, we felt that we are constantly bombarded by diskspace issues. Management also thought that diskspace is a huge problem. It wasn’t a big deal – but it was very noisy.
5) Don’t monitor things that are irrelevant. We are plagued with dozens of monitors that are no longer relevant – sometimes they were added to handle an application problem that was resolved in a later version but no one knows about it, or maybe after an incident management demanded some action and we decided to add a monitor to calm them down, or maybe someone added a monitor to show off his new skills. We have a monitor that alerts us when transaction has been open for over an hour. Why? Who cares? No one. But in SQL Server long transactions are a big deal, and we used to be an SQL Server shop and the monitor remains. Same goes for the monitor with the number of open sessions.
6) Try to avoid very broad monitors. We have a monitor for errors in alert log. Generally it is a good idea, but the alert log can have many different errors with different impacts and priorities. Putting everything in one monitor means less work defining monitors, but it also means that you may get alerted for issues that really don’t interest you, like a fellow DBA who made a mistake in a stored procedure.