Change ManagementPosted: August 13, 2007
“Change is not merely necessary to life — It is life.” ~ Alvin Toffler
Changes to production systems are a significant part of my job. At any given week there is a patch that should be applied, a customer that wants to restore his database from older backup, SGA that need to be resized, a RAC node to drop, and many many more.
All DBAs know, but hate admitting, that the number one cause of unscheduled downtime is maintenance that went wrong. Yes, we cause most of the incidents in the system. Random crashes are not as common as we prefer to believe. This is actually good news, because if we cause the incidents, it may be possible to cause less of them.
We try to cause less incidents during changes by having a change control process in place. Our process includes a change request form, filled online. The form contains details such as the change procedure (i.e. build index), the verification procedure (check execution plan for query and see that it uses index), risks (table will be locked when we build index, users may be impacted) and rollback procedure (drop the index). Just filling this form and thinking on the process and its risks probably prevented few incidents, and there is the added bonus that you don’t need to figure out the rollback while you are in the middle of a catastrophe. The next step is that the responsible manager needs to review the change and approve it, hopefully catching any mistake missed by the DBA. In many cases, the manager will ask the DBA to try the procedure on a test system before the change is approved.
As you can imagine, this is lots of overhead work, and it can often be annoying. The important question is: Does it work?
It does in most cases. Writing down the procedure, having someone else go over it and trying it in advance on a test system often catches problems that would have caused a severe incident on production.
When doesn’t it work? There are two cases where the process doesn’t really help:
1. Small and well known changes. There are some tasks that are really short and easy and that we did hundreds of times already. In this case, the change itself can take 5 minutes but the procedure can take from an hour to few days (if we need to set up a test system from scratch)
2. Random issues. Even the most planned process doesn’t help when something truly random happens. We upgrading from 10.1 to 10.2 worked great in staging, but in production it hanged, or maybe patch installation that worked great when tested but failed on production. Often, even Oracle support won’t be able to tell you why the process failed just when we really needed it to work. Call it Murphy’s law.
Another nice side effect of managing all changes in a change control system is that if you notice a performance issue that started on July 27, you can use the system to check what changed on that day. It doesn’t guarantee a solution, but it could be a lead.
All in all, I strongly recommend having a good change control process which include a review of the change system by a second pair of eyes. The benefits are clearly worth the overhead.