Change Management

“Change is not merely necessary to life — It is life.” ~ Alvin Toffler

Changes to production systems are a significant part of my job. At any given week there is a patch that should be applied, a customer that wants to restore his database from older backup, SGA that need to be resized, a RAC node to drop, and many many more.

All DBAs know, but hate admitting, that the number one cause of unscheduled downtime is maintenance that went wrong. Yes, we cause most of the incidents in the system. Random crashes are not as common as we prefer to believe. This is actually good news, because if we cause the incidents, it may be possible to cause less of them.

We try to cause less incidents during changes by having a change control process in place. Our process includes a change request form, filled online. The form contains details such as the change procedure (i.e. build index), the verification procedure (check execution plan for query and see that it uses index), risks (table will be locked when we build index, users may be impacted) and rollback procedure (drop the index). Just filling this form and thinking on the process and its risks probably prevented few incidents, and there is the added bonus that you don’t need to figure out the rollback while you are in the middle of a  catastrophe. The next step is that the responsible manager needs to review the change and approve it, hopefully catching any mistake missed by the DBA. In many cases, the manager will ask the DBA to try the procedure on a test system before the change is approved.

As you can imagine, this is lots of overhead work, and it can often be annoying. The important question is: Does it work?
It does in most cases. Writing down the procedure, having someone else go over it and trying it in advance on a test system often catches problems that would have caused a severe incident on production.

When doesn’t it work? There are two cases where the process doesn’t really help:
1. Small and well known changes.  There are some tasks that are really short and easy and that we did hundreds of times already. In this case, the change itself can take 5 minutes but the procedure can take from an hour to few days (if we need to set up a test system from scratch)
2. Random issues. Even the most planned process doesn’t help when something truly random happens.  We upgrading from 10.1 to 10.2 worked great in staging, but in production it hanged, or maybe patch installation that worked great when tested but failed on production. Often, even Oracle support won’t be able to tell you why the process failed just when we really needed it to work. Call it Murphy’s law.

Another nice side effect of managing all changes in a change control system is that if you notice a performance issue that started on July 27, you can use the system to check what changed on that day. It doesn’t guarantee a solution, but it could be a lead.

All in all, I strongly recommend having a good change control process which include a review of the change system by a second pair of eyes. The benefits are clearly worth the overhead.

Advertisements

3 Comments on “Change Management”

  1. Great article.

    I’d recommend doing some ITIL reading – in particular your ‘small, well known changes’ issue is dealt with in that framework by setting up the test and procedure once, and then just raising a standard change for the subsequent repeats, standard changes not requiring the same level of test and acceptance precisely because they are small and well-known.

  2. prodlife says:

    Thanks.

    Actually, I did an ITIL certification course (naturally, a year after we had to put all these procedures in place…), and we definitely have these standard procedures.
    The problem usually occurs when there is a change that I believe is similar enough to a standard change so it shouldn’t be retested, but I can’t convince my team lead that it is similar enough.
    A good example is that recently we had to move tons of databases to new storage. In my opinion, we can test the move once and then move all servers. My managers decided we should test the move once per application, even though the change is entirely in the database, the application doesn’t care which storage the DB is using.
    We definitely prefer to err on the side of too much testing, but sometimes it is taken to extremes.

  3. […] 19th, 2007 I’ve written in the past about change management processes, and if you’ve read my old post you know that I’m a huge fan of change management. […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s