Our new Netapp is cursed

We had another spectacular failure at a routine maintenance.

Remember the last time a routine maintenance went very bad? Well, few days ago we had to move another machine to the new Netapp.

During the last month we moved dozens of machines to the new Netapp, so we had lots of practice and knew the drill: stop the database, unmount volumes, edit fstab with new filer, mount volumes, start database.

No problem. Database stopped, volumes umounted, fstab edited, volumes moun… wait. Mount backgrounding? No route to host?
Few seconds later we found out that we couldn’t reach any host. Call the network manager. Reconfigure interface, ifdown, ifup, rinse, repeat. Half an hour later it appears that we were going nowhere. The network manager is pretty sure there is a hardware issue with the network card on the machine. What a coincidence! Hardware failure right in the middle of a maintenance.

No problem though. A good DBA is always prepared. We took the standby machine, connected it to the new netapp, mount works!
Time to “startup” the database…
ORA-00202: controlfile: ‘/u01/oracle/oradata/PROD/control02.ctl’
ORA-27086: skgfglk: unable to lock file – already in use

What did I do to deserve it?

We shutdown the original machine, to make extra sure it is not locking anything. we unmount and mount, we offline the volume and online it again, we reboot the machine, and we even revert to a snapshot of the database taken just before the move. Nope, file is still locked. Even though no process is locking it, Oracle refuses to use the control file. When we changed Oracle configuration to use another control file, Oracle claims that the other file is also locked.

Finally, defeated, we connect the standby machine to the old Netapp, and finally the database opens.

Now I have to write a report about this and suggest ways to do the move better next time. What can I say? The same procedure worked successfully million times in the past. The network card was clearly very bad luck, but what about the lock? It could be said that if we had better expertise with NFS maybe we would have known how to solve it, but the experts at Netapp or Redhat don’t have any answers on what have happened either. Is it just our fate to suffer failures from time to time which we can’t explain or prevent?

Advertisements

7 Comments on “Our new Netapp is cursed”

  1. Freek says:

    About the lock on the controlfile.
    I had this problem once when one of our clients tried to upgrade a 2 node rac database when one of the instances was still online.

    After brining down the second node, we could not starting the instance with the cluster_database parameter sto to false.

    Eventually I found that the netapp was still keeping an nfs lock on one of the controlfiles, even when all instances where down.

    On metalink there is also a note (4289172 – startup upgrade fails with ora-205 with 10.1.0.4) that explains the same nfs locking problem but in that note the problem was caused because the host file on the server listed first the long name of the netapp and then the short name, while the output of the hostname command on the netapp showed the short name.

    To find and resolve the locking issue I had to use the following command’s:

    lock status -f

    priv set advanced
    sm_mon -l clientname
    priv set

    lock status -f

    regards

    Freek

  2. prodlife says:

    Freek,

    Thanks for the suggestions.
    I’ll send them to our storage manager as well. I have some experience with Netapp, but somehow I’ve never heard of those commands and they sound just the thing for NFS locking issues.

    Thanks,
    Chen Shapira

  3. kevinclosson says:

    Just out of curiosity, is your ORACLE_HOME also on the filer, or do you install Oracle onto an Ext3 Home?

  4. prodlife says:

    Kevin,

    We actually have both kinds of systems.

    I like placing ORACLE_HOME on the filer, because it allows for neat tricks like taking a snapshot before applying a patch and duplicating an installation to a new system quickly.

    Other DBAs feel uncomfortable with placing ORACLE_HOME on the Netapp (Mostly because it is usually associated with shared home type of RAC installation, although it doesn’t have to be) and they install systems with ORACLE_HOME on the local disk.

    After two years of running both systems, I’m pretty sure it doesn’t really matter.

  5. kevinclosson says:

    Prodlife,

    I like Oracle Home on NAS too. I was just wondering. So, when you were thrashing to get this instance started but locks were mysteriously held, did you happen to notice whether there was an lk${ORACLE_SID} file in the dbs directory under ORACLE_HOME on the NAS?

  6. prodlife says:

    Kevin,

    I didn’t even know enough to look for this file, its the first time I’ve heard of it.

    I think it was not there, because one of the steps we took during the attempts to start the database on the new filer was to revert to a snapshot taken before the move, which didn’t help. I find it unlikely that the snapshot had that lock file, but I can’t know for sure.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s