Our new Netapp is cursedPosted: July 13, 2007
We had another spectacular failure at a routine maintenance.
Remember the last time a routine maintenance went very bad? Well, few days ago we had to move another machine to the new Netapp.
During the last month we moved dozens of machines to the new Netapp, so we had lots of practice and knew the drill: stop the database, unmount volumes, edit fstab with new filer, mount volumes, start database.
No problem. Database stopped, volumes umounted, fstab edited, volumes moun… wait. Mount backgrounding? No route to host?
Few seconds later we found out that we couldn’t reach any host. Call the network manager. Reconfigure interface, ifdown, ifup, rinse, repeat. Half an hour later it appears that we were going nowhere. The network manager is pretty sure there is a hardware issue with the network card on the machine. What a coincidence! Hardware failure right in the middle of a maintenance.
No problem though. A good DBA is always prepared. We took the standby machine, connected it to the new netapp, mount works!
Time to “startup” the database…
ORA-00202: controlfile: ‘/u01/oracle/oradata/PROD/control02.ctl’
ORA-27086: skgfglk: unable to lock file – already in use
What did I do to deserve it?
We shutdown the original machine, to make extra sure it is not locking anything. we unmount and mount, we offline the volume and online it again, we reboot the machine, and we even revert to a snapshot of the database taken just before the move. Nope, file is still locked. Even though no process is locking it, Oracle refuses to use the control file. When we changed Oracle configuration to use another control file, Oracle claims that the other file is also locked.
Finally, defeated, we connect the standby machine to the old Netapp, and finally the database opens.
Now I have to write a report about this and suggest ways to do the move better next time. What can I say? The same procedure worked successfully million times in the past. The network card was clearly very bad luck, but what about the lock? It could be said that if we had better expertise with NFS maybe we would have known how to solve it, but the experts at Netapp or Redhat don’t have any answers on what have happened either. Is it just our fate to suffer failures from time to time which we can’t explain or prevent?