Unusual IO activity on shared clusterware home

Sometimes problem exist in a system for years, but only become apparent when you prepare for a big change. This war story begins when our storage admin decided to replace our Netapp disks with new disks, twice as large. It is a cheap way to increase disk space and IO wait times.

While assessing the impact of this change, he found out that the volumes where we put shared oracle home for our RAC clusters have 6000 IO operations per second (IOPS). The data and redo volumes never exceeded 2000 IOPS, so 6000 is quite significant, especially on disks that should be practically idle.

First debug showed that almost all the IO was neither read nor write, but things like “get attribute” and “access”. At this point I discovered that there is almost no way to get any information about IO activity on NFS mounts. I could not see which processes do this activity, nor on which files or directories it was done.

Time to get advice from the experts on Oracle-L. Vasu Balla of Pythian provided the solution:

“Oracle recommends using noac or
actime=o options when mounting nfs for Datafiles, Voting Disk and OCR. Noac
means “no attribute cache” means none of the file attributes are cached in
the filesystem cache, which  is very much needed for RAC. If you put your
shared oracle home also in that mountpoint which is mounted noac, every
access to a file in the oracle home requires a physical IO at the netapp. So
I recommend moving all software directories ( db oracle home, asm oracle
home and crs oracle home etc ) to a nfs mount which is not mounted with noac
or actime=o.”

What a wonderful explanation. I now understand the issue and know what to do to solve it. I took me about 3 minutes to test this solution on our staging environment, and it worked like charm.

Unfortunately, both Netapp and Oracle insisted that shared oracle home on Netapp must be mounted with actimeo=0, and that if this is causing me trouble, I should move to local home instead of shared. Only after very long discussions with two experts from Oracle I got a non-official confirmation that the official documentation is probably wrong and that mounting oracle home with actimeo=0 is a bad idea.

To my surprise, my boss agreed to go ahead with the unofficial but working solution and change NFS mounts to remove “actimeo=0”.

So, we schedule downtime on our production RACs, and we change the mount options, and… Nothing happens. At all. 6000 IOPS before and after the change. If I wasn’t so shocked, I might have noticed my professional credibility taking a hit there.

Why didn’t it work on production? For weeks I had no answer. Until our network admin mentioned that I could use rpcdebug to get more insight about the issue. Turns out that NFS is RPC, and that Linux has flags for debugging RPC. By throwing magic numbers into /proc/sys/sunrpc/nfs_debug I could get NFS trace messages throwin into /var/log/messages. Now we are getting somewhere.

Except that it didn’t get me very far. I could see which devices NFS access, but I already knew that. I could see that our prod server had many many calls to “getattr”, while our staging system didn’t. To complete my tests I decided to turn off the attribute caching on staging again and compare the logs. Just to see what it looks like when both systems are in the same state.

Strange difference caught my eye: The staging systems had messages saying “NFS: Refresh_inode” which did not exist in production. Tiny difference, but maybe it has an impact? What does refresh inode mean? Time to go to lxr.linux.no and look at the Linux kernel code for clues. I just need to recall which version to look at.

When the lightbulb went off it nearly blinded me. Staging system has Linux 2.4.27, production is running 2.6.9. I was the one who pushed for the upgrade. I said “There are many NFS improvements in the new kernel versions.”

From here it was easy to find the change. In 2.4 the code for getting file attributes from the server looked like this:

 static inline int
 nfs_revalidate_inode(struct nfs_server *server, struct inode *inode)
         if (time_before(jiffies, NFS_READTIME(inode)+NFS_ATTRTIMEO(inode)))
                return NFS_STALE(inode) ? -ESTALE : 0;
         return __nfs_revalidate_inode(server, inode);

Which basically means – get the new attributes if the cache has timed out.

In 2.6 the code changed and the following check was added:

/* We may force a getattr if the user cares about atime */
       if (need_atime)
                err = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
                err = nfs_revalidate_inode(NFS_SERVER(inode), inode);

Which means that if the user needs to know the last time the attribute changed, we skip the cache time check and force a get attribute from the server. Another IO operations. Even if the cache did not time out.

Luckily, the fix is also easy. Just add “noatime” to the nfs mount, to let the kernel know that we don’t care about the last time attributes changed, and therefore it can go back and use the cache.

So easy once you know what to look for!


DBA hallucinations

Every once in a while something happens that makes me question not only my skills as a DBA, but my sanity as well. Today’s event was one of the stranger ones that happened to me.

A coworker asked me to export a schema and import it into the same database but under a different name. Very common request, not something that is very likely to go wrong.  I’m running the expdp command, finished without warnings. Time to run impdp command, with remap_schema to load the data into a new schema, again it finishes with no warnings. I log on to the database and change the password for the new user, again it finished successfully.

Great, I call the co-worker and I tell him his new schema is ready. Two minutes later he is back, and complains that he gets an error logging in. Puzzled, I try logging in as that user myself:

SQL> conn test_user_copy/secret
ORA-01017: invalid username/password; logon denied

Even more puzzled I check dba_users for the user I just created. It is not there.  Blinking, I check again. Still not there.  I know that I successfully ran “alter user… identified by….”  on that user not two minutes ago. It is a large user – dropping the schema would take significantly longer, even if someone was so inclined, which I doubt.

I recheck the import log, and it still seems that the new user was created and the data was imported successfully.  Resigned, I reran the import command. It succeeded, which means that the user really was no longer there. I altered the password again, verified that the user still exists and notified the coworker that he can try again. He confirmed that everything is fine, so I guess the user didn’t disappear this time.

Puzzled, I consulted a senior DBA with my problem. He suggested looking at the alert log, as exports and imports also write a rather cryptic line there. I checked.  The alert log clearly showed only one import – the second one.

The senior DBA kindly suggested that I was imagining things when I thought I was running the first import, and maybe I should go home early today and get some rest.

I really have no explanation for this event. I’m not even planning on contacting Oracle Support with this story, it is just too strange.


While searching for something vaguely NFS related in Google, I found out that someone in Netapp has a blog.  I was unable to find out what is he doing in Netapp besides blogging, but I found his posts interesting. I happen to love working with Netapp – both the appliance and the vendor, and the blog really echo the competent and friendly vibe I get when dealing with the company.

MySQL on Netapp

One of the databases we need to support is a MySQL database, installed on a windows server. The DBA team collectively has about 3 days of experience with MySQL. The system is technically production, but it is used only by internal users. It is a rather large database (200G) and gets tons of traffic because the application does lots of automatic data collection and reports. In addition it is an ultra sensitive application, which will crash if we stop the database for more than a minute.

But the worse thing is that the developers can’t keep their hands off the database. So every day contains a new surprise – new version, new storage system, the data files move to a new directory, all our backup scripts were deleted, etc etc.

I placed the MySQL data files on a Netapp LUN (connected to windows via iScsi) and I wrote a very nice script that locked tables, flushed them to disk, took netapp snapshot and unlocked the tables. I was rather proud of this script because it was really a good use of Netapp and allowed us to take backup without upsetting our sensitive application.

Well, few weeks ago we noticed that snapshots stopped on the relevant volume. Farther investigation revealed that the backup scripts are gone. Well, I’m used to that and I immediately uploaded a new copy of the scripts.

Only today I noticed that at some point our developers changed MySQL storage from MyISAM to InnoDB. InnoDB tables are not affected by lock and flush commands, which means that we have no valid backups.

I looked into InnoDB backup possibilities a bit and there seem to be two solutions, both of them are not really compatible with Netapp snapshot capabilities, which mean that they will be really really slow and take up too much storage. I’m still checking our options with Netapp support, but we may need to resort to cold backups.

In the meantime my opinion of MySQL just got a bit more negative. It appears to give DBAs (or developers) two choices for storage:
1. MyISAM – which is quick and can be easily backed up, but doesn’t support transactions, ACID or point-in-time recovery, so it can barely be called a real DB.
2. InnoDB – which can be called a real DB, except that it is very difficult to back up.

I love Open Source, but I prefer my databases to be Oracle.

Our new Netapp is cursed

We had another spectacular failure at a routine maintenance.

Remember the last time a routine maintenance went very bad? Well, few days ago we had to move another machine to the new Netapp.

During the last month we moved dozens of machines to the new Netapp, so we had lots of practice and knew the drill: stop the database, unmount volumes, edit fstab with new filer, mount volumes, start database.

No problem. Database stopped, volumes umounted, fstab edited, volumes moun… wait. Mount backgrounding? No route to host?
Few seconds later we found out that we couldn’t reach any host. Call the network manager. Reconfigure interface, ifdown, ifup, rinse, repeat. Half an hour later it appears that we were going nowhere. The network manager is pretty sure there is a hardware issue with the network card on the machine. What a coincidence! Hardware failure right in the middle of a maintenance.

No problem though. A good DBA is always prepared. We took the standby machine, connected it to the new netapp, mount works!
Time to “startup” the database…
ORA-00202: controlfile: ‘/u01/oracle/oradata/PROD/control02.ctl’
ORA-27086: skgfglk: unable to lock file – already in use

What did I do to deserve it?

We shutdown the original machine, to make extra sure it is not locking anything. we unmount and mount, we offline the volume and online it again, we reboot the machine, and we even revert to a snapshot of the database taken just before the move. Nope, file is still locked. Even though no process is locking it, Oracle refuses to use the control file. When we changed Oracle configuration to use another control file, Oracle claims that the other file is also locked.

Finally, defeated, we connect the standby machine to the old Netapp, and finally the database opens.

Now I have to write a report about this and suggest ways to do the move better next time. What can I say? The same procedure worked successfully million times in the past. The network card was clearly very bad luck, but what about the lock? It could be said that if we had better expertise with NFS maybe we would have known how to solve it, but the experts at Netapp or Redhat don’t have any answers on what have happened either. Is it just our fate to suffer failures from time to time which we can’t explain or prevent?