I'm just a simple DBA on a complex production system

Writing about all things production. Especially Oracle databases.

Margin of Error November 19, 2008

Filed under: nerdism — prodlife @ 5:43 am

Few weeks ago, I was at a friendly dinner party, discussing the upcoming elections, and specifically the results of recent voter surveys. One of the participants in the discussion said “I never pay attention to the error margins, since they apply to both candidates”.

I think he meant that if a specific poll said that 52% of the sampled voters preferred Obama and 46% preferred McCain, and the poll has a margin of error of 3%, then perhaps the “real” numbers are 55% for Obama and 49% for McCain, or maybe 49% Obama and 43% McCain, but it doesn’t really matter since the difference between them is constant.

This is of course, very false. For three important reasons:

  1. The margin of error is 3%, which means that the result of 49% for Obama and 49% for McCain cannot be ruled out. It is possible that Obama has no lead at all. It is important to understand that the 0% difference between the candidates is just at likely as the 6% difference the poll result actually show. There is no statistical way to differentiate detween these scenarios and both are just as real.
  2. 3% margin of error actually means that there are 95% chance that the “real” result is within 3% of the reported result. (Where “real” means what theresults would be if the entire adult population had been polled with complete accuracy). Remember that around election times, many polls are published. 5% of them have a bigger error than they report. How big? We have no idea.
  3. The reported margin of error is correct assuming that the sampling was perfect. Which means that no one refused to answer questions, no one lied, the questions were not worded or ordered in a way that caused bias, the selection of the sample was not biased, etc, etc. All these factors are likely to cause errors much larger than the theoretic sampling error, and what’s worse – we have no idea how big they can be and in which direction.

If you are really interested in the subject and not afraid of some mathematical notation, Terence Tao has a much deeper analysis of the subject.

 

Unusual IO activity on shared clusterware home November 15, 2008

Filed under: Linux, hardcore, nerdism, netapp, nfs — prodlife @ 3:09 am

Sometimes problem exist in a system for years, but only become apparent when you prepare for a big change. This war story begins when our storage admin decided to replace our Netapp disks with new disks, twice as large. It is a cheap way to increase disk space and IO wait times.

While assessing the impact of this change, he found out that the volumes where we put shared oracle home for our RAC clusters have 6000 IO operations per second (IOPS). The data and redo volumes never exceeded 2000 IOPS, so 6000 is quite significant, especially on disks that should be practically idle.

First debug showed that almost all the IO was neither read nor write, but things like “get attribute” and “access”. At this point I discovered that there is almost no way to get any information about IO activity on NFS mounts. I could not see which processes do this activity, nor on which files or directories it was done.

Time to get advice from the experts on Oracle-L. Vasu Balla of Pythian provided the solution:

“Oracle recommends using noac or
actime=o options when mounting nfs for Datafiles, Voting Disk and OCR. Noac
means “no attribute cache” means none of the file attributes are cached in
the filesystem cache, which  is very much needed for RAC. If you put your
shared oracle home also in that mountpoint which is mounted noac, every
access to a file in the oracle home requires a physical IO at the netapp. So
I recommend moving all software directories ( db oracle home, asm oracle
home and crs oracle home etc ) to a nfs mount which is not mounted with noac
or actime=o.”

What a wonderful explanation. I now understand the issue and know what to do to solve it. I took me about 3 minutes to test this solution on our staging environment, and it worked like charm.

Unfortunately, both Netapp and Oracle insisted that shared oracle home on Netapp must be mounted with actimeo=0, and that if this is causing me trouble, I should move to local home instead of shared. Only after very long discussions with two experts from Oracle I got a non-official confirmation that the official documentation is probably wrong and that mounting oracle home with actimeo=0 is a bad idea.

To my surprise, my boss agreed to go ahead with the unofficial but working solution and change NFS mounts to remove “actimeo=0″.

So, we schedule downtime on our production RACs, and we change the mount options, and… Nothing happens. At all. 6000 IOPS before and after the change. If I wasn’t so shocked, I might have noticed my professional credibility taking a hit there.

Why didn’t it work on production? For weeks I had no answer. Until our network admin mentioned that I could use rpcdebug to get more insight about the issue. Turns out that NFS is RPC, and that Linux has flags for debugging RPC. By throwing magic numbers into /proc/sys/sunrpc/nfs_debug I could get NFS trace messages throwin into /var/log/messages. Now we are getting somewhere.

Except that it didn’t get me very far. I could see which devices NFS access, but I already knew that. I could see that our prod server had many many calls to “getattr”, while our staging system didn’t. To complete my tests I decided to turn off the attribute caching on staging again and compare the logs. Just to see what it looks like when both systems are in the same state.

Strange difference caught my eye: The staging systems had messages saying “NFS: Refresh_inode” which did not exist in production. Tiny difference, but maybe it has an impact? What does refresh inode mean? Time to go to lxr.linux.no and look at the Linux kernel code for clues. I just need to recall which version to look at.

When the lightbulb went off it nearly blinded me. Staging system has Linux 2.4.27, production is running 2.6.9. I was the one who pushed for the upgrade. I said “There are many NFS improvements in the new kernel versions.”

From here it was easy to find the change. In 2.4 the code for getting file attributes from the server looked like this:

 static inline int
 nfs_revalidate_inode(struct nfs_server *server, struct inode *inode)
 {
         if (time_before(jiffies, NFS_READTIME(inode)+NFS_ATTRTIMEO(inode)))
                return NFS_STALE(inode) ? -ESTALE : 0;
         return __nfs_revalidate_inode(server, inode);
 }

Which basically means – get the new attributes if the cache has timed out.

In 2.6 the code changed and the following check was added:

/* We may force a getattr if the user cares about atime */
       if (need_atime)
                err = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
        else
                err = nfs_revalidate_inode(NFS_SERVER(inode), inode);

Which means that if the user needs to know the last time the attribute changed, we skip the cache time check and force a get attribute from the server. Another IO operations. Even if the cache did not time out.

Luckily, the fix is also easy. Just add “noatime” to the nfs mount, to let the kernel know that we don’t care about the last time attributes changed, and therefore it can go back and use the cache.

So easy once you know what to look for!

 

Advert – NoCoug Fall Conference November 12, 2008

Filed under: advert, nocoug — prodlife @ 4:23 am

NoCoug, the North California Oracle User Group, will hold its fall conference on Thursday, November 13. You can read the full details in our website.

If you are in the area, you really don’t want to miss this. Jonathan Lewis will give a keynote and a session about partitioning, Dan Tow will explain tuning for recent data and Jeremiah Wilton will give two session. In short, you’ll have trouble choosing which session to attend. Just the way a good conference should be.

I volunteered to be a track lead, so you can find me in Tassajara room, fixing the projector, giving announecements and introducing speakers. Drop by and say hello :)