Cloning Oracle Home from RAC to Stand-Alone

This post originally appeared over at Pythian. There are also some very smart comments over there that you shouldn’t miss, go take a look!

This should have been the easiest task on my todo list: Install Oracle EE standalone on a new Linux RHEL 5 server, later to be used as a standby for a production RAC system. This means 2 lines of “runinstall -silent …”, less than 5 minutes of DBA work and maybe 20 minutes of waiting. I did not expect to spend over 5 hours doing this.

Problems started when I discovered that I don’t have the patchset and another patch that exists on production and should be installed on the standby. I had to wait for my Metalink credentials to be approved for this customer CSI before I could download the patches for them.

“Why don’t you just clone the software from production?” asked a helpful colleague.

Sounds like a great suggestion. I cloned Oracle software before and it is a simple process: tar $ORACLE_HOME, copy the tar file to the new server, untar, run the cloning script which will register the new home with the inventory, and you are done!

In theory, at least.

Here is what actually happened:

  1. Tar, copy, untar, script
  2. Ran OPatch to verify that the new oracle home is in the inventory and that I see the correct version and patches.
  3. OPatch is showing two nodes. Oops. I didn’t realize oracle home has information about the cluster – didn’t Oracle move the inventory elsewhere? Spend an hour looking for the cause of this.
  4. Found that the two nodes are mentioned in  $ORACLE_HOME/inventory/ContentsXML/oraclehomeproperties.xml
  5. Removed this file.
  6. Deattached Oracle Home to clean inventory without deleting the software.
  7. Ran the clone script again
  8. Yay! OPatch looks good now.
  9. Decided to create test database  to be extra sure everything is fine
  10. NETCA failed with linking error. Spent an hour figuring out why. Cursed a bit.
  11. Had to install libc-devel, 32 bit version. Too bad RDA didn’t catch this.
  12. Created test database, but SQLPLUS now fails with linking error.  More cursing. Wondered what I did to deserve this.
  13. was missing so I had to install the 64 bit version of libaio. Too bad RDA was silent about this as well.
  14. Couldn’t start the database because the database couldn’t find the cluster. Why was it even looking for a cluster? Spent an hour figuring out why. Ah, because I copied the software from a RAC server and it was linked as RAC database.
  15. Relinked everything with RAC_OFF option.
  16. Finally things are working. Too bad it is 8pm already.

What I should have done: (I’m not sure if it is supported by Oracle, but at least it works)

  1. Double check that we have all RPMs.
  2. Tar, copy, untar
  3. remove $ORACLE_HOME/inventory/ContentsXML/oraclehomeproperties.xml
  4. run clone/bin/ ORACLE_HOME=/appl/oracle/product/10.2.0/db_1 ORACLE_HOME_NAME=OraDb10g_home1
  5. Relink as non-RAC:  make -f rac_off
  6. Verify with OPatch.
  7. Create test DB:
    netca /silent /responsefile ~/netca.rsp
    dbca -silent -createDatabase -templateName General_Purpose.dbc -gdbName TST -sid TST -SysPassword xxx -SystemPassword xxxx -emConfiguration NONE -datafileDestination /appl/oracle/oradata  -storageType FS -characterSet WE8ISO8859P1 -nationalcharacterSet AL16UTF16 -memoryPercentage 40
  8. Go for a nice afternoon ride.

I hope that I’m not the only DBA who always have to find the most difficult way to accomplish a task, and that this post will be useful to others. Perhaps the best piece of advice I can offer is to avoid this type of cloning in the first place.

Lessons From OOW09 #1 – Shell Script Tips

During OpenWorld I went to a session about shell scripting. The speaker, Ray Smith, was excellent. Clear, got the pace right, educating and entertaining.

His presentation was based on the book “The Art of Unix Programming” by one Eric Raymond. He recommended reading it, and I may end up doing that.

The idea is that shell scripts should obey two important rules:

  1. Shell scripts must work
  2. Shell scripts must keep working (even when Oracle takes BDUMP away).

Hard to object to that 🙂

Here’s some of his advice on how to achieve these goals (He had many more tips, these are just the ones I found non-trivial and potentially useful. My comments in italics.)

  1. Document dead ends, the things you tried and did not work, so that the next person to maintain the code won’t try them again.
  2. Document the script purpose in the script header, as well as the input arguments
  3. Be kind – try to make the script easy to read. Use indentation. Its 2009, I’m horrified that “please indent” is still a relevant tip.
  4. Clean up temporary files you will use before trying to use them:

    function CleanUpFiles {
    [ $LOGFILE ] && rm -rf ${LOGFILE}
    [ $SPOOLFILE ] && rm -rf ${SPOOLFILE}
  5. Revisit old scripts. Even if they work. Technology changes. This one is very controversial – do we really need to keep chasing the latest technology?
  6. Be nice to the users by working with them – verify before taking actions and keep user informed of what the script is doing at any time. OPatch is a great example.
  7. Error messages should explain errors and advise how to fix them
  8. Same script can work interactively or in cron by using: if [ tty -s ] …
  9. When sending email notifying of success or failure, be complete. Say which host, which job, what happened, how to troubleshoot, when is the next run (or what is the schedule).
  10. Dialog/Zenity – tools that let you easily create cool dialog screens
  11. Never hardcode passwords, hostname, DB name, path. Use ORATAB, command line arguments or parameter files.I felt like clapping here. This is so obvious, yet we are now running a major project to modify all scripts to be like that.
  12. Be consistent – try to use same scripts whenever possible and limit editing permissions
  13. Use version control for your scripts. Getting our team to use version control was one of my major projects this year.
  14. Subversion has HTTP access, so the internal knowledge base can point at the scripts. Wish I knew that last year.
  15. Use deployment control tool like CFEngine. I should definitely check this one out.
  16. Use getopts for parameters. Getopts looked to complicated when I first checked it out, but I should give it another try.
  17. Create everything you need every time you need it. Don’t fail just because a directory does not exist. Output what you just did.
  18. You can have common data files with things like hosts list or DB lists that are collected automatically on regular basis and that you can then reference in your scripts.
  19. You can put comments and descriptions in ORATAB

Group By in Shell

Storage manager dropped by my cube to say that redo log archives for one of our DBs grew from 40G to 200G in the last day. Running “du -Sh” few times showed that the files are no longer getting written as fast, so the issue is probably over. But what was it? And to start, when was it? Once I figure out when the issue occure, I may be able to find the offending DML in ASH.

So, I need to do something like:

select day,hour,sum(bytes)
from (ls -l)
group by day,hour
order by day,hour

I guess I could have used the output of “ls -l”  as an external table and actually run this query. But I’m actually trying to improve my shell scripting skills. There must be some way to do it!

I asked on Twitter, and Kris Rice (of SQLDeveloper fame) pointed me in the right direction – use AWK.

Here’s the result:

ls -l | sed ‘s/:/ /g’ | awk ‘{sizes[$7$8] += int($5)} END {for (hour in sizes)
{ printf(“At %s %d bytes were inserted\n”,hour,sizes[hour])}}’ | sort

I run ls. Use sed to seperate the time into fields (Laurent Schneider will kill me, because I’m mixing sed and awk, but I’m not it his level yet…). After the split, the fifth field contains the file size, the seventh contains the date and the eighth is the hour. So I  have an associative array called “sizes”, indexed on the date and hour, and each bucket contains the sum of the file sizes in this hour.

Then I loop on the array and print the results, and sorted them to make it more readable.

A long post for a short script, but I want to make sure I don’t forget it 😉

Anyway, the best part by far was everyone who tried to help my desperate call in Twitter/Facebook. It was unexpected and very very nice. About 5 people sent tips, ideas and even solutions! I only imagine that doing my work is more fun than doing their own work, but I appreciate the help a lot.

Unusual IO activity on shared clusterware home

Sometimes problem exist in a system for years, but only become apparent when you prepare for a big change. This war story begins when our storage admin decided to replace our Netapp disks with new disks, twice as large. It is a cheap way to increase disk space and IO wait times.

While assessing the impact of this change, he found out that the volumes where we put shared oracle home for our RAC clusters have 6000 IO operations per second (IOPS). The data and redo volumes never exceeded 2000 IOPS, so 6000 is quite significant, especially on disks that should be practically idle.

First debug showed that almost all the IO was neither read nor write, but things like “get attribute” and “access”. At this point I discovered that there is almost no way to get any information about IO activity on NFS mounts. I could not see which processes do this activity, nor on which files or directories it was done.

Time to get advice from the experts on Oracle-L. Vasu Balla of Pythian provided the solution:

“Oracle recommends using noac or
actime=o options when mounting nfs for Datafiles, Voting Disk and OCR. Noac
means “no attribute cache” means none of the file attributes are cached in
the filesystem cache, which  is very much needed for RAC. If you put your
shared oracle home also in that mountpoint which is mounted noac, every
access to a file in the oracle home requires a physical IO at the netapp. So
I recommend moving all software directories ( db oracle home, asm oracle
home and crs oracle home etc ) to a nfs mount which is not mounted with noac
or actime=o.”

What a wonderful explanation. I now understand the issue and know what to do to solve it. I took me about 3 minutes to test this solution on our staging environment, and it worked like charm.

Unfortunately, both Netapp and Oracle insisted that shared oracle home on Netapp must be mounted with actimeo=0, and that if this is causing me trouble, I should move to local home instead of shared. Only after very long discussions with two experts from Oracle I got a non-official confirmation that the official documentation is probably wrong and that mounting oracle home with actimeo=0 is a bad idea.

To my surprise, my boss agreed to go ahead with the unofficial but working solution and change NFS mounts to remove “actimeo=0”.

So, we schedule downtime on our production RACs, and we change the mount options, and… Nothing happens. At all. 6000 IOPS before and after the change. If I wasn’t so shocked, I might have noticed my professional credibility taking a hit there.

Why didn’t it work on production? For weeks I had no answer. Until our network admin mentioned that I could use rpcdebug to get more insight about the issue. Turns out that NFS is RPC, and that Linux has flags for debugging RPC. By throwing magic numbers into /proc/sys/sunrpc/nfs_debug I could get NFS trace messages throwin into /var/log/messages. Now we are getting somewhere.

Except that it didn’t get me very far. I could see which devices NFS access, but I already knew that. I could see that our prod server had many many calls to “getattr”, while our staging system didn’t. To complete my tests I decided to turn off the attribute caching on staging again and compare the logs. Just to see what it looks like when both systems are in the same state.

Strange difference caught my eye: The staging systems had messages saying “NFS: Refresh_inode” which did not exist in production. Tiny difference, but maybe it has an impact? What does refresh inode mean? Time to go to and look at the Linux kernel code for clues. I just need to recall which version to look at.

When the lightbulb went off it nearly blinded me. Staging system has Linux 2.4.27, production is running 2.6.9. I was the one who pushed for the upgrade. I said “There are many NFS improvements in the new kernel versions.”

From here it was easy to find the change. In 2.4 the code for getting file attributes from the server looked like this:

 static inline int
 nfs_revalidate_inode(struct nfs_server *server, struct inode *inode)
         if (time_before(jiffies, NFS_READTIME(inode)+NFS_ATTRTIMEO(inode)))
                return NFS_STALE(inode) ? -ESTALE : 0;
         return __nfs_revalidate_inode(server, inode);

Which basically means – get the new attributes if the cache has timed out.

In 2.6 the code changed and the following check was added:

/* We may force a getattr if the user cares about atime */
       if (need_atime)
                err = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
                err = nfs_revalidate_inode(NFS_SERVER(inode), inode);

Which means that if the user needs to know the last time the attribute changed, we skip the cache time check and force a get attribute from the server. Another IO operations. Even if the cache did not time out.

Luckily, the fix is also easy. Just add “noatime” to the nfs mount, to let the kernel know that we don’t care about the last time attributes changed, and therefore it can go back and use the cache.

So easy once you know what to look for!

Yet Another Nice Myth

A while ago I explained Nice, what it can do and what it can’t. From this point, whenever someone misunderstood nice, I just pointed at my post.

But yesterday I ran into a misunderstanding that was not covered in the first post:

Boss: Server X is at 100% cpu. Can you renice process 4096?
Me: Sure thing boss. *clickety-click* Done
Boss: Why is cpu still at 100%?

Renice will not lower CPU utilization because:
1. It does not kill processes. After renicing the process will have lower priority, but it will still need the same amount of CPU it did before, and so do all the other processes in the system. At any given moment, our nice process will either wait for CPU or use CPU. Either way, we are still at 100% CPU.
2. You can’t lower the priority to a point that the CPU will prefer to sit idle instead of running that process. Which is exactly what you ask for if you expect renice to affect CPU utilization.

The Linux OS will not let the CPU run on IDLE if someone wants it. It makes absolutely no sense to make processes wait while the CPU is not doing anything.

Boss: Sysadmin types always say that 100% CPU is not necessarily a bad thing, but we are DBAs and we know that if a machine is at 100% CPU it means trouble.

I hope we can all agree that what we want to prevent is processes having to wait for CPU. Ideally, every process that needs CPU will get it immediately. In this sense, 100% CPU is indeed a bad thing, because either processes are already waiting for CPU (look at load average!) or it can happen any second. But it is not because 100% cpu is the problem, it is because waiting processes are the problem.

Suppose you have a classic OLTP system – lots of tiny, fast, transactions. Usually the system runs at around 60% CPU on its single CPU server, but this morning someone ran a huge cpu-intensive report on this system. You know that this report runs for 60 minutes on an empty system. He begs you not to kill the process, but he doesn’t mind waiting. Now you have two options:

1. Let this process run as usual. Since it is a large report, it could be scheduled for more than its fair share of CPU. Taking maybe 80% of cpu time, leaving 20% for our tiny transactions. Process will finish in 75 minutes. On average each tiny transaction will take three times longer than usual. 75 of 100% cpu, bad performance and complaints.
2. Renice the process. Now it will get 40% of CPU time. Tiny transactions get 60% (which is all they need). Report takes 150 minutes, tiny transactions take as usual. 150 minutes of 100% cpu, no performance issues.

Obviously the second scenario is much better. Even if you spend twice as much time in 100% CPU. So maybe 100% cpu is not always evil, and renice is useful even when it doesn’t lower the cpu utilization.

Boss: But look at Top! Things look terrible! How do we know that everything is OK?

It is important to remember that viewing the CPU world through “top” tends to skew our perspective a bit. Top gives CPU utilization aggregated according to its refresh rate – 3 seconds by default. Now during these 3 seconds, 1.2 seconds were used by huge report (40%!) and 2.8 seconds used by 20 processes pooling all those tiny 2 ms transactions. Each process will show close to 0.05% CPU. It looks scary – big bad report taking so much CPU leaving almost nothing for other processes. But each time a transaction wanted to run it got 100% of the CPU. It just finished very very fast, and the big report stayed.

How do we REALLY know that everything is OK?
I trust Oracle on this and go to Active Session History tables (or EM report), find the transaction processing sessions, check out the wait events and make sure CPU is reasonable.

Boss: But what do we do about CPU?

So, 100% CPU is not good, but we can’t use renice to solve this. What can we do?
Killing the offending process is a crude but effective solution. Rumor says SQL Profiles that will allow you to alter the execution plan of the query, maybe making it easier on the CPU. I never tried it. Of course, the ultimate solution is to optimize the report to use less CPU, and in my experience it is easier to convince developers to optimize after you killed the report several times. In worst case, buying more CPU power is always an option.

Nice Processes Finish Last

I’ve ran into lots of myths and misconceptions about the Linux “nice” command. Lots of people seem to view it as a magic wand that can be waved at misbehaving processes – they are not sure what it really does but maybe it will make the process nicer?

Indeed there is so much confusion around it, that it is incredibly difficult to find good information about it. Especially since the behavior of nice changes between kernel releases and information that was great two years ago, may not be true today.

Fortunately, Linux is open source and one can always read the source code of the relevant kernel version to know exactly what Nice is doing today. I mean it – the relevant code is all in one file (kernel/sched.c) and is very readable.

Since I’ve already did some reading on the 2.4.31 kernel, I’ll summarize here the behavior of nice in that version and then shoot down specific myths. To explain what Nice does I’ll have to start by a short overview of how Linux Scheduler works. Should be entirely painless and it makes things much clearer later on.

The important part of the scheduler is the schedule() function. This runs every once in a while (I’m not getting into that, except to say that since the scheduler takes CPU, we definitely don’t want it to run too often!), stops the process that is taking CPU now and gives the CPU to a new process. The main question is – which process is going to go next. That’s where the scheduling algorithm comes in.

Scheduling algorithm in Linux works in “epochs” – when the epoch starts, each process in the run queue is allocated certain number of CPU ticks that it can use this epoch. When a process has used all its ticks it needs to give the CPU to another process. If it waits for IO, it doesn’t use any ticks and another process can run at that time, but once it finishes waiting it can use the remaining ticks during the same epoch. When all the processes use all their ticks, the epoch is over and everyone gets more ticks.

So, which process runs next? The algorithm is implemented in goodness() function, which returns a number for each process. The higher the goodness of the process, the more likely it is to run next. Goodness is highest for special real time processes, which always run first. Higher for the process currently using the CPU (because it has things on cache and you don’t want to mess up the cache), high for processes with many ticks left as opposed to those with only few ticks, and 0 for processes that are out of ticks.

Now we get to Nice. Nice impacts the number of ticks a process gets from the CPU. We assign ticks in batches that amount to 50ms, and therefore contain different amounts of ticks for different CPUs, but the basic formula in 2.4 is this:

#define NICE_TO_TICKS(nice)     (TICK_SCALE(20-(nice))+1

TICK_SCALE is where we match the number of ticks to the CPU speed, but as you can see there is an inverse relation between nice value and the number of ticks, where nice 20 gets you the least ticks and -20 gets you the most, and it is a linear relation – when you move from nice 10 to nice 11 it is the same as moving from -5 to -4. This is no longer true in 2.6 where the slope is different for positive and negative values.

To summarize – the nicer you are, the less CPU you get and the less often you will be scheduled. Nice processes will finish last.

What Nice does not do:

  • Many people (including Linux kernel experts!) believe that “nice 20” means your process will only run if no one else wants the CPU. Not exactly. Every process has to use up all their tick counts before the scheduler starts a new epoch (unless the process is sleeping), so while your nice process will probably run last at each epoch, other processes will still wait for it to finish before the new epoch starts and they get a new tick quota.
  • Nice will not have much effect on a low utilization system. Your CPU is at 0% now. You start a heavy process (maybe export) with nice 20, what will happen? It will take its usual 70% CPU. Because as long as there are no queues this nice process will be rescheduled again and again. What’s the rest of the CPU doing? running the scheduler, which is now working much harder since it has to reschedule this process again and again with a tiny quota each time.
  • Top shows 100% cpu utilization, and you see that 80% of it is taken by a single process. Should you nice it? Not always. Top will show IOWait time as part of the CPU time, but processes don’t use ticks while they wait for IO, so if the process is IO bound, it will show a very similar behavior regardless of nice. The behavior is to use very little CPU, wait long time for IO, and show on Top as a very busy process.
  • Export takes tons of CPU and I don’t want other processes to wait for it, I don’t mind if export takes a bit longer. Should I nice? Probably not. For one thing, export is IO bound, which makes it insensitive to nice anyway. What is more important – export doesn’t do most of the work. What is doing the work is Oracle’s server process that is opened to work with export, and you actually have to “renice” the oracle child process if you hope for anything at all to happen. What you can nice with pretty good effect it the “zip” process that you run after the export.
  • And if we mentioned the child processes, keep in mind that when a process forks the child process gets the same “nice” value as the parent, but from that point on you can renice each of them independently.

Recommended articles from Metalink

Another DBA team in the company hired a consultant to review and comment on their RAC installations, and they shared his report with us. Some of the recommendations were rather mysterious and involved changing kernel parameters I was not used to changing, so I decided to look around metalink to see if these parameters and their effects are documented somewhere.

I didn’t find what I was searching for. I have lots of trouble getting the metalink search engine to find anything useful for me. Oracle can probably get rid of many support calls by making a deal with google and using them for the metalink search. But while I was looking, I ran across two cool articles.

Note:265262.1 – Things to Know About Linux
A very cute document summarizing everything a DBA ever wanted to know about Linux, or almost. Links to installation guides, useful prerequisites and parameters, how to find out important information (like whether you are running 64 bit or 32), how to relink the kernel and even debugging tools.
A must read, must bookmark, print and hang besides the desk. This one is a keeper.

Note:169706.1 – Oracle Database on AIX,HP-UX,Linux,MacOSX,Solaris,Tru64 Operating Systems Installation and Configuration Requirements Quick Reference (8.0.5 to 10.2)
They are cheating with the name – the document also include 11g.
A useful and concise summary of the installation pre-reqs on unix platforms. Nice if you are doing many installations on different platforms and are tired of looking up a specific detail in a huge installation manual.