Streams on RAC

We had RAC system as streams source for 18 month now. But just today I configured a RAC as streams target.

It was somewhat of an anticlimax since there is absolutely nothing interesting to do.

We do downstream capture, so I had to place the archive logs on a shared drive. Every place where I used the SID before, I now used the DB name.

Capture and Apply processes both started on the same node. When I stopped that node, I saw the other node modifying a service called SYS${streams user}.{streams queue name}.{target db name}’ , changing it to run on the remaining node.

Then the capture and apply processes started on the remaining node and everything continued as usual.

As I said, no big deal. I just wanted to let everyone know that it is no big deal.


Modifying RAC Parameters

Few commands that proved useful today and I’ll probably want to refer to in the future:

  1. Change the voting disk timeout to 3 minutes: crsctl set css misscount 180
    This is not really recommended due to increase risk of split brain issues.
  2. Changing VIP timeouts is a longer story…
    1. As oracle:
      srvctl stop instance -d -i
      srvctl stop nodeapps -n
    2. As root:
      crs_stat -p ora.<hostname>.vip > /tmp/ora.<hostname>.vip.cap
      crs_profile -update ora.<hostname>.vip -dir /tmp -o ci=120,st=120
      (ci is check_interval, st is script_timeout, both in seconds)
      crs_register ora.<hostname>.vip -dir /tmp -u
    3. Verify:
      crs_stat -p ora.<hostname>.vip | grep CHECK_INTERVAL
      crs_stat -p ora.<hostname>.vip | grep SCRIPT_TIMEOUT
    4. As oracle again:
      srvctl start nodeapps -n <hostname>
      srvctl start instance -d <dbname> -i <instname>
  3. When the machine is moved to a new segment:
    oifcfg delif -global eth0/
    oifcfg setif -global eth0/
  4. When you have a new VIP:
    srvctl modify nodeapps -n hostname -A new_vip/

I’m not one to insist on a friendly interface and certainly not on GUI, but I could appreciate a bit more consistency.

Troubleshooting Broken Clusterware

I spent most of the day figuring out why one node on a 2-node RAC crashed during the night.

Here are the steps I found useful when debugging cluster issues:

  1. Check DB alert log on all nodes
  2. Check clusterware logs on all nodes. There are found in $CRS_HOME/log. The useful ones are the alert log, crsd log and cssd log.
  3. Check write permissions to voting disk. From all nodes. As Oracle and as root.
  4. Check the network interfaces. Both by looking at ifconfig on all nodes, pinging every node from every other node using all its names and interfaces (public, private, vip).
  5. Verify SSH the same way.
  6. Check that both nodes run the same OS version and same DB and clusterware versions (including patches).
  7. Stop and start clusterware on each node seperately and then on both nodes together.
  8. Reboot both nodes.

In my case, the interconnect disappeared during the night. Simply no ping on the internal interface. Maybe someone stole our network cards?

The symptoms  were

  1. After each restart, the first node up worked fine, while the second one failed to connect to cluster.
  2. In crsd log, it looked like this:
    2008-07-09 16:11:17.539: [ CSSCLNT][2541575744]clsssInitNative: connect failed, rc 9
    2008-07-09 16:11:17.539: [  CRSRTI][2541575744]0CSS is not ready. Received status 3 from CSS. Waiting for good status ..
  3. In ocssd log:
    [    CSSD]2008-07-09 19:43:41.769 [1220598112] >WARNING: clssnmLocalJoinEvent: takeover aborted due to ALIVE node on Disk
    [    CSSD]2008-07-09 19:43:42.753 [1115699552] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(14465) LATS(11886024) Disk lastSeqNo(14465)
    [    CSSD]2008-07-09 19:43:43.755 [1115699552] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(14466) LATS(11887024) Disk lastSeqNo(14466)
    [    CSSD]2008-07-09 19:43:44.758 [1115699552] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(14467) LATS(11888034) Disk lastSeqNo(14467)

Just in case you run into the same issue 🙂

ORACLE_HOME – to share or not to share

When setting up a RAC system, one of the questions that tend to come up is whether to have one ORACLE_HOME per node on local disk, or to have one ORACLE_HOME for both instances on the shared storage. These approaches are sometimes called “private home” vs. “shared home”.

Oracle have an amazing white paper on the topic, which I’ve been reading for the last two days:

We have both types of systems here (because we have two teams of DBAs with somewhat conflicting procedures), and I worked on both. As with many questions, there are pros and cons to each approach, and you decide based on your priorities. The paper (highly recommended!) covers almost all the pros and cons in lots of details. However, I still want to give here a short summary, based on my experience.

Why use private home?

  1. Easier rolling upgrades. If you need to patch your DB you can patch first one node and then the other, allowing for 0 downtime patches.  You can have rolling upgrades on shared home as well, but it is a longer procedure and not supported automatically by OPatch.
  2. You can’t lose your entire cluster by a careless delete. Never underestimate the impact of human errors.
  3. With shared home, mistakes done in ORACLE_HOME impact the entire cluster. With private homes – mistakes impact just one node.
  4. Add node / Delete node procedures are somewhat simpler (but take longer) on private homes. Especially when doing delete node, you don’t run the risk of deleting the shared oracle home by mistake. Very scary!
  5. Starting Oracle demands that at least the oraInventory will be local.
  6. Oracle recommends local home. Sometimes, that’s a good enough reason.

Why use shared home?

  1. Quicker installs and upgrades, because there is no need to copy all files twice, over the interconnect.
  2. If your shared storage is advanced enough to support snapshots, you have the ability to take a snapshot of ORACLE_HOME before applying a patch and simply restoring the snapshot if the patching went wrong.
  3. You can really easily migrate nodes or entire DBs from server to server that way.
  4. No version compatibility issues between different nodes on cluster. You know that all your nodes are always running exactly same version, patches, etc.
  5. You don’t have to ssh from server to server while trying to track issues in alert logs and dump files.
  6. Shared storage tend to be more stable, have checksum, striping, mirroring and other nice stuff. With ORACLE_HOME on this storage, you are less likely to lose a node to media failure.

Rethinking RAC

I’ve working with RAC since my first day as a DBA. My first task was to install a RAC server (took well over a week), and since then I’ve installed dozens of RAC servers, more than anyone I know, and I spend 90% of my time maintaining them.

I’ve had lots and lots of trouble with RAC, but at the end of the day – I love RAC, with all of its marvelous complexity.  I love the ability to do rolling maintenances, without any downtime to our customers and quite easily and I love the technology and the amazing ideas behind it.

Which is why everyone was very surprised that in a large DBA and management meeting, I suggested replacing RAC with DataGuard for HA. The reason for the meeting was cost-cutting. The objective was clear – reduce the cost per customer. We do shared hosting (mostly), so to reduce the cost per customer we need to either reduce the cost of a single system or put more customers on one system. Both options are viable, and I’ll also write a post about how to max-out an existing system, but it seemed to me that replacing RAC with other HA alternatives will be a very immediate way of cutting costs without significantly lowering availability.

Moans Noorgard wrote a while back an amazing article: “You probably don’t need RAC“.  I’ve spent the days since the last meeting reading it again and again, and trying to prepare a rock solid case that our system can host more customers with more availability and lower costs without RAC. I’ve also found a good post by the storage guy on the same subject,  he is feeling the pain with 11g RAC. The real irony is that at the same time I’m claiming that we don’t need RAC at all, I’m still proceeding with our 11g clusterware tests, because a DBA should always be prepared.

A year later Moens wrote another good article, this time about how difficult it is to get his message accepted, he also mentions that the discussion is very emotional and non-technical. I hope my experience won’t be as bad as his, but I think it may be worse. In addition of the usual difficulties of getting people to participate in a serious technical discussion (it is a lot of work to prepare a serious technical case, and much easier to resort to rhetorics), the entire team that made the RAC decision three years ago is still around, and saying “we made a wrong decision and stuck with it for three years” is very difficult at best of times, and then Oracle sales will get involved sooner or later, the difference between our RAC and non-RAC cost is very high (especially since we are talking about many servers), and I can’t see Oracle accepting the sudden loss of revenue without a fight.

Maybe Oracle is fighting back already? A while back Kevin Closson wrote number of articles about RAC, its performance, high availability, maintenance, etc. I remember they were very good, but I’ve made the mistake and didn’t save a copy assuming they will always be there.  Unfortunately, many good articles are no longer available.

I’m doing Log Buffer again this  Friday! Don’t forget to visit for the hottest DB blog posts of the week.

Common mistakes in RAC installation

This was supposed to be my OpenWorld Unconference session, which I didn’t give partially due to shyness and partially because I preferred to spend my time listening and learning.

I’m probably the worlds expert on failed RAC installations. I started my career as a DBA by spending four days with a consultant failing to install RAC in our test environment. In the three years that passed since that fatefull week, I’ve probably failed installing RAC over fifty times (I’ve succeeded quite a few times too), so I’m well qualified to tell everyone how to fail installing RAC.

So, how do you completely screw up your RAC installation?

  1. Don’t use the installation guide. Thats a common mistake done by both beginners and experts. If you don’t follow your RAC installation guide closely, your RAC installation will fail. The installation is simply too complicated to do from memory or by hunches. That is the most important thing to remember. The rest of this post will just contain common consequences of not following the installation guide. Also keep in mind to match the version of the installation guide to the version of RAC you are actually installing, because some things change with time.
  2. Your nodes don’t see each other. Huge mistake. Your nodes should be able to connect to each other by name, ip and fully qualified domain name, through public ip and interconnect ip. Verify with pings. Also make sure your host name is spelled the same everywhere – some parts of the installation are case sensitive.
  3. Don’t verify that all your RPMs are installed before beginning the installation. Unfortunately, this is a very easy mistake to make, because the RPM list in the installation guide is somewhat incomplete. There are metalink articles that attempt to correct the mistakes, so look for them. Keep in mind that at least in 10g, the prerequisite check didn’t cover all the required RPMs, so if you mess this step you will end up with a rather random error during the installation.
  4. Ask your network manager to configure the VIP in Linux before your install your cluster ware. Don’t. Just ask him for an IP – Oracle has a VIPCA utility that will configure and manage the VIP for you. If Linux already controls the VIP, RAC installation will fail.
  5. Configure SSH incorrectly. SSH configuration is a somewhat tricky part. Remember that your nodes should be able to ssh each other with user oracle without ssh asking for password or saying anything. ssh remotenode date should just give the date.
  6. Different times for different nodes. All nodes should show the exact date and time.
  7. Bad permissions on shared storage. Verify that root on all nodes has write access to the voting disk.

Thats what I recall right now. I’m sure there are lots more.

Sniffing the network for fun and profit

Every DBA should know something about the network (and OS, and storage, and application development – good DBAs are multi-disciplinary). Probably everyone knows how to debug connectivity issues with ping and trace route, how to diagnose problems with the DNS, and how clients are connecting to the listener.

DBAs with RAC need to know much more about their network – they have virtual ips and interconnect and usually a storage network too. RAC systems can crash due to network latency problems, and the DBA may need to diagnose this. RAC also has this entire load balancing and failover thing where network addresses can move around and you need LOCAL_LISTENER and REMOTE_LISTENER parameters configured.

However, it is very rare that things get as complicated as they did this week:
On Thursday we configured a server with LOCAL_LISTENER and REMOTE_LISTENER values.
Almost immediately a customer started experiencing occasional connection failures. Naturally, we (DBAs) didn’t hear about it until it was escalated to an emergency call at Saturday afternoon. I had a date on Saturday night and didn’t want to be late due to a long debugging session, so I rolled back the change, asked the customer to send me his tnsnames.ora, hosts file and screenshots with the failure, and told them I’ll look into the issue on Monday.

Monday morning arrived sooner than I would have preferred. Their tnsnames.ora that was supposed to connect to our database actually contained an address I did not recognize. A quick meeting with the local network manager, revealed that these guys have VPN, they connect through a NAT and they also have a special firewall configuration. Remember I said that every DBA should know networks? Well, I didn’t mean NATs and VPNs. So I don’t know enough about networks, and the network manager doesn’t understand listeners and RAC, but we had to find a solution together.

It appeared that after I configured LOCAL_LISTENER values, when the listener attempted to redirect their connection to another server, it sent them an address (or maybe IP?) that their client couldn’t connect to and therefore failed. But why did everything work before we configured the LOCAL_LISTENER? According to the documentation we would still send addresses the client can’t connect to, just the local addresses instead of the vip. The network administrator had a theory that maybe the NAT translated the local address as it was sent back to the client to something the client understands, but this is really far fetched.

This is where the sniffer comes into the picture. When you have a complex setup and you have to know exactly what is the behavior – who is initiating the connection, what is the reply, where are the redirects, who closes the connection and how. The sniffer will listen to the network and give you the complete picture.

I use Ethereal, which is graphical and relatively clear and friendly, but still  very powerful.

In this case, Ethereal was especially useful – using the customer network capture, we could easily see that we were debugging the wrong issue all along. The tnsnames.ora file he sent us belonged to a different machine that didn’t experience any problem. The machine that experienced the issue connected to a different ip, which no one really knew about until this time. We are still not sure how it is related.
What we do know is that if you have a complicated network configuration, that changed several times, in very inconsistent ways and that no one documented – a sniffer is your only friend.

RAC tricks – rolling patch with a shared home

We had to apply yet another opatch, but we are only allowed 4 hours of downtime per system per month and we already used our monthly budget, so we need a way to apply an opatch without any downtime.

On some of our systems, it is not an issue. They are RAC systems where each node has its own $ORACLE_HOME on its own server. We take one node down, apply the patch, start the node, stop the other node, apply the patch, start other node. Patch installed on both nodes, no downtime for our customers. Win-Win.

But what do we do about our other systems? The ones which share a single $ORACLE_HOME on a filer? Where we need to take both nodes down for applying the patch?

A co-worker came up with a brilliant idea:

Stop one node. Use the filer power to duplicate $ORACLE_HOME. Connect node to new home, just make the change in /etc/fstab, the database will never notice the difference.
Apply patch in new home. Start database in new home. Now stop the second node and connect it to the new home as well. Start the node in the new home. We have a patched DB with no downtime in a shared home system! We even have a built in rollback – connect one node after the other back to the old home, where we didn’t apply the patch. In my experience rollback of opatches don’t always work, so having a sure rollback plan is a great bonus.

We tested it today in a staging environment and it seems to work well. Now we just need to convince management that we should do it in production. It looks like a great solution, but in my experience management hates approving any plan that does not appear in Oracle manuals. For all their talk of innovation and “thinking outside the box” they are a very conservative bunch. I can understand the extreme risk aversion of IT management, but if you never do anything new, you can never improve, and thats also risky.