Adventures Installing 11g ClusterwarePosted: March 31, 2009
Let me start by announcing that 11g clusterwear is easy to install. Really. Straightforward, simple, no issues at all. A crazy Dane can do it with both hands tied behind his back. The adventures I’ll describe below are 100% my own fault and have nothing to do with the quality of the product, which is excellent.
Before describing my adventures, I want to talk a bit about mountain biking. Mountain biking is a fine hobby, but riding on rocks and roots requires some skill. As a beginner, you find yourself riding very slowly and walking a lot around difficult sections. It is frusturating, but this way, you rarely crash at all. Experts, of course, ride very fast and rarely walk at all, and they still rarely crash. On the way from beginner to expert, there is a time where you gain some confidence in your skills, so you start riding faster. Unfortunately, this confidence often arrives before you actually have the skills you need to ride fast. The result is about 6 to 12 month of frequent crashes – until skills improves and confidence is reduced to the point that they match again.
I think I just hit this dangerous stage in DBAing. I now have some confidence in my understanding of how Oracle works, so I do not constantly refer to the docs. Which means that I make more mistakes than I did as a newbie.
Back to 11g clusterware:
Installation went fine. About 2 minutes of “next-next-next install” and 5 minutes of waiting for the install to finish. It is nearly identical to 10g installation, except that the automatic configuration of the VIP actually works, even if your public IP happened to be 192.168.X.X, so no need to run VIPCA manually after the install. Nice.
But then I discovered that I installed clusterware in the wrong directory. Not a big deal, but I dislike non-standard installations, and since it was so easy to install, I decided to take another 15 minutes to uninstall and install it again.
How do I uninstall?
I assumed that you uninstall clusterware just like you uninstall the database software – just run the installer UI, select the right product and click on uninstall. Why bother checking the docs when you can make convenient assumptions?
Click-click-click and the product should be uninstalled. There was some error message about files it could not remove. I decided to ignore it – the new installation will be in a different directory, and I can always remove extra files later.
When I tried to install it again, the installer complained that VIP is taken.
Strange. Didn’t I uninstall the clusterware? I ran crs_stat to check, and was somewhat worried that it actually worked. Returning all resources with status “unknown”.
I decided that I need to reboot that nodes. At least this should get rid of the VIP.
10 minutes later I found out that the nodes can’t stop rebooting. They start, and 30 seconds later they crash again. Those of you who have some experience with clusterware can already guess what was wrong. /etc/init.d/init.crs – the script that starts clusterware on boot was still there, attempting to start a partially uninstalled cluster, and failing. I did not even bother checking the logs, but I assume they’d show either that the VD is no longer there or that the interconnect is not configured, which leads each node to decide on a split brain and crash.
Over and over again. Thanks RedHat for interactive boot, which allowed me to stop this madness.
When the servers came back up, at least VIP was gone. So I decided to try another install. This time it ran all the way until the point it attempted to configure the notification services. This failed in a rather unhelpful fashion. The log error just said “configuration failed”. Thanks.
I decided to go for extreme cleanup, and simply delete ever related file I could find on the servers – in /etc, $ORACLE_BASE, $CRS_HOME, VD, OCR. Everything I could think of.
Attempting to install again. Again Notification Services fail. At least I know enough not to ignore this error. My redeeming virtue, I guess.
When all else fails, read the docs. Which was not as easy as you would believe. I kind of fell out of practice with the documentation, and 11g did move things around a bit. I could not find the RAC installation guide. Looking under “Grid”, I found RAC administration guide and Clusterware administration guide. Both contained advice on how to remove a node from the cluster, but nothing about how to remove the entire cluster.
Searching for “clusterware uninstall”, led to Overview of Deinstallation Process, which seemed promising. It contains this good advice: “Refer to Oracle Clusterware Installation Guide for your platform for Oracle Clusterware deinstallation procedures.” , but it did not link anywhere. I did find the installation guide, under “Installation” (duh), and it did contain uninstall instructions. I’m still a bit annoyed that searching for “uninstall clusterware” did not come up with this document.
Following the documentation turned out the best idea I’ve had that day. It reminded me that I should run rootdelete.sh, and then rootdeinstall.sh and only then run ./runInstaller -deinstall -removeallfiles.
Since I caused significant manual damage prior to following the documentation, I was not surprised by a long list of complaints that each of these scripts had for me.
But after following the uninstall documentation, I was finally able to install clusterware 11g, successfully, in the right directory.
5 hours after I decided on a small 15 minute solution. It was time to go home.
BTW. Now that I think of it, it is quite possible that in 10.2, it was impossible (or at least undocumented) to uninstall clusterware on Linux. I cannot find the instructions in 10.2 documentation at all (The OpenVMS docs do contain uninstall instructions). Our internal procedure was always just “reimage the servers”.