Adventures Installing 11g Clusterware

Let me start by announcing that 11g clusterwear is easy to install. Really. Straightforward, simple, no issues at all. A crazy Dane can do it with both hands tied behind his back. The adventures I’ll describe below are 100% my own fault and have nothing to do with the quality of the product, which is excellent.

Before describing my adventures, I want to talk a bit about mountain biking. Mountain biking is a fine hobby, but riding on rocks and roots requires some skill. As a beginner, you find yourself riding very slowly and walking a lot around difficult sections. It is frusturating, but this way, you rarely crash at all. Experts, of course, ride very fast and rarely walk at all, and they still rarely crash. On the way from beginner to expert, there is a time where you gain some confidence in your skills, so you start riding faster. Unfortunately, this confidence often arrives before you actually have the skills you need to ride fast. The result is about 6 to 12 month of frequent crashes – until skills improves and confidence is reduced to the point that they match again.

I think I just hit this dangerous stage in DBAing. I now have some confidence in my understanding of how Oracle works, so I do not constantly refer to the docs. Which means that I  make more mistakes than I did as a newbie.

Back to 11g clusterware:

Installation went fine. About 2 minutes of “next-next-next install” and 5 minutes of waiting for the install to finish. It is nearly identical to 10g installation, except that the automatic configuration of the VIP actually works, even if your public IP happened to be 192.168.X.X, so no need to run VIPCA manually after the install. Nice.

But then I discovered that I installed clusterware in the wrong directory. Not a big deal, but I dislike non-standard installations, and since it was so easy to install, I decided to take another 15 minutes to uninstall and install it again.

How do I uninstall?

I assumed that you uninstall clusterware just like you uninstall the database software – just run the installer UI, select the right product and click on uninstall. Why bother checking  the docs when you can make convenient assumptions?

Click-click-click and the product should be uninstalled. There was some error message about files it could not remove. I decided to ignore it – the new installation will be in a different directory, and I can always remove extra files later.

When I tried to install it again, the installer complained that VIP is taken.

Strange. Didn’t I uninstall the clusterware? I ran crs_stat to check, and was somewhat worried that it actually worked. Returning all resources with status “unknown”.

I decided that I need to reboot that nodes. At least this should get rid of the VIP.

10 minutes later I found out that the nodes can’t stop rebooting. They start, and 30 seconds later they crash again. Those of you who have some experience with clusterware can already guess what was wrong. /etc/init.d/init.crs – the script that starts clusterware on boot was still there, attempting to start a partially uninstalled cluster, and failing. I did not even bother checking the logs, but I assume they’d show either that the VD is no longer there or that the interconnect is not configured, which leads each node to decide on a split brain and crash.

Over and over again. Thanks RedHat for interactive boot, which allowed me to stop this madness.

When the servers came back up, at least VIP was gone. So I decided to try another install. This time it ran all the way until the point it attempted to configure the notification services. This failed in a rather unhelpful fashion. The log error just said “configuration failed”. Thanks.

I decided to go for extreme cleanup, and simply delete ever related file I could find on the servers – in /etc, $ORACLE_BASE, $CRS_HOME, VD, OCR. Everything I could think of.

Attempting to install again. Again Notification Services fail. At least I know enough not to ignore this error. My redeeming virtue, I guess.

When all else fails, read the docs. Which was not as easy as you would believe. I kind of fell out of practice with the documentation, and 11g did move things around a bit. I could not find the RAC installation guide. Looking under “Grid”, I found RAC administration guide and Clusterware administration guide. Both contained advice on how to remove a node from the cluster, but nothing about how to remove the entire cluster.

Searching for “clusterware uninstall”, led to Overview of Deinstallation Process, which seemed promising. It contains this good advice: “Refer to Oracle Clusterware Installation Guide for your platform for Oracle Clusterware deinstallation procedures.” , but it did not link anywhere. I did find the installation guide, under “Installation” (duh), and it did contain uninstall instructions. I’m still a bit annoyed that searching for “uninstall clusterware” did not come up with this document.

Following the documentation turned out the best idea I’ve had that day. It reminded me that I should run rootdelete.sh, and then rootdeinstall.sh and only then run ./runInstaller -deinstall -removeallfiles.

Since I caused significant manual damage prior to following the documentation, I was not surprised by a long list of complaints that each of these scripts had for me.

But after following the uninstall documentation, I was finally able to install clusterware 11g, successfully, in the right directory.

5 hours after I decided on a small 15 minute solution. It was time to go home.

BTW. Now that I think of it, it is quite possible that in 10.2, it was impossible (or at least undocumented) to uninstall clusterware on Linux. I cannot find the instructions in 10.2 documentation at all (The OpenVMS docs do contain uninstall instructions). Our internal procedure was always just “reimage the servers”.

Advertisements

9 Comments on “Adventures Installing 11g Clusterware”

  1. chris_c says:

    Its possible to deinstall/delete the clusterware on 10.2 metalink note id 239998.1 gives the details, of course searching for “deinstall clusterware” doesn’t find it as its really a cleanup process for failed installs rather than a normal deinstall, if you have a clean image it might be faster to just re-image and start from scratch.

  2. sq says:

    On our dev systems we do this all the time. We actually wrote a script to remove it off the system. We typically are removing all oracle software from a server when we need to uninstall clusterware.

    /etc/init.d/init.crs stop || /u01/crs/oracle/product/10.2.0/crs/bin/crsctl stop crs
    rm -rf /etc/ora*
    rm -rf /u01/app
    rm -rf /u01/crs
    rm -rf /etc/init.d/init.*

  3. prodlife says:

    @sq

    And you leave OCR/VD just as is?

    I’ll try it next time 🙂

  4. adeel says:

    Nice article but I could not understand well becoz I’ have just start to study RAC. Today I came to know you are a oracle ACE great. How did you get this position in such little time. Plz give me some guide line what should I do to become just like you successfully plz you must give me guide line. Miss I have download cluster ware for windows 64 bit but I think I have download a wrong software becoz at the time of Installtion it did not ask specify cluster. I have prepared two system for testing plz give me link from where I can download cluster ware software for window 2003 64bit 11g

  5. aachleon says:

    Thanks for sharing your mistakes! 🙂 I could see myself doing this…

  6. Paul says:

    You are being very humble by blaming yourself. That’s exactly the kind of customer attitude that bad product designers and lazy innovators the world over rely on!

    “…just run the installer UI, select the right product and click on uninstall. Why bother checking the docs when you can make convenient assumptions?”

    .. hmm, the smoking gun? I would like a rewrite:

    “…just run the installer UI, select the right product and click on uninstall. Why bother checking the docs when the installer is not going to make any assumptions that you have?”

    I wish..

  7. prodlife says:

    Hey Paul,

    You work with Oracle long enough and you don’t really expect the installer to protect you from mistakes or even guide you through the entire install procedure.

    Read the installation document CAREFULLY before installing is still mandatory, especially with RAC. As an experienced DBA, I should have known that.

  8. Paul says:

    Hi prodlife. You are right of course. I’ve worked with and at Oracle long enough to unfortunately have to agree. All part of the plan-do-check that is critical when operating on production systems. It is a pity that still so much effort must go into checking fundamental low-level details … kind of goes against the other production system mantra of “simplify/automate!”

    I did say “..I wish” 😉

  9. Jaya says:

    The beauty of the attitude is in the words of explanation. True, de-installation wasn’t easy for us either. The problems we faced led us to google-search and go thro’ this narration. Thankyou.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s