Troubleshooting Broken Clusterware

I spent most of the day figuring out why one node on a 2-node RAC crashed during the night.

Here are the steps I found useful when debugging cluster issues:

  1. Check DB alert log on all nodes
  2. Check clusterware logs on all nodes. There are found in $CRS_HOME/log. The useful ones are the alert log, crsd log and cssd log.
  3. Check write permissions to voting disk. From all nodes. As Oracle and as root.
  4. Check the network interfaces. Both by looking at ifconfig on all nodes, pinging every node from every other node using all its names and interfaces (public, private, vip).
  5. Verify SSH the same way.
  6. Check that both nodes run the same OS version and same DB and clusterware versions (including patches).
  7. Stop and start clusterware on each node seperately and then on both nodes together.
  8. Reboot both nodes.

In my case, the interconnect disappeared during the night. Simply no ping on the internal interface. Maybe someone stole our network cards?

The symptoms  were

  1. After each restart, the first node up worked fine, while the second one failed to connect to cluster.
  2. In crsd log, it looked like this:
    2008-07-09 16:11:17.539: [ CSSCLNT][2541575744]clsssInitNative: connect failed, rc 9
    2008-07-09 16:11:17.539: [  CRSRTI][2541575744]0CSS is not ready. Received status 3 from CSS. Waiting for good status ..
  3. In ocssd log:
    [    CSSD]2008-07-09 19:43:41.769 [1220598112] >WARNING: clssnmLocalJoinEvent: takeover aborted due to ALIVE node on Disk
    [    CSSD]2008-07-09 19:43:42.753 [1115699552] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(14465) LATS(11886024) Disk lastSeqNo(14465)
    [    CSSD]2008-07-09 19:43:43.755 [1115699552] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(14466) LATS(11887024) Disk lastSeqNo(14466)
    [    CSSD]2008-07-09 19:43:44.758 [1115699552] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(14467) LATS(11888034) Disk lastSeqNo(14467)

Just in case you run into the same issue 🙂


5 Comments on “Troubleshooting Broken Clusterware”

  1. Freek says:

    Chen,

    When you have a failing interconnect, you should see ping missing messages in the cssd.log file (on both nodes). After a certain amount of lost pings (depending on the platform), each node will attempt to take a lock on the voting disk(s) to decide which node will have to reboot.

    You could monitor the cssd.log file to check if you have regularly missed pings (without reaching the threshold), which would indicate an unstable interconnect.

    regards

    Freek

  2. prodlife says:

    Freek,
    Thanks for the advice. Monitoring cssd.log for lost pings is a great idea.

    In that specific case, it turns out that if the interconnect is completely missing after the machines are restarted, but both nodes have access to voting disks, the errors are a bit different.

  3. Freek says:

    “interconnect is completely missing”
    euh… have you looked under the table? 😉

    Normally the initial errors (before the restart) should be the same:

    [ CSSD]2007-03-05 12:19:50.538 [14] >TRACE: clssgmClientConnectMsg: Connect from con(100bfd420) proc(100bfc260) pid() proto(10:2:1:1)
    [ CSSD]2007-03-05 12:20:32.840 [18] >WARNING: clssnmPollingThread: node sxsolb1 (2) at 50% heartbeat fatal, eviction in 14.215 seconds
    [ CSSD]2007-03-05 12:20:39.840 [18] >WARNING: clssnmPollingThread: node sxsolb1 (2) at 75% heartbeat fatal, eviction in 7.215 seconds
    [ CSSD]2007-03-05 12:20:40.840 [18] >WARNING: clssnmPollingThread: node sxsolb1 (2) at 75% heartbeat fatal, eviction in 6.215 seconds
    [ CSSD]2007-03-05 12:20:43.840 [18] >TRACE: clssnmPollingThread: node sxsolb1 (2) is impending reconfig
    [ CSSD]2007-03-05 12:20:43.840 [18] >TRACE: clssnmPollingThread: diskTimeout set to (27000)ms impending reconfig status(1)
    [ CSSD]2007-03-05 12:20:44.840 [18] >TRACE: clssnmPollingThread: node sxsolb1 (2) is impending reconfig
    [ CSSD]2007-03-05 12:20:44.840 [18] >WARNING: clssnmPollingThread: node sxsolb1 (2) at 90% heartbeat fatal, eviction in 2.215 seconds
    [ CSSD]2007-03-05 12:20:45.840 [18] >TRACE: clssnmPollingThread: node sxsolb1 (2) is impending reconfig
    [ CSSD]2007-03-05 12:20:45.840 [18] >WARNING: clssnmPollingThread: node sxsolb1 (2) at 90% heartbeat fatal, eviction in 1.215 seconds
    [ CSSD]2007-03-05 12:20:46.840 [18] >TRACE: clssnmPollingThread: node sxsolb1 (2) is impending reconfig
    [ CSSD]2007-03-05 12:20:46.840 [18] >WARNING: clssnmPollingThread: node sxsolb1 (2) at 90% heartbeat fatal, eviction in 0.215 seconds
    [ CSSD]2007-03-05 12:20:47.060 [18] >TRACE: clssnmPollingThread: node sxsolb1 (2) is impending reconfig
    [ CSSD]2007-03-05 12:20:47.060 [18] >TRACE: clssnmPollingThread: Eviction started for node sxsolb1 (2), flags 0x000d, state 3, wt4c 0

  4. OBJESALIESS says:

    Thanks for the post

  5. […] – bookmarked by 4 members originally found by ewuzrab on 2008-11-13 Troubleshooting Broken Clusterware https://prodlife.wordpress.com/2008/07/10/troubleshooting-broken-clusterware/ – bookmarked by 5 […]


Leave a comment