Troubleshooting Broken ClusterwarePosted: July 10, 2008
I spent most of the day figuring out why one node on a 2-node RAC crashed during the night.
Here are the steps I found useful when debugging cluster issues:
- Check DB alert log on all nodes
- Check clusterware logs on all nodes. There are found in $CRS_HOME/log. The useful ones are the alert log, crsd log and cssd log.
- Check write permissions to voting disk. From all nodes. As Oracle and as root.
- Check the network interfaces. Both by looking at ifconfig on all nodes, pinging every node from every other node using all its names and interfaces (public, private, vip).
- Verify SSH the same way.
- Check that both nodes run the same OS version and same DB and clusterware versions (including patches).
- Stop and start clusterware on each node seperately and then on both nodes together.
- Reboot both nodes.
In my case, the interconnect disappeared during the night. Simply no ping on the internal interface. Maybe someone stole our network cards?
The symptoms were
- After each restart, the first node up worked fine, while the second one failed to connect to cluster.
- In crsd log, it looked like this:
2008-07-09 16:11:17.539: [ CSSCLNT]clsssInitNative: connect failed, rc 9
2008-07-09 16:11:17.539: [ CRSRTI]0CSS is not ready. Received status 3 from CSS. Waiting for good status ..
- In ocssd log:
[ CSSD]2008-07-09 19:43:41.769  >WARNING: clssnmLocalJoinEvent: takeover aborted due to ALIVE node on Disk
[ CSSD]2008-07-09 19:43:42.753  >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(14465) LATS(11886024) Disk lastSeqNo(14465)
[ CSSD]2008-07-09 19:43:43.755  >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(14466) LATS(11887024) Disk lastSeqNo(14466)
[ CSSD]2008-07-09 19:43:44.758  >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(14467) LATS(11888034) Disk lastSeqNo(14467)
Just in case you run into the same issue 🙂