Troubleshooting Broken Clusterware

Posted: July 10, 2008 | Author: prodlife | Filed under: rac, tips |5 Comments

I spent most of the day figuring out why one node on a 2-node RAC crashed during the night.

Here are the steps I found useful when debugging cluster issues:

Check DB alert log on all nodes
Check clusterware logs on all nodes. There are found in $CRS_HOME/log. The useful ones are the alert log, crsd log and cssd log.
Check write permissions to voting disk. From all nodes. As Oracle and as root.
Check the network interfaces. Both by looking at ifconfig on all nodes, pinging every node from every other node using all its names and interfaces (public, private, vip).
Verify SSH the same way.
Check that both nodes run the same OS version and same DB and clusterware versions (including patches).
Stop and start clusterware on each node seperately and then on both nodes together.
Reboot both nodes.

In my case, the interconnect disappeared during the night. Simply no ping on the internal interface. Maybe someone stole our network cards?

The symptoms were

After each restart, the first node up worked fine, while the second one failed to connect to cluster.
In crsd log, it looked like this:
2008-07-09 16:11:17.539: [ CSSCLNT][2541575744]clsssInitNative: connect failed, rc 9
2008-07-09 16:11:17.539: [ CRSRTI][2541575744]0CSS is not ready. Received status 3 from CSS. Waiting for good status ..
In ocssd log:
[    CSSD]2008-07-09 19:43:41.769 [1220598112] >WARNING: clssnmLocalJoinEvent: takeover aborted due to ALIVE node on Disk
[    CSSD]2008-07-09 19:43:42.753 [1115699552] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(14465) LATS(11886024) Disk lastSeqNo(14465)
[    CSSD]2008-07-09 19:43:43.755 [1115699552] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(14466) LATS(11887024) Disk lastSeqNo(14466)
[    CSSD]2008-07-09 19:43:44.758 [1115699552] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(14467) LATS(11888034) Disk lastSeqNo(14467)

Just in case you run into the same issue 🙂

5 Comments on “Troubleshooting Broken Clusterware”

Freek says:

July 11, 2008 at 11:21 am

Chen,

When you have a failing interconnect, you should see ping missing messages in the cssd.log file (on both nodes). After a certain amount of lost pings (depending on the platform), each node will attempt to take a lock on the voting disk(s) to decide which node will have to reboot.

You could monitor the cssd.log file to check if you have regularly missed pings (without reaching the threshold), which would indicate an unstable interconnect.

regards

Freek

Reply
prodlife says:

July 16, 2008 at 4:29 am

Freek,
Thanks for the advice. Monitoring cssd.log for lost pings is a great idea.

In that specific case, it turns out that if the interconnect is completely missing after the machines are restarted, but both nodes have access to voting disks, the errors are a bit different.

Reply
Freek says:

July 17, 2008 at 6:37 pm

“interconnect is completely missing”
euh… have you looked under the table? 😉

Normally the initial errors (before the restart) should be the same:

[ CSSD]2007-03-05 12:19:50.538 [14] >TRACE: clssgmClientConnectMsg: Connect from con(100bfd420) proc(100bfc260) pid() proto(10:2:1:1)
[ CSSD]2007-03-05 12:20:32.840 [18] >WARNING: clssnmPollingThread: node sxsolb1 (2) at 50% heartbeat fatal, eviction in 14.215 seconds
[ CSSD]2007-03-05 12:20:39.840 [18] >WARNING: clssnmPollingThread: node sxsolb1 (2) at 75% heartbeat fatal, eviction in 7.215 seconds
[ CSSD]2007-03-05 12:20:40.840 [18] >WARNING: clssnmPollingThread: node sxsolb1 (2) at 75% heartbeat fatal, eviction in 6.215 seconds
[ CSSD]2007-03-05 12:20:43.840 [18] >TRACE: clssnmPollingThread: node sxsolb1 (2) is impending reconfig
[ CSSD]2007-03-05 12:20:43.840 [18] >TRACE: clssnmPollingThread: diskTimeout set to (27000)ms impending reconfig status(1)
[ CSSD]2007-03-05 12:20:44.840 [18] >TRACE: clssnmPollingThread: node sxsolb1 (2) is impending reconfig
[ CSSD]2007-03-05 12:20:44.840 [18] >WARNING: clssnmPollingThread: node sxsolb1 (2) at 90% heartbeat fatal, eviction in 2.215 seconds
[ CSSD]2007-03-05 12:20:45.840 [18] >TRACE: clssnmPollingThread: node sxsolb1 (2) is impending reconfig
[ CSSD]2007-03-05 12:20:45.840 [18] >WARNING: clssnmPollingThread: node sxsolb1 (2) at 90% heartbeat fatal, eviction in 1.215 seconds
[ CSSD]2007-03-05 12:20:46.840 [18] >TRACE: clssnmPollingThread: node sxsolb1 (2) is impending reconfig
[ CSSD]2007-03-05 12:20:46.840 [18] >WARNING: clssnmPollingThread: node sxsolb1 (2) at 90% heartbeat fatal, eviction in 0.215 seconds
[ CSSD]2007-03-05 12:20:47.060 [18] >TRACE: clssnmPollingThread: node sxsolb1 (2) is impending reconfig
[ CSSD]2007-03-05 12:20:47.060 [18] >TRACE: clssnmPollingThread: Eviction started for node sxsolb1 (2), flags 0x000d, state 3, wt4c 0

Reply
OBJESALIESS says:

August 2, 2008 at 11:03 pm

Thanks for the post

Reply
Bookmarks about Rc says:

December 15, 2008 at 1:00 pm

[…] – bookmarked by 4 members originally found by ewuzrab on 2008-11-13 Troubleshooting Broken Clusterware https://prodlife.wordpress.com/2008/07/10/troubleshooting-broken-clusterware/ – bookmarked by 5 […]

Reply

Just a simple Hadoop DBA

Adventures with Data and Massively Parallel Databases

Troubleshooting Broken Clusterware

5 Comments on “Troubleshooting Broken Clusterware”

Leave a comment Cancel reply

Recent Posts

Twitter Updates

Blogroll

Just a simple Hadoop DBA

Adventures with Data and Massively Parallel Databases

Troubleshooting Broken Clusterware

Share this:

Related

5 Comments on “Troubleshooting Broken Clusterware”

Leave a comment Cancel reply

Recent Posts

Twitter Updates

Blogroll