[Linux-HA] Network fails on primary node but secondary does not take over

Karl Pálsson karl.palsson at tern.is
Fri Oct 5 07:50:20 MDT 2007


Hi,

I have two nodes connected in a heartbeat cluster. They have eth0 intended for normal work and eth1 for heartbeat. I simulate network failure on the primary node (unplug network cable on eth0) and expect heartbeat to failover to the secondary node. This does not happen. The primary stays primary and the secondary stays ... secondary. The network router (which heartbeat is configured to ping) is on the same network as eth0.

/var/log/messages contains:
Oct  5 13:14:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the first line isn't read in. Maybe the heartbeat does not ouput string correctly for status operation. Or the code (myself) is wrong.
Oct  5 13:15:17 amhs-1 ntpd[2232]: synchronized to LOCAL(0), stratum 10
Oct  5 13:15:17 amhs-1 ntpd[2232]: kernel time sync enabled 0001
Oct  5 13:16:11 amhs-1 kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Down
Oct  5 13:16:16 amhs-1 kernel: drbd0: PingAck did not arrive in time.
Oct  5 13:16:16 amhs-1 kernel: drbd0: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Oct  5 13:16:16 amhs-1 kernel: drbd0: Creating new current UUID
Oct  5 13:16:16 amhs-1 kernel: drbd0: asender terminated
Oct  5 13:16:16 amhs-1 kernel: drbd0: short read expecting header on sock: r=-512
Oct  5 13:16:16 amhs-1 kernel: drbd0: tl_clear()
Oct  5 13:16:16 amhs-1 kernel: drbd0: Connection closed
Oct  5 13:16:16 amhs-1 kernel: drbd0: Writing meta data super block now.
Oct  5 13:16:16 amhs-1 kernel: drbd0: conn( NetworkFailure -> Unconnected )
Oct  5 13:16:16 amhs-1 kernel: drbd0: receiver terminated
Oct  5 13:16:16 amhs-1 kernel: drbd0: receiver (re)started
Oct  5 13:16:16 amhs-1 kernel: drbd0: conn( Unconnected -> WFConnection )
Oct  5 13:16:21 amhs-1 heartbeat: [2337]: WARN: node 10.10.10.8: is dead
Oct  5 13:16:21 amhs-1 heartbeat: [2337]: info: Link 10.10.10.8:10.10.10.8 dead.
Oct  5 13:16:21 amhs-1 crmd: [2510]: notice: crmd_ha_status_callback: Status update: Node 10.10.10.8 now has status [dead]
Oct  5 13:16:21 amhs-1 crmd: [2510]: WARN: get_uuid: Could not calculate UUID for 10.10.10.8
Oct  5 13:16:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the first line isn't read in. Maybe the heartbeat does not ouput string correctly for status operation. Or the code (myself) is wrong.
Oct  5 13:18:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the first line isn't read in. Maybe the heartbeat does not ouput string correctly for status operation. Or the code (myself) is wrong.
Oct  5 13:20:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the first line isn't read in. Maybe the heartbeat does not ouput string correctly for status operation. Or the code (myself) is wrong.
Oct  5 13:22:28 amhs-1 cib: [2506]: info: cib_stats: Processed 71 operations (422.00us average, 0% utilization) in the last 10min
Oct  5 13:22:50 amhs-1 lrmd: [2507]: WARN: There is something wrong: the first line isn't read in. Maybe the heartbeat does not ouput string correctly for status operation. Or the code (myself) is wrong.



/etc/ha.d/ha.cf contains:
keepalive 1                          # How long between heartbeats
deadtime 10                          # How long-to-declare-host-dead?
warntime 5                           # How long before issuing "late heartbeat" warning?
initdead 40                          # Very first dead time (initdead)
udpport 694                          # Portnumber to use
auto_failback off                    # Remain on the node until that node fails
#watchdog /dev/watchdog               # If it does not beat for a minute the machine will reboot
node amhs-1.tern.is                  # Host, member of the cluster, must match uname -n
node amhs-2.tern.is                  # Host, member of the cluster, must match uname -n
bcast eth1                           # Broadcast heartbeats on eth1 interface
ping 10.10.10.8                      # Ping our router to monitor ethernet connectivity
respawn hacluster /usr/lib/heartbeat/dopd  
apiauth dopd gid=haclient uid=hacluster
use_logd yes
crm yes                              #Enable version 2 functionality supporting clusters with  > 2 nodes

"ps ax" reveals that dopd is running.

Heartbeat is of version 2.1.2. 

The OS is Centos release 5.

Cib.xml is attached.

-- 
Best regards / Bestu kveðjur
Karl Palsson

-------------- next part --------------
A non-text attachment was scrubbed...
Name: cib.xml
Type: text/xml
Size: 5651 bytes
Desc: cib.xml
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20071005/68854ecd/cib-0001.bin


More information about the Linux-HA mailing list