[Linux-HA] After restart of primary node secondary ist shutdown

Guochun Shi gshi at ncsa.uiuc.edu
Tue Mar 22 09:41:59 MST 2005


At 03:06 PM 3/21/2005 +0100, you wrote:
>Hi,
>
>I've no idea why the first node shuts down the second in this configuration
>(auto_failback is off) after restarting heartbeat facilities:
>
>keepalive 2
>deadtime 45
>warntime 10
>initdead 120
>
>stonith_host itaibi09 external foo /etc/ha.d/stonith.ibmrsa -h rsa-itaibi12 -c "power off"
>stonith_host itaibi12 external foo /etc/ha.d/stonith.ibmrsa -h rsa-itaibi09 -c "power off"
>
># What interfaces to broadcast heartbeats over?
>ucast eth0 10.250.22.118
>ucast eth1 192.168.22.2
>udpport 694
>auto_failback off
>
># Tell what machines are in the cluster
>node itaibi09
>node itaibi12
>
># external ping address to test network
># (10.250.22.104 is the ip address of one of the nodes of the hp cluster)
>ping 10.250.22.1
>respawn hacluster /usr/lib64/heartbeat/ipfail
>
>
>Failover itaibi09 -> itaibi12 works:
>[...]
>Mar 21 14:54:37 itaibi09 heartbeat[10906]: info: Core process 10913 exited. 2 remaining
>Mar 21 14:54:37 itaibi09 heartbeat[10906]: info: Core process 10914 exited. 1 remaining
>Mar 21 14:54:37 itaibi09 heartbeat[10906]: info: Heartbeat shutdown complete.
>[...]
>
>And on the second node: Why these messages Link itaibi09/eth0 now has status
>dead ??
>
>[...]
>Mar 21 14:54:37 itaibi12 heartbeat: info: /usr/lib64/heartbeat/mach_down: nice_failback: foreign resources acquired
>Mar 21 14:54:37 itaibi12 heartbeat[3758]: info: mach_down takeover complete.
>Mar 21 14:54:37 itaibi12 heartbeat: info: mach_down takeover complete for node itaibi09.
>Mar 21 14:55:21 itaibi12 heartbeat[3758]: WARN: node itaibi09: is dead
>Mar 21 14:55:21 itaibi12 heartbeat[3758]: info: Dead node itaibi09 gave up resources.
>Mar 21 14:55:21 itaibi12 heartbeat[3758]: info: Link itaibi09:eth0 dead.
>Mar 21 14:55:21 itaibi12 heartbeat[3758]: info: Link itaibi09:eth1 dead.
>Mar 21 14:55:21 itaibi12 ipfail[3855]: info: Link Status update: Link itaibi09/eth0 now has status dead
>Mar 21 14:55:21 itaibi12 ipfail[3855]: debug: Found ping node 10.250.22.1!
>Mar 21 14:55:21 itaibi12 ipfail[3855]: info: Asking other side for ping node count.
>[...]
>
>Now starting heartbeat on itibi09 successfull:
>
>Mar 21 15:00:48 itaibi09 heartbeat[12411]: info: pid 12411 locked in memory.
>Mar 21 15:00:48 itaibi09 heartbeat[12410]: info: pid 12410 locked in memory.
>Mar 21 15:00:48 itaibi09 heartbeat[12408]: info: pid 12408 locked in memory.
>Mar 21 15:00:48 itaibi09 heartbeat[12408]: info: Local status now set to: 'up'
>Mar 21 15:00:48 itaibi09 heartbeat[12413]: info: pid 12413 locked in memory.
>Mar 21 15:00:48 itaibi09 heartbeat[12416]: info: pid 12416 locked in memory.
>Mar 21 15:00:48 itaibi09 heartbeat[12415]: info: pid 12415 locked in memory.
>Mar 21 15:00:48 itaibi09 heartbeat[12414]: info: pid 12414 locked in memory.
>Mar 21 15:00:48 itaibi09 heartbeat[12408]: info: Link 10.250.22.1:10.250.22.1 up.
>Mar 21 15:00:48 itaibi09 heartbeat[12408]: info: Status update for node 10.250.22.1: status ping
>Mar 21 15:00:49 itaibi09 heartbeat[12412]: info: pid 12412 locked in memory.
>
>But then (after amount of initdead):
>
>[...]
>Mar 21 15:02:48 itaibi09 heartbeat[12408]: WARN: node itaibi12: is dead

This indicates node itaibi09 cannot hear from itaibi12 therefore before it (itaibi09) get all resources it 
needs to shoot itaibi12 to make sure the other node does not hold any resources.

Why doesn't itaibi09 hear from itaibi09? Most time it's because of your firewall setting

-Guochun




More information about the Linux-HA mailing list