[Linux-HA] After restart of primary node secondary ist shutdown
Alan Robertson
alanr at unix.sh
Tue Mar 22 11:33:20 MST 2005
Peter Weiss wrote:
> Guochun Shi <gshi at ncsa.uiuc.edu> writes:
>
>
>>At 03:06 PM 3/21/2005 +0100, you wrote:
>>
>>>Hi,
>>>
>>>I've no idea why the first node shuts down the second in this configuration
>>>(auto_failback is off) after restarting heartbeat facilities:
>>>
>>>keepalive 2
>>>deadtime 45
>>>warntime 10
>>>initdead 120
>>>
>>>stonith_host itaibi09 external foo /etc/ha.d/stonith.ibmrsa -h rsa-itaibi12 -c "power off"
>>>stonith_host itaibi12 external foo /etc/ha.d/stonith.ibmrsa -h rsa-itaibi09 -c "power off"
>>>
>>># What interfaces to broadcast heartbeats over?
>>>ucast eth0 10.250.22.118
>>>ucast eth1 192.168.22.2
>>>udpport 694
>>>auto_failback off
>>>
>>># Tell what machines are in the cluster
>>>node itaibi09
>>>node itaibi12
>>>
>>># external ping address to test network
>>># (10.250.22.104 is the ip address of one of the nodes of the hp cluster)
>>>ping 10.250.22.1
>>>respawn hacluster /usr/lib64/heartbeat/ipfail
>>>
>>>
>>>Failover itaibi09 -> itaibi12 works:
>>>[...]
>>>Mar 21 14:54:37 itaibi09 heartbeat[10906]: info: Core process 10913 exited. 2 remaining
>>>Mar 21 14:54:37 itaibi09 heartbeat[10906]: info: Core process 10914 exited. 1 remaining
>>>Mar 21 14:54:37 itaibi09 heartbeat[10906]: info: Heartbeat shutdown complete.
>>>[...]
>>>
>>>And on the second node: Why these messages Link itaibi09/eth0 now has status
>>>dead ??
>>>
>>>[...]
>>>Mar 21 14:54:37 itaibi12 heartbeat: info: /usr/lib64/heartbeat/mach_down: nice_failback: foreign resources acquired
>>>Mar 21 14:54:37 itaibi12 heartbeat[3758]: info: mach_down takeover complete.
>>>Mar 21 14:54:37 itaibi12 heartbeat: info: mach_down takeover complete for node itaibi09.
>>>Mar 21 14:55:21 itaibi12 heartbeat[3758]: WARN: node itaibi09: is dead
>>>Mar 21 14:55:21 itaibi12 heartbeat[3758]: info: Dead node itaibi09 gave up resources.
>>>Mar 21 14:55:21 itaibi12 heartbeat[3758]: info: Link itaibi09:eth0 dead.
>>>Mar 21 14:55:21 itaibi12 heartbeat[3758]: info: Link itaibi09:eth1 dead.
>>>Mar 21 14:55:21 itaibi12 ipfail[3855]: info: Link Status update: Link itaibi09/eth0 now has status dead
>>>Mar 21 14:55:21 itaibi12 ipfail[3855]: debug: Found ping node 10.250.22.1!
>>>Mar 21 14:55:21 itaibi12 ipfail[3855]: info: Asking other side for ping node count.
>>>[...]
>>>
>>>Now starting heartbeat on itibi09 successfull:
>>>
>>>Mar 21 15:00:48 itaibi09 heartbeat[12411]: info: pid 12411 locked in memory.
>>>Mar 21 15:00:48 itaibi09 heartbeat[12410]: info: pid 12410 locked in memory.
>>>Mar 21 15:00:48 itaibi09 heartbeat[12408]: info: pid 12408 locked in memory.
>>>Mar 21 15:00:48 itaibi09 heartbeat[12408]: info: Local status now set to: 'up'
>>>Mar 21 15:00:48 itaibi09 heartbeat[12413]: info: pid 12413 locked in memory.
>>>Mar 21 15:00:48 itaibi09 heartbeat[12416]: info: pid 12416 locked in memory.
>>>Mar 21 15:00:48 itaibi09 heartbeat[12415]: info: pid 12415 locked in memory.
>>>Mar 21 15:00:48 itaibi09 heartbeat[12414]: info: pid 12414 locked in memory.
>>>Mar 21 15:00:48 itaibi09 heartbeat[12408]: info: Link 10.250.22.1:10.250.22.1 up.
>>>Mar 21 15:00:48 itaibi09 heartbeat[12408]: info: Status update for node 10.250.22.1: status ping
>>>Mar 21 15:00:49 itaibi09 heartbeat[12412]: info: pid 12412 locked in memory.
>>>
>>>But then (after amount of initdead):
>>>
>>>[...]
>>>Mar 21 15:02:48 itaibi09 heartbeat[12408]: WARN: node itaibi12: is dead
>>
>>This indicates node itaibi09 cannot hear from itaibi12 therefore before it (itaibi09) get all resources it
>>needs to shoot itaibi12 to make sure the other node does not hold any resources.
>>
>>Why doesn't itaibi09 hear from itaibi09? Most time it's because of your firewall setting
>>
>
>
> Nope, there isn't any. It's all internal and I can ping the interfaces in
> the initdead intervall. I see no messages in the logs.
Ability to ping does NOT imply the absence of a firewall.
Firewalls can do lots of different things, and pings are often allowed,
even though most ports are blocked by default.
It's either a firewall, or too low an initdead setting.
Your logs have no messages in them at all? It doesn't look like that from
your email above. What messages do you think your logs don't have?
And, in these log excerpts, there have been no STONITH attempts. Why do
you think it's tried to STONITH anything?
Don't trim your logs. Most people leave out important things - including
apparently you.
Given your long deadtime, I suspect either your ucast addresses or
interfaces are wrong, or you have a firewall enabled.
Your ucast directives look wrong. If you want to heartbeat over two
interfaces, then normally you have 4 ucast directives, one for each (node,
interface) pair.
I'd suggest trying bcast to start with. This eliminates a number of
possible errors in ucast configuration. Then if this works, something was
wrong with your ucast directives.
--
Alan Robertson <alanr at unix.sh>
"Openness is the foundation and preservative of friendship... Let me claim
from you at all times your undisguised opinions." - William Wilberforce
More information about the Linux-HA
mailing list