[Linux-HA] Unexplained failovers

Alan Robertson alanr at unix.sh
Mon Jul 11 09:00:39 MDT 2005


Joe Kemp wrote:
> I have a server that seems to be failing over with no cause.  Everything 
> was fine for about 1 year.  I rebooted the server a week ago and now 
> every day or two it fails over.  There does not appear to be anything 
> wrong with the box.  Do the log entries below indicate that it was 
> unable to ping the servers in the ping group group1?  If so how long 
> would it have been trying to ping them, if deadtime is set at 30 seconds 
> why did it fail over 4 seconds after the warning was written to the log 
> file?  I had a ping running every second and it did not fail when the 
> box failed over.   Any ideas?
> 
>  
> 
>  
> 
> HA-LOG server01
> 
> heartbeat: 2005/07/10_00:30:12 WARN: node group1: is dead
> 
> heartbeat: 2005/07/10_00:30:12 info: Link group1:group1 dead.
> 
> heartbeat: 2005/07/10_00:30:12 info: Running /etc/ha.d/rc.d/status status
> 
> heartbeat: 2005/07/10_00:30:16 info: server01 wants to go standby [all]
> 
> heartbeat: 2005/07/10_00:30:16 info: standby: server02 can take our all 
> resources
> 
> heartbeat: 2005/07/10_00:30:16 info: give up all HA resources (standby).
> 
> heartbeat: 2005/07/10_00:30:16 info: Releasing resource group: server02 
> 192.168.1.80 runjabber
> 
> heartbeat: 2005/07/10_00:30:16 info: Running /etc/init.d/runjabber  stop
> 
> heartbeat: 2005/07/10_00:30:18 info: Running /etc/ha.d/resource.d/IPaddr 
> 192.168.1.80 stop
> 
>  
> 
> HA-DEBUG server01
> 
> heartbeat: 2005/07/10_00:30:12 debug: notify_world: setting SIGCHLD 
> Handler to SIG_DFL
> 
> heartbeat: 2005/07/10_00:30:17 debug: Starting /etc/init.d/runjabber  stop
> 
>  
> 
> HARESOURCES
> 
> server02 192.168.1.80 runjabber
> 
>  
> 
> HA.CF
> 
> keepalive 2
> 
> deadtime 30
> 
> warntime 10
> 
> baud    19200
> 
> serial  /dev/ttyS0      # Linux
> 
> bcast   eth0            # Linux
> 
> auto_failback off
> 
> node    server01
> 
> node    server02
> 
> ping_group group1 192.168.1.10 192.168.1.11 192.168.1.12 192.168.1.13


This does indeed seem to indicate that it can't ping that resource.  If 
it's marked dead, that's because 'deadtime' has already passed without 
hearing any ping responses from any node in the ping group...

Why it didn't hear any pings from any of those nodes I can't tell you...

-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce



More information about the Linux-HA mailing list