[Linux-HA] heartbeat takeover occurring twice
Brad Barnett
lists at l8r.net
Wed May 23 05:36:15 MDT 2007
No ideas on this, anyone? ;(
--------------
Hello,
I am using DRBD 0.7 (master + slave config) + heartbeat + debian etch.
I've been using the same setup with sarge, without issue, for about a year
and a half.
Anyhow, after my upgrade to etch, and a few minor scripting changes, I
noticed that my boxes were not failing over correctly to the slave when
the master was rebooted. Everything works fine if I just pull the plug,
but during a controlled reboot of the master, the slave had problems.
On further investigation, I noticed that the slave was attempting a
takeover twice. Once when the master box started the reboot process (and
in doing so, Debian scripts informed the slave of the reboot, and the
takeover started). Then, when the reboot happened, heartbeat on the slave
noticed the main box was gone, and started a second takeover attempt.
Some logs of interest are attached...
Now, I know that all scripts must be written so that multiple takeover
attempts will not cause problems, and I've complied with that.
My slave box now takes over fine, even if a double takeover
attempt happens on it.
Something odd happens when using the version of heartbeat in etch (1.2.5).
I've seen my network interface (that heartbeat uses to communicate
between the boxes, with a crossover cable) drop packets all over the place
after the second takeover. It happened repeated times, and I have not yet
been able to reproduce this behaviour with Debian's oldstable 1.2.3.
Also using the version of heartbeat in etch, I've it skip a step or two
on a release. The logs show it missing a step (although not the logs
attached... ).
Anyhow, any ideas from my logs, as to why the second takeover? As well,
any ideas at all about heartbeat borking the interface? It doesn't really
make any sense as to how heartbeat could cause the problem, but there it
is. Pings are dropped all over the place, and heartbeat can no longer
effectively communicate on that interface...
-------------- next part --------------
heartbeat: 2007/05/05_18:55:14 info: Received shutdown notice from 'masterbox.domain'.
heartbeat: 2007/05/05_18:55:14 info: Resources being acquired from masterbox.domain.
heartbeat: 2007/05/05_18:55:14 info: acquire all HA resources (standby).
heartbeat: 2007/05/05_18:55:14 info: Acquiring resource group: slavebox.domain drbddisk::r0 Filesystem::/dev/drbd0::/mnt/nfsraid::ext3::noatime killnfsd sleep::2 nfs-common nfs-kernel-server mysql sleep::6 IPaddr::x.x.x.45/24/eth1 IPaddr::y.y.y.45/24/eth1
heartbeat: 2007/05/05_18:55:14 info: Running /etc/ha.d/resource.d/drbddisk r0 start
heartbeat: 2007/05/05_18:55:14 info: Local Resource acquisition completed.
heartbeat: 2007/05/05_18:55:14 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/nfsraid ext3 noatime start
heartbeat: 2007/05/05_18:55:24 info: Running /etc/ha.d/resource.d/killnfsd start
heartbeat: 2007/05/05_18:55:25 WARN: node masterbox.domain: is dead
heartbeat: 2007/05/05_18:55:25 info: Dead node masterbox.domain gave up resources.
heartbeat: 2007/05/05_18:55:25 info: Link masterbox.domain:eth0 dead.
heartbeat: 2007/05/05_18:55:35 info: Running /etc/ha.d/resource.d/sleep 2 start
heartbeat: 2007/05/05_18:55:37 info: Running /etc/ha.d/resource.d/nfs-common start
heartbeat: 2007/05/05_18:55:37 info: Running /etc/ha.d/resource.d/nfs-kernel-server start
heartbeat: 2007/05/05_18:55:38 info: Running /etc/ha.d/resource.d/mysql start
heartbeat: 2007/05/05_18:55:45 info: Running /etc/ha.d/resource.d/sleep 6 start
heartbeat: 2007/05/05_18:55:52 info: Running /etc/ha.d/resource.d/IPaddr x.x.x.45/24/eth1 start
heartbeat: 2007/05/05_18:55:52 info: /sbin/ifconfig eth1:0 x.x.x.45 netmask 255.255.255.0 broadcast x.x.x.255
heartbeat: 2007/05/05_18:55:52 info: Sending Gratuitous Arp for x.x.x.45 on eth1:0 [eth1]
heartbeat: 2007/05/05_18:55:52 /usr/lib/heartbeat/send_arp -i 1010 -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-x.x.x.45 eth1 x.x.x.45 auto x.x.x.45 ffffffffffff
heartbeat: 2007/05/05_18:55:52 info: Running /etc/ha.d/resource.d/IPaddr y.y.y.45/24/eth1 start
heartbeat: 2007/05/05_18:55:52 info: /sbin/ifconfig eth1:2 y.y.y.45 netmask 255.255.255.0 broadcast y.y.y.255
heartbeat: 2007/05/05_18:55:52 info: Sending Gratuitous Arp for y.y.y.45 on eth1:2 [eth1]
heartbeat: 2007/05/05_18:55:52 /usr/lib/heartbeat/send_arp -i 1010 -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-y.y.y.45 eth1 y.y.y.45 auto y.y.y.45 ffffffffffff
heartbeat: 2007/05/05_18:55:52 info: all HA resource acquisition completed (standby).
heartbeat: 2007/05/05_18:55:52 ERROR: Ignored standby message 'done' from slavebox.domain in state 0
heartbeat: 2007/05/05_18:55:52 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2007/05/05_18:55:52 info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
heartbeat: 2007/05/05_18:55:52 info: mach_down takeover complete.
heartbeat: 2007/05/05_18:55:52 info: mach_down takeover complete for node masterbox.domain.
heartbeat: 2007/05/05_18:55:52 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp
heartbeat: 2007/05/05_18:55:52 received ip-request-resp drbddisk::r0 OK yes
heartbeat: 2007/05/05_18:55:52 info: Acquiring resource group: slavebox.domain drbddisk::r0 Filesystem::/dev/drbd0::/mnt/nfsraid::ext3::noatime killnfsd sleep::2 nfs-common nfs-kernel-server mysql sleep::6 IPaddr::x.x.x.45/24/eth1 IPaddr::y.y.y.45/24/eth1
heartbeat: 2007/05/05_18:56:01 info: Running /etc/ha.d/resource.d/killnfsd start
heartbeat: 2007/05/05_18:56:12 info: Running /etc/ha.d/resource.d/sleep 2 start
heartbeat: 2007/05/05_18:56:14 info: Running /etc/ha.d/resource.d/nfs-common start
heartbeat: 2007/05/05_18:56:14 info: Running /etc/ha.d/resource.d/nfs-kernel-server start
heartbeat: 2007/05/05_18:56:14 info: Running /etc/ha.d/resource.d/mysql start
heartbeat: 2007/05/05_18:56:21 info: Running /etc/ha.d/resource.d/sleep 6 start
More information about the Linux-HA
mailing list