[Linux-HA] failover trouble with DRBD after Stonith reset of active server

Dave Dykstra dwdha at drdykstra.us
Thu Jul 14 11:43:34 MDT 2005


In the failover my production server experienced this morning, the standby
server initially failed to take over because starting DRBD right after it
had Stonithed the active server failed.  I'm quite sure this is a case
of the same problem I had previously reported to DRBD mailing list but
which has so far gotten not received a satisfactory answer.   I now think
the answer may be primarily have to come from heartbeat, not DRBD.

The problem comes in when as far as DRBD is concerned the other node
is active right up until the time the Stonith happens: DRBD refuses to
allow the standby node to become active until its own timeout period has
elapsed.  In the past I had only noticed that when doing a kill -9 of the
active heartbeat process on the active server, in order to test Stonith.
I wasn't too bothered by not having an answer for that because I reasoned
that maybe it wasn't too likely that heartbeat would die before DRBD in
practice, but this morning I experienced that kind of situation; DRBD
uses the point-to-point link between the two servers, and the standby
DRBD noticed no problem until the active server was Stonithed.

Heartbeat on the standby server did eventually recover by restarting
itself, then noticing that its peer wasn't alive yet it attempted to take
over again and this time enough time had elapsed for DRBD to not complain.
It sure seems that there should be a more controlled recovery than that
though.  What would it take for heartbeat to be able to tell DRBD that
it knows for sure that the other side is dead so DRBD should go ahead
and take over?  I had first suggested on the DRBD list that DRBD send a
flurry of packets in a short time to try to determine whether or not the
other side is still up, but they didn't go for that and I don't blame them.
Somehow the information that heartbeat knows, that the other side is
really dead, needs to be passed to DRBD.

On the other hand, should DRBD always be trusting that heartbeat would
never tell it to start unless the other side is really stopped?  Maybe all
that really needs to change is /etc/ha.d/resource.d/drbddisk, to pass
the drbdsetup --do-what-I-say option on the command to become primary.
That would send the ball back to DRBD because drbddisk is part of that
package.  What's the opinion of the linux-ha team?

Below is the full log entries for the failed & retried takeover.  Note that
there's no error message from drbddisk saved in the log, but I'm 98%
sure that's what the problem was because that's how DRBD behaves.

- Dave

On the active server:
Jul 14 06:52:41 swfs1 heartbeat: [4161]: WARN: Late heartbeat: Node swfs1: interval 2190 ms
Jul 14 06:52:41 swfs1 heartbeat: [4161]: WARN: Late heartbeat: Node swfs1: interval 3410 ms
Jul 14 06:52:45 swfs1 heartbeat: [4161]: WARN: Late heartbeat: Node swfs1: interval 2030 ms
Jul 14 06:52:46 swfs1 heartbeat: [4161]: WARN: node swfs2: is dead
Jul 14 06:52:47 swfs1 heartbeat: [4161]: info: Dead node swfs2 gave up resources.
Jul 14 06:52:49 swfs1 heartbeat: [4161]: WARN: node 172.18.1.1: is dead
Jul 14 06:52:50 swfs1 heartbeat: [4161]: info: Link swfs2:eth0 dead.
Jul 14 06:52:52 swfs1 ipfail: [4280]: info: Status update: Node swfs2 now has status dead
Jul 14 06:52:52 swfs1 ipfail: [4280]: WARN:  message delayed from enqueue to recv 3020 ms (enqueue-time=32179984, peer pid=4161)
Jul 14 06:52:52 swfs1 heartbeat: [4161]: info: Link swfs2:eth1 dead.
Jul 14 06:56:26 swfs1 syslogd 1.4.1#17: restart.

On the standby-becoming-active server:
Jul 14 06:52:33 swfs2 heartbeat: [3453]: WARN: Late heartbeat: Node swfs1: interval 4570 ms
Jul 14 06:52:40 swfs2 heartbeat: [3453]: WARN: Late heartbeat: Node swfs1: interval 3050 ms
Jul 14 06:52:50 swfs2 heartbeat: [3453]: WARN: node swfs1: is dead
Jul 14 06:52:50 swfs2 ipfail: [3572]: info: Status update: Node swfs1 now has status dead
Jul 14 06:52:50 swfs2 ipfail: [3572]: info: NS: We are still alive!
Jul 14 06:52:50 swfs2 heartbeat: [7947]: info: Resetting node swfs1 with [APCSmart]
Jul 14 06:52:52 swfs2 heartbeat: [3453]: WARN: Deadtime value may be too small.
Jul 14 06:52:52 swfs2 heartbeat: [3453]: info: See documentation for information on tuning deadtime.
Jul 14 06:52:52 swfs2 heartbeat: [3453]: WARN: Late heartbeat: Node swfs1: interval 11970 ms
Jul 14 06:52:52 swfs2 heartbeat: [3453]: info: Status update for node swfs1: status active
Jul 14 06:52:52 swfs2 ipfail: [3572]: info: Status update: Node swfs1 now has status active
Jul 14 06:52:52 swfs2 harc[7948]: info: Running /etc/ha.d/rc.d/status status
Jul 14 06:52:53 swfs2 heartbeat: [7947]: info: node swfs1 now reset.
Jul 14 06:52:53 swfs2 heartbeat: [3453]: info: Exiting STONITH swfs1 process 7947 returned rc 0.
Jul 14 06:52:53 swfs2 heartbeat: [3453]: info: Resources being acquired from swfs1.
Jul 14 06:52:53 swfs2 harc[7958]: info: Running /etc/ha.d/rc.d/status status
Jul 14 06:52:53 swfs2 heartbeat: [7959]: info: No local resources [/usr/lib/heartbeat/ResourceManager listkeys swfs2] to acquire.
Jul 14 06:52:53 swfs2 mach_down[7978]: info: Taking over resource group drbddisk::home
Jul 14 06:52:53 swfs2 ResourceManager[7998]: info: Acquiring resource group: swfs1 drbddisk::home FilesystemNokill::/dev/drbd0::/mnt/home::ext3 FilesystemMaint::/mnt/home drbddisk::mirror FilesystemNokill::/dev/drbd1::/mnt/mirror::ext3::ro FilesystemMaint::/mnt/mirror Restart::ntp-server::down IPaddr2::192.168.1.3 ypserv:: StartStop::bind9::/usr/sbin/named IPaddr2::172.18.1.3 StartStop::nfs-common::/sbin/rpc.statd StartStop::nfs-kernel-server::/usr/sbin/rpc.mountd StartStop::samba::/usr/sbin/nmbd IPaddr2::172.18.30.21 IPaddr2::172.18.30.18 IPsrcaddr::172.18.30.21 FilesystemMaint::/nfs/swfs/home FilesystemMaint::/nfs/swfs/mirror Restart::ssh::up Restart::ntp-server::up StatusChange::
Jul 14 06:52:53 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/drbddisk home start
Jul 14 06:52:54 swfs2 heartbeat: [3453]: WARN: Shutdown delayed until current resource activity finishes.
Jul 14 06:52:54 swfs2 kernel: e1000: eth1: e1000_watchdog: NIC Link is Down
Jul 14 06:52:58 swfs2 ResourceManager[7998]: CRIT: Giving up resources due to failure of drbddisk::home
Jul 14 06:52:58 swfs2 ResourceManager[7998]: info: Releasing resource group: swfs1 drbddisk::home FilesystemNokill::/dev/drbd0::/mnt/home::ext3 FilesystemMaint::/mnt/home drbddisk::mirror FilesystemNokill::/dev/drbd1::/mnt/mirror::ext3::ro FilesystemMaint::/mnt/mirror Restart::ntp-server::down IPaddr2::192.168.1.3 ypserv:: StartStop::bind9::/usr/sbin/named IPaddr2::172.18.1.3 StartStop::nfs-common::/sbin/rpc.statd StartStop::nfs-kernel-server::/usr/sbin/rpc.mountd StartStop::samba::/usr/sbin/nmbd IPaddr2::172.18.30.21 IPaddr2::172.18.30.18 IPsrcaddr::172.18.30.21 FilesystemMaint::/nfs/swfs/home FilesystemMaint::/nfs/swfs/mirror Restart::ssh::up Restart::ntp-server::up StatusChange::
Jul 14 06:52:58 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/StatusChange  stop
Jul 14 06:52:58 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/Restart ntp-server up stop
Jul 14 06:52:58 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/Restart ssh up stop
Jul 14 06:52:58 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/FilesystemMaint /nfs/swfs/mirror stop
Jul 14 06:52:58 swfs2 FilesystemMaint[8270]: /nfs/swfs/mirror not mounted, nothing to do
Jul 14 06:52:58 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/FilesystemMaint /nfs/swfs/home stop
Jul 14 06:52:58 swfs2 FilesystemMaint[8304]: /nfs/swfs/home not mounted, nothing to do
Jul 14 06:52:58 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/IPsrcaddr 172.18.30.21 stop
Jul 14 06:52:58 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/IPaddr2 172.18.30.18 stop
Jul 14 06:52:59 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/IPaddr2 172.18.30.21 stop
Jul 14 06:52:59 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/StartStop samba /usr/sbin/nmbd stop
Jul 14 06:53:01 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/StartStop nfs-kernel-server /usr/sbin/rpc.mountd stop
Jul 14 06:53:01 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/StartStop nfs-common /sbin/rpc.statd stop
Jul 14 06:53:01 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/IPaddr2 172.18.1.3 stop
Jul 14 06:53:01 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/StartStop bind9 /usr/sbin/named stop
Jul 14 06:53:01 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/ypserv  stop
Jul 14 06:53:01 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/IPaddr2 192.168.1.3 stop
Jul 14 06:53:01 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/Restart ntp-server down stop
Jul 14 06:53:02 swfs2 kernel: drbd0: drbd0_asender [4992]: cstate Connected --> NetworkFailure
Jul 14 06:53:02 swfs2 kernel: drbd0: asender terminated
Jul 14 06:53:02 swfs2 kernel: drbd0: drbd0_receiver [3355]: cstate NetworkFailure --> BrokenPipe
Jul 14 06:53:02 swfs2 kernel: drbd0: worker terminated
Jul 14 06:53:02 swfs2 kernel: drbd0: drbd0_receiver [3355]: cstate BrokenPipe --> Unconnected
Jul 14 06:53:02 swfs2 kernel: drbd0: Connection lost.
Jul 14 06:53:02 swfs2 kernel: drbd0: drbd0_receiver [3355]: cstate Unconnected --> WFConnection
Jul 14 06:53:02 swfs2 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex
Jul 14 06:53:03 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/FilesystemMaint /mnt/mirror stop
Jul 14 06:53:03 swfs2 FilesystemMaint[8776]: /mnt/mirror not mounted, nothing to do
Jul 14 06:53:03 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/FilesystemNokill /dev/drbd1 /mnt/mirror ext3 ro stop
Jul 14 06:53:03 swfs2 FilesystemNokill[8810]: WARNING: Filesystem /mnt/mirror not mounted?
Jul 14 06:53:03 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/drbddisk mirror stop
Jul 14 06:53:03 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/FilesystemMaint /mnt/home stop
Jul 14 06:53:03 swfs2 FilesystemMaint[8882]: /mnt/home not mounted, nothing to do
Jul 14 06:53:03 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/FilesystemNokill /dev/drbd0 /mnt/home ext3 stop
Jul 14 06:53:03 swfs2 FilesystemNokill[8916]: WARNING: Filesystem /mnt/home not mounted?
Jul 14 06:53:03 swfs2 ResourceManager[7998]: info: Running /etc/ha.d/resource.d/drbddisk home stop
Jul 14 06:53:03 swfs2 mach_down[7978]: info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
Jul 14 06:53:03 swfs2 heartbeat: [3453]: WARN: node swfs1: is dead
Jul 14 06:53:03 swfs2 heartbeat: [3453]: info: Dead node swfs1 gave up resources.
Jul 14 06:53:03 swfs2 heartbeat: [3453]: info: Link swfs1:eth0 dead.
Jul 14 06:53:03 swfs2 heartbeat: [3453]: info: Link swfs1:eth1 dead.
Jul 14 06:53:03 swfs2 heartbeat: [3453]: info: mach_down takeover complete.
Jul 14 06:53:03 swfs2 mach_down[7978]: info: mach_down takeover complete for node swfs1.
Jul 14 06:53:03 swfs2 ipfail: [3572]: info: Status update: Node swfs1 now has status dead
Jul 14 06:53:03 swfs2 ipfail: [3572]: info: NS: We are still alive!
Jul 14 06:53:03 swfs2 ipfail: [3572]: info: Link Status update: Link swfs1/eth0 now has status dead
Jul 14 06:53:03 swfs2 heartbeat: [3453]: info: Heartbeat shutdown in progress. (3453)
Jul 14 06:53:03 swfs2 heartbeat: [8979]: info: Giving up all HA resources.
Jul 14 06:53:03 swfs2 ipfail: [3572]: info: Asking other side for ping node count.
Jul 14 06:53:03 swfs2 ipfail: [3572]: info: Checking remote count of ping nodes.
Jul 14 06:53:03 swfs2 ipfail: [3572]: info: Link Status update: Link swfs1/eth1 now has status dead
Jul 14 06:53:03 swfs2 ipfail: [3572]: info: Asking other side for ping node count.
Jul 14 06:53:03 swfs2 ipfail: [3572]: info: Checking remote count of ping nodes.
Jul 14 06:53:03 swfs2 ResourceManager[8989]: info: Releasing resource group: swfs1 drbddisk::home FilesystemNokill::/dev/drbd0::/mnt/home::ext3 FilesystemMaint::/mnt/home drbddisk::mirror FilesystemNokill::/dev/drbd1::/mnt/mirror::ext3::ro FilesystemMaint::/mnt/mirror Restart::ntp-server::down IPaddr2::192.168.1.3 ypserv:: StartStop::bind9::/usr/sbin/named IPaddr2::172.18.1.3 StartStop::nfs-common::/sbin/rpc.statd StartStop::nfs-kernel-server::/usr/sbin/rpc.mountd StartStop::samba::/usr/sbin/nmbd IPaddr2::172.18.30.21 IPaddr2::172.18.30.18 IPsrcaddr::172.18.30.21 FilesystemMaint::/nfs/swfs/home FilesystemMaint::/nfs/swfs/mirror Restart::ssh::up Restart::ntp-server::up StatusChange::
Jul 14 06:53:03 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/StatusChange  stop
Jul 14 06:53:03 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/Restart ntp-server up stop
Jul 14 06:53:03 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/Restart ssh up stop
Jul 14 06:53:03 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/FilesystemMaint /nfs/swfs/mirror stop
Jul 14 06:53:03 swfs2 FilesystemMaint[9105]: /nfs/swfs/mirror not mounted, nothing to do
Jul 14 06:53:03 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/FilesystemMaint /nfs/swfs/home stop
Jul 14 06:53:03 swfs2 FilesystemMaint[9139]: /nfs/swfs/home not mounted, nothing to do
Jul 14 06:53:03 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/IPsrcaddr 172.18.30.21 stop
Jul 14 06:53:03 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/IPaddr2 172.18.30.18 stop
Jul 14 06:53:03 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/IPaddr2 172.18.30.21 stop
Jul 14 06:53:03 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/StartStop samba /usr/sbin/nmbd stop
Jul 14 06:53:05 swfs2 kernel: drbd1: drbd1_asender [4993]: cstate Connected --> NetworkFailure
Jul 14 06:53:05 swfs2 kernel: drbd1: asender terminated
Jul 14 06:53:05 swfs2 kernel: drbd1: drbd1_receiver [3363]: cstate NetworkFailure --> BrokenPipe
Jul 14 06:53:05 swfs2 kernel: drbd1: worker terminated
Jul 14 06:53:05 swfs2 kernel: drbd1: drbd1_receiver [3363]: cstate BrokenPipe --> Unconnected
Jul 14 06:53:05 swfs2 kernel: drbd1: Connection lost.
Jul 14 06:53:05 swfs2 kernel: drbd1: drbd1_receiver [3363]: cstate Unconnected --> WFConnection
Jul 14 06:53:06 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/StartStop nfs-kernel-server /usr/sbin/rpc.mountd stop
Jul 14 06:53:06 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/StartStop nfs-common /sbin/rpc.statd stop
Jul 14 06:53:06 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/IPaddr2 172.18.1.3 stop
Jul 14 06:53:06 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/StartStop bind9 /usr/sbin/named stop
Jul 14 06:53:06 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/ypserv  stop
Jul 14 06:53:06 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/IPaddr2 192.168.1.3 stop
Jul 14 06:53:06 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/Restart ntp-server down stop
Jul 14 06:53:07 swfs2 kernel: e1000: eth1: e1000_watchdog: NIC Link is Down
Jul 14 06:53:08 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/FilesystemMaint /mnt/mirror stop
Jul 14 06:53:08 swfs2 FilesystemMaint[9611]: /mnt/mirror not mounted, nothing to do
Jul 14 06:53:08 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/FilesystemNokill /dev/drbd1 /mnt/mirror ext3 ro stop
Jul 14 06:53:08 swfs2 FilesystemNokill[9645]: WARNING: Filesystem /mnt/mirror not mounted?
Jul 14 06:53:08 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/drbddisk mirror stop
Jul 14 06:53:08 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/FilesystemMaint /mnt/home stop
Jul 14 06:53:08 swfs2 FilesystemMaint[9717]: /mnt/home not mounted, nothing to do
Jul 14 06:53:08 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/FilesystemNokill /dev/drbd0 /mnt/home ext3 stop
Jul 14 06:53:08 swfs2 FilesystemNokill[9751]: WARNING: Filesystem /mnt/home not mounted?
Jul 14 06:53:08 swfs2 ResourceManager[8989]: info: Running /etc/ha.d/resource.d/drbddisk home stop
Jul 14 06:53:08 swfs2 heartbeat: [8979]: info: All HA resources relinquished.
Jul 14 06:53:09 swfs2 heartbeat: [3453]: info: killing /usr/lib/heartbeat/ipfail  process group 3572 with signal 15
Jul 14 06:53:10 swfs2 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: killing HBREAD process 3564 with signal 15
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: killing HBFIFO process 3556 with signal 15
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: killing HBWRITE process 3557 with signal 15
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: killing HBREAD process 3558 with signal 15
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: killing HBWRITE process 3559 with signal 15
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: killing HBREAD process 3560 with signal 15
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: killing HBWRITE process 3561 with signal 15
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: killing HBREAD process 3562 with signal 15
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: killing HBWRITE process 3563 with signal 15
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: Core process 3558 exited. 9 remaining
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: Core process 3556 exited. 8 remaining
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: Core process 3557 exited. 7 remaining
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: Core process 3559 exited. 6 remaining
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: Core process 3560 exited. 5 remaining
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: Core process 3561 exited. 4 remaining
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: Core process 3562 exited. 3 remaining
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: Core process 3563 exited. 2 remaining
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: Core process 3564 exited. 1 remaining
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: Heartbeat shutdown complete.
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: Heartbeat restart triggered.
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: Restarting heartbeat.
Jul 14 06:53:10 swfs2 heartbeat: [3453]: info: Performing heartbeat restart exec.
Jul 14 06:53:25 swfs2 heartbeat: [3453]: WARN: Core dumps could be lost if multiple dumps occur
Jul 14 06:53:25 swfs2 heartbeat: [3453]: WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability
Jul 14 06:53:25 swfs2 heartbeat: [3453]: WARN: Logging daemon is disabled --enabling logging daemon is recommended
Jul 14 06:53:25 swfs2 heartbeat: [3453]: info: **************************
Jul 14 06:53:25 swfs2 heartbeat: [3453]: info: Configuration validated. Starting heartbeat 1.99.6
Jul 14 06:53:25 swfs2 heartbeat: [9802]: info: heartbeat: version 1.99.6
Jul 14 06:53:25 swfs2 heartbeat: [9802]: info: Heartbeat generation: 294
Jul 14 06:53:25 swfs2 heartbeat: [9802]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0
Jul 14 06:53:25 swfs2 heartbeat: [9802]: info: glib: ucast: bound send socket to device: eth0
Jul 14 06:53:25 swfs2 heartbeat: [9802]: info: glib: ucast: bound receive socket to device: eth0
Jul 14 06:53:25 swfs2 heartbeat: [9802]: info: glib: ucast: started on port 694 interface eth0 to 172.18.30.26
Jul 14 06:53:25 swfs2 heartbeat: [9802]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0
Jul 14 06:53:25 swfs2 heartbeat: [9802]: info: glib: ucast: bound send socket to device: eth0
Jul 14 06:53:25 swfs2 heartbeat: [9802]: info: glib: ucast: bound receive socket to device: eth0
Jul 14 06:53:25 swfs2 heartbeat: [9802]: info: glib: ucast: started on port 694 interface eth0 to 172.18.30.27
Jul 14 06:53:25 swfs2 heartbeat: [9802]: info: glib: UDP Broadcast heartbeat started on port 694 (694) interface eth1
Jul 14 06:53:25 swfs2 heartbeat: [9802]: info: glib: ping heartbeat started.
Jul 14 06:53:25 swfs2 heartbeat: [9802]: info: G_main_add_SignalHandler: Added signal handler for signal 17
Jul 14 06:53:25 swfs2 heartbeat: [9802]: info: pid 9802 locked in memory.
Jul 14 06:53:25 swfs2 heartbeat: [9802]: info: Local status now set to: 'up'
Jul 14 06:53:26 swfs2 heartbeat: [9804]: info: pid 9804 locked in memory.
Jul 14 06:53:26 swfs2 heartbeat: [9809]: info: pid 9809 locked in memory.
Jul 14 06:53:26 swfs2 heartbeat: [9805]: info: pid 9805 locked in memory.
Jul 14 06:53:26 swfs2 heartbeat: [9806]: info: pid 9806 locked in memory.
Jul 14 06:53:26 swfs2 heartbeat: [9812]: info: pid 9812 locked in memory.
Jul 14 06:53:26 swfs2 heartbeat: [9807]: info: pid 9807 locked in memory.
Jul 14 06:53:26 swfs2 heartbeat: [9811]: info: pid 9811 locked in memory.
Jul 14 06:53:26 swfs2 heartbeat: [9802]: info: Link 172.18.1.1:172.18.1.1 up.
Jul 14 06:53:26 swfs2 heartbeat: [9802]: info: Status update for node 172.18.1.1: status ping
Jul 14 06:53:26 swfs2 heartbeat: [9808]: info: pid 9808 locked in memory.
Jul 14 06:53:26 swfs2 heartbeat: [9810]: info: pid 9810 locked in memory.
Jul 14 06:53:26 swfs2 heartbeat: [9802]: info: Link swfs2:eth1 up.
Jul 14 06:53:45 swfs2 heartbeat: [9802]: WARN: node swfs1: is dead
Jul 14 06:53:45 swfs2 heartbeat: [9802]: info: Local status now set to: 'active'
Jul 14 06:53:45 swfs2 heartbeat: [9802]: info: Starting child client "/usr/lib/heartbeat/ipfail " (17,110)
Jul 14 06:53:45 swfs2 heartbeat: [9815]: info: Starting "/usr/lib/heartbeat/ipfail " as uid 17  gid 110 (pid 9815)
Jul 14 06:53:45 swfs2 heartbeat: [9814]: info: Checking status of STONITH device [APCSmart]
Jul 14 06:53:45 swfs2 heartbeat: [9813]: info: Resetting node swfs1 with [APCSmart]
Jul 14 06:53:45 swfs2 heartbeat: [9802]: WARN: Exiting STONITH swfs1 process 9813 returned rc 1.
Jul 14 06:53:48 swfs2 heartbeat: [9802]: WARN: Exiting STONITH-stat process 9814 returned rc 1.
Jul 14 06:53:48 swfs2 heartbeat: [9802]: info: This may mean that the STONITH device has failed!
Jul 14 06:53:50 swfs2 heartbeat: [9816]: info: Resetting node swfs1 with [APCSmart]
Jul 14 06:53:54 swfs2 kernel: e1000: eth1: e1000_watchdog: NIC Link is Down
Jul 14 06:53:54 swfs2 heartbeat: [9816]: info: node swfs1 now reset.
Jul 14 06:53:54 swfs2 heartbeat: [9802]: info: Exiting STONITH swfs1 process 9816 returned rc 0.
Jul 14 06:53:54 swfs2 heartbeat: [9802]: info: Resources being acquired from swfs1.
Jul 14 06:53:54 swfs2 harc[9817]: info: Running /etc/ha.d/rc.d/status status
Jul 14 06:53:54 swfs2 heartbeat: [9818]: info: No local resources [/usr/lib/heartbeat/ResourceManager listkeys swfs2] to acquire.
Jul 14 06:53:54 swfs2 heartbeat: [9802]: info: Initial resource acquisition complete (T_RESOURCES(us))
Jul 14 06:53:54 swfs2 mach_down[9837]: info: Taking over resource group drbddisk::home
Jul 14 06:53:54 swfs2 ResourceManager[9857]: info: Acquiring resource group: swfs1 drbddisk::home FilesystemNokill::/dev/drbd0::/mnt/home::ext3 FilesystemMaint::/mnt/home drbddisk::mirror FilesystemNokill::/dev/drbd1::/mnt/mirror::ext3::ro FilesystemMaint::/mnt/mirror Restart::ntp-server::down IPaddr2::192.168.1.3 ypserv:: StartStop::bind9::/usr/sbin/named IPaddr2::172.18.1.3 StartStop::nfs-common::/sbin/rpc.statd StartStop::nfs-kernel-server::/usr/sbin/rpc.mountd StartStop::samba::/usr/sbin/nmbd IPaddr2::172.18.30.21 IPaddr2::172.18.30.18 IPsrcaddr::172.18.30.21 FilesystemMaint::/nfs/swfs/home FilesystemMaint::/nfs/swfs/mirror Restart::ssh::up Restart::ntp-server::up StatusChange::
Jul 14 06:53:54 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/drbddisk home start
Jul 14 06:53:54 swfs2 kernel: drbd0: Secondary/Unknown --> Primary/Unknown
Jul 14 06:53:54 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/FilesystemNokill /dev/drbd0 /mnt/home ext3 start
Jul 14 06:54:01 swfs2 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex
Jul 14 06:54:04 swfs2 heartbeat: [9802]: info: Local Resource acquisition completed. (none)
Jul 14 06:54:04 swfs2 heartbeat: [9802]: info: local resource transition completed.
Jul 14 06:54:05 swfs2 kernel: kjournald starting.  Commit interval 5 seconds
Jul 14 06:54:05 swfs2 kernel: EXT3 FS on drbd0, internal journal
Jul 14 06:54:05 swfs2 kernel: EXT3-fs: recovery complete.
Jul 14 06:54:05 swfs2 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Jul 14 06:54:05 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/FilesystemMaint /mnt/home start
Jul 14 06:54:05 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/drbddisk mirror start
Jul 14 06:54:05 swfs2 kernel: drbd1: Secondary/Unknown --> Primary/Unknown
Jul 14 06:54:05 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/FilesystemNokill /dev/drbd1 /mnt/mirror ext3 ro start
Jul 14 06:54:05 swfs2 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Jul 14 06:54:05 swfs2 kernel: kjournald starting.  Commit interval 5 seconds
Jul 14 06:54:05 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/FilesystemMaint /mnt/mirror start
Jul 14 06:54:05 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/IPaddr2 192.168.1.3 start
Jul 14 06:54:05 swfs2 IPaddr2[10256]: info: /bin/ip -f inet addr add 192.168.1.3/24 brd 192.168.1.255 dev eth1
Jul 14 06:54:05 swfs2 IPaddr2[10256]: info: /bin/ip link set eth1 up
Jul 14 06:54:05 swfs2 IPaddr2[10256]: /usr/lib/heartbeat/send_arp -i 200 -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-192.168.1.3 eth1 192.168.1.3 auto 192.168.1.3 ffffffffffff
Jul 14 06:54:05 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/ypserv  start
Jul 14 06:54:05 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/StartStop bind9 /usr/sbin/named start
Jul 14 06:54:05 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/IPaddr2 172.18.1.3 start
Jul 14 06:54:05 swfs2 IPaddr2[10460]: info: /bin/ip -f inet addr add 172.18.1.3/18 brd 172.18.63.255 dev eth0
Jul 14 06:54:05 swfs2 IPaddr2[10460]: info: /bin/ip link set eth0 up
Jul 14 06:54:05 swfs2 IPaddr2[10460]: /usr/lib/heartbeat/send_arp -i 200 -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-172.18.1.3 eth0 172.18.1.3 auto 172.18.1.3 ffffffffffff
Jul 14 06:54:05 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/StartStop nfs-common /sbin/rpc.statd start
Jul 14 06:54:05 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/StartStop nfs-kernel-server /usr/sbin/rpc.mountd start
Jul 14 06:54:06 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/StartStop samba /usr/sbin/nmbd start
Jul 14 06:54:06 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/IPaddr2 172.18.30.21 start
Jul 14 06:54:06 swfs2 IPaddr2[10700]: info: /bin/ip -f inet addr add 172.18.30.21/18 brd 172.18.63.255 dev eth0
Jul 14 06:54:06 swfs2 IPaddr2[10700]: info: /bin/ip link set eth0 up
Jul 14 06:54:06 swfs2 IPaddr2[10700]: /usr/lib/heartbeat/send_arp -i 200 -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-172.18.30.21 eth0 172.18.30.21 auto 172.18.30.21 ffffffffffff
Jul 14 06:54:06 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/IPaddr2 172.18.30.18 start
Jul 14 06:54:06 swfs2 IPaddr2[10803]: info: /bin/ip -f inet addr add 172.18.30.18/18 brd 172.18.63.255 dev eth0
Jul 14 06:54:06 swfs2 IPaddr2[10803]: info: /bin/ip link set eth0 up
Jul 14 06:54:06 swfs2 IPaddr2[10803]: /usr/lib/heartbeat/send_arp -i 200 -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-172.18.30.18 eth0 172.18.30.18 auto 172.18.30.18 ffffffffffff
Jul 14 06:54:06 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/IPsrcaddr 172.18.30.21 start
Jul 14 06:54:06 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/FilesystemMaint /nfs/swfs/home start
Jul 14 06:54:06 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/FilesystemMaint /nfs/swfs/mirror start
Jul 14 06:54:07 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/Restart ssh up start
Jul 14 06:54:07 swfs2 kernel: e1000: eth1: e1000_watchdog: NIC Link is Down
Jul 14 06:54:09 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/Restart ntp-server up start
Jul 14 06:54:09 swfs2 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
Jul 14 06:54:11 swfs2 ResourceManager[9857]: info: Running /etc/ha.d/resource.d/StatusChange  start
Jul 14 06:54:11 swfs2 mach_down[9837]: info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
Jul 14 06:54:11 swfs2 heartbeat: [9802]: info: mach_down takeover complete.
Jul 14 06:54:11 swfs2 mach_down[9837]: info: mach_down takeover complete for node swfs1.
Jul 14 06:55:56 swfs2 kernel: e1000: eth1: e1000_watchdog: NIC Link is Down
Jul 14 06:55:59 swfs2 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
Jul 14 06:56:15 swfs2 kernel: e1000: eth1: e1000_watchdog: NIC Link is Down
Jul 14 06:56:17 swfs2 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
Jul 14 06:56:39 swfs2 kernel: drbd0: drbd0_receiver [3355]: cstate WFConnection --> WFReportParams
Jul 14 06:56:39 swfs2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Jul 14 06:56:39 swfs2 kernel: drbd0: Connection established.
Jul 14 06:56:39 swfs2 kernel: drbd0: I am(P): 1:00000005:00000003:00000240:00000042:10
Jul 14 06:56:39 swfs2 kernel: drbd0: Peer(S): 1:00000005:00000003:00000240:00000041:11
Jul 14 06:56:39 swfs2 kernel: drbd0: drbd0_receiver [3355]: cstate WFReportParams --> WFBitMapS
Jul 14 06:56:39 swfs2 kernel: drbd1: drbd1_receiver [3363]: cstate WFConnection --> WFReportParams
Jul 14 06:56:39 swfs2 kernel: drbd1: Handshake successful: DRBD Network Protocol version 74
Jul 14 06:56:39 swfs2 kernel: drbd1: Connection established.
Jul 14 06:56:39 swfs2 kernel: drbd1: I am(P): 1:00000002:00000001:0000002a:00000004:10
Jul 14 06:56:39 swfs2 kernel: drbd1: Peer(S): 1:00000002:00000001:0000002a:00000003:11
Jul 14 06:56:39 swfs2 kernel: drbd1: drbd1_receiver [3363]: cstate WFReportParams --> WFBitMapS
Jul 14 06:56:39 swfs2 kernel: drbd0: Primary/Unknown --> Primary/Secondary
Jul 14 06:56:39 swfs2 kernel: drbd1: Primary/Unknown --> Primary/Secondary
Jul 14 06:56:39 swfs2 kernel: drbd0: drbd0_receiver [3355]: cstate WFBitMapS --> SyncSource
Jul 14 06:56:39 swfs2 kernel: drbd0: Resync started as SyncSource (need to sync 530604 KB [132651 bits set]).
Jul 14 06:56:39 swfs2 kernel: drbd1: drbd1_receiver [3363]: cstate WFBitMapS --> SyncSource
Jul 14 06:56:39 swfs2 kernel: drbd1: Resync started as SyncSource (need to sync 520192 KB [130048 bits set]).
Jul 14 06:56:49 swfs2 heartbeat: [9802]: info: Link swfs1:eth0 up.
Jul 14 06:56:49 swfs2 heartbeat: [9802]: info: Status update for node swfs1: status init
Jul 14 06:56:49 swfs2 heartbeat: [9802]: info: Status update for node swfs1: status up
Jul 14 06:56:49 swfs2 ipfail: [9815]: info: Link Status update: Link swfs1/eth0 now has status up
Jul 14 06:56:49 swfs2 ipfail: [9815]: info: Status update: Node swfs1 now has status init
Jul 14 06:56:49 swfs2 ipfail: [9815]: info: Status update: Node swfs1 now has status up
Jul 14 06:56:49 swfs2 harc[11188]: info: Running /etc/ha.d/rc.d/status status
Jul 14 06:56:49 swfs2 heartbeat: [9802]: info: Link swfs1:eth1 up.
Jul 14 06:56:49 swfs2 ipfail: [9815]: info: Link Status update: Link swfs1/eth1 now has status up
Jul 14 06:56:49 swfs2 harc[11198]: info: Running /etc/ha.d/rc.d/status status
Jul 14 06:56:50 swfs2 heartbeat: [9802]: info: Status update for node swfs1: status active
Jul 14 06:56:50 swfs2 heartbeat: [9802]: info: remote resource transition completed.
Jul 14 06:56:50 swfs2 ipfail: [9815]: info: Status update: Node swfs1 now has status active
Jul 14 06:56:50 swfs2 harc[11208]: info: Running /etc/ha.d/rc.d/status status
Jul 14 06:56:50 swfs2 ipfail: [9815]: info: Asking other side for ping node count.
Jul 14 06:56:51 swfs2 ipfail: [9815]: info: No giveup timer to abort.
Jul 14 06:57:10 swfs2 kernel: drbd1: Resync done (total 30 sec; paused 0 sec; 17336 K/sec)
Jul 14 06:57:10 swfs2 kernel: drbd1: drbd1_worker [9330]: cstate SyncSource --> Connected
Jul 14 06:57:25 swfs2 kernel: drbd0: Resync done (total 45 sec; paused 0 sec; 11788 K/sec)
Jul 14 06:57:25 swfs2 kernel: drbd0: drbd0_worker [8749]: cstate SyncSource --> Connected




More information about the Linux-HA mailing list