[Linux-HA] Re: deadtime, warntime, and drbd
Lars Marowsky-Bree
lmb at suse.de
Tue Mar 8 03:34:38 MST 2005
On 2005-03-07T14:03:47, Jason Joines <support at bus.okstate.edu> wrote:
> ########### nodea log start ###########
> Feb 25 12:30:51 nodea kernel: drbd0: PingAck did not arrive in time.
> Feb 25 12:30:51 nodea kernel: drbd0: drbd0_asender [7323]: cstate
> Connected --> NetworkFailure
> Feb 25 12:30:51 nodea kernel: drbd0: asender terminated
> Feb 25 12:30:51 nodea kernel: drbd0: drbd0_receiver [7322]: cstate
> NetworkFailure --> BrokenPipe
> Feb 25 12:30:51 nodea kernel: drbd0: short read expecting header on
> sock: r=-512
> Feb 25 12:30:51 nodea kernel: drbd0: worker terminated
> Feb 25 12:30:51 nodea kernel: drbd0: drbd0_receiver [7322]: cstate
> BrokenPipe --> Unconnected
> Feb 25 12:30:51 nodea kernel: drbd0: Connection lost.
> Feb 25 12:30:51 nodea kernel: drbd0: drbd0_receiver [7322]: cstate
> Unconnected --> WFConnection
> Feb 25 12:30:51 nodea kernel: drbd1: PingAck did not arrive in time.
> Feb 25 12:30:51 nodea kernel: drbd1: drbd1_asender [7436]: cstate
> Connected --> NetworkFailure
> Feb 25 12:30:51 nodea kernel: drbd1: asender terminated
> Feb 25 12:30:51 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
> NetworkFailure --> BrokenPipe
> Feb 25 12:30:51 nodea kernel: drbd1: short read receiving data block:
> read 3900 expected 4096
> Feb 25 12:30:51 nodea kernel: drbd1: error receiving Data, l: 4112!
> Feb 25 12:30:51 nodea kernel: drbd1: worker terminated
> Feb 25 12:30:51 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
> BrokenPipe --> Unconnected
> Feb 25 12:30:51 nodea kernel: drbd1: Connection lost.
> Feb 25 12:30:51 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
> Unconnected --> WFConnection
> Feb 25 12:31:03 nodea heartbeat[878]: WARN: node nodeb: is dead
> Feb 25 12:31:03 nodea heartbeat[878]: info: Dead node nodeb gave up
> resources.
> Feb 25 12:31:03 nodea heartbeat[878]: info: Link nodeb:eth1 dead.
> Feb 25 12:31:04 nodea ipfail[887]: info: Link Status update: Link
> nodeb/eth1 now has status dead
> Feb 25 12:31:04 nodea ipfail[887]: debug: Found ping node 172.18.91.254!
> Feb 25 12:31:04 nodea ipfail[887]: info: Asking other side for ping node
> count.
> Feb 25 12:31:04 nodea ipfail[887]: debug: Message [num_ping] sent.
> Feb 25 12:31:04 nodea ipfail[887]: info: Checking remote count of ping
> nodes.
So far, so good. All components on nodea have detected that nodeb is
gone.
Is there really _nothing_ in the logs on nodea between 12:31:04 and
12:38:00? There should be some messages about STONITH being attempted,
if configured.
(Oh, BTW, this is exactly the kind of scenario you need STONITH for. It
is _NOT_ optional.)
> Feb 25 12:38:00 nodea heartbeat[878]: WARN: Cluster node nodeb returning
> after partition.
> Feb 25 12:38:00 nodea heartbeat[878]: WARN: Deadtime value may be too small.
> Feb 25 12:38:00 nodea heartbeat[878]: info: See documentation for
> information on tuning deadtime.
> Feb 25 12:38:00 nodea heartbeat[878]: WARN: 436 lost packet(s) for
> [nodeb] [358470:358907]
> Feb 25 12:38:00 nodea heartbeat[878]: ERROR: lost a lot of packets!
> Feb 25 12:38:00 nodea heartbeat[878]: info: Link nodeb:eth1 up.
> Feb 25 12:38:00 nodea heartbeat[878]: WARN: Late heartbeat: Node nodeb:
> interval 433430 ms
Ouch! 433 seconds late!
What's going on on nodeb at this time?! That looks like it is having
massive troubles.
What now happens is that heartbeat tries to recover from the split brain
scenario by restarting itself on both sides, which is expected
behaviour.
> Feb 25 12:38:00 nodea heartbeat[878]: info: Status update for node
> nodeb: status active
> Feb 25 12:38:00 nodea ipfail[887]: info: Link Status update: Link
> nodeb/eth1 now has status up
> Feb 25 12:38:00 nodea ipfail[887]: info: Status update: Node nodeb now
> has status active
> Feb 25 12:38:00 nodea heartbeat[16016]: debug: notify_world: setting
> SIGCHLD Handler to SIG_DFL
> Feb 25 12:38:00 nodea heartbeat: info: Running /etc/ha.d/rc.d/status status
> Feb 25 12:38:01 nodea kernel: drbd0: drbd0_receiver [7322]: cstate
> WFConnection --> WFReportParams
> Feb 25 12:38:01 nodea kernel: drbd0: Handshake successful: DRBD Network
> Protocol version 74
> Feb 25 12:38:01 nodea kernel: drbd0: incompatible states (both Primary!)
> Feb 25 12:38:01 nodea kernel: drbd0: drbd0_receiver [7322]: cstate
> WFReportParams --> StandAlone
> Feb 25 12:38:01 nodea kernel: drbd0: error receiving ReportParams, l: 72!
> Feb 25 12:38:01 nodea kernel: drbd0: worker terminated
> Feb 25 12:38:01 nodea kernel: drbd0: asender terminated
> Feb 25 12:38:01 nodea kernel: drbd0: drbd0_receiver [7322]: cstate
> StandAlone --> StandAlone
> Feb 25 12:38:01 nodea kernel: drbd0: Connection lost.
> Feb 25 12:38:01 nodea kernel: drbd0: receiver terminated
> Feb 25 12:38:02 nodea heartbeat[878]: WARN: Shutdown delayed until
> current resource activity finishes.
> Feb 25 12:38:02 nodea ipfail[887]: debug: Other side is unstable.
> Feb 25 12:38:04 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
> WFConnection --> WFReportParams
> Feb 25 12:38:04 nodea kernel: drbd1: Handshake successful: DRBD Network
> Protocol version 74
> Feb 25 12:38:04 nodea kernel: drbd1: Connection established.
> Feb 25 12:38:04 nodea kernel: drbd1: I am(S):
> 1:00000002:00000001:0000002e:00000006:01
> Feb 25 12:38:04 nodea kernel: drbd1: Peer(P):
> 1:00000002:00000001:0000002f:00000006:10
> Feb 25 12:38:04 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
> WFReportParams --> WFBitMapT
> Feb 25 12:38:04 nodea kernel: drbd1: Secondary/Unknown --> Secondary/Primary
> Feb 25 12:38:05 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
> WFBitMapT --> SyncTarget
> Feb 25 12:38:05 nodea kernel: drbd1: Resync started as SyncTarget (need
> to sync 60488 KB [15122 bits set]).
> Feb 25 12:38:07 nodea kernel: drbd1: Resync done (total 2 sec; paused 0
> sec; 30244 K/sec)
> Feb 25 12:38:07 nodea kernel: drbd1: drbd1_worker [12364]: cstate
> SyncTarget --> Connected
Okay, drbd was syncing and receiving data in secondary state. So far, so
good.
> Feb 25 12:38:32 nodea heartbeat[878]: WARN: node nodeb: is dead
> Feb 25 12:38:32 nodea heartbeat[878]: info: Dead node nodeb gave up
> resources.
> Feb 25 12:38:32 nodea heartbeat[878]: info: Link nodeb:eth1 dead.
> Feb 25 12:38:33 nodea kernel: drbd1: PingAck did not arrive in time.
> Feb 25 12:38:33 nodea kernel: drbd1: drbd1_asender [16022]: cstate
> Connected --> NetworkFailure
> Feb 25 12:38:33 nodea kernel: drbd1: asender terminated
> Feb 25 12:38:33 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
> NetworkFailure --> BrokenPipe
> Feb 25 12:38:33 nodea kernel: drbd1: short read expecting header on
> sock: r=-512
> Feb 25 12:38:33 nodea kernel: drbd1: worker terminated
> Feb 25 12:38:33 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
> BrokenPipe --> Unconnected
> Feb 25 12:38:33 nodea kernel: drbd1: Connection lost.
> Feb 25 12:38:33 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
> Unconnected --> WFConnection
> Feb 25 12:38:33 nodea ipfail[887]: info: Link Status update: Link
> nodeb/eth1 now has status dead
Oh no! 30 seconds later still in the resync, and nodeb dies again. This
isn't good.
> Feb 25 12:38:33 nodea ipfail[887]: debug: Found ping node 172.18.91.254!
> Feb 25 12:38:33 nodea ipfail[887]: info: Asking other side for ping node
> count.
> Feb 25 12:38:33 nodea ipfail[887]: debug: Message [num_ping] sent.
> Feb 25 12:38:33 nodea ipfail[887]: info: Checking remote count of ping
> nodes.
> Feb 25 12:41:09 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
> WFConnection --> WFReportParams
> Feb 25 12:41:09 nodea kernel: drbd1: Handshake successful: DRBD Network
> Protocol version 74
> Feb 25 12:41:09 nodea kernel: drbd1: Connection established.
Okay, here drbd has noticed before heartbeat that nodeb has returned,
~35s after it dieing again. This isn't good at all, you're machines are
quite beyond help. At least from these logs it looks like nodeb got
serious issues...
> Feb 25 12:41:09 nodea kernel: drbd1: I am(S):
> 1:00000002:00000001:0000002f:00000006:01
> Feb 25 12:41:09 nodea kernel: drbd1: Peer(S):
> 1:00000002:00000001:0000002f:00000006:11
> Feb 25 12:41:09 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
> WFReportParams --> WFBitMapT
> Feb 25 12:41:09 nodea kernel: drbd1: Secondary/Unknown -->
> Secondary/Secondary
> Feb 25 12:41:09 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
> WFBitMapT --> SyncTarget
> Feb 25 12:41:09 nodea kernel: drbd1: Resync started as SyncTarget (need
> to sync 520192 KB [130048 bits set]).
> Feb 25 12:41:23 nodea kernel: drbd1: Resync done (total 14 sec; paused 0
> sec; 37156 K/sec)
> Feb 25 12:41:23 nodea kernel: drbd1: drbd1_worker [16024]: cstate
> SyncTarget --> Connected
> Feb 25 12:42:15 nodea heartbeat[878]: info: Heartbeat restart on node nodeb
> Feb 25 12:42:15 nodea heartbeat[878]: info: Link nodeb:eth1 up.
> Feb 25 12:42:15 nodea heartbeat[878]: info: Status update for node
> nodeb: status up
> Feb 25 12:42:15 nodea ipfail[887]: info: Link Status update: Link
> nodeb/eth1 now has status up
> Feb 25 12:42:15 nodea ipfail[887]: info: Status update: Node nodeb now
> has status up
> Feb 25 12:42:15 nodea heartbeat[16027]: debug: notify_world: setting
> SIGCHLD Handler to SIG_DFL
> Feb 25 12:42:15 nodea heartbeat[878]: info: Status update for node
> nodeb: status active
> Feb 25 12:42:15 nodea heartbeat[878]: debug: StartNextRemoteRscReq():
> child count 1
> Feb 25 12:42:15 nodea ipfail[887]: info: Status update: Node nodeb now
> has status active
> Feb 25 12:42:15 nodea heartbeat: info: Running /etc/ha.d/rc.d/status status
> Feb 25 12:42:15 nodea ipfail[887]: debug: Other side is unstable.
> Feb 25 12:42:15 nodea heartbeat[878]: info: remote resource transition
> completed.
> Feb 25 12:42:15 nodea ipfail[887]: debug: Other side is now stable.
> Feb 25 12:42:15 nodea heartbeat[16031]: debug: notify_world: setting
> SIGCHLD Handler to SIG_DFL
> Feb 25 12:42:15 nodea heartbeat: info: Running /etc/ha.d/rc.d/status status
> Feb 25 12:42:15 nodea heartbeat[878]: info: Heartbeat shutdown in
> progress. (878)
Ok, heartbeat is now restarting on nodea too.
> Feb 25 12:42:15 nodea heartbeat[16035]: info: Giving up all HA resources.
> Feb 25 12:42:15 nodea ipfail[887]: debug: Got join message from another
> ipfail client. (nodeb)
> Feb 25 12:42:15 nodea ipfail[887]: debug: Found ping node 172.18.91.254!
> Feb 25 12:42:15 nodea ipfail[887]: info: Asking other side for ping node
> count.
> Feb 25 12:42:15 nodea ipfail[887]: debug: Message [num_ping] sent.
> Feb 25 12:42:15 nodea heartbeat: info: Releasing resource group: nodea
> drbddisk::drbd0
> Filesystem::/dev/drbd0::/home::xfs::rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota
> 172.18.88.102 smb
> Feb 25 12:42:15 nodea heartbeat: info: Running /etc/init.d/smb stop
> Feb 25 12:42:15 nodea heartbeat: debug: Starting /etc/init.d/smb stop
> Feb 25 12:42:15 nodea heartbeat: debug: /etc/init.d/smb stop done. RC=0
> Feb 25 12:42:15 nodea heartbeat: info: Running
> /etc/ha.d/resource.d/IPaddr 172.18.88.102 stop
> Feb 25 12:42:15 nodea heartbeat: debug: Starting
> /etc/ha.d/resource.d/IPaddr 172.18.88.102 stop
> Feb 25 12:42:16 nodea heartbeat: info: /sbin/route -n del -host
> 172.18.88.102
> Feb 25 12:42:16 nodea heartbeat: info: /sbin/ifconfig eth0:0 down
> Feb 25 12:42:16 nodea heartbeat: info: IP Address 172.18.88.102 released
> Feb 25 12:42:16 nodea heartbeat: debug: /etc/ha.d/resource.d/IPaddr
> 172.18.88.102 stop done. RC=0
> Feb 25 12:42:16 nodea heartbeat: info: Running
> /etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs
> rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop
> Feb 25 12:42:16 nodea heartbeat: debug: Starting
> /etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs
> rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop
> Feb 25 12:42:16 nodea heartbeat: ERROR: Couldn't unmount /home
> Feb 25 12:42:16 nodea heartbeat: debug: /etc/ha.d/resource.d/Filesystem
> /dev/drbd0 /home xfs
> rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop done. RC=1
> Feb 25 12:42:16 nodea heartbeat: ERROR: Return code 1 from
> /etc/ha.d/resource.d/Filesystem
It tried to stop all resources, but it can't unmount /home, whyever -
probably because something is still accessing it, maybe some kernel
thread has it still open?
> Feb 25 12:42:16 nodea ipfail[887]: info: No giveup timer to abort.
> Feb 25 12:42:17 nodea heartbeat: info: Retrying failed stop operation
> [Filesystem::/dev/drbd0::/home::xfs::rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota]
> Feb 25 12:42:17 nodea heartbeat: info: Running
> /etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs
> rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop
> Feb 25 12:42:17 nodea heartbeat: debug: Starting
> /etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs
> rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop
> Feb 25 12:42:17 nodea heartbeat: debug: /etc/ha.d/resource.d/Filesystem
> /dev/drbd0 /home xfs
> rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop done. RC=0
Ok, it worked on the second try. Good.
> Feb 25 12:42:17 nodea heartbeat: info: Running
> /etc/ha.d/resource.d/drbddisk drbd0 stop
> Feb 25 12:42:17 nodea heartbeat: debug: Starting
> /etc/ha.d/resource.d/drbddisk drbd0 stop
> Feb 25 12:42:17 nodea kernel: drbd0: Primary/Unknown --> Secondary/Unknown
> Feb 25 12:42:18 nodea heartbeat: debug: /etc/ha.d/resource.d/drbddisk
> drbd0 stop done. RC=0
> Feb 25 12:42:18 nodea heartbeat: info: Releasing resource group: nodeb
> drbddisk::drbd1
> Feb 25 12:42:18 nodea heartbeat: info: Running
> /etc/ha.d/resource.d/drbddisk drbd1 stop
> Feb 25 12:42:18 nodea heartbeat: debug: Starting
> /etc/ha.d/resource.d/drbddisk drbd1 stop
> Feb 25 12:42:18 nodea heartbeat: debug: /etc/ha.d/resource.d/drbddisk
> drbd1 stop done. RC=0
Ho-hum. drbd for _drbd1_ is stopped, but above you're mounting the
filesystem from /dev/drbd0. You sure your haresources is correct?
> Feb 25 12:42:18 nodea heartbeat[16035]: info: killing
> /usr/lib/heartbeat/ipfail process group 887 with signal 15
> Feb 25 12:42:18 nodea heartbeat[16035]: info: killing heartbeat resource
> child process group 16031 with signal 9
> Feb 25 12:42:18 nodea heartbeat[16035]: info: All HA resources relinquished.
> Feb 25 12:42:18 nodea heartbeat[878]: info: killing
> /usr/lib/heartbeat/ipfail process group 887 with signal 15
> Feb 25 12:42:18 nodea heartbeat[878]: info: EOF from client pid 887
> Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBFIFO process 880
> with signal 15
> Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBWRITE process 881
> with signal 15
> Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBREAD process 882
> with signal 15
> Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBWRITE process 883
> with signal 15
> Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBREAD process 884
> with signal 15
> Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBWRITE process 885
> with signal 15
> Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBREAD process 886
> with signal 15
> Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 880 exited. 7
> remaining
> Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 881 exited. 6
> remaining
> Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 882 exited. 5
> remaining
> Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 883 exited. 4
> remaining
> Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 884 exited. 3
> remaining
> Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 885 exited. 2
> remaining
> Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 886 exited. 1
> remaining
> Feb 25 12:42:19 nodea heartbeat[878]: info: Heartbeat shutdown complete.
> Feb 25 12:42:19 nodea heartbeat[878]: info: Heartbeat restart triggered.
> Feb 25 12:42:19 nodea heartbeat[878]: info: Restarting heartbeat.
> Feb 25 12:42:19 nodea heartbeat[878]: info: Performing heartbeat restart
> exec.
okay, here it's coming up again on nodea.
> Feb 25 12:42:20 nodea kernel: drbd1: Secondary/Secondary -->
> Secondary/Primary
(Remote side has activated the drbd device now.)
> Feb 25 12:42:36 nodea heartbeat[878]: info: **************************
> Feb 25 12:42:36 nodea heartbeat[878]: info: Configuration validated.
> Starting heartbeat 1.2.3
> Feb 25 12:42:36 nodea heartbeat[16280]: info: heartbeat: version 1.2.3
> Feb 25 12:42:36 nodea heartbeat[16280]: info: Heartbeat generation: 22
> Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: write socket
> priority set to IPTOS_LOWDELAY on eth0
> Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: bound send socket
> to device: eth0
> Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: bound receive
> socket to device: eth0
> Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: started on port 694
> interface eth0 to 172.18.88.93
> Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: write socket
> priority set to IPTOS_LOWDELAY on eth1
> Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: bound send socket
> to device: eth1
> Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: bound receive
> socket to device: eth1
> Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: started on port 694
> interface eth1 to 172.17.1.1
> Feb 25 12:42:36 nodea heartbeat[16280]: info: ping heartbeat started.
> Feb 25 12:42:36 nodea heartbeat[16280]: info: pid 16280 locked in memory.
> Feb 25 12:42:36 nodea heartbeat[16280]: info: Local status now set to: 'up'
> Feb 25 12:42:37 nodea heartbeat[16282]: info: pid 16282 locked in memory.
> Feb 25 12:42:37 nodea heartbeat[16286]: info: pid 16286 locked in memory.
> Feb 25 12:42:37 nodea heartbeat[16283]: info: pid 16283 locked in memory.
> Feb 25 12:42:37 nodea heartbeat[16280]: info: Link nodeb:eth1 up.
> Feb 25 12:42:37 nodea heartbeat[16280]: info: Status update for node
> nodeb: status active
> Feb 25 12:42:37 nodea heartbeat[16289]: debug: notify_world: setting
> SIGCHLD Handler to SIG_DFL
> Feb 25 12:42:37 nodea heartbeat[16284]: info: pid 16284 locked in memory.
> Feb 25 12:42:37 nodea heartbeat[16287]: info: pid 16287 locked in memory.
> Feb 25 12:42:37 nodea heartbeat[16288]: info: pid 16288 locked in memory.
> Feb 25 12:42:37 nodea heartbeat[16280]: info: Link
> 172.18.91.254:172.18.91.254 up.
> Feb 25 12:42:37 nodea heartbeat[16280]: info: Status update for node
> 172.18.91.254: status ping
> Feb 25 12:42:37 nodea heartbeat[16280]: info: Local status now set to:
> 'active'
> Feb 25 12:42:37 nodea heartbeat[16280]: info: Starting child client
> "/usr/lib/heartbeat/ipfail" (90,90)
> Feb 25 12:42:37 nodea heartbeat: info: Running /etc/ha.d/rc.d/status status
> Feb 25 12:42:37 nodea heartbeat[16285]: info: pid 16285 locked in memory.
> Feb 25 12:42:37 nodea heartbeat[16290]: info: Starting
> "/usr/lib/heartbeat/ipfail" as uid 90 gid 90 (pid 16290)
> Feb 25 12:42:37 nodea heartbeat[16280]: info: remote resource transition
> completed.
> Feb 25 12:42:37 nodea heartbeat[16280]: info: remote resource transition
> completed.
> Feb 25 12:42:37 nodea heartbeat[16280]: info: Local Resource acquisition
> completed. (none)
> Feb 25 12:42:37 nodea heartbeat[16280]: info: Initial resource
> acquisition complete (T_RESOURCES(them))
> Feb 25 12:42:37 nodea ipfail[16290]: debug: PID=16290
> Feb 25 12:42:37 nodea ipfail[16290]: debug: Signing in with heartbeat
> Feb 25 12:42:37 nodea ipfail[16290]: debug: [We are nodea]
> Feb 25 12:42:37 nodea ipfail[16290]: debug: auto_failback -> 0 (off)
> Feb 25 12:42:37 nodea ipfail[16290]: debug: Setting message filter mode
> Feb 25 12:42:37 nodea ipfail[16290]: debug: Starting node walk
> Feb 25 12:42:37 nodea ipfail[16290]: debug: Cluster node: 172.18.91.254:
> status: ping
> Feb 25 12:42:37 nodea ipfail[16290]: debug: Cluster node: nodeb: status:
> active
> Feb 25 12:42:37 nodea ipfail[16290]: debug: [They are nodeb]
> Feb 25 12:42:37 nodea ipfail[16290]: debug: Cluster node: nodea: status:
> active
> Feb 25 12:42:37 nodea ipfail[16290]: debug: Setting message signal
> Feb 25 12:42:37 nodea ipfail[16290]: debug: Waiting for messages...
> Feb 25 12:42:38 nodea ipfail[16290]: debug: Got asked for num_ping.
> Feb 25 12:42:38 nodea ipfail[16290]: debug: Found ping node 172.18.91.254!
> Feb 25 12:42:38 nodea ipfail[16290]: info: Ping node count is balanced.
> Feb 25 12:42:38 nodea ipfail[16290]: debug: Abort message sent.
> ########### nodea log stop ###########
Looks good so far. I don't see a reboot, just a restart, caused by nodeb
seriously misbehaving.
(Or nodea in some way which is beyond help. But I'd bet on nodeb being
the culprit.)
I think nodeb needs help, a lot of it. What's going on on that node, or
rather, the links between nodea and nodeb?
Oh, and you need a STONITH device with that kind of untrustworthy link,
if your data integrity is something you care for.
Sincerely,
Lars Marowsky-Brée <lmb at suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business
More information about the Linux-HA
mailing list