[Linux-HA] Re: deadtime, warntime, and drbd

Jason Joines support at bus.okstate.edu
Mon Mar 7 13:03:47 MST 2005


Lars Marowsky-Bree wrote:

>It literally rebooted itself? Are you using the watchdog timer?
>
>Please provide the log messages of the node directly prior to the
>reboot.
>
>
>Sincerely,
>    Lars Marowsky-Brée <lmb at suse.de>
>
########### nodea log start ###########
Feb 25 12:30:51 nodea kernel: drbd0: PingAck did not arrive in time.
Feb 25 12:30:51 nodea kernel: drbd0: drbd0_asender [7323]: cstate 
Connected --> NetworkFailure
Feb 25 12:30:51 nodea kernel: drbd0: asender terminated
Feb 25 12:30:51 nodea kernel: drbd0: drbd0_receiver [7322]: cstate 
NetworkFailure --> BrokenPipe
Feb 25 12:30:51 nodea kernel: drbd0: short read expecting header on 
sock: r=-512
Feb 25 12:30:51 nodea kernel: drbd0: worker terminated
Feb 25 12:30:51 nodea kernel: drbd0: drbd0_receiver [7322]: cstate 
BrokenPipe --> Unconnected
Feb 25 12:30:51 nodea kernel: drbd0: Connection lost.
Feb 25 12:30:51 nodea kernel: drbd0: drbd0_receiver [7322]: cstate 
Unconnected --> WFConnection
Feb 25 12:30:51 nodea kernel: drbd1: PingAck did not arrive in time.
Feb 25 12:30:51 nodea kernel: drbd1: drbd1_asender [7436]: cstate 
Connected --> NetworkFailure
Feb 25 12:30:51 nodea kernel: drbd1: asender terminated
Feb 25 12:30:51 nodea kernel: drbd1: drbd1_receiver [7435]: cstate 
NetworkFailure --> BrokenPipe
Feb 25 12:30:51 nodea kernel: drbd1: short read receiving data block: 
read 3900 expected 4096
Feb 25 12:30:51 nodea kernel: drbd1: error receiving Data, l: 4112!
Feb 25 12:30:51 nodea kernel: drbd1: worker terminated
Feb 25 12:30:51 nodea kernel: drbd1: drbd1_receiver [7435]: cstate 
BrokenPipe --> Unconnected
Feb 25 12:30:51 nodea kernel: drbd1: Connection lost.
Feb 25 12:30:51 nodea kernel: drbd1: drbd1_receiver [7435]: cstate 
Unconnected --> WFConnection
Feb 25 12:31:03 nodea heartbeat[878]: WARN: node nodeb: is dead
Feb 25 12:31:03 nodea heartbeat[878]: info: Dead node nodeb gave up 
resources.
Feb 25 12:31:03 nodea heartbeat[878]: info: Link nodeb:eth1 dead.
Feb 25 12:31:04 nodea ipfail[887]: info: Link Status update: Link 
nodeb/eth1 now has status dead
Feb 25 12:31:04 nodea ipfail[887]: debug: Found ping node 172.18.91.254!
Feb 25 12:31:04 nodea ipfail[887]: info: Asking other side for ping node 
count.
Feb 25 12:31:04 nodea ipfail[887]: debug: Message [num_ping] sent.
Feb 25 12:31:04 nodea ipfail[887]: info: Checking remote count of ping 
nodes.
Feb 25 12:38:00 nodea heartbeat[878]: WARN: Cluster node nodeb returning 
after partition.
Feb 25 12:38:00 nodea heartbeat[878]: WARN: Deadtime value may be too small.
Feb 25 12:38:00 nodea heartbeat[878]: info: See documentation for 
information on tuning deadtime.
Feb 25 12:38:00 nodea heartbeat[878]: WARN: 436 lost packet(s) for 
[nodeb] [358470:358907]
Feb 25 12:38:00 nodea heartbeat[878]: ERROR: lost a lot of packets!
Feb 25 12:38:00 nodea heartbeat[878]: info: Link nodeb:eth1 up.
Feb 25 12:38:00 nodea heartbeat[878]: WARN: Late heartbeat: Node nodeb: 
interval 433430 ms
Feb 25 12:38:00 nodea heartbeat[878]: info: Status update for node 
nodeb: status active
Feb 25 12:38:00 nodea ipfail[887]: info: Link Status update: Link 
nodeb/eth1 now has status up
Feb 25 12:38:00 nodea ipfail[887]: info: Status update: Node nodeb now 
has status active
Feb 25 12:38:00 nodea heartbeat[16016]: debug: notify_world: setting 
SIGCHLD Handler to SIG_DFL
Feb 25 12:38:00 nodea heartbeat: info: Running /etc/ha.d/rc.d/status status
Feb 25 12:38:01 nodea kernel: drbd0: drbd0_receiver [7322]: cstate 
WFConnection --> WFReportParams
Feb 25 12:38:01 nodea kernel: drbd0: Handshake successful: DRBD Network 
Protocol version 74
Feb 25 12:38:01 nodea kernel: drbd0: incompatible states (both Primary!)
Feb 25 12:38:01 nodea kernel: drbd0: drbd0_receiver [7322]: cstate 
WFReportParams --> StandAlone
Feb 25 12:38:01 nodea kernel: drbd0: error receiving ReportParams, l: 72!
Feb 25 12:38:01 nodea kernel: drbd0: worker terminated
Feb 25 12:38:01 nodea kernel: drbd0: asender terminated
Feb 25 12:38:01 nodea kernel: drbd0: drbd0_receiver [7322]: cstate 
StandAlone --> StandAlone
Feb 25 12:38:01 nodea kernel: drbd0: Connection lost.
Feb 25 12:38:01 nodea kernel: drbd0: receiver terminated
Feb 25 12:38:02 nodea heartbeat[878]: WARN: Shutdown delayed until 
current resource activity finishes.
Feb 25 12:38:02 nodea ipfail[887]: debug: Other side is unstable.
Feb 25 12:38:04 nodea kernel: drbd1: drbd1_receiver [7435]: cstate 
WFConnection --> WFReportParams
Feb 25 12:38:04 nodea kernel: drbd1: Handshake successful: DRBD Network 
Protocol version 74
Feb 25 12:38:04 nodea kernel: drbd1: Connection established.
Feb 25 12:38:04 nodea kernel: drbd1: I am(S): 
1:00000002:00000001:0000002e:00000006:01
Feb 25 12:38:04 nodea kernel: drbd1: Peer(P): 
1:00000002:00000001:0000002f:00000006:10
Feb 25 12:38:04 nodea kernel: drbd1: drbd1_receiver [7435]: cstate 
WFReportParams --> WFBitMapT
Feb 25 12:38:04 nodea kernel: drbd1: Secondary/Unknown --> Secondary/Primary
Feb 25 12:38:05 nodea kernel: drbd1: drbd1_receiver [7435]: cstate 
WFBitMapT --> SyncTarget
Feb 25 12:38:05 nodea kernel: drbd1: Resync started as SyncTarget (need 
to sync 60488 KB [15122 bits set]).
Feb 25 12:38:07 nodea kernel: drbd1: Resync done (total 2 sec; paused 0 
sec; 30244 K/sec)
Feb 25 12:38:07 nodea kernel: drbd1: drbd1_worker [12364]: cstate 
SyncTarget --> Connected
Feb 25 12:38:32 nodea heartbeat[878]: WARN: node nodeb: is dead
Feb 25 12:38:32 nodea heartbeat[878]: info: Dead node nodeb gave up 
resources.
Feb 25 12:38:32 nodea heartbeat[878]: info: Link nodeb:eth1 dead.
Feb 25 12:38:33 nodea kernel: drbd1: PingAck did not arrive in time.
Feb 25 12:38:33 nodea kernel: drbd1: drbd1_asender [16022]: cstate 
Connected --> NetworkFailure
Feb 25 12:38:33 nodea kernel: drbd1: asender terminated
Feb 25 12:38:33 nodea kernel: drbd1: drbd1_receiver [7435]: cstate 
NetworkFailure --> BrokenPipe
Feb 25 12:38:33 nodea kernel: drbd1: short read expecting header on 
sock: r=-512
Feb 25 12:38:33 nodea kernel: drbd1: worker terminated
Feb 25 12:38:33 nodea kernel: drbd1: drbd1_receiver [7435]: cstate 
BrokenPipe --> Unconnected
Feb 25 12:38:33 nodea kernel: drbd1: Connection lost.
Feb 25 12:38:33 nodea kernel: drbd1: drbd1_receiver [7435]: cstate 
Unconnected --> WFConnection
Feb 25 12:38:33 nodea ipfail[887]: info: Link Status update: Link 
nodeb/eth1 now has status dead
Feb 25 12:38:33 nodea ipfail[887]: debug: Found ping node 172.18.91.254!
Feb 25 12:38:33 nodea ipfail[887]: info: Asking other side for ping node 
count.
Feb 25 12:38:33 nodea ipfail[887]: debug: Message [num_ping] sent.
Feb 25 12:38:33 nodea ipfail[887]: info: Checking remote count of ping 
nodes.
Feb 25 12:41:09 nodea kernel: drbd1: drbd1_receiver [7435]: cstate 
WFConnection --> WFReportParams
Feb 25 12:41:09 nodea kernel: drbd1: Handshake successful: DRBD Network 
Protocol version 74
Feb 25 12:41:09 nodea kernel: drbd1: Connection established.
Feb 25 12:41:09 nodea kernel: drbd1: I am(S): 
1:00000002:00000001:0000002f:00000006:01
Feb 25 12:41:09 nodea kernel: drbd1: Peer(S): 
1:00000002:00000001:0000002f:00000006:11
Feb 25 12:41:09 nodea kernel: drbd1: drbd1_receiver [7435]: cstate 
WFReportParams --> WFBitMapT
Feb 25 12:41:09 nodea kernel: drbd1: Secondary/Unknown --> 
Secondary/Secondary
Feb 25 12:41:09 nodea kernel: drbd1: drbd1_receiver [7435]: cstate 
WFBitMapT --> SyncTarget
Feb 25 12:41:09 nodea kernel: drbd1: Resync started as SyncTarget (need 
to sync 520192 KB [130048 bits set]).
Feb 25 12:41:23 nodea kernel: drbd1: Resync done (total 14 sec; paused 0 
sec; 37156 K/sec)
Feb 25 12:41:23 nodea kernel: drbd1: drbd1_worker [16024]: cstate 
SyncTarget --> Connected
Feb 25 12:42:15 nodea heartbeat[878]: info: Heartbeat restart on node nodeb
Feb 25 12:42:15 nodea heartbeat[878]: info: Link nodeb:eth1 up.
Feb 25 12:42:15 nodea heartbeat[878]: info: Status update for node 
nodeb: status up
Feb 25 12:42:15 nodea ipfail[887]: info: Link Status update: Link 
nodeb/eth1 now has status up
Feb 25 12:42:15 nodea ipfail[887]: info: Status update: Node nodeb now 
has status up
Feb 25 12:42:15 nodea heartbeat[16027]: debug: notify_world: setting 
SIGCHLD Handler to SIG_DFL
Feb 25 12:42:15 nodea heartbeat[878]: info: Status update for node 
nodeb: status active
Feb 25 12:42:15 nodea heartbeat[878]: debug: StartNextRemoteRscReq(): 
child count 1
Feb 25 12:42:15 nodea ipfail[887]: info: Status update: Node nodeb now 
has status active
Feb 25 12:42:15 nodea heartbeat: info: Running /etc/ha.d/rc.d/status status
Feb 25 12:42:15 nodea ipfail[887]: debug: Other side is unstable.
Feb 25 12:42:15 nodea heartbeat[878]: info: remote resource transition 
completed.
Feb 25 12:42:15 nodea ipfail[887]: debug: Other side is now stable.
Feb 25 12:42:15 nodea heartbeat[16031]: debug: notify_world: setting 
SIGCHLD Handler to SIG_DFL
Feb 25 12:42:15 nodea heartbeat: info: Running /etc/ha.d/rc.d/status status
Feb 25 12:42:15 nodea heartbeat[878]: info: Heartbeat shutdown in 
progress. (878)
Feb 25 12:42:15 nodea heartbeat[16035]: info: Giving up all HA resources.
Feb 25 12:42:15 nodea ipfail[887]: debug: Got join message from another 
ipfail client. (nodeb)
Feb 25 12:42:15 nodea ipfail[887]: debug: Found ping node 172.18.91.254!
Feb 25 12:42:15 nodea ipfail[887]: info: Asking other side for ping node 
count.
Feb 25 12:42:15 nodea ipfail[887]: debug: Message [num_ping] sent.
Feb 25 12:42:15 nodea heartbeat: info: Releasing resource group: nodea 
drbddisk::drbd0 
Filesystem::/dev/drbd0::/home::xfs::rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota 
172.18.88.102 smb
Feb 25 12:42:15 nodea heartbeat: info: Running /etc/init.d/smb  stop
Feb 25 12:42:15 nodea heartbeat: debug: Starting /etc/init.d/smb  stop
Feb 25 12:42:15 nodea heartbeat: debug: /etc/init.d/smb  stop done. RC=0
Feb 25 12:42:15 nodea heartbeat: info: Running 
/etc/ha.d/resource.d/IPaddr 172.18.88.102 stop
Feb 25 12:42:15 nodea heartbeat: debug: Starting 
/etc/ha.d/resource.d/IPaddr 172.18.88.102 stop
Feb 25 12:42:16 nodea heartbeat: info: /sbin/route -n del -host 
172.18.88.102
Feb 25 12:42:16 nodea heartbeat: info: /sbin/ifconfig eth0:0 down
Feb 25 12:42:16 nodea heartbeat: info: IP Address 172.18.88.102 released
Feb 25 12:42:16 nodea heartbeat: debug: /etc/ha.d/resource.d/IPaddr 
172.18.88.102 stop done. RC=0
Feb 25 12:42:16 nodea heartbeat: info: Running 
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs 
rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop
Feb 25 12:42:16 nodea heartbeat: debug: Starting 
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs 
rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop
Feb 25 12:42:16 nodea heartbeat: ERROR: Couldn't unmount /home
Feb 25 12:42:16 nodea heartbeat: debug: /etc/ha.d/resource.d/Filesystem 
/dev/drbd0 /home xfs 
rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop done. RC=1
Feb 25 12:42:16 nodea heartbeat: ERROR: Return code 1 from 
/etc/ha.d/resource.d/Filesystem
Feb 25 12:42:16 nodea ipfail[887]: info: No giveup timer to abort.
Feb 25 12:42:17 nodea heartbeat: info: Retrying failed stop operation 
[Filesystem::/dev/drbd0::/home::xfs::rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota]
Feb 25 12:42:17 nodea heartbeat: info: Running 
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs 
rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop
Feb 25 12:42:17 nodea heartbeat: debug: Starting 
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs 
rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop
Feb 25 12:42:17 nodea heartbeat: debug: /etc/ha.d/resource.d/Filesystem 
/dev/drbd0 /home xfs 
rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop done. RC=0
Feb 25 12:42:17 nodea heartbeat: info: Running 
/etc/ha.d/resource.d/drbddisk drbd0 stop
Feb 25 12:42:17 nodea heartbeat: debug: Starting 
/etc/ha.d/resource.d/drbddisk drbd0 stop
Feb 25 12:42:17 nodea kernel: drbd0: Primary/Unknown --> Secondary/Unknown
Feb 25 12:42:18 nodea heartbeat: debug: /etc/ha.d/resource.d/drbddisk 
drbd0 stop done. RC=0
Feb 25 12:42:18 nodea heartbeat: info: Releasing resource group: nodeb 
drbddisk::drbd1
Feb 25 12:42:18 nodea heartbeat: info: Running 
/etc/ha.d/resource.d/drbddisk drbd1 stop
Feb 25 12:42:18 nodea heartbeat: debug: Starting 
/etc/ha.d/resource.d/drbddisk drbd1 stop
Feb 25 12:42:18 nodea heartbeat: debug: /etc/ha.d/resource.d/drbddisk 
drbd1 stop done. RC=0
Feb 25 12:42:18 nodea heartbeat[16035]: info: killing 
/usr/lib/heartbeat/ipfail process group 887 with signal 15
Feb 25 12:42:18 nodea heartbeat[16035]: info: killing heartbeat resource 
child process group 16031 with signal 9
Feb 25 12:42:18 nodea heartbeat[16035]: info: All HA resources relinquished.
Feb 25 12:42:18 nodea heartbeat[878]: info: killing 
/usr/lib/heartbeat/ipfail process group 887 with signal 15
Feb 25 12:42:18 nodea heartbeat[878]: info: EOF from client pid 887
Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBFIFO process 880 
with signal 15
Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBWRITE process 881 
with signal 15
Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBREAD process 882 
with signal 15
Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBWRITE process 883 
with signal 15
Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBREAD process 884 
with signal 15
Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBWRITE process 885 
with signal 15
Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBREAD process 886 
with signal 15
Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 880 exited. 7 
remaining
Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 881 exited. 6 
remaining
Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 882 exited. 5 
remaining
Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 883 exited. 4 
remaining
Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 884 exited. 3 
remaining
Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 885 exited. 2 
remaining
Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 886 exited. 1 
remaining
Feb 25 12:42:19 nodea heartbeat[878]: info: Heartbeat shutdown complete.
Feb 25 12:42:19 nodea heartbeat[878]: info: Heartbeat restart triggered.
Feb 25 12:42:19 nodea heartbeat[878]: info: Restarting heartbeat.
Feb 25 12:42:19 nodea heartbeat[878]: info: Performing heartbeat restart 
exec.
Feb 25 12:42:20 nodea kernel: drbd1: Secondary/Secondary --> 
Secondary/Primary
Feb 25 12:42:36 nodea heartbeat[878]: info: **************************
Feb 25 12:42:36 nodea heartbeat[878]: info: Configuration validated. 
Starting heartbeat 1.2.3
Feb 25 12:42:36 nodea heartbeat[16280]: info: heartbeat: version 1.2.3
Feb 25 12:42:36 nodea heartbeat[16280]: info: Heartbeat generation: 22
Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: write socket 
priority set to IPTOS_LOWDELAY on eth0
Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: bound send socket 
to device: eth0
Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: bound receive 
socket to device: eth0
Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: started on port 694 
interface eth0 to 172.18.88.93
Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: write socket 
priority set to IPTOS_LOWDELAY on eth1
Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: bound send socket 
to device: eth1
Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: bound receive 
socket to device: eth1
Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: started on port 694 
interface eth1 to 172.17.1.1
Feb 25 12:42:36 nodea heartbeat[16280]: info: ping heartbeat started.
Feb 25 12:42:36 nodea heartbeat[16280]: info: pid 16280 locked in memory.
Feb 25 12:42:36 nodea heartbeat[16280]: info: Local status now set to: 'up'
Feb 25 12:42:37 nodea heartbeat[16282]: info: pid 16282 locked in memory.
Feb 25 12:42:37 nodea heartbeat[16286]: info: pid 16286 locked in memory.
Feb 25 12:42:37 nodea heartbeat[16283]: info: pid 16283 locked in memory.
Feb 25 12:42:37 nodea heartbeat[16280]: info: Link nodeb:eth1 up.
Feb 25 12:42:37 nodea heartbeat[16280]: info: Status update for node 
nodeb: status active
Feb 25 12:42:37 nodea heartbeat[16289]: debug: notify_world: setting 
SIGCHLD Handler to SIG_DFL
Feb 25 12:42:37 nodea heartbeat[16284]: info: pid 16284 locked in memory.
Feb 25 12:42:37 nodea heartbeat[16287]: info: pid 16287 locked in memory.
Feb 25 12:42:37 nodea heartbeat[16288]: info: pid 16288 locked in memory.
Feb 25 12:42:37 nodea heartbeat[16280]: info: Link 
172.18.91.254:172.18.91.254 up.
Feb 25 12:42:37 nodea heartbeat[16280]: info: Status update for node 
172.18.91.254: status ping
Feb 25 12:42:37 nodea heartbeat[16280]: info: Local status now set to: 
'active'
Feb 25 12:42:37 nodea heartbeat[16280]: info: Starting child client 
"/usr/lib/heartbeat/ipfail" (90,90)
Feb 25 12:42:37 nodea heartbeat: info: Running /etc/ha.d/rc.d/status status
Feb 25 12:42:37 nodea heartbeat[16285]: info: pid 16285 locked in memory.
Feb 25 12:42:37 nodea heartbeat[16290]: info: Starting 
"/usr/lib/heartbeat/ipfail" as uid 90  gid 90 (pid 16290)
Feb 25 12:42:37 nodea heartbeat[16280]: info: remote resource transition 
completed.
Feb 25 12:42:37 nodea heartbeat[16280]: info: remote resource transition 
completed.
Feb 25 12:42:37 nodea heartbeat[16280]: info: Local Resource acquisition 
completed. (none)
Feb 25 12:42:37 nodea heartbeat[16280]: info: Initial resource 
acquisition complete (T_RESOURCES(them))
Feb 25 12:42:37 nodea ipfail[16290]: debug: PID=16290
Feb 25 12:42:37 nodea ipfail[16290]: debug: Signing in with heartbeat
Feb 25 12:42:37 nodea ipfail[16290]: debug: [We are nodea]
Feb 25 12:42:37 nodea ipfail[16290]: debug: auto_failback -> 0 (off)
Feb 25 12:42:37 nodea ipfail[16290]: debug: Setting message filter mode
Feb 25 12:42:37 nodea ipfail[16290]: debug: Starting node walk
Feb 25 12:42:37 nodea ipfail[16290]: debug: Cluster node: 172.18.91.254: 
status: ping
Feb 25 12:42:37 nodea ipfail[16290]: debug: Cluster node: nodeb: status: 
active
Feb 25 12:42:37 nodea ipfail[16290]: debug: [They are nodeb]
Feb 25 12:42:37 nodea ipfail[16290]: debug: Cluster node: nodea: status: 
active
Feb 25 12:42:37 nodea ipfail[16290]: debug: Setting message signal
Feb 25 12:42:37 nodea ipfail[16290]: debug: Waiting for messages...
Feb 25 12:42:38 nodea ipfail[16290]: debug: Got asked for num_ping.
Feb 25 12:42:38 nodea ipfail[16290]: debug: Found ping node 172.18.91.254!
Feb 25 12:42:38 nodea ipfail[16290]: info: Ping node count is balanced.
Feb 25 12:42:38 nodea ipfail[16290]: debug: Abort message sent.
########### nodea log stop ###########


More information about the Linux-HA mailing list