[Linux-HA] failure of resync of secondary after successful connection to primary - resulted in data loss

Alex Strachan astrachan at inter-systems.com.au
Fri Sep 21 04:01:27 MDT 2007


Hi All,

 

This error has resulted in data loss and I need to fully understand why and
hopefully stop it from happening again.  Any help in this would be warmly
received.

 

Thanks Alex

 

 

I have experienced a failure for DRBD to reconnect after a system forced
power off/on

due to a hung host.

 

DRBD 0.7.21

Hearbeat 2.1.2

 

DRBD log output

 

  Sep 20 17:58:06 sinfids3a1 drbd: Stopping all DRBD resources

  Sep 20 17:58:06 sinfids3a1 drbd: ioctl(,SET_STATE,) failed:

 

The host - on it's way down.

At this point the filesystem was still active as Heartbeat didn't stop
successfully.

 

Is there a way for drbd to handle this?

 

  Sep 20 17:58:06 sinfids3a1 drbd: Device or resource busy

  Sep 20 17:58:06 sinfids3a1 drbd: Someone has opened the device for RW
access!

  Sep 20 17:58:06 sinfids3a1 drbd: Command '/sbin/drbdsetup /dev/drbd0 down'
terminated with exit code 20

  Sep 20 17:58:06 sinfids3a1 drbd: drbdsetup exited with code 20

  Sep 20 17:58:06 sinfids3a1 drbd: ERROR: Module drbd is in use

  Sep 20 17:58:06 sinfids3a1 drbd: .

  Sep 20 17:58:06 sinfids3a1 rc: Stopping drbd:  succeeded

 

 

The host - on it's way up.

 

  Sep 21 10:11:49 sinfids3a1 drbd: [

  Sep 21 10:11:49 sinfids3a1 drbd: d0

  Sep 21 10:11:49 sinfids3a1 kernel: drbd0: Adjusting my ra_pages to backing
device's (32 -> 1024)

  Sep 21 10:11:49 sinfids3a1 kernel: klogd 1.4.1, ---------- state change
----------

  Sep 21 10:11:49 sinfids3a1 kernel: drbd0: resync bitmap: bits=43350764
words=677356

  Sep 21 10:11:49 sinfids3a1 kernel: drbd0: size = 165 GB (173403056 KB)

  Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc0: IOCStatus(0x0002): Busy

  Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc1: IOCStatus(0x0002): Busy

  Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc0: IOCStatus(0x0002): Busy

  Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc1: IOCStatus(0x0002): Busy

  Sep 21 10:11:51 sinfids3a1 kernel: drbd0: 180 MB marked out-of-sync by on
disk bit-map.

  Sep 21 10:11:51 sinfids3a1 kernel: drbd0: Found 6 transactions (324 active
extents) in activity log.

  Sep 21 10:11:51 sinfids3a1 kernel: drbd0: Marked additional 983 MB as
out-of-sync based on AL.

  Sep 21 10:11:51 sinfids3a1 kernel: drbd0: drbdsetup [5217]: cstate
Unconfigured --> StandAlone

  Sep 21 10:11:51 sinfids3a1 drbd: s0

  Sep 21 10:11:51 sinfids3a1 drbd: n0

  Sep 21 10:11:51 sinfids3a1 kernel: drbd0: drbdsetup [5318]: cstate
StandAlone --> Unconnected

  Sep 21 10:11:51 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate
Unconnected --> WFConnection

  Sep 21 10:11:51 sinfids3a1 drbd: ]

  Sep 21 10:11:51 sinfids3a1 drbd: .

  Sep 21 10:11:51 sinfids3a1 drbd: WARN: stdin/stdout is not a TTY; using
/dev/console

  Sep 21 10:11:51 sinfids3a1 kernel: drbd0: using degr_wfc_timeout=10
seconds

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate
WFConnection --> WFReportParams

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Handshake successful: DRBD
Network Protocol version 74

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Connection established.

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: I am(S):
1:00000002:00000001:000000c5:00000033:10

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Peer(P):
1:00000002:00000001:000000c3:00000034:10

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate
WFReportParams --> WFBitMapS

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: sock_sendmsg returned -32

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate
WFBitMapS --> BrokenPipe

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: short sent ReportBitMap
size=4096 sent=0

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Secondary/Unknown -->
Secondary/Primary  

 

It has detected it's partner but then breaks the connection.  This has
resulted in serious data loss.

 

Is there a way to prevent this?

 

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: meta connection shut down by
peer.

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: asender terminated

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: sock was shut down by peer

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate
BrokenPipe --> BrokenPipe

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: short read expecting header on
sock: r=0

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: worker terminated

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate
BrokenPipe --> Unconnected

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Connection lost.

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate
Unconnected --> WFConnection

  Sep 21 10:12:01 sinfids3a1 rc: Starting drbd:  succeeded

 

 

 

 

 

 

The scenario:

 

sinfids3a1 running as master

host sinfids3a1 hung

reboot sinfids3a1  

sinfids3a2 running as master

updated filesystem (nagios config)

powered off/on sinfids3a1

[.....]

failed system over to sinfids3a1

updates to the filesystem were LOST !!!

 

 

 

  Sep 20 17:56:50 sinfids3a1 rc: Stopping cmanic:  succeeded

  Sep 20 17:56:50 sinfids3a1 heartbeat: Stopping High-Availability services:

  Sep 20 17:56:50 sinfids3a1 heartbeat: [5398]: info: killing
/usr/lib64/heartbeat/mgmtd -v process group 5533 with signal 15

  Sep 20 17:56:50 sinfids3a1 mgmtd: [5533]: info: mgmtd is shutting down

  Sep 20 17:56:50 sinfids3a1 heartbeat: [5398]: info: killing
/usr/lib64/heartbeat/crmd process group 5532 with signal 15

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: info: crm_shutdown: Requesting
shutdown

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: info: do_shutdown_req: Sending
shutdown request to DC: <null>

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: info: do_shutdown_req: Processing
shutdown locally

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: ERROR: send_ha_message: Not
connected to Heartbeat

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: crm_log_message_adv:
#========= HA[outbound] message start ==========#

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG: Dumping message with
10 fields

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[0] :
[origin=do_election_vote]

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[1] : [t=crmd]

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[2] : [version=1.0.9]

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[3] : [subt=request]

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[4] :
[reference=vote-crmd-1190282210-138]

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[5] : [crm_task=vote]

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[6] : [crm_sys_to=crmd]

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[7] :
[crm_sys_from=crmd]

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[8] :
[election-owner=3e940b7a-0e41-4df4-9279-71205d5038e8]

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[9] : [election-id=15]

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: send_msg_via_ha: Sending
broadcast HA message (ref=vote-crmd-1190282210-138) to crmd@<all> failed.

  Sep 20 17:56:50 sinfids3a1 crmd: [5532]: info: handle_shutdown_request:
Creating shutdown request for sinfids3a1

  Sep 20 17:56:50 sinfids3a1 tengine: [29019]: info: extract_event: Aborting
on shutdown attribute for 3e940b7a-0e41-4df4-9279-71205d5038e8

  Sep 20 17:57:26 sinfids3a1 sinfids3a_apache[6867]: [6959]: INFO: 17:57:26
URL:http://0.0.0.0:80/server-status [2765/2765] -> "-" [1]

  Sep 20 17:57:27 sinfids3a1 lrmd: [5529]: WARN: G_SIG_dispatch: Dispatch
function for SIGCHLD was delayed 1000 ms (> 100 ms) before being called
(GSource: 0x51bdc8)

  Sep 20 17:57:27 sinfids3a1 lrmd: [5529]: info: G_SIG_dispatch: started at
455233814 should have started at 455233714

  Sep 20 17:57:31 sinfids3a1 tengine: [29019]: info: process_graph_event:
Action resource_sinfids3a_fids_proxy_monitor_360000 initiated by a different
transitioner

  Sep 20 17:57:31 sinfids3a1 tengine: [29019]: WARN: update_failcount:
Updating failcount for resource_sinfids3a_fids_proxy on
8cc209ea-9be8-48a7-a1a1-3fdc7b7a64d4 after failed monitor: rc=7

  Sep 20 17:57:32 sinfids3a1 tengine: [29019]: info: extract_event: Aborting
on transient_attributes changes for 8cc209ea-9be8-48a7-a1a1-3fdc7b7a64d4

  Sep 20 17:57:32 sinfids3a1 tengine: [29019]: info: te_update_diff:
Aborting on transient_attributes deletions

  Sep 20 17:57:33 sinfids3a1 tengine: [29019]: info: extract_event: Aborting
on transient_attributes changes for 8cc209ea-9be8-48a7-a1a1-3fdc7b7a64d4

  Sep 20 17:57:33 sinfids3a1 tengine: [29019]: info: te_update_diff:
Aborting on transient_attributes deletions

  Sep 20 17:57:50 sinfids3a1 rc: Stopping heartbeat:  failed

 

 

  Sep 20 17:58:06 sinfids3a1 drbd: Stopping all DRBD resources

  Sep 20 17:58:06 sinfids3a1 drbd: ioctl(,SET_STATE,) failed:

  Sep 20 17:58:06 sinfids3a1 drbd: Device or resource busy

  Sep 20 17:58:06 sinfids3a1 drbd: Someone has opened the device for RW
access!

  Sep 20 17:58:06 sinfids3a1 drbd: Command '/sbin/drbdsetup /dev/drbd0 down'
terminated with exit code 20

  Sep 20 17:58:06 sinfids3a1 drbd: drbdsetup exited with code 20

  Sep 20 17:58:06 sinfids3a1 drbd: ERROR: Module drbd is in use

  Sep 20 17:58:06 sinfids3a1 drbd: .

  Sep 20 17:58:06 sinfids3a1 rc: Stopping drbd:  succeeded

 

 

  Sep 21 10:11:49 sinfids3a1 drbd: [

  Sep 21 10:11:49 sinfids3a1 drbd: d0

  Sep 21 10:11:49 sinfids3a1 kernel: drbd0: Adjusting my ra_pages to backing
device's (32 -> 1024)

  Sep 21 10:11:49 sinfids3a1 kernel: klogd 1.4.1, ---------- state change
----------

  Sep 21 10:11:49 sinfids3a1 kernel: drbd0: resync bitmap: bits=43350764
words=677356

  Sep 21 10:11:49 sinfids3a1 kernel: drbd0: size = 165 GB (173403056 KB)

  Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc0: IOCStatus(0x0002): Busy

  Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc1: IOCStatus(0x0002): Busy

  Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc0: IOCStatus(0x0002): Busy

  Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc1: IOCStatus(0x0002): Busy

  Sep 21 10:11:51 sinfids3a1 kernel: drbd0: 180 MB marked out-of-sync by on
disk bit-map.

  Sep 21 10:11:51 sinfids3a1 kernel: drbd0: Found 6 transactions (324 active
extents) in activity log.

  Sep 21 10:11:51 sinfids3a1 kernel: drbd0: Marked additional 983 MB as
out-of-sync based on AL.

  Sep 21 10:11:51 sinfids3a1 kernel: drbd0: drbdsetup [5217]: cstate
Unconfigured --> StandAlone

  Sep 21 10:11:51 sinfids3a1 drbd: s0

  Sep 21 10:11:51 sinfids3a1 drbd: n0

  Sep 21 10:11:51 sinfids3a1 kernel: drbd0: drbdsetup [5318]: cstate
StandAlone --> Unconnected

  Sep 21 10:11:51 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate
Unconnected --> WFConnection

  Sep 21 10:11:51 sinfids3a1 drbd: ]

  Sep 21 10:11:51 sinfids3a1 drbd: .

  Sep 21 10:11:51 sinfids3a1 drbd: WARN: stdin/stdout is not a TTY; using
/dev/console

  Sep 21 10:11:51 sinfids3a1 kernel: drbd0: using degr_wfc_timeout=10
seconds

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate
WFConnection --> WFReportParams

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Handshake successful: DRBD
Network Protocol version 74

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Connection established.

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: I am(S):
1:00000002:00000001:000000c5:00000033:10

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Peer(P):
1:00000002:00000001:000000c3:00000034:10

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate
WFReportParams --> WFBitMapS

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: sock_sendmsg returned -32

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate
WFBitMapS --> BrokenPipe

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: short sent ReportBitMap
size=4096 sent=0

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Secondary/Unknown -->
Secondary/Primary  

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: meta connection shut down by
peer.

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: asender terminated

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: sock was shut down by peer

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate
BrokenPipe --> BrokenPipe

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: short read expecting header on
sock: r=0

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: worker terminated

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate
BrokenPipe --> Unconnected

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Connection lost.

  Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate
Unconnected --> WFConnection

  Sep 21 10:12:01 sinfids3a1 rc: Starting drbd:  succeeded

 

  Sep 21 10:22:55 sinfids3a1 kernel: drbd0: Secondary/Unknown -->
Primary/Unknown

  Sep 21 10:22:57 sinfids3a1 sinfids3a_Filesystem[7631]: [7643]: INFO:
Running start for /dev/drbd0 on /export

 

 

  Sep 21 10:10:57 sinfids3a2 kernel: bcm5700: eth1 NIC Link is Down

  Sep 21 10:11:00 sinfids3a2 kernel: bcm5700: eth1 NIC Link is Up, 1000 Mbps
full duplex, receive & transmit flow control ON

  Sep 21 10:11:53 sinfids3a2 kernel: drbd0: drbd0_receiver [5318]: cstate
WFConnection --> WFReportParams

  Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Handshake successful: DRBD
Network Protocol version 74

  Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Connection established.

  Sep 21 10:11:53 sinfids3a2 kernel: drbd0: I am(P):
1:00000002:00000001:000000c3:00000034:10

  Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Peer(S):
1:00000002:00000001:000000c5:00000033:10

  Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Current Primary shall become
sync TARGET! Aborting to prevent data corruption.

  Sep 21 10:11:53 sinfids3a2 kernel: drbd0: drbd0_receiver [5318]: cstate
WFReportParams --> StandAlone

  Sep 21 10:11:53 sinfids3a2 kernel: drbd0: error receiving ReportParams, l:
72!

  Sep 21 10:11:53 sinfids3a2 kernel: drbd0: asender terminated

  Sep 21 10:11:53 sinfids3a2 kernel: drbd0: worker terminated

  Sep 21 10:11:53 sinfids3a2 kernel: drbd0: drbd0_receiver [5318]: cstate
StandAlone --> StandAlone

  Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Connection lost.

  Sep 21 10:11:53 sinfids3a2 kernel: drbd0: receiver terminated

 

 

 




More information about the Linux-HA mailing list