[Linux-HA] HA2 OCF CRM: Manage multiple DRBD Resources

Dominik Klein dk at in-telegence.net
Wed Jul 4 07:04:36 MDT 2007


>> My drbd setup is working. I can manually set each note to be primary for 
>> each resource (while the other is secondary of course). When starting 
>> heartbeat, I make sure every drbd device is either Unconfigured/down or 
>> secondary.
> 
> Just don't start drbd at boot. If it's running anyway, heartbeat should
> probe and find out that the resources are running.

Okay. As you will see, this leads into other problems (which I solved), 
but it does not change my main problem.

So here's what I do:

"1:" means its done on machine 1
"2:" means its done on machine 2
#: means its done on both machines
<...> is a comment by me

#: reboot
<wait> :)
#: ls /proc/drbd
ls: Zugriff auf /proc/drbd nicht möglich: Datei oder Verzeichnis nicht 
gefunden
<german for: file does not exist>

<make sure we start off clean>
# rm /var/lib/heartbeat/crm/*

# /etc/init.d/heartbeat start
<wait again>
<crm_mon shows 2 online nodes, 0 resources>

1: cibadmin -U -x cib.xml (all target roles = stopped, no instance 
attributes for any node)
<crm_mon show 2 online nodes, DC=acd-xen03, *4* resources>

1: crm_resource -r ms-r0 -v 'started' -p target_role
1: crm_resource -r fs0 -v 'started' -p target_role
# cat /proc/drbd
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by root at ACD-xen01, 2007-06-11 14:48:25
<this means: no resources configured>
<crm_mon shows r0 "started" for both nodes -> not good>

1: drbdadm state r0
Unknown/TOO_LARGE
<OCF script needs to be changed to recognize this (maybe new drbd8) 
state after just the module being loaded>
<done>

So except for changing and copying the script, I started over from 
reboot up to target_role=started for fs0
<now crm_mon show r0:0 on acd-xen03 is master>
<fs0 is mounted on acd-xen03>
<2 online nodes, *4* resources>

Now comes the strange thing:
1: crm_resource -r ms-r1 -v 'started' -p target_role
1: crm_resource -r fs1 -v 'started' -p target_role
<crm_mon shows 2 online nodes, DC still acd-xen03, but *5* resources (+1)>
<one would expect to see the same result as with r0, but:
<crm_mon show started on both, no master
# cat /proc/drbd
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by root at ACD-xen01, 2007-06-11 14:48:25
...
  1: cs:Connected st:Secondary/Secondary ds:UpToDate/UpToDate C r---
     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
         resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
         act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
<confirmed, no master>
<fs1 is not mounted>

<logs suggest to ask crm_verify, I strip datetime values for readability>
1: crm_verify -LVVVV
info: log_data_element: create_fake_resource: Orphan resource 
<lrm_resource id="r1:1" type="drbd_master_slave" class="ocf" provider="dk">
info: log_data_element: create_fake_resource: Orphan resource 
<lrm_rsc_op id="r1:1_monitor_0" operation="monitor" 
crm-debug-origin="do_update_resource" 
transition_key="7:4:0e0997d6-d70e-4b9e-949b-0c98684f6259" 
transition_magic="0:7;7:4:0e0997d6-d70e-4b9e-949b-0c98684f6259" 
call_id="7" crm_feature_set="1.0.7" rc_code="7" op_status="0" 
interval="0" op_digest="58850437bf287086d1b41caade76bbf1"/>
info: log_data_element: create_fake_resource: Orphan resource 
</lrm_resource>
info: unpack_find_resource: Making sure orphan r1:1/r1:2 is stopped on 
acd-xen01
info: unpack_find_resource: Internally renamed r1:1 on acd-xen01 to r1:2
debug: unpack_rsc_order: r0_before_fs0: ms-r0.promote after fs0.start 
(symmetrical)
debug: unpack_rsc_order: r1_before_fs1: ms-r1.promote after fs1.start 
(symmetrical)
debug: cib_native_signoff: Signing out of the CIB Service

<r1:2 looks suspicious - no idea where this comes from>

>> I get one drbd+fs pair to run (the one using r1 in my config). But when 
>> I try to add another one (the one using r0 in my config), it does not 
>> promote the master and therefore does not mount the fs. The OCF script 
>> hangs and times out at "crm_master -v 75" and as you can see in the 
>> nodes section of the CIB, only the master value for r1 made it to the CIB.
> 
> The process "hangs"? How so? 

The notify action times out (20s).
Here a log extract:
Jul  4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7412]: DEBUG: r1 
notify: post for start - counts: active 0 - starting 2 - stopping 0
Jul  4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7413]: DEBUG: DK 
drbd_start_phase_2 with param "no"
Jul  4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7415]: DEBUG: r1: 
Calling /sbin/drbdadm -c /etc/drbd.conf state r1
Jul  4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7419]: DEBUG: r1: 
Exit code 0
Jul  4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7420]: DEBUG: r1: 
Command output: Secondary/Secondary
Jul  4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7428]: DEBUG: r1: 
Calling /sbin/drbdadm -c /etc/drbd.conf cstate r1
Jul  4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7432]: DEBUG: r1: 
Exit code 0
Jul  4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7433]: DEBUG: r1: 
Command output: Connected
Jul  4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7434]: DEBUG: r1 
status: Secondary/Secondary local: Secondary remote: Secondary 
connection: Connected
Jul  4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7435]: DEBUG: DK 
before crm_master -v 75
Jul  4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7436]: DEBUG: r1: 
Calling /usr/sbin/crm_master -v 75
########### notice: +20s
Jul  4 14:46:19 ACD-xen03 lrmd: [7073]: WARN: on_op_timeout_expired: 
TIMEOUT: operation notify[15] on ocf::drbd_master_slave::r1:1 for client 
7076, its parameters: CRM_meta_op_target_rc=[7] 
CRM_meta_notify_operation=[start] CRM_meta_notify_start_resource=[r1:0 
r1:1 ] drbd_resource=[r1] CRM_meta_master_max=[1] CRM_meta_timeout=[200.
Jul  4 14:46:19 ACD-xen03 crmd: [7076]: ERROR: process_lrm_event: LRM 
operation r1:1_notify_0 (15) Timed Out (timeout=20000ms)
Jul  4 14:46:19 ACD-xen03 cib: [7072]: info: cib_diff_notify: Update 
(client: 7076, call:48): 0.1.135 -> 0.1.136 (ok)
Jul  4 14:46:19 ACD-xen03 tengine: [7084]: info: te_update_diff: 
Processing diff (cib_update): 0.1.135 -> 0.1.136
Jul  4 14:46:19 ACD-xen03 tengine: [7084]: info: match_graph_event: 
Action r1:1_post_notify_start_0 (79) confirmed on 
d4506030-b86e-4877-9984-72b7b39e29ca
Jul  4 14:46:19 ACD-xen03 cib: [7439]: info: write_cib_contents: Wrote 
version 0.1.136 of the CIB to disk (digest: 
ba84a2cd700f604ea7aee326cc06e1b6)
Jul  4 14:46:20 ACD-xen03 cib: [7072]: info: cib_diff_notify: Update 
(client: 7087, call:36): 0.1.136 -> 0.1.137 (ok)
Jul  4 14:46:20 ACD-xen03 tengine: [7084]: info: te_update_diff: 
Processing diff (cib_update): 0.1.136 -> 0.1.137
Jul  4 14:46:20 ACD-xen03 tengine: [7084]: info: match_graph_event: 
Action r1:0_post_notify_start_0 (76) confirmed on 
f6ffbaa8-9c5b-4da1-9e93-b50d227ba805
Jul  4 14:46:20 ACD-xen03 crmd: [7076]: info: do_state_transition: 
acd-xen03: State transition S_TRANSITION_ENGINE -> S_IDLE [ 
input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]


> Have you tried stracing the crm_master
> process?

No.

> I recall that I had some issues with drbd complaining about resources
> which mentioned nodes which weren't local; I worked around that by
> splitting drbd.conf into several parts and giving each drbd resource its
> own separate configfile using the drbdconf attribute.

Don't know if that would help.

> Well, here's hoping that this change of yours truly is the only one
> needed to fully support drbd8 ;-)

Well the drbdadm commands issued from the script seem to be the same. As 
you have read earlier, I added some more Status strings to look out for, 
but you are right, I do not know for sure if this is all that needs to 
be changed.

Please note that this behaviour is not dependant on my r0 or r1 
resource. If I start out with r0, r0 works and r1 faults. If I start the 
other way around with r1, then r0 will fault.

Maybe you can still help me figure this out.

Regards
Dominik


More information about the Linux-HA mailing list