[Linux-HA] HA2 OCF CRM: Manage multiple DRBD Resources
Dominik Klein
dk at in-telegence.net
Wed Jul 4 07:04:36 MDT 2007
>> My drbd setup is working. I can manually set each note to be primary for
>> each resource (while the other is secondary of course). When starting
>> heartbeat, I make sure every drbd device is either Unconfigured/down or
>> secondary.
>
> Just don't start drbd at boot. If it's running anyway, heartbeat should
> probe and find out that the resources are running.
Okay. As you will see, this leads into other problems (which I solved),
but it does not change my main problem.
So here's what I do:
"1:" means its done on machine 1
"2:" means its done on machine 2
#: means its done on both machines
<...> is a comment by me
#: reboot
<wait> :)
#: ls /proc/drbd
ls: Zugriff auf /proc/drbd nicht möglich: Datei oder Verzeichnis nicht
gefunden
<german for: file does not exist>
<make sure we start off clean>
# rm /var/lib/heartbeat/crm/*
# /etc/init.d/heartbeat start
<wait again>
<crm_mon shows 2 online nodes, 0 resources>
1: cibadmin -U -x cib.xml (all target roles = stopped, no instance
attributes for any node)
<crm_mon show 2 online nodes, DC=acd-xen03, *4* resources>
1: crm_resource -r ms-r0 -v 'started' -p target_role
1: crm_resource -r fs0 -v 'started' -p target_role
# cat /proc/drbd
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by root at ACD-xen01, 2007-06-11 14:48:25
<this means: no resources configured>
<crm_mon shows r0 "started" for both nodes -> not good>
1: drbdadm state r0
Unknown/TOO_LARGE
<OCF script needs to be changed to recognize this (maybe new drbd8)
state after just the module being loaded>
<done>
So except for changing and copying the script, I started over from
reboot up to target_role=started for fs0
<now crm_mon show r0:0 on acd-xen03 is master>
<fs0 is mounted on acd-xen03>
<2 online nodes, *4* resources>
Now comes the strange thing:
1: crm_resource -r ms-r1 -v 'started' -p target_role
1: crm_resource -r fs1 -v 'started' -p target_role
<crm_mon shows 2 online nodes, DC still acd-xen03, but *5* resources (+1)>
<one would expect to see the same result as with r0, but:
<crm_mon show started on both, no master
# cat /proc/drbd
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by root at ACD-xen01, 2007-06-11 14:48:25
...
1: cs:Connected st:Secondary/Secondary ds:UpToDate/UpToDate C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
<confirmed, no master>
<fs1 is not mounted>
<logs suggest to ask crm_verify, I strip datetime values for readability>
1: crm_verify -LVVVV
info: log_data_element: create_fake_resource: Orphan resource
<lrm_resource id="r1:1" type="drbd_master_slave" class="ocf" provider="dk">
info: log_data_element: create_fake_resource: Orphan resource
<lrm_rsc_op id="r1:1_monitor_0" operation="monitor"
crm-debug-origin="do_update_resource"
transition_key="7:4:0e0997d6-d70e-4b9e-949b-0c98684f6259"
transition_magic="0:7;7:4:0e0997d6-d70e-4b9e-949b-0c98684f6259"
call_id="7" crm_feature_set="1.0.7" rc_code="7" op_status="0"
interval="0" op_digest="58850437bf287086d1b41caade76bbf1"/>
info: log_data_element: create_fake_resource: Orphan resource
</lrm_resource>
info: unpack_find_resource: Making sure orphan r1:1/r1:2 is stopped on
acd-xen01
info: unpack_find_resource: Internally renamed r1:1 on acd-xen01 to r1:2
debug: unpack_rsc_order: r0_before_fs0: ms-r0.promote after fs0.start
(symmetrical)
debug: unpack_rsc_order: r1_before_fs1: ms-r1.promote after fs1.start
(symmetrical)
debug: cib_native_signoff: Signing out of the CIB Service
<r1:2 looks suspicious - no idea where this comes from>
>> I get one drbd+fs pair to run (the one using r1 in my config). But when
>> I try to add another one (the one using r0 in my config), it does not
>> promote the master and therefore does not mount the fs. The OCF script
>> hangs and times out at "crm_master -v 75" and as you can see in the
>> nodes section of the CIB, only the master value for r1 made it to the CIB.
>
> The process "hangs"? How so?
The notify action times out (20s).
Here a log extract:
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7412]: DEBUG: r1
notify: post for start - counts: active 0 - starting 2 - stopping 0
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7413]: DEBUG: DK
drbd_start_phase_2 with param "no"
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7415]: DEBUG: r1:
Calling /sbin/drbdadm -c /etc/drbd.conf state r1
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7419]: DEBUG: r1:
Exit code 0
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7420]: DEBUG: r1:
Command output: Secondary/Secondary
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7428]: DEBUG: r1:
Calling /sbin/drbdadm -c /etc/drbd.conf cstate r1
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7432]: DEBUG: r1:
Exit code 0
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7433]: DEBUG: r1:
Command output: Connected
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7434]: DEBUG: r1
status: Secondary/Secondary local: Secondary remote: Secondary
connection: Connected
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7435]: DEBUG: DK
before crm_master -v 75
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7436]: DEBUG: r1:
Calling /usr/sbin/crm_master -v 75
########### notice: +20s
Jul 4 14:46:19 ACD-xen03 lrmd: [7073]: WARN: on_op_timeout_expired:
TIMEOUT: operation notify[15] on ocf::drbd_master_slave::r1:1 for client
7076, its parameters: CRM_meta_op_target_rc=[7]
CRM_meta_notify_operation=[start] CRM_meta_notify_start_resource=[r1:0
r1:1 ] drbd_resource=[r1] CRM_meta_master_max=[1] CRM_meta_timeout=[200.
Jul 4 14:46:19 ACD-xen03 crmd: [7076]: ERROR: process_lrm_event: LRM
operation r1:1_notify_0 (15) Timed Out (timeout=20000ms)
Jul 4 14:46:19 ACD-xen03 cib: [7072]: info: cib_diff_notify: Update
(client: 7076, call:48): 0.1.135 -> 0.1.136 (ok)
Jul 4 14:46:19 ACD-xen03 tengine: [7084]: info: te_update_diff:
Processing diff (cib_update): 0.1.135 -> 0.1.136
Jul 4 14:46:19 ACD-xen03 tengine: [7084]: info: match_graph_event:
Action r1:1_post_notify_start_0 (79) confirmed on
d4506030-b86e-4877-9984-72b7b39e29ca
Jul 4 14:46:19 ACD-xen03 cib: [7439]: info: write_cib_contents: Wrote
version 0.1.136 of the CIB to disk (digest:
ba84a2cd700f604ea7aee326cc06e1b6)
Jul 4 14:46:20 ACD-xen03 cib: [7072]: info: cib_diff_notify: Update
(client: 7087, call:36): 0.1.136 -> 0.1.137 (ok)
Jul 4 14:46:20 ACD-xen03 tengine: [7084]: info: te_update_diff:
Processing diff (cib_update): 0.1.136 -> 0.1.137
Jul 4 14:46:20 ACD-xen03 tengine: [7084]: info: match_graph_event:
Action r1:0_post_notify_start_0 (76) confirmed on
f6ffbaa8-9c5b-4da1-9e93-b50d227ba805
Jul 4 14:46:20 ACD-xen03 crmd: [7076]: info: do_state_transition:
acd-xen03: State transition S_TRANSITION_ENGINE -> S_IDLE [
input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]
> Have you tried stracing the crm_master
> process?
No.
> I recall that I had some issues with drbd complaining about resources
> which mentioned nodes which weren't local; I worked around that by
> splitting drbd.conf into several parts and giving each drbd resource its
> own separate configfile using the drbdconf attribute.
Don't know if that would help.
> Well, here's hoping that this change of yours truly is the only one
> needed to fully support drbd8 ;-)
Well the drbdadm commands issued from the script seem to be the same. As
you have read earlier, I added some more Status strings to look out for,
but you are right, I do not know for sure if this is all that needs to
be changed.
Please note that this behaviour is not dependant on my r0 or r1
resource. If I start out with r0, r0 works and r1 faults. If I start the
other way around with r1, then r0 will fault.
Maybe you can still help me figure this out.
Regards
Dominik
More information about the Linux-HA
mailing list