[Linux-HA] pengine: increment_clone erros with clones on 10 nodes
Andrew Beekhof
beekhof at gmail.com
Fri Nov 9 09:02:25 MST 2007
Fixed:
http://hg.beekhof.net/lha/crm-dev/rev/b6801b38541b
It will be in the next interim update (in a week or two).
You can also file a support request with SUSE which might mean you can
get an updated build sooner.
Thanks for the report!
On Nov 8, 2007, at 3:05 PM, Iain Arnell wrote:
> Hi,
>
> I've been happily running a cluster of eight SLES10 machines using the
> standard SLES10 service pack 1 heartbeat-2.0.8-0.19 RPMs. But after
> adding 2 more machines, I'm now running into problems with the clone
> resources. (And I get the same behaviour using the latest
> heartbeat-2.1.2-18.1 from opensuse build service).
>
> The simplest testcase I can manage is to prepare a basic cluster of 10
> nodes, add evms to ha.cf,
>
> apiauth evms uid=root gid=haclient
> respawn root /sbin/evmsd
>
> then inject this clone definition into the CIB:
>
> <clone id="evmscloneset" notify="true" globally_unique="false">
> <instance_attributes id="evmscloneset">
> <attributes>
> <nvpair id="evmscloneset-01" name="clone_node_max" value="1"/>
> </attributes>
> </instance_attributes>
> <primitive id="evmsclone" class="ocf" type="EvmsSCC"
> provider="heartbeat"/>
> </clone>
>
> As expected, in no time at all, a bunch of evmsclone resources
> appear and
> start themselves on each machine:
>
> evmsclone:0 (heartbeat::ocf:EvmsSCC): Started xp1tbkec2
> evmsclone:1 (heartbeat::ocf:EvmsSCC): Started xp1tbkec1
> evmsclone:2 (heartbeat::ocf:EvmsSCC): Started xp1tbkeb1
> evmsclone:3 (heartbeat::ocf:EvmsSCC): Started xp1tbkea2
> evmsclone:4 (heartbeat::ocf:EvmsSCC): Started xp1tbkea1
> evmsclone:5 (heartbeat::ocf:EvmsSCC): Started xp1tdbma1
> evmsclone:6 (heartbeat::ocf:EvmsSCC): Started xp1tfrea1
> evmsclone:7 (heartbeat::ocf:EvmsSCC): Started xp1tdbma2
> evmsclone:8 (heartbeat::ocf:EvmsSCC): Started xp1tfrea2
> evmsclone:9 (heartbeat::ocf:EvmsSCC): Started xp1tbkeb2
>
> But, if I now bounce one node (or stop/start heartbeat), as soon as it
> starts to come up again, I get this in the log from the DC
>
> tengine[7406]: 2007/11/08_10:04:46 info: match_graph_event: Action
> evmsclone:2_start_0 (32) confirmed on xp1tbkeb2 (rc=0)
> tengine[7406]: 2007/11/08_10:04:46 info: te_pseudo_action: Pseudo
> action
> 34 fired and confirmed
> tengine[7406]: 2007/11/08_10:04:46 info: te_pseudo_action: Pseudo
> action
> 37 fired and confirmed
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 48: evmsclone:0_post_notify_start_0 on xp1tbkec2
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 52: evmsclone:1_post_notify_start_0 on xp1tbkec1
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 56: evmsclone:3_post_notify_start_0 on xp1tbkeb1
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 60: evmsclone:4_post_notify_start_0 on xp1tbkea2
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 64: evmsclone:5_post_notify_start_0 on xp1tbkea1
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 68: evmsclone:6_post_notify_start_0 on xp1tdbma2
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 72: evmsclone:7_post_notify_start_0 on xp1tdbma1
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 76: evmsclone:8_post_notify_start_0 on xp1tfrea2
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 80: evmsclone:9_post_notify_start_0 on xp1tfrea1
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 83: evmsclone:2_post_notify_start_0 on xp1tbkeb2
> crmd[7374]: 2007/11/08_10:04:46 info: do_lrm_rsc_op: Performing
> op=evmsclone:9_notify_0 key=80:10:823cef09-7dbd-4819-baa8-
> f91910dd8f35)
> lrmd[7371]: 2007/11/08_10:04:46 info: rsc:evmsclone:9: notify
> crmd[7374]: 2007/11/08_10:04:46 info: process_lrm_event: LRM operation
> evmsclone:9_notify_0 (call=8, rc=0) complete
> tengine[7406]: 2007/11/08_10:04:46 info: match_graph_event: Action
> evmsclone:9_post_notify_start_0 (80) confirmed on xp1tfrea1 (rc=0)
> tengine[7406]: 2007/11/08_10:04:46 info: match_graph_event: Action
> evmsclone:1_post_notify_start_0 (52) confirmed on xp1tbkec1 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
> evmsclone:4_post_notify_start_0 (60) confirmed on xp1tbkea2 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
> evmsclone:0_post_notify_start_0 (48) confirmed on xp1tbkec2 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
> evmsclone:8_post_notify_start_0 (76) confirmed on xp1tfrea2 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
> evmsclone:3_post_notify_start_0 (56) confirmed on xp1tbkeb1 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
> evmsclone:7_post_notify_start_0 (72) confirmed on xp1tdbma1 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
> evmsclone:5_post_notify_start_0 (64) confirmed on xp1tbkea1 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
> evmsclone:6_post_notify_start_0 (68) confirmed on xp1tdbma2 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
> evmsclone:2_post_notify_start_0 (83) confirmed on xp1tbkeb2 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: te_pseudo_action: Pseudo
> action
> 38 fired and confirmed
> tengine[7406]: 2007/11/08_10:04:47 info: run_graph: Transition 10:
> (Complete=29, Pending=0, Fired=0, Skipped=0, Incomplete=0)
> crmd[7374]: 2007/11/08_10:04:47 info: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_IPC_MESSAGE origin=route_message ]
> crmd[7374]: 2007/11/08_10:04:47 info: do_state_transition: All 10
> cluster
> nodes are eligible to run resources.
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tdbma2 is online
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tbkec1 is online
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tbkec2 is online
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tfrea1 is online
> pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource:
> Internally
> renamed evmsclone:0 on xp1tfrea1 to evmsclone:2
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tbkeb1 is online
> pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource:
> Internally
> renamed evmsclone:0 on xp1tbkeb1 to evmsclone:2
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tfrea2 is online
> pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource:
> Internally
> renamed evmsclone:0 on xp1tfrea2 to evmsclone:2
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tbkeb2 is online
> pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource:
> Internally
> renamed evmsclone:0 on xp1tbkeb2 to evmsclone:2
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tbkea1 is online
> pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource:
> Internally
> renamed evmsclone:0 on xp1tbkea1 to evmsclone:4
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tbkea2 is online
> pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource:
> Internally
> renamed evmsclone:0 on xp1tbkea2 to evmsclone:7
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tdbma1 is online
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
>
> With the last line repeated forever (until I kill things and clean
> out the
> CIB and orphaned resources).
>
> I think can sort of see what's going wrong - for whatever reason,
> something wants to increment the clone number in "evmsclone:9",
> finds that
> it's "9", so increments it to "0" and then moves left in the string to
> increment the tens digit. Unfortunately, this doesn't exists, so it
> seems
> to find the colon instead and complains (a lot!). Is there some way
> I can
> get it to use double digits for these clones?, e.g. "evmsclone:00".
>
> As a workaround, I can add clone_nodes="11" to the cloneset
> definition,
> giving me resources evmsclone:0 to evmsclone:10, which seems to
> work, but
> I'm not totally convinced. For some reason, evmsclone:10 runs in
> preference to evmsclone:9, and no amount of stopping/starting seems to
> cause problems (yet).
>
> evmsclone:0 (heartbeat::ocf:EvmsSCC): Started xp1tbkec2
> evmsclone:1 (heartbeat::ocf:EvmsSCC): Started xp1tbkec1
> evmsclone:2 (heartbeat::ocf:EvmsSCC): Started xp1tbkeb1
> evmsclone:3 (heartbeat::ocf:EvmsSCC): Started xp1tbkea2
> evmsclone:4 (heartbeat::ocf:EvmsSCC): Started xp1tbkea1
> evmsclone:5 (heartbeat::ocf:EvmsSCC): Started xp1tdbma1
> evmsclone:6 (heartbeat::ocf:EvmsSCC): Started xp1tfrea1
> evmsclone:7 (heartbeat::ocf:EvmsSCC): Started xp1tdbma2
> evmsclone:8 (heartbeat::ocf:EvmsSCC): Started xp1tfrea2
> evmsclone:9 (heartbeat::ocf:EvmsSCC): Stopped
> evmsclone:10 (heartbeat::ocf:EvmsSCC): Started
> xp1tbkeb2
>
>
> --
> Iain.
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
More information about the Linux-HA
mailing list