[Linux-HA] pengine: increment_clone erros with clones on 10 nodes

Andrew Beekhof beekhof at gmail.com
Fri Nov 9 09:02:25 MST 2007


Fixed:
     http://hg.beekhof.net/lha/crm-dev/rev/b6801b38541b

It will be in the next interim update (in a week or two).
You can also file a support request with SUSE which might mean you can  
get an updated build sooner.

Thanks for the report!

On Nov 8, 2007, at 3:05 PM, Iain Arnell wrote:

> Hi,
>
> I've been happily running a cluster of eight SLES10 machines using the
> standard SLES10 service pack 1 heartbeat-2.0.8-0.19 RPMs.  But after
> adding 2 more machines, I'm now running into problems with the clone
> resources.  (And I get the same behaviour using the latest
> heartbeat-2.1.2-18.1 from opensuse build service).
>
> The simplest testcase I can manage is to prepare a basic cluster of 10
> nodes, add evms to ha.cf,
>
>  apiauth evms uid=root gid=haclient
>  respawn root /sbin/evmsd
>
> then inject this clone definition into the CIB:
>
>   <clone id="evmscloneset" notify="true" globally_unique="false">
>     <instance_attributes id="evmscloneset">
>       <attributes>
>         <nvpair id="evmscloneset-01" name="clone_node_max" value="1"/>
>       </attributes>
>     </instance_attributes>
>     <primitive id="evmsclone" class="ocf" type="EvmsSCC"
> provider="heartbeat"/>
>   </clone>
>
> As expected, in no time at all, a bunch of evmsclone resources  
> appear and
> start themselves on each machine:
>
>    evmsclone:0 (heartbeat::ocf:EvmsSCC):       Started xp1tbkec2
>    evmsclone:1 (heartbeat::ocf:EvmsSCC):       Started xp1tbkec1
>    evmsclone:2 (heartbeat::ocf:EvmsSCC):       Started xp1tbkeb1
>    evmsclone:3 (heartbeat::ocf:EvmsSCC):       Started xp1tbkea2
>    evmsclone:4 (heartbeat::ocf:EvmsSCC):       Started xp1tbkea1
>    evmsclone:5 (heartbeat::ocf:EvmsSCC):       Started xp1tdbma1
>    evmsclone:6 (heartbeat::ocf:EvmsSCC):       Started xp1tfrea1
>    evmsclone:7 (heartbeat::ocf:EvmsSCC):       Started xp1tdbma2
>    evmsclone:8 (heartbeat::ocf:EvmsSCC):       Started xp1tfrea2
>    evmsclone:9 (heartbeat::ocf:EvmsSCC):       Started xp1tbkeb2
>
> But, if I now bounce one node (or stop/start heartbeat), as soon as it
> starts to come up again, I get this in the log from the DC
>
> tengine[7406]: 2007/11/08_10:04:46 info: match_graph_event: Action
> evmsclone:2_start_0 (32) confirmed on xp1tbkeb2 (rc=0)
> tengine[7406]: 2007/11/08_10:04:46 info: te_pseudo_action: Pseudo  
> action
> 34 fired and confirmed
> tengine[7406]: 2007/11/08_10:04:46 info: te_pseudo_action: Pseudo  
> action
> 37 fired and confirmed
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 48: evmsclone:0_post_notify_start_0 on xp1tbkec2
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 52: evmsclone:1_post_notify_start_0 on xp1tbkec1
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 56: evmsclone:3_post_notify_start_0 on xp1tbkeb1
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 60: evmsclone:4_post_notify_start_0 on xp1tbkea2
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 64: evmsclone:5_post_notify_start_0 on xp1tbkea1
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 68: evmsclone:6_post_notify_start_0 on xp1tdbma2
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 72: evmsclone:7_post_notify_start_0 on xp1tdbma1
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 76: evmsclone:8_post_notify_start_0 on xp1tfrea2
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 80: evmsclone:9_post_notify_start_0 on xp1tfrea1
> tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
> action 83: evmsclone:2_post_notify_start_0 on xp1tbkeb2
> crmd[7374]: 2007/11/08_10:04:46 info: do_lrm_rsc_op: Performing
> op=evmsclone:9_notify_0 key=80:10:823cef09-7dbd-4819-baa8- 
> f91910dd8f35)
> lrmd[7371]: 2007/11/08_10:04:46 info: rsc:evmsclone:9: notify
> crmd[7374]: 2007/11/08_10:04:46 info: process_lrm_event: LRM operation
> evmsclone:9_notify_0 (call=8, rc=0) complete
> tengine[7406]: 2007/11/08_10:04:46 info: match_graph_event: Action
> evmsclone:9_post_notify_start_0 (80) confirmed on xp1tfrea1 (rc=0)
> tengine[7406]: 2007/11/08_10:04:46 info: match_graph_event: Action
> evmsclone:1_post_notify_start_0 (52) confirmed on xp1tbkec1 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
> evmsclone:4_post_notify_start_0 (60) confirmed on xp1tbkea2 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
> evmsclone:0_post_notify_start_0 (48) confirmed on xp1tbkec2 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
> evmsclone:8_post_notify_start_0 (76) confirmed on xp1tfrea2 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
> evmsclone:3_post_notify_start_0 (56) confirmed on xp1tbkeb1 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
> evmsclone:7_post_notify_start_0 (72) confirmed on xp1tdbma1 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
> evmsclone:5_post_notify_start_0 (64) confirmed on xp1tbkea1 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
> evmsclone:6_post_notify_start_0 (68) confirmed on xp1tdbma2 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
> evmsclone:2_post_notify_start_0 (83) confirmed on xp1tbkeb2 (rc=0)
> tengine[7406]: 2007/11/08_10:04:47 info: te_pseudo_action: Pseudo  
> action
> 38 fired and confirmed
> tengine[7406]: 2007/11/08_10:04:47 info: run_graph: Transition 10:
> (Complete=29, Pending=0, Fired=0, Skipped=0, Incomplete=0)
> crmd[7374]: 2007/11/08_10:04:47 info: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_IPC_MESSAGE origin=route_message ]
> crmd[7374]: 2007/11/08_10:04:47 info: do_state_transition: All 10  
> cluster
> nodes are eligible to run resources.
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tdbma2 is online
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tbkec1 is online
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tbkec2 is online
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tfrea1 is online
> pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource:  
> Internally
> renamed evmsclone:0 on xp1tfrea1 to evmsclone:2
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tbkeb1 is online
> pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource:  
> Internally
> renamed evmsclone:0 on xp1tbkeb1 to evmsclone:2
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tfrea2 is online
> pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource:  
> Internally
> renamed evmsclone:0 on xp1tfrea2 to evmsclone:2
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tbkeb2 is online
> pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource:  
> Internally
> renamed evmsclone:0 on xp1tbkeb2 to evmsclone:2
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tbkea1 is online
> pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource:  
> Internally
> renamed evmsclone:0 on xp1tbkea1 to evmsclone:4
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tbkea2 is online
> pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource:  
> Internally
> renamed evmsclone:0 on xp1tbkea2 to evmsclone:7
> pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
> xp1tdbma1 is online
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
> pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
> char: : (9)
>
> With the last line repeated forever (until I kill things and clean  
> out the
> CIB and orphaned resources).
>
> I think can sort of see what's going wrong - for whatever reason,
> something wants to increment the clone number in "evmsclone:9",  
> finds that
> it's "9", so increments it to "0" and then moves left in the string to
> increment the tens digit.  Unfortunately, this doesn't exists, so it  
> seems
> to find the colon instead and complains (a lot!).  Is there some way  
> I can
> get it to use double digits for these clones?, e.g. "evmsclone:00".
>
> As a workaround, I can add clone_nodes="11" to the cloneset  
> definition,
> giving me resources evmsclone:0 to evmsclone:10, which seems to  
> work, but
> I'm not totally convinced.  For some reason, evmsclone:10 runs in
> preference to evmsclone:9, and no amount of stopping/starting seems to
> cause problems (yet).
>
>    evmsclone:0 (heartbeat::ocf:EvmsSCC):       Started xp1tbkec2
>    evmsclone:1 (heartbeat::ocf:EvmsSCC):       Started xp1tbkec1
>    evmsclone:2 (heartbeat::ocf:EvmsSCC):       Started xp1tbkeb1
>    evmsclone:3 (heartbeat::ocf:EvmsSCC):       Started xp1tbkea2
>    evmsclone:4 (heartbeat::ocf:EvmsSCC):       Started xp1tbkea1
>    evmsclone:5 (heartbeat::ocf:EvmsSCC):       Started xp1tdbma1
>    evmsclone:6 (heartbeat::ocf:EvmsSCC):       Started xp1tfrea1
>    evmsclone:7 (heartbeat::ocf:EvmsSCC):       Started xp1tdbma2
>    evmsclone:8 (heartbeat::ocf:EvmsSCC):       Started xp1tfrea2
>    evmsclone:9 (heartbeat::ocf:EvmsSCC):       Stopped
>    evmsclone:10        (heartbeat::ocf:EvmsSCC):       Started  
> xp1tbkeb2
>
>
> -- 
> Iain.
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems



More information about the Linux-HA mailing list