[Linux-HA] Problem with STONITH and heartbeat 2

Andrew Beekhof beekhof at gmail.com
Sat Oct 8 13:43:27 MDT 2005


On 10/7/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> Hello everybody,
>
> I have a weird problem with heartbeat 2 (crm enabled) and stonith.
>
> I setup a two node postgresql cluster with two wti_nps Stonith devices.
> When I start heartbeat on both nodes and after that initiate a split
> brain situation by pulling the cluster interconnect cables, everything
> just works fine and one host gets stonith.
>
> But if I bring heartbeat down on one node, initiate the split brain
> situation and start heartbeat on the node again, it starts the resources
> before it could successfully stonith the other node.

this was recently fixed in CVS.  I had used resource "stop" instead of
"start" for one of the internal constraints.

>
> My ha.cf:
> node spock
> node sarek
> bcast eth3
> #bcast bond0
> debugfile /var/log/ha-debug
> debug 1
> #serial /dev/ttyS1
> auto_failback on
> crm yes
>
> My cib.xml is attached.
>
> What I have found out from the logfiles is:
> Before heartbeat starts the resources it claims that it want to stonith
> the other node.
> <snip>
> Oct  7 18:23:34 spock pengine: [31749]: WARN: mask(stages.c:stage6):
> Scheduling Node sarek for STONITH
> Oct  7 18:23:34 spock pengine: [31749]: info: mask(stages.c:stage8):
> Creating transition graph 0.
> ...
> Oct  7 18:23:37 spock tengine: [31748]: info:
> mask(tengine.c:initiate_action): Executing fencing operation (16) on
> sarek
> Oct  7 18:23:38 spock tengine: [31748]: info:
> mask(tengine.c:cib_action_updated): Initiating action 1: start
> kill_sarek on spock
> Oct  7 18:23:38 spock crmd: [31659]: WARN: lrm_get_rsc(653): got a
> return code HA_FAIL from a reply message of getrsc with function
> get_ret_from_msg.
> Oct  7 18:23:38 spock tengine: [31748]: info:
> mask(tengine.c:cib_action_updated): Initiating action 4: start
> infobase_rg:infobase_ip on spock
> Oct  7 18:23:38 spock crmd: [31659]: WARN: lrm_get_rsc(653): got a
> return code HA_FAIL from a reply message of getrsc with function
> get_ret_from_msg.
> <snip/>
>
> But somehow it doesn't wait from the stonith operation to complete
> successfully and starts the resources. One minute later I get this
> messages:
>
> <snip>
> Oct  7 18:24:37 spock stonithd: [31657]: info: Failed to STONITH the
> node sarek: optype=1, op_result=2
> Oct  7 18:24:37 spock stonithd: Cannot open : No such file or directory
> Oct  7 18:24:37 spock tengine: [31748]: info:
> mask(callbacks.c:tengine_stonith_callback): optype=1, node_name=sarek,
> result=2, node_list=
> Oct  7 18:24:37 spock tengine: [31748]: ERROR:
> mask(tengine.c:match_down_event): Stonith of
> 43ee5c7d-87dd-4524-909a-80a98dc07926 failed (2)... aborting transition.
> Oct  7 18:24:37 spock tengine: [31748]: WARN:
> mask(utils.c:send_complete): 0 - Transition status: Aborted by failed
> action: Stonith failed
> Oct  7 18:24:37 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> Synapse 0 was confirmed
> Oct  7 18:24:37 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> Synapse 1 was confirmed
> Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> Synapse 2 was confirmed
> Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> Synapse 3 was confirmed
> Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> Synapse 4 was confirmed
> Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> Synapse 5 was confirmed
> Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> Synapse 6 was confirmed
> Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> Synapse 7 was confirmed
> Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> Synapse 8 was confirmed
> Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> Synapse 9 was confirmed
> Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> Synapse 10 was executed
> Oct  7 18:24:38 spock tengine: [31748]: WARN:
> mask(utils.c:print_action):       [Action 16] Completed (cannot fail)
> Oct  7 18:24:38 spock tengine: [31748]: WARN:
> mask(utils.c:print_action):               CRM Op: stonith on sarek
> (43ee5c7d-87dd-4524-909a-80a98dc07926)
> Oct  7 18:24:38 spock tengine: [31748]: WARN:
> mask(utils.c:send_complete): 0 - Transition status: Aborted by failed
> action: Fencing op failed
> Oct  7 18:24:38 spock crmd: [31659]: info:
> mask(fsa.c:do_state_transition): State transition S_TRANSITION_ENGINE ->
> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE
> origin=do_msg_route ]
> Oct  7 18:24:38 spock tengine: [31748]: info:
> mask(tengine.c:process_trigger): Trigger from action -2 (0 more)
> discarded: Not in transition
> Oct  7 18:24:38 spock crmd: [31659]: info:
> mask(fsa.c:do_state_transition): All 1 cluster nodes are eligable to run
> resources.
> Oct  7 18:24:38 spock tengine: [31748]: info:
> mask(utils.c:send_complete): 0 - Transition status: Confirmed Stopped:
> Last pending action confirmed
> Oct  7 18:24:39 spock pengine: [31749]: info: mask(process_pe_message):
> [generation] <cib admin_epoch="0" have_quorum="true" num_peers="1"
> origin="spock" cib_feature_revision="1" last_written="Fri Oct  7
> 18:23:52 2005" debug_source="finalize_join"
> dc_uuid="d2996479-d6f9-47ef-b123-95776945d5cc" generated="true"
> epoch="7" num_updates="140" ccm_transition="1"/>
> Oct  7 18:24:40 spock pengine: [31749]: WARN:
> mask(unpack.c:param_value): Option default_resource_stickiness not set
> Oct  7 18:24:40 spock pengine: [31749]: info:
> mask(unpack.c:unpack_config): STONITH of failed nodes is enabled
> Oct  7 18:24:40 spock pengine: [31749]: info:
> mask(unpack.c:unpack_config): Cluster is symmetric - resources can run
> anywhere by default
> Oct  7 18:24:40 spock pengine: [31749]: info:
> mask(unpack.c:unpack_config): On loss of CCM Quorum: Freeze resources
> Oct  7 18:24:40 spock pengine: [31749]: info:
> mask(native.c:native_create_actions): Leave resource kill_sarek
> (spock)
> Oct  7 18:24:40 spock pengine: [31749]: WARN:
> mask(native.c:create_recurring_actions):    kill_spock_monitor_5000:
> (<null>) (cancelled : start un-runnable)
> Oct  7 18:24:40 spock pengine: [31749]: info:
> mask(native.c:native_create_actions): Leave resource
> infobase_rg:infobase_ip      (spock)
> Oct  7 18:24:40 spock pengine: [31749]: info:
> mask(native.c:native_create_actions): Leave resource
> telebase_rg:telebase_ip      (spock)
> Oct  7 18:24:40 spock pengine: [31749]: WARN: mask(stages.c:stage6):
> Scheduling Node sarek for STONITH
> Oct  7 18:24:41 spock pengine: [31749]: info: mask(stages.c:stage8):
> Creating transition graph 1.
> Oct  7 18:24:41 spock crmd: [31659]: info:
> mask(fsa.c:do_state_transition): State transition S_POLICY_ENGINE ->
> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=do_msg_route ]
> Oct  7 18:24:41 spock tengine: [31748]: info:
> mask(unpack.c:unpack_graph): Beginning transition 1 : timeout set to
> 120000ms
> Oct  7 18:24:41 spock tengine: [31748]: info:
> mask(unpack.c:unpack_graph): Unpacked 1 actions in 1 synapses
> Oct  7 18:24:41 spock tengine: [31748]: info:
> mask(tengine.c:initiate_transition): Initating transition
> Oct  7 18:24:41 spock tengine: [31748]: info:
> mask(tengine.c:initiate_action): Executing fencing operation (19) on
> sarek
> Oct  7 18:24:45 spock pengine: [31749]: ERROR:
> mask(ipc.c:subsystem_msg_dispatch): pengine took 6200ms to complete
> Oct  7 18:24:59 spock stonithd: [31657]: info: Succeeded to STONITH the
> node sarek: optype=1. whodoit: spock
> Oct  7 18:24:59 spock stonithd: Cannot open : No such file or directory
> <snip/>
>
> I don't know why the stonith operation fails. If i trace the network
> traffic with etherreal I see the 'monitor' communication works but I
> can't see an attemp to kill the other node.
>
> I tried to use the haresource and disable crm then everything worked.
> Also tried to play with the cib.xml parameters. I looked at:
> http://www.linux-ha.org/NodeFencing where it says ther is a mandatory
> node_fencing="(yes|no)" attribute in the CIB. But it's not mentioned in
> the ClusterResourceManager/DTD1.0 and a grep over the heartbeat
> sourcecode returned no match.
> I would appreciate if anyone could tell me how I can tell heartbeat that
> it should make sure, that my resources are started only after a
> successfull stonith operation?
>
> Many thanks in advance.
>
> Stefan Peinkofer
>
>
>
> --
> --------------------------------------------------------------------------------
> Stefan Peinkofer
> Zentrum fuer angewandte Kommunikationstechnologien (ZaK)
> Fachhochschule Muenchen, Munich University of Applied Sciences
> URL: http://www.fhm.edu/zak/
> --------------------------------------------------------------------------------
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.0 (GNU/Linux)
>
> iD8DBQBDRrrelOJ92uOdG/4RAi4bAJ999by0gHwfE0kUHtjswuDRTbFEfwCbBeEl
> 9dFVsadQZeMfekKeOmp6MVs=
> =l/tK
> -----END PGP SIGNATURE-----
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>
>



More information about the Linux-HA mailing list