[Linux-HA] Problem with STONITH and heartbeat 2

Andrew Beekhof beekhof at gmail.com
Tue Oct 11 06:34:01 MDT 2005


On 10/11/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> Hello,
>
> On Sat, 2005-10-08 at 21:43 +0200, Andrew Beekhof wrote:
> > On 10/7/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > > Hello everybody,
> > >
> > > I have a weird problem with heartbeat 2 (crm enabled) and stonith.
> > >
> > > I setup a two node postgresql cluster with two wti_nps Stonith devices.
> > > When I start heartbeat on both nodes and after that initiate a split
> > > brain situation by pulling the cluster interconnect cables, everything
> > > just works fine and one host gets stonith.
> > >
> > > But if I bring heartbeat down on one node, initiate the split brain
> > > situation and start heartbeat on the node again, it starts the resources
> > > before it could successfully stonith the other node.
> >
> > this was recently fixed in CVS.  I had used resource "stop" instead of
> > "start" for one of the internal constraints.
> >
> Thanks for the fast reply.
> I checked out the current CVS version (By using the download tarball
> function in the web-cvs browser). It compiled fine but I'm still
> expieriencing the error.
>
> The BasicSanityCheck returned only one error saying:
> pengine[3325]: 2005/10/11_13:33:53 ERROR:
> mask(ipc.c:subsystem_msg_dispatch): pengine took 6370ms to complete

that one can be ignored for now

>
> To make sure, I didn't do anything wrong with my config:
>
> Is it "normal", that the first stonith attemp in this scenario fails?

before friday, the answer was "probably".
since then, the answer is no.

so it depends when you updated from CVS last.

> Is it "normal", that stonithd, ccm and lrmd complain:
> Cannot open : No such file or directory ?

no

> Is it "normal", that following message appears:
> Oct 11 13:09:57 sarek crmd: [22363]: WARN: lrm_get_rsc(653): got a
> return code HA_FAIL from a reply message of getrsc with function
> get_ret_from_msg. ?

depends on the context

>
> Many thanks in advance.
>
> MFG
> Stefan Peinkofer
> > >
> > > My ha.cf:
> > > node spock
> > > node sarek
> > > bcast eth3
> > > #bcast bond0
> > > debugfile /var/log/ha-debug
> > > debug 1
> > > #serial /dev/ttyS1
> > > auto_failback on
> > > crm yes
> > >
> > > My cib.xml is attached.
> > >
> > > What I have found out from the logfiles is:
> > > Before heartbeat starts the resources it claims that it want to stonith
> > > the other node.
> > > <snip>
> > > Oct  7 18:23:34 spock pengine: [31749]: WARN: mask(stages.c:stage6):
> > > Scheduling Node sarek for STONITH
> > > Oct  7 18:23:34 spock pengine: [31749]: info: mask(stages.c:stage8):
> > > Creating transition graph 0.
> > > ...
> > > Oct  7 18:23:37 spock tengine: [31748]: info:
> > > mask(tengine.c:initiate_action): Executing fencing operation (16) on
> > > sarek
> > > Oct  7 18:23:38 spock tengine: [31748]: info:
> > > mask(tengine.c:cib_action_updated): Initiating action 1: start
> > > kill_sarek on spock
> > > Oct  7 18:23:38 spock crmd: [31659]: WARN: lrm_get_rsc(653): got a
> > > return code HA_FAIL from a reply message of getrsc with function
> > > get_ret_from_msg.
> > > Oct  7 18:23:38 spock tengine: [31748]: info:
> > > mask(tengine.c:cib_action_updated): Initiating action 4: start
> > > infobase_rg:infobase_ip on spock
> > > Oct  7 18:23:38 spock crmd: [31659]: WARN: lrm_get_rsc(653): got a
> > > return code HA_FAIL from a reply message of getrsc with function
> > > get_ret_from_msg.
> > > <snip/>
> > >
> > > But somehow it doesn't wait from the stonith operation to complete
> > > successfully and starts the resources. One minute later I get this
> > > messages:
> > >
> > > <snip>
> > > Oct  7 18:24:37 spock stonithd: [31657]: info: Failed to STONITH the
> > > node sarek: optype=1, op_result=2
> > > Oct  7 18:24:37 spock stonithd: Cannot open : No such file or directory
> > > Oct  7 18:24:37 spock tengine: [31748]: info:
> > > mask(callbacks.c:tengine_stonith_callback): optype=1, node_name=sarek,
> > > result=2, node_list=
> > > Oct  7 18:24:37 spock tengine: [31748]: ERROR:
> > > mask(tengine.c:match_down_event): Stonith of
> > > 43ee5c7d-87dd-4524-909a-80a98dc07926 failed (2)... aborting transition.
> > > Oct  7 18:24:37 spock tengine: [31748]: WARN:
> > > mask(utils.c:send_complete): 0 - Transition status: Aborted by failed
> > > action: Stonith failed
> > > Oct  7 18:24:37 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > > Synapse 0 was confirmed
> > > Oct  7 18:24:37 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > > Synapse 1 was confirmed
> > > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > > Synapse 2 was confirmed
> > > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > > Synapse 3 was confirmed
> > > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > > Synapse 4 was confirmed
> > > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > > Synapse 5 was confirmed
> > > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > > Synapse 6 was confirmed
> > > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > > Synapse 7 was confirmed
> > > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > > Synapse 8 was confirmed
> > > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > > Synapse 9 was confirmed
> > > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > > Synapse 10 was executed
> > > Oct  7 18:24:38 spock tengine: [31748]: WARN:
> > > mask(utils.c:print_action):       [Action 16] Completed (cannot fail)
> > > Oct  7 18:24:38 spock tengine: [31748]: WARN:
> > > mask(utils.c:print_action):               CRM Op: stonith on sarek
> > > (43ee5c7d-87dd-4524-909a-80a98dc07926)
> > > Oct  7 18:24:38 spock tengine: [31748]: WARN:
> > > mask(utils.c:send_complete): 0 - Transition status: Aborted by failed
> > > action: Fencing op failed
> > > Oct  7 18:24:38 spock crmd: [31659]: info:
> > > mask(fsa.c:do_state_transition): State transition S_TRANSITION_ENGINE ->
> > > S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE
> > > origin=do_msg_route ]
> > > Oct  7 18:24:38 spock tengine: [31748]: info:
> > > mask(tengine.c:process_trigger): Trigger from action -2 (0 more)
> > > discarded: Not in transition
> > > Oct  7 18:24:38 spock crmd: [31659]: info:
> > > mask(fsa.c:do_state_transition): All 1 cluster nodes are eligable to run
> > > resources.
> > > Oct  7 18:24:38 spock tengine: [31748]: info:
> > > mask(utils.c:send_complete): 0 - Transition status: Confirmed Stopped:
> > > Last pending action confirmed
> > > Oct  7 18:24:39 spock pengine: [31749]: info: mask(process_pe_message):
> > > [generation] <cib admin_epoch="0" have_quorum="true" num_peers="1"
> > > origin="spock" cib_feature_revision="1" last_written="Fri Oct  7
> > > 18:23:52 2005" debug_source="finalize_join"
> > > dc_uuid="d2996479-d6f9-47ef-b123-95776945d5cc" generated="true"
> > > epoch="7" num_updates="140" ccm_transition="1"/>
> > > Oct  7 18:24:40 spock pengine: [31749]: WARN:
> > > mask(unpack.c:param_value): Option default_resource_stickiness not set
> > > Oct  7 18:24:40 spock pengine: [31749]: info:
> > > mask(unpack.c:unpack_config): STONITH of failed nodes is enabled
> > > Oct  7 18:24:40 spock pengine: [31749]: info:
> > > mask(unpack.c:unpack_config): Cluster is symmetric - resources can run
> > > anywhere by default
> > > Oct  7 18:24:40 spock pengine: [31749]: info:
> > > mask(unpack.c:unpack_config): On loss of CCM Quorum: Freeze resources
> > > Oct  7 18:24:40 spock pengine: [31749]: info:
> > > mask(native.c:native_create_actions): Leave resource kill_sarek
> > > (spock)
> > > Oct  7 18:24:40 spock pengine: [31749]: WARN:
> > > mask(native.c:create_recurring_actions):    kill_spock_monitor_5000:
> > > (<null>) (cancelled : start un-runnable)
> > > Oct  7 18:24:40 spock pengine: [31749]: info:
> > > mask(native.c:native_create_actions): Leave resource
> > > infobase_rg:infobase_ip      (spock)
> > > Oct  7 18:24:40 spock pengine: [31749]: info:
> > > mask(native.c:native_create_actions): Leave resource
> > > telebase_rg:telebase_ip      (spock)
> > > Oct  7 18:24:40 spock pengine: [31749]: WARN: mask(stages.c:stage6):
> > > Scheduling Node sarek for STONITH
> > > Oct  7 18:24:41 spock pengine: [31749]: info: mask(stages.c:stage8):
> > > Creating transition graph 1.
> > > Oct  7 18:24:41 spock crmd: [31659]: info:
> > > mask(fsa.c:do_state_transition): State transition S_POLICY_ENGINE ->
> > > S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> > > origin=do_msg_route ]
> > > Oct  7 18:24:41 spock tengine: [31748]: info:
> > > mask(unpack.c:unpack_graph): Beginning transition 1 : timeout set to
> > > 120000ms
> > > Oct  7 18:24:41 spock tengine: [31748]: info:
> > > mask(unpack.c:unpack_graph): Unpacked 1 actions in 1 synapses
> > > Oct  7 18:24:41 spock tengine: [31748]: info:
> > > mask(tengine.c:initiate_transition): Initating transition
> > > Oct  7 18:24:41 spock tengine: [31748]: info:
> > > mask(tengine.c:initiate_action): Executing fencing operation (19) on
> > > sarek
> > > Oct  7 18:24:45 spock pengine: [31749]: ERROR:
> > > mask(ipc.c:subsystem_msg_dispatch): pengine took 6200ms to complete
> > > Oct  7 18:24:59 spock stonithd: [31657]: info: Succeeded to STONITH the
> > > node sarek: optype=1. whodoit: spock
> > > Oct  7 18:24:59 spock stonithd: Cannot open : No such file or directory
> > > <snip/>
> > >
> > > I don't know why the stonith operation fails. If i trace the network
> > > traffic with etherreal I see the 'monitor' communication works but I
> > > can't see an attemp to kill the other node.
> > >
> > > I tried to use the haresource and disable crm then everything worked.
> > > Also tried to play with the cib.xml parameters. I looked at:
> > > http://www.linux-ha.org/NodeFencing where it says ther is a mandatory
> > > node_fencing="(yes|no)" attribute in the CIB. But it's not mentioned in
> > > the ClusterResourceManager/DTD1.0 and a grep over the heartbeat
> > > sourcecode returned no match.
> > > I would appreciate if anyone could tell me how I can tell heartbeat that
> > > it should make sure, that my resources are started only after a
> > > successfull stonith operation?
> > >
> > > Many thanks in advance.
> > >
> > > Stefan Peinkofer
> > >
> > >
> > >
> > > --
> > > --------------------------------------------------------------------------------
> > > Stefan Peinkofer
> > > Zentrum fuer angewandte Kommunikationstechnologien (ZaK)
> > > Fachhochschule Muenchen, Munich University of Applied Sciences
> > > URL: http://www.fhm.edu/zak/
> > > --------------------------------------------------------------------------------
> > >
> > >
> > > -----BEGIN PGP SIGNATURE-----
> > > Version: GnuPG v1.4.0 (GNU/Linux)
> > >
> > > iD8DBQBDRrrelOJ92uOdG/4RAi4bAJ999by0gHwfE0kUHtjswuDRTbFEfwCbBeEl
> > > 9dFVsadQZeMfekKeOmp6MVs=
> > > =l/tK
> > > -----END PGP SIGNATURE-----
> > >
> > >
> > > _______________________________________________
> > > Linux-HA mailing list
> > > Linux-HA at lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > >
> > >
> > >
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.0 (GNU/Linux)
>
> iD8DBQBDS6nmlOJ92uOdG/4RApL4AJ9m1JC8uSZt310QCpfQwL2x1MiwWwCeLHon
> erBzl+Cg3xXx+itos6OCNFo=
> =01tT
> -----END PGP SIGNATURE-----
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>



More information about the Linux-HA mailing list