[Linux-HA] Problem with STONITH and heartbeat 2

Stefan Peinkofer peinkofe at fhm.edu
Tue Oct 11 06:02:46 MDT 2005


Hello,

On Sat, 2005-10-08 at 21:43 +0200, Andrew Beekhof wrote:
> On 10/7/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > Hello everybody,
> >
> > I have a weird problem with heartbeat 2 (crm enabled) and stonith.
> >
> > I setup a two node postgresql cluster with two wti_nps Stonith devices.
> > When I start heartbeat on both nodes and after that initiate a split
> > brain situation by pulling the cluster interconnect cables, everything
> > just works fine and one host gets stonith.
> >
> > But if I bring heartbeat down on one node, initiate the split brain
> > situation and start heartbeat on the node again, it starts the resources
> > before it could successfully stonith the other node.
> 
> this was recently fixed in CVS.  I had used resource "stop" instead of
> "start" for one of the internal constraints.
> 
Thanks for the fast reply.
I checked out the current CVS version (By using the download tarball
function in the web-cvs browser). It compiled fine but I'm still
expieriencing the error.

The BasicSanityCheck returned only one error saying:
pengine[3325]: 2005/10/11_13:33:53 ERROR:
mask(ipc.c:subsystem_msg_dispatch): pengine took 6370ms to complete

To make sure, I didn't do anything wrong with my config:

Is it "normal", that the first stonith attemp in this scenario fails?
Is it "normal", that stonithd, ccm and lrmd complain:
Cannot open : No such file or directory ?
Is it "normal", that following message appears:
Oct 11 13:09:57 sarek crmd: [22363]: WARN: lrm_get_rsc(653): got a
return code HA_FAIL from a reply message of getrsc with function
get_ret_from_msg. ?

Many thanks in advance.

MFG
Stefan Peinkofer
> >
> > My ha.cf:
> > node spock
> > node sarek
> > bcast eth3
> > #bcast bond0
> > debugfile /var/log/ha-debug
> > debug 1
> > #serial /dev/ttyS1
> > auto_failback on
> > crm yes
> >
> > My cib.xml is attached.
> >
> > What I have found out from the logfiles is:
> > Before heartbeat starts the resources it claims that it want to stonith
> > the other node.
> > <snip>
> > Oct  7 18:23:34 spock pengine: [31749]: WARN: mask(stages.c:stage6):
> > Scheduling Node sarek for STONITH
> > Oct  7 18:23:34 spock pengine: [31749]: info: mask(stages.c:stage8):
> > Creating transition graph 0.
> > ...
> > Oct  7 18:23:37 spock tengine: [31748]: info:
> > mask(tengine.c:initiate_action): Executing fencing operation (16) on
> > sarek
> > Oct  7 18:23:38 spock tengine: [31748]: info:
> > mask(tengine.c:cib_action_updated): Initiating action 1: start
> > kill_sarek on spock
> > Oct  7 18:23:38 spock crmd: [31659]: WARN: lrm_get_rsc(653): got a
> > return code HA_FAIL from a reply message of getrsc with function
> > get_ret_from_msg.
> > Oct  7 18:23:38 spock tengine: [31748]: info:
> > mask(tengine.c:cib_action_updated): Initiating action 4: start
> > infobase_rg:infobase_ip on spock
> > Oct  7 18:23:38 spock crmd: [31659]: WARN: lrm_get_rsc(653): got a
> > return code HA_FAIL from a reply message of getrsc with function
> > get_ret_from_msg.
> > <snip/>
> >
> > But somehow it doesn't wait from the stonith operation to complete
> > successfully and starts the resources. One minute later I get this
> > messages:
> >
> > <snip>
> > Oct  7 18:24:37 spock stonithd: [31657]: info: Failed to STONITH the
> > node sarek: optype=1, op_result=2
> > Oct  7 18:24:37 spock stonithd: Cannot open : No such file or directory
> > Oct  7 18:24:37 spock tengine: [31748]: info:
> > mask(callbacks.c:tengine_stonith_callback): optype=1, node_name=sarek,
> > result=2, node_list=
> > Oct  7 18:24:37 spock tengine: [31748]: ERROR:
> > mask(tengine.c:match_down_event): Stonith of
> > 43ee5c7d-87dd-4524-909a-80a98dc07926 failed (2)... aborting transition.
> > Oct  7 18:24:37 spock tengine: [31748]: WARN:
> > mask(utils.c:send_complete): 0 - Transition status: Aborted by failed
> > action: Stonith failed
> > Oct  7 18:24:37 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > Synapse 0 was confirmed
> > Oct  7 18:24:37 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > Synapse 1 was confirmed
> > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > Synapse 2 was confirmed
> > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > Synapse 3 was confirmed
> > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > Synapse 4 was confirmed
> > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > Synapse 5 was confirmed
> > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > Synapse 6 was confirmed
> > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > Synapse 7 was confirmed
> > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > Synapse 8 was confirmed
> > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > Synapse 9 was confirmed
> > Oct  7 18:24:38 spock tengine: [31748]: WARN: mask(utils.c:print_state):
> > Synapse 10 was executed
> > Oct  7 18:24:38 spock tengine: [31748]: WARN:
> > mask(utils.c:print_action):       [Action 16] Completed (cannot fail)
> > Oct  7 18:24:38 spock tengine: [31748]: WARN:
> > mask(utils.c:print_action):               CRM Op: stonith on sarek
> > (43ee5c7d-87dd-4524-909a-80a98dc07926)
> > Oct  7 18:24:38 spock tengine: [31748]: WARN:
> > mask(utils.c:send_complete): 0 - Transition status: Aborted by failed
> > action: Fencing op failed
> > Oct  7 18:24:38 spock crmd: [31659]: info:
> > mask(fsa.c:do_state_transition): State transition S_TRANSITION_ENGINE ->
> > S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE
> > origin=do_msg_route ]
> > Oct  7 18:24:38 spock tengine: [31748]: info:
> > mask(tengine.c:process_trigger): Trigger from action -2 (0 more)
> > discarded: Not in transition
> > Oct  7 18:24:38 spock crmd: [31659]: info:
> > mask(fsa.c:do_state_transition): All 1 cluster nodes are eligable to run
> > resources.
> > Oct  7 18:24:38 spock tengine: [31748]: info:
> > mask(utils.c:send_complete): 0 - Transition status: Confirmed Stopped:
> > Last pending action confirmed
> > Oct  7 18:24:39 spock pengine: [31749]: info: mask(process_pe_message):
> > [generation] <cib admin_epoch="0" have_quorum="true" num_peers="1"
> > origin="spock" cib_feature_revision="1" last_written="Fri Oct  7
> > 18:23:52 2005" debug_source="finalize_join"
> > dc_uuid="d2996479-d6f9-47ef-b123-95776945d5cc" generated="true"
> > epoch="7" num_updates="140" ccm_transition="1"/>
> > Oct  7 18:24:40 spock pengine: [31749]: WARN:
> > mask(unpack.c:param_value): Option default_resource_stickiness not set
> > Oct  7 18:24:40 spock pengine: [31749]: info:
> > mask(unpack.c:unpack_config): STONITH of failed nodes is enabled
> > Oct  7 18:24:40 spock pengine: [31749]: info:
> > mask(unpack.c:unpack_config): Cluster is symmetric - resources can run
> > anywhere by default
> > Oct  7 18:24:40 spock pengine: [31749]: info:
> > mask(unpack.c:unpack_config): On loss of CCM Quorum: Freeze resources
> > Oct  7 18:24:40 spock pengine: [31749]: info:
> > mask(native.c:native_create_actions): Leave resource kill_sarek
> > (spock)
> > Oct  7 18:24:40 spock pengine: [31749]: WARN:
> > mask(native.c:create_recurring_actions):    kill_spock_monitor_5000:
> > (<null>) (cancelled : start un-runnable)
> > Oct  7 18:24:40 spock pengine: [31749]: info:
> > mask(native.c:native_create_actions): Leave resource
> > infobase_rg:infobase_ip      (spock)
> > Oct  7 18:24:40 spock pengine: [31749]: info:
> > mask(native.c:native_create_actions): Leave resource
> > telebase_rg:telebase_ip      (spock)
> > Oct  7 18:24:40 spock pengine: [31749]: WARN: mask(stages.c:stage6):
> > Scheduling Node sarek for STONITH
> > Oct  7 18:24:41 spock pengine: [31749]: info: mask(stages.c:stage8):
> > Creating transition graph 1.
> > Oct  7 18:24:41 spock crmd: [31659]: info:
> > mask(fsa.c:do_state_transition): State transition S_POLICY_ENGINE ->
> > S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> > origin=do_msg_route ]
> > Oct  7 18:24:41 spock tengine: [31748]: info:
> > mask(unpack.c:unpack_graph): Beginning transition 1 : timeout set to
> > 120000ms
> > Oct  7 18:24:41 spock tengine: [31748]: info:
> > mask(unpack.c:unpack_graph): Unpacked 1 actions in 1 synapses
> > Oct  7 18:24:41 spock tengine: [31748]: info:
> > mask(tengine.c:initiate_transition): Initating transition
> > Oct  7 18:24:41 spock tengine: [31748]: info:
> > mask(tengine.c:initiate_action): Executing fencing operation (19) on
> > sarek
> > Oct  7 18:24:45 spock pengine: [31749]: ERROR:
> > mask(ipc.c:subsystem_msg_dispatch): pengine took 6200ms to complete
> > Oct  7 18:24:59 spock stonithd: [31657]: info: Succeeded to STONITH the
> > node sarek: optype=1. whodoit: spock
> > Oct  7 18:24:59 spock stonithd: Cannot open : No such file or directory
> > <snip/>
> >
> > I don't know why the stonith operation fails. If i trace the network
> > traffic with etherreal I see the 'monitor' communication works but I
> > can't see an attemp to kill the other node.
> >
> > I tried to use the haresource and disable crm then everything worked.
> > Also tried to play with the cib.xml parameters. I looked at:
> > http://www.linux-ha.org/NodeFencing where it says ther is a mandatory
> > node_fencing="(yes|no)" attribute in the CIB. But it's not mentioned in
> > the ClusterResourceManager/DTD1.0 and a grep over the heartbeat
> > sourcecode returned no match.
> > I would appreciate if anyone could tell me how I can tell heartbeat that
> > it should make sure, that my resources are started only after a
> > successfull stonith operation?
> >
> > Many thanks in advance.
> >
> > Stefan Peinkofer
> >
> >
> >
> > --
> > --------------------------------------------------------------------------------
> > Stefan Peinkofer
> > Zentrum fuer angewandte Kommunikationstechnologien (ZaK)
> > Fachhochschule Muenchen, Munich University of Applied Sciences
> > URL: http://www.fhm.edu/zak/
> > --------------------------------------------------------------------------------
> >
> >
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.0 (GNU/Linux)
> >
> > iD8DBQBDRrrelOJ92uOdG/4RAi4bAJ999by0gHwfE0kUHtjswuDRTbFEfwCbBeEl
> > 9dFVsadQZeMfekKeOmp6MVs=
> > =l/tK
> > -----END PGP SIGNATURE-----
> >
> >
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> >
> >
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linux-ha.org/pipermail/linux-ha/attachments/20051011/2316f360/attachment.pgp>


More information about the Linux-HA mailing list