[Linux-HA] Problem with STONITH and heartbeat 2

Andrew Beekhof beekhof at gmail.com
Wed Oct 12 00:27:22 MDT 2005


On 10/11/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> Hello,
>
> On Tue, 2005-10-11 at 14:34 +0200, Andrew Beekhof wrote:
> > On 10/11/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > > Hello,
> > >
> > > On Sat, 2005-10-08 at 21:43 +0200, Andrew Beekhof wrote:
> > > > On 10/7/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > > > > Hello everybody,
> > > > >
> > > > > I have a weird problem with heartbeat 2 (crm enabled) and stonith.
> > > > >
> > > > > I setup a two node postgresql cluster with two wti_nps Stonith devices.
> > > > > When I start heartbeat on both nodes and after that initiate a split
> > > > > brain situation by pulling the cluster interconnect cables, everything
> > > > > just works fine and one host gets stonith.
> > > > >
> > > > > But if I bring heartbeat down on one node, initiate the split brain
> > > > > situation and start heartbeat on the node again, it starts the resources
> > > > > before it could successfully stonith the other node.
> > > >
> > > > this was recently fixed in CVS.  I had used resource "stop" instead of
> > > > "start" for one of the internal constraints.
> > > >
> > > Thanks for the fast reply.
> > > I checked out the current CVS version (By using the download tarball
> > > function in the web-cvs browser). It compiled fine but I'm still
> > > expieriencing the error.
> > >
> > > The BasicSanityCheck returned only one error saying:
> > > pengine[3325]: 2005/10/11_13:33:53 ERROR:
> > > mask(ipc.c:subsystem_msg_dispatch): pengine took 6370ms to complete
> >
> > that one can be ignored for now
> >
> > >
> > > To make sure, I didn't do anything wrong with my config:
> > >
> > > Is it "normal", that the first stonith attemp in this scenario fails?
> >
> > before friday, the answer was "probably".
> > since then, the answer is no.
> >
> > so it depends when you updated from CVS last.
> >
> To be sure I, did a 'cvs co linux-ha' at about 2005-10-11-16:50 but
> unfortunately this version didn't wait for stonith to complete
> successfully either. (Is there a mistake in my cib.xml, i attached in
> the first mail?) And the first stonith try failed again :(
> But it was a little bit more verbose, it says:
> Oct 11 18:22:57 spock stonithd: [4454]: ERROR: has_this_callid: scenario
> value error.
> Oct 11 18:22:57 spock stonithd: [4454]: info: Failed to STONITH the node
> sarek: optype=1, op_result=2
> Oct 11 18:22:57 spock tengine: [4537]: info:
> mask(callbacks.c:tengine_stonith_callback): optype=1, node_name=sarek,
> result=2, node_list=
> Oct 11 18:22:57 spock tengine: [4537]: ERROR:
> mask(tengine.c:match_down_event): Stonith of
> 5cc75967-9ace-4c9b-9882-670a2be70256 failed (2)... aborting transition.
> Oct 11 18:22:57 spock tengine: [4537]: WARN:
> mask(utils.c:send_complete): 0 - Transition status: Aborted by failed
> action: Stonith failed

you'll need to provide complete logs for this i think.

> > > Is it "normal", that stonithd, ccm and lrmd complain:
> > > Cannot open : No such file or directory ?
> >
> > no
> OK, after using the logd the messages disappeard.
> >
> > > Is it "normal", that following message appears:
> > > Oct 11 13:09:57 sarek crmd: [22363]: WARN: lrm_get_rsc(653): got a
> > > return code HA_FAIL from a reply message of getrsc with function
> > > get_ret_from_msg. ?
> >
> > depends on the context
> Contex:

ok - thats normal then.

> Oct 11 18:22:11 spock tengine: [4537]: info:
> mask(tengine.c:initiate_transition): Initating transition
> Oct 11 18:22:11 spock tengine: [4537]: info:
> mask(tengine.c:cib_action_updated): Initiating action 3: monitor
> kill_sarek on spock
> Oct 11 18:22:11 spock tengine: [4537]: info:
> mask(tengine.c:cib_action_updated): Initiating action 4: monitor
> kill_spock on spock
> Oct 11 18:22:11 spock crmd: [4456]: WARN: lrm_get_rsc(653): got a return
> code HA_FAIL from a reply message of getrsc with function
> get_ret_from_msg.
> Oct 11 18:22:11 spock crmd: [4456]: WARN: lrm_get_rsc(653): got a return
> code HA_FAIL from a reply message of getrsc with function
> get_ret_from_msg.
> Oct 11 18:22:11 spock tengine: [4537]: info:
> mask(tengine.c:cib_action_updated): Initiating action 5: monitor
> infobase_rg:infobase_ip on spock
> Oct 11 18:22:12 spock tengine: [4537]: info:
> mask(tengine.c:cib_action_updated): Initiating action 6: monitor
> telebase_rg:telebase_ip on spock
> Oct 11 18:22:12 spock lrmd: [4455]: notice: lrmd_rsc_new(): No
> lrm_rprovider field in message
> Oct 11 18:22:12 spock crmd: [4456]: info: mask(lrm.c:do_lrm_rsc_op):
> Performing op monitor on kill_sarek
> Oct 11 18:22:12 spock tengine: [4537]: info:
> mask(tengine.c:initiate_action): Executing fencing operation (21) on
> sarek
> Oct 11 18:22:13 spock crmd: [4456]: WARN: lrm_get_rsc(653): got a return
> code HA_FAIL from a reply message of getrsc with function
> get_ret_from_msg.
> Oct 11 18:22:13 spock crmd: [4456]: WARN: lrm_get_rsc(653): got a return
> code HA_FAIL from a reply message of getrsc with function
> get_ret_from_msg.
> Oct 11 18:22:13 spock lrmd: [4455]: notice: lrmd_rsc_new(): No
> lrm_rprovider field in message
> Oct 11 18:22:14 spock crmd: [4456]: info: mask(lrm.c:do_lrm_rsc_op):
> Performing op monitor on kill_spock
> Oct 11 18:22:14 spock crmd: [4456]: info: mask(lrm.c:do_lrm_event):
> Confirmed stopped: kill_sarek
> Oct 11 18:22:15 spock crmd: [4456]: info: mask(lrm.c:send_direct_ack):
> NACK'ing resource op: monitor for kill_sarek
> Oct 11 18:22:15 spock crmd: [4456]: WARN: lrm_get_rsc(653): got a return
> code HA_FAIL from a reply message of getrsc with function
> get_ret_from_msg.
> Oct 11 18:22:15 spock crmd: [4456]: WARN: lrm_get_rsc(653): got a return
> code HA_FAIL from a reply message of getrsc with function
> get_ret_from_msg.
> Oct 11 18:22:15 spock crmd: [4456]: info: mask(lrm.c:do_lrm_rsc_op):
> Performing op monitor on infobase_rg:infobase_ip
> Oct 11 18:22:16 spock crmd: [4456]: info: mask(lrm.c:do_lrm_event):
> Confirmed stopped: kill_spock
> Oct 11 18:22:16 spock crmd: [4456]: info: mask(lrm.c:send_direct_ack):
> NACK'ing resource op: monitor for kill_spock
> Oct 11 18:22:17 spock tengine: [4537]: info:
> mask(tengine.c:match_graph_event): Target rc = 7 (7)
> Oct 11 18:22:17 spock tengine: [4537]: info:
> mask(tengine.c:match_graph_event): Target rc: == 7
> Oct 11 18:22:17 spock tengine: [4537]: info:
> mask(tengine.c:match_graph_event): Action 3 confirmed
> Oct 11 18:22:17 spock crmd: [4456]: WARN: lrm_get_rsc(653): got a return
> code HA_FAIL from a reply message of getrsc with function
> get_ret_from_msg.
> Oct 11 18:22:17 spock crmd: [4456]: WARN: lrm_get_rsc(653): got a return
> code HA_FAIL from a reply message of getrsc with function
> get_ret_from_msg.
> Oct 11 18:22:18 spock crmd: [4456]: info: mask(lrm.c:do_lrm_rsc_op):
> Performing op monitor on telebase_rg:telebase_ip
>
> Many thanks in advance.
>
> Stefan Peinkofer
>
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.0 (GNU/Linux)
>
> iD8DBQBDS+u4lOJ92uOdG/4RAoksAJ93cuPyMj9QpIedXux6C5krnSem8ACfVi+6
> ZsIhZ7SqqUb1N0hCK94iecQ=
> =lh1b
> -----END PGP SIGNATURE-----
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>



More information about the Linux-HA mailing list