[Linux-HA] Problem with STONITH and heartbeat 2

Stefan Peinkofer peinkofe at fhm.edu
Tue Oct 11 10:43:36 MDT 2005


Hello,

On Tue, 2005-10-11 at 14:34 +0200, Andrew Beekhof wrote:
> On 10/11/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > Hello,
> >
> > On Sat, 2005-10-08 at 21:43 +0200, Andrew Beekhof wrote:
> > > On 10/7/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > > > Hello everybody,
> > > >
> > > > I have a weird problem with heartbeat 2 (crm enabled) and stonith.
> > > >
> > > > I setup a two node postgresql cluster with two wti_nps Stonith devices.
> > > > When I start heartbeat on both nodes and after that initiate a split
> > > > brain situation by pulling the cluster interconnect cables, everything
> > > > just works fine and one host gets stonith.
> > > >
> > > > But if I bring heartbeat down on one node, initiate the split brain
> > > > situation and start heartbeat on the node again, it starts the resources
> > > > before it could successfully stonith the other node.
> > >
> > > this was recently fixed in CVS.  I had used resource "stop" instead of
> > > "start" for one of the internal constraints.
> > >
> > Thanks for the fast reply.
> > I checked out the current CVS version (By using the download tarball
> > function in the web-cvs browser). It compiled fine but I'm still
> > expieriencing the error.
> >
> > The BasicSanityCheck returned only one error saying:
> > pengine[3325]: 2005/10/11_13:33:53 ERROR:
> > mask(ipc.c:subsystem_msg_dispatch): pengine took 6370ms to complete
> 
> that one can be ignored for now
> 
> >
> > To make sure, I didn't do anything wrong with my config:
> >
> > Is it "normal", that the first stonith attemp in this scenario fails?
> 
> before friday, the answer was "probably".
> since then, the answer is no.
> 
> so it depends when you updated from CVS last.
> 
To be sure I, did a 'cvs co linux-ha' at about 2005-10-11-16:50 but
unfortunately this version didn't wait for stonith to complete
successfully either. (Is there a mistake in my cib.xml, i attached in
the first mail?) And the first stonith try failed again :(
But it was a little bit more verbose, it says:
Oct 11 18:22:57 spock stonithd: [4454]: ERROR: has_this_callid: scenario
value error.
Oct 11 18:22:57 spock stonithd: [4454]: info: Failed to STONITH the node
sarek: optype=1, op_result=2
Oct 11 18:22:57 spock tengine: [4537]: info:
mask(callbacks.c:tengine_stonith_callback): optype=1, node_name=sarek,
result=2, node_list=
Oct 11 18:22:57 spock tengine: [4537]: ERROR:
mask(tengine.c:match_down_event): Stonith of
5cc75967-9ace-4c9b-9882-670a2be70256 failed (2)... aborting transition.
Oct 11 18:22:57 spock tengine: [4537]: WARN:
mask(utils.c:send_complete): 0 - Transition status: Aborted by failed
action: Stonith failed
> > Is it "normal", that stonithd, ccm and lrmd complain:
> > Cannot open : No such file or directory ?
> 
> no
OK, after using the logd the messages disappeard.
> 
> > Is it "normal", that following message appears:
> > Oct 11 13:09:57 sarek crmd: [22363]: WARN: lrm_get_rsc(653): got a
> > return code HA_FAIL from a reply message of getrsc with function
> > get_ret_from_msg. ?
> 
> depends on the context
Contex:
Oct 11 18:22:11 spock tengine: [4537]: info:
mask(tengine.c:initiate_transition): Initating transition
Oct 11 18:22:11 spock tengine: [4537]: info:
mask(tengine.c:cib_action_updated): Initiating action 3: monitor
kill_sarek on spock
Oct 11 18:22:11 spock tengine: [4537]: info:
mask(tengine.c:cib_action_updated): Initiating action 4: monitor
kill_spock on spock
Oct 11 18:22:11 spock crmd: [4456]: WARN: lrm_get_rsc(653): got a return
code HA_FAIL from a reply message of getrsc with function
get_ret_from_msg.
Oct 11 18:22:11 spock crmd: [4456]: WARN: lrm_get_rsc(653): got a return
code HA_FAIL from a reply message of getrsc with function
get_ret_from_msg.
Oct 11 18:22:11 spock tengine: [4537]: info:
mask(tengine.c:cib_action_updated): Initiating action 5: monitor
infobase_rg:infobase_ip on spock
Oct 11 18:22:12 spock tengine: [4537]: info:
mask(tengine.c:cib_action_updated): Initiating action 6: monitor
telebase_rg:telebase_ip on spock
Oct 11 18:22:12 spock lrmd: [4455]: notice: lrmd_rsc_new(): No
lrm_rprovider field in message
Oct 11 18:22:12 spock crmd: [4456]: info: mask(lrm.c:do_lrm_rsc_op):
Performing op monitor on kill_sarek
Oct 11 18:22:12 spock tengine: [4537]: info:
mask(tengine.c:initiate_action): Executing fencing operation (21) on
sarek
Oct 11 18:22:13 spock crmd: [4456]: WARN: lrm_get_rsc(653): got a return
code HA_FAIL from a reply message of getrsc with function
get_ret_from_msg.
Oct 11 18:22:13 spock crmd: [4456]: WARN: lrm_get_rsc(653): got a return
code HA_FAIL from a reply message of getrsc with function
get_ret_from_msg.
Oct 11 18:22:13 spock lrmd: [4455]: notice: lrmd_rsc_new(): No
lrm_rprovider field in message
Oct 11 18:22:14 spock crmd: [4456]: info: mask(lrm.c:do_lrm_rsc_op):
Performing op monitor on kill_spock
Oct 11 18:22:14 spock crmd: [4456]: info: mask(lrm.c:do_lrm_event):
Confirmed stopped: kill_sarek
Oct 11 18:22:15 spock crmd: [4456]: info: mask(lrm.c:send_direct_ack):
NACK'ing resource op: monitor for kill_sarek
Oct 11 18:22:15 spock crmd: [4456]: WARN: lrm_get_rsc(653): got a return
code HA_FAIL from a reply message of getrsc with function
get_ret_from_msg.
Oct 11 18:22:15 spock crmd: [4456]: WARN: lrm_get_rsc(653): got a return
code HA_FAIL from a reply message of getrsc with function
get_ret_from_msg.
Oct 11 18:22:15 spock crmd: [4456]: info: mask(lrm.c:do_lrm_rsc_op):
Performing op monitor on infobase_rg:infobase_ip
Oct 11 18:22:16 spock crmd: [4456]: info: mask(lrm.c:do_lrm_event):
Confirmed stopped: kill_spock
Oct 11 18:22:16 spock crmd: [4456]: info: mask(lrm.c:send_direct_ack):
NACK'ing resource op: monitor for kill_spock
Oct 11 18:22:17 spock tengine: [4537]: info:
mask(tengine.c:match_graph_event): Target rc = 7 (7)
Oct 11 18:22:17 spock tengine: [4537]: info:
mask(tengine.c:match_graph_event): Target rc: == 7
Oct 11 18:22:17 spock tengine: [4537]: info:
mask(tengine.c:match_graph_event): Action 3 confirmed
Oct 11 18:22:17 spock crmd: [4456]: WARN: lrm_get_rsc(653): got a return
code HA_FAIL from a reply message of getrsc with function
get_ret_from_msg.
Oct 11 18:22:17 spock crmd: [4456]: WARN: lrm_get_rsc(653): got a return
code HA_FAIL from a reply message of getrsc with function
get_ret_from_msg.
Oct 11 18:22:18 spock crmd: [4456]: info: mask(lrm.c:do_lrm_rsc_op):
Performing op monitor on telebase_rg:telebase_ip

Many thanks in advance.

Stefan Peinkofer

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linux-ha.org/pipermail/linux-ha/attachments/20051011/8b4309a6/attachment.pgp>


More information about the Linux-HA mailing list