[Linux-HA] Problem with STONITH and heartbeat 2

Andrew Beekhof beekhof at gmail.com
Mon Oct 17 09:14:33 MDT 2005


On 10/17/05, Andrew Beekhof <beekhof at gmail.com> wrote:
> On 10/17/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > Hello Andrew,
> >
> > I wanted to ask if you could figure something out of the logfiles I mailed you, yet. Maybe you have some suggestions what I can try on my side in parallel.
> >
>
> I looked and I think I concluded that IBM China needed to look into it...
> I'll take another look today.

yes, definitely requires the input from IBM China.

The logs below indicate that the CRM did its part but the call failed.

Oct 12 11:47:39 sarek tengine: [23000]: info:
mask(callbacks.c:tengine_stonith_callback): optype=1, node_name=spock,
result=2, node_list=
Oct 12 11:47:39 sarek tengine: [23000]: ERROR:
mask(tengine.c:match_down_event): Stonith of
94e85471-d30d-4b94-aa8e-69c8440361a0 failed (2)... aborting
transition.

>
> > Many thanks in advance.
> > Stefan Peinkofer
> > On Wed, Oct 12, 2005 at 12:06:40PM +0200, Stefan Peinkofer wrote:
> > > On Wed, 2005-10-12 at 08:27 +0200, Andrew Beekhof wrote:
> > > > On 10/11/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > > > > Hello,
> > > > >
> > > > > On Tue, 2005-10-11 at 14:34 +0200, Andrew Beekhof wrote:
> > > > > > On 10/11/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > On Sat, 2005-10-08 at 21:43 +0200, Andrew Beekhof wrote:
> > > > > > > > On 10/7/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > > > > > > > > Hello everybody,
> > > > > > > > >
> > > > > > > > > I have a weird problem with heartbeat 2 (crm enabled) and stonith.
> > > > > > > > >
> > > > > > > > > I setup a two node postgresql cluster with two wti_nps Stonith devices.
> > > > > > > > > When I start heartbeat on both nodes and after that initiate a split
> > > > > > > > > brain situation by pulling the cluster interconnect cables, everything
> > > > > > > > > just works fine and one host gets stonith.
> > > > > > > > >
> > > > > > > > > But if I bring heartbeat down on one node, initiate the split brain
> > > > > > > > > situation and start heartbeat on the node again, it starts the resources
> > > > > > > > > before it could successfully stonith the other node.
> > > > > > > >
> > > > > > > > this was recently fixed in CVS.  I had used resource "stop" instead of
> > > > > > > > "start" for one of the internal constraints.
> > > > > > > >
> > > > > > > Thanks for the fast reply.
> > > > > > > I checked out the current CVS version (By using the download tarball
> > > > > > > function in the web-cvs browser). It compiled fine but I'm still
> > > > > > > expieriencing the error.
> > > > > > >
> > > > > > > The BasicSanityCheck returned only one error saying:
> > > > > > > pengine[3325]: 2005/10/11_13:33:53 ERROR:
> > > > > > > mask(ipc.c:subsystem_msg_dispatch): pengine took 6370ms to complete
> > > > > >
> > > > > > that one can be ignored for now
> > > > > >
> > > > > > >
> > > > > > > To make sure, I didn't do anything wrong with my config:
> > > > > > >
> > > > > > > Is it "normal", that the first stonith attemp in this scenario fails?
> > > > > >
> > > > > > before friday, the answer was "probably".
> > > > > > since then, the answer is no.
> > > > > >
> > > > > > so it depends when you updated from CVS last.
> > > > > >
> > > > > To be sure I, did a 'cvs co linux-ha' at about 2005-10-11-16:50 but
> > > > > unfortunately this version didn't wait for stonith to complete
> > > > > successfully either. (Is there a mistake in my cib.xml, i attached in
> > > > > the first mail?) And the first stonith try failed again :(
> > > > > But it was a little bit more verbose, it says:
> > > > > Oct 11 18:22:57 spock stonithd: [4454]: ERROR: has_this_callid: scenario
> > > > > value error.
> > > > > Oct 11 18:22:57 spock stonithd: [4454]: info: Failed to STONITH the node
> > > > > sarek: optype=1, op_result=2
> > > > > Oct 11 18:22:57 spock tengine: [4537]: info:
> > > > > mask(callbacks.c:tengine_stonith_callback): optype=1, node_name=sarek,
> > > > > result=2, node_list=
> > > > > Oct 11 18:22:57 spock tengine: [4537]: ERROR:
> > > > > mask(tengine.c:match_down_event): Stonith of
> > > > > 5cc75967-9ace-4c9b-9882-670a2be70256 failed (2)... aborting transition.
> > > > > Oct 11 18:22:57 spock tengine: [4537]: WARN:
> > > > > mask(utils.c:send_complete): 0 - Transition status: Aborted by failed
> > > > > action: Stonith failed
> > > >
> > > > you'll need to provide complete logs for this i think.
> > > >
> > > OK, attached the logfile, it contains the log of the whole procedure.
> > > Starting up heartbeat with ClusterInterconnect with the cib.xml I wrote.
> > > Stopping heartbeat on sarek. Initiae split brain.
> > > Starting up heartbeat on sarek again.
> > >
> > > JFYI: I tried something else yesterday. I started heartbeat on both
> > > nodes with ClusterInterconnect. After the resources were running as
> > > expected, i initiaed a split brain situation and disabled the path to
> > > the stonith device on the DC. Then everything worked fine. The other
> > > node stonithed the DC and started the resource of the DC after the DC
> > > was dead. (I think it waited with this until the transition timeout!?)
> > >
> > > Many thanks in advance.
> > >
> > > Stefan Peinkofer
> >
> >
> >
> >
> >
> > > _______________________________________________
> > > Linux-HA mailing list
> > > Linux-HA at lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> >
>
>



More information about the Linux-HA mailing list