[Linux-HA] Problem with STONITH and heartbeat 2

Stefan Peinkofer peinkofe at fhm.edu
Tue Oct 18 10:00:50 MDT 2005


Hello Andrew,

On Tue, 2005-10-18 at 16:51 +0200, Andrew Beekhof wrote:
> On 10/18/05, Andrew Beekhof <beekhof at gmail.com> wrote:
> > On 10/18/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > > Hello Andrew,
> > >
> > > On Tue, 2005-10-18 at 14:29 +0200, Andrew Beekhof wrote:
> > > > On 10/18/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > > > > Hello Andrew,
> > > > >
> > > > > On Mon, 2005-10-17 at 17:14 +0200, Andrew Beekhof wrote:
> > > > > > On 10/17/05, Andrew Beekhof <beekhof at gmail.com> wrote:
> > > > > > > On 10/17/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > > > > > > > Hello Andrew,
> > > > > > > >
> > > > > > > > I wanted to ask if you could figure something out of the logfiles I mailed you, yet. Maybe you have some suggestions what I can try on my side in parallel.
> > > > > > > >
> > > > > > >
> > > > > > > I looked and I think I concluded that IBM China needed to look into it...
> > > > > > > I'll take another look today.
> > > > > >
> > > > > > yes, definitely requires the input from IBM China.
> > > > > >
> > > > > > The logs below indicate that the CRM did its part but the call failed.
> > > > > >
> > > > > > Oct 12 11:47:39 sarek tengine: [23000]: info:
> > > > > > mask(callbacks.c:tengine_stonith_callback): optype=1, node_name=spock,
> > > > > > result=2, node_list=
> > > > > > Oct 12 11:47:39 sarek tengine: [23000]: ERROR:
> > > > > > mask(tengine.c:match_down_event): Stonith of
> > > > > > 94e85471-d30d-4b94-aa8e-69c8440361a0 failed (2)... aborting
> > > > > > transition.
> > > > > >
> > > > > maybe you want to answer a question to me, so I can understand this
> > > > > whole issue better.
> > > > >
> > > > > From my point of view it looks like that there are comming two
> > > > > independent faults together.
> > > > > Fault 1: The resources get started before a successfull stonith
> > > > > occoured.
> > > >
> > > > As far as the PE knows, the resources aren't running anywhere so its
> > > > safe to start them.
> > >
> > > > If you'd like start actions for the resources to wait for STONITHs to
> > > > complete - you'll need to use the "prereq" option for <op
> > > > name=start"...>
> > > >
> > > that's the thing that I definitely want, because this should become an
> > > active/active database cluster with two shared fc disks. So having a fc
> > > disk mounted on both nodes at the same time is deadly.
> > >
> > > The following is in my cib.xml which I supplied with the first mail:
> > > <crm_config>
> > >                                         <nvpair
> > > id="transition_idle_timeout" name="transition_idle_timeout"
> > > value="120s"/>
> > >                                         <nvpair id="symmetric_cluster"
> > > name="symmetric_cluster" value="true"/>
> > >                                         <nvpair id="no_quorum_policy"
> > > name="no_quorum_policy" value="freeze"/>
> > >                                         <nvpair id="stonith_enabled"
> > > name="stonith_enabled" value="true"/>
> > >                 </crm_config>
> > >
> > > According to the cib.xml DTD this implies that start_prereq is fencing.
> > > To make sure the prereq is really fencing I supplied fencing wherever it
> > > was possible in the resource definition (before I wrote my first mail):
> > >
> > > <group id="infobase_rg" on_stopfail="fence" start_prereq="fencing">
> > >                                 <primitive class="ocf" id="infobase_ip"
> > > provider="heartbeat" type="IPaddr" on_stopfail="fence"
> > > start_prereq="fencing">
> > >                                         <operations>
> > >                                                 <op id="1" interval="5s"
> > > name="monitor" timeout="5s" on_fail="stop"/>
> > >                                                 <op id="2" name="start"
> > > timeout="30s" on_fail="stop" prereq="fencing"/>
> > >                                                 <op id="3" name="stop"
> > > timeout="30s" prereq="nothing"/>
> > >                                         </operations>
> > >                                         <instance_attributes>
> > >                                                 <attributes>
> > >                                                         <nvpair
> > > name="ip" value="10.20.120.205"/>
> > >                                                 </attributes>
> > >                                         </instance_attributes>
> > >                                 </primitive>
> > >                         </group>
> > >
> > > But my problem is that this prereq directives seem to be ignored in this
> > > special scenario.
> >
> > Apparently I broke this for group resources.
> > I've added a regression test and am trying to make it work now.
> 
> a fix has been committed to CVS
> 
Many thanks. I just tried it out and it works as expected. Now I can
rest easy again :) 

Btw: Are you guy's interested in an OCF resource script for postgresql?
It's currently still beta but if I'm done with it, it would be a
pleasure for me to donate it to the heartbeat project.

> > Sorry for the inconvenience.
Never mind.

Best regards.

Stefan Peinkofer

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linux-ha.org/pipermail/linux-ha/attachments/20051018/8679cb15/attachment.pgp>


More information about the Linux-HA mailing list