[Linux-HA] Problem with STONITH and heartbeat 2

Andrew Beekhof beekhof at gmail.com
Tue Oct 18 08:51:54 MDT 2005


On 10/18/05, Andrew Beekhof <beekhof at gmail.com> wrote:
> On 10/18/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > Hello Andrew,
> >
> > On Tue, 2005-10-18 at 14:29 +0200, Andrew Beekhof wrote:
> > > On 10/18/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > > > Hello Andrew,
> > > >
> > > > On Mon, 2005-10-17 at 17:14 +0200, Andrew Beekhof wrote:
> > > > > On 10/17/05, Andrew Beekhof <beekhof at gmail.com> wrote:
> > > > > > On 10/17/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > > > > > > Hello Andrew,
> > > > > > >
> > > > > > > I wanted to ask if you could figure something out of the logfiles I mailed you, yet. Maybe you have some suggestions what I can try on my side in parallel.
> > > > > > >
> > > > > >
> > > > > > I looked and I think I concluded that IBM China needed to look into it...
> > > > > > I'll take another look today.
> > > > >
> > > > > yes, definitely requires the input from IBM China.
> > > > >
> > > > > The logs below indicate that the CRM did its part but the call failed.
> > > > >
> > > > > Oct 12 11:47:39 sarek tengine: [23000]: info:
> > > > > mask(callbacks.c:tengine_stonith_callback): optype=1, node_name=spock,
> > > > > result=2, node_list=
> > > > > Oct 12 11:47:39 sarek tengine: [23000]: ERROR:
> > > > > mask(tengine.c:match_down_event): Stonith of
> > > > > 94e85471-d30d-4b94-aa8e-69c8440361a0 failed (2)... aborting
> > > > > transition.
> > > > >
> > > > maybe you want to answer a question to me, so I can understand this
> > > > whole issue better.
> > > >
> > > > From my point of view it looks like that there are comming two
> > > > independent faults together.
> > > > Fault 1: The resources get started before a successfull stonith
> > > > occoured.
> > >
> > > As far as the PE knows, the resources aren't running anywhere so its
> > > safe to start them.
> >
> > > If you'd like start actions for the resources to wait for STONITHs to
> > > complete - you'll need to use the "prereq" option for <op
> > > name=start"...>
> > >
> > that's the thing that I definitely want, because this should become an
> > active/active database cluster with two shared fc disks. So having a fc
> > disk mounted on both nodes at the same time is deadly.
> >
> > The following is in my cib.xml which I supplied with the first mail:
> > <crm_config>
> >                                         <nvpair
> > id="transition_idle_timeout" name="transition_idle_timeout"
> > value="120s"/>
> >                                         <nvpair id="symmetric_cluster"
> > name="symmetric_cluster" value="true"/>
> >                                         <nvpair id="no_quorum_policy"
> > name="no_quorum_policy" value="freeze"/>
> >                                         <nvpair id="stonith_enabled"
> > name="stonith_enabled" value="true"/>
> >                 </crm_config>
> >
> > According to the cib.xml DTD this implies that start_prereq is fencing.
> > To make sure the prereq is really fencing I supplied fencing wherever it
> > was possible in the resource definition (before I wrote my first mail):
> >
> > <group id="infobase_rg" on_stopfail="fence" start_prereq="fencing">
> >                                 <primitive class="ocf" id="infobase_ip"
> > provider="heartbeat" type="IPaddr" on_stopfail="fence"
> > start_prereq="fencing">
> >                                         <operations>
> >                                                 <op id="1" interval="5s"
> > name="monitor" timeout="5s" on_fail="stop"/>
> >                                                 <op id="2" name="start"
> > timeout="30s" on_fail="stop" prereq="fencing"/>
> >                                                 <op id="3" name="stop"
> > timeout="30s" prereq="nothing"/>
> >                                         </operations>
> >                                         <instance_attributes>
> >                                                 <attributes>
> >                                                         <nvpair
> > name="ip" value="10.20.120.205"/>
> >                                                 </attributes>
> >                                         </instance_attributes>
> >                                 </primitive>
> >                         </group>
> >
> > But my problem is that this prereq directives seem to be ignored in this
> > special scenario.
>
> Apparently I broke this for group resources.
> I've added a regression test and am trying to make it work now.

a fix has been committed to CVS

>
> Sorry for the inconvenience.
>
> >
> > Many thanks in advance.
> >
> > Stefan Peinkofer
> >
> > > Having more than 2 nodes would also prevent this (in a cluster with 2
> > > partitions of 2 nodes each - neither partition will have quorum).
> > >
> > > > Fault 2: The first stonith attemp fails.
> > > > Is this assumption right, from your point of view?
> > >
> > > yes, it fails.
> > >
> > > on further inspection the reason for this is that no resources
> > > (particularly the fencing ones) are active yet.
> > > the PE does not and should not know which resource will be used to
> > > perform the STONITH so it cannot avoid the first STONITH failure in
> > > this scenario.
> > >
> > > however, we do make progress and eventually shoot "spock" as seen here:
> > > Oct 12 11:48:16 sarek stonithd: [22925]: info: Succeeded to STONITH
> > > the node spock: optype=1. whodoit: sarek
> > >
> > > but at the time the first STONITH fails, we dont know that - so we log an ERROR.
> > >
> > > btw. you should probably update to avoid the following bug that was
> > > fixed in CVS:
> > >
> > > Oct 12 11:47:44 sarek pengine: [23001]: notice:
> > > mask(unpack.c:unpack_lrm_rsc_state): Forcing restart of
> > > infobase_rg:infobase_ip on sarek, type changed: <null> -> IPaddr
> > > Oct 12 11:47:44 sarek pengine: [23001]: notice:
> > > mask(unpack.c:unpack_lrm_rsc_state): Forcing restart of
> > > infobase_rg:infobase_ip on sarek, class changed: <null> -> ocf
> > > Oct 12 11:47:44 sarek pengine: [23001]: notice:
> > > mask(unpack.c:unpack_lrm_rsc_state): Forcing restart of
> > > infobase_rg:infobase_ip on sarek, provider changed: <null> ->
> > > heartbeat
> > >
> > > The CRM now updates the CIB correctly to prevent this.
> > >
> > > >
> > > > Many thanks in advance.
> > > >
> > > > Stefan Peinkofer
> > > >
> > > >
> > > > -----BEGIN PGP SIGNATURE-----
> > > > Version: GnuPG v1.4.0 (GNU/Linux)
> > > >
> > > > iD8DBQBDVM09lOJ92uOdG/4RAqdTAJ0QRAJWWgB6EQHhhwzDrjwPZPFOmgCdEu6z
> > > > lRKvSjsDFmh3wIpA8aypjVY=
> > > > =9+wR
> > > > -----END PGP SIGNATURE-----
> > > >
> > > >
> > > > _______________________________________________
> > > > Linux-HA mailing list
> > > > Linux-HA at lists.linux-ha.org
> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > See also: http://linux-ha.org/ReportingProblems
> > > >
> > > >
> > > _______________________________________________
> > > Linux-HA mailing list
> > > Linux-HA at lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> >
> >
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.0 (GNU/Linux)
> >
> > iD8DBQBDVP8flOJ92uOdG/4RAkIiAKDIxpAfnCH9R8tIZ5E1pE4N8LxD9gCggYb8
> > Z6i6UmpGKRFMxI4pRszuKTE=
> > =oSvZ
> > -----END PGP SIGNATURE-----
> >
> >
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> >
>



More information about the Linux-HA mailing list