[Linux-HA] Problem with STONITH and heartbeat 2

Andrew Beekhof beekhof at gmail.com
Tue Oct 18 06:29:33 MDT 2005


On 10/18/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> Hello Andrew,
>
> On Mon, 2005-10-17 at 17:14 +0200, Andrew Beekhof wrote:
> > On 10/17/05, Andrew Beekhof <beekhof at gmail.com> wrote:
> > > On 10/17/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> > > > Hello Andrew,
> > > >
> > > > I wanted to ask if you could figure something out of the logfiles I mailed you, yet. Maybe you have some suggestions what I can try on my side in parallel.
> > > >
> > >
> > > I looked and I think I concluded that IBM China needed to look into it...
> > > I'll take another look today.
> >
> > yes, definitely requires the input from IBM China.
> >
> > The logs below indicate that the CRM did its part but the call failed.
> >
> > Oct 12 11:47:39 sarek tengine: [23000]: info:
> > mask(callbacks.c:tengine_stonith_callback): optype=1, node_name=spock,
> > result=2, node_list=
> > Oct 12 11:47:39 sarek tengine: [23000]: ERROR:
> > mask(tengine.c:match_down_event): Stonith of
> > 94e85471-d30d-4b94-aa8e-69c8440361a0 failed (2)... aborting
> > transition.
> >
> maybe you want to answer a question to me, so I can understand this
> whole issue better.
>
> From my point of view it looks like that there are comming two
> independent faults together.
> Fault 1: The resources get started before a successfull stonith
> occoured.

As far as the PE knows, the resources aren't running anywhere so its
safe to start them.

If you'd like start actions for the resources to wait for STONITHs to
complete - you'll need to use the "prereq" option for <op
name=start"...>

Having more than 2 nodes would also prevent this (in a cluster with 2
partitions of 2 nodes each - neither partition will have quorum).

> Fault 2: The first stonith attemp fails.
> Is this assumption right, from your point of view?

yes, it fails.

on further inspection the reason for this is that no resources
(particularly the fencing ones) are active yet.
the PE does not and should not know which resource will be used to
perform the STONITH so it cannot avoid the first STONITH failure in
this scenario.

however, we do make progress and eventually shoot "spock" as seen here:
Oct 12 11:48:16 sarek stonithd: [22925]: info: Succeeded to STONITH
the node spock: optype=1. whodoit: sarek

but at the time the first STONITH fails, we dont know that - so we log an ERROR.

btw. you should probably update to avoid the following bug that was
fixed in CVS:

Oct 12 11:47:44 sarek pengine: [23001]: notice:
mask(unpack.c:unpack_lrm_rsc_state): Forcing restart of
infobase_rg:infobase_ip on sarek, type changed: <null> -> IPaddr
Oct 12 11:47:44 sarek pengine: [23001]: notice:
mask(unpack.c:unpack_lrm_rsc_state): Forcing restart of
infobase_rg:infobase_ip on sarek, class changed: <null> -> ocf
Oct 12 11:47:44 sarek pengine: [23001]: notice:
mask(unpack.c:unpack_lrm_rsc_state): Forcing restart of
infobase_rg:infobase_ip on sarek, provider changed: <null> ->
heartbeat

The CRM now updates the CIB correctly to prevent this.

>
> Many thanks in advance.
>
> Stefan Peinkofer
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.0 (GNU/Linux)
>
> iD8DBQBDVM09lOJ92uOdG/4RAqdTAJ0QRAJWWgB6EQHhhwzDrjwPZPFOmgCdEu6z
> lRKvSjsDFmh3wIpA8aypjVY=
> =9+wR
> -----END PGP SIGNATURE-----
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>



More information about the Linux-HA mailing list