[Linux-HA] New problem(s) with heartbeat 2.0.3 and STONITH

Andrew Beekhof beekhof at gmail.com
Wed Oct 26 12:27:30 MDT 2005


On 10/26/05, Stefan Peinkofer <peinkofe at fhm.edu> wrote:
> Hello Alan,
> On Wed, 2005-10-26 at 10:14 -0600, Alan Robertson wrote:
> > Stefan Peinkofer wrote:
> > > Hello everybody,
> > >
> > > unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
> > > and stonith.
> > >
> > > I ran a cvs heartbeat which was checked out on 2005-10-18 and
> > > encountered a problem with stonithd which was killed by signal 11.
> > > The effects were that the stonith resources were NOT_ACTIVE and when I
> > > initiated a split brain no node could fence the other off.
> > >
> > > I thought maybe it's already fixed in cvs and checkout a version today
> > > (2005-10-26). But unfortunately this version seems to contain a even
> > > worse problem with stonith.
> > >
> > >
> > > After I startup heartbeat on the two nodes, and wait until it's started
> > > up completely I initiated the split brain situation. I had expected that
> > > this works as expected because both stonith resources were active.
> > >
> > > In the logs I saw:
> > > Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
> > > Scheduling Node sarek for STONITH
> > > Thats what I want :)
> > > But then the following message appeared:
> > > Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
> > > cannot add field to ha_msg.
> > >
> > > And no node kills the other. The try it over and over again but it
> > > breaks always with the above message.
> > >
> > > I have attached the complete logfile of the DC. As well as my ha.cf and
> > > the cib.xml.
> > > Note that both nodes have the problem.
> > >
> > > My system: two RHEL 4 Update 2 Kernel 2.6.0-11ELsmp
> > > 2 wti_nps power switches.
> >
> > IIRC used to see the signal 11 stuff in our testing a few months ago,
> > but it went away - so we could't fix it.
>
> > Can you get us the stack trace from the core dump from this occurance?
> >
> Sorry, my problem description may be ambiguous. I'm talking about two
> presumably independent problems. Problem 1 is the 'killed by signal 11'
> problem. That was the reason why I updated my heartbeat to a more recent
> cvs version. Unfortunately I haven't keep the logs of this problem.
> (Because I wanted to use the more recent cvs version to provide logs and
> stuff)
> Problem 2 is the problem with 'cannot add field to ha_msg' and it
> appeared with the more recent cvs version. The logs attached are for
> Prolbem 2. I will be able to provide logs, cores and stuff for Problem 1
> if Problem 2 is fixed (since it takes place before Problem 1 occours).
> I hope I did a better job this time.

I believe IBM China fixed Problem 1 in CVS a while back - or maybe
this is a different problem with the same symptom.

The Problem 2 ERRORs indicate an internal stonithd problem (rather
than a CRM one).

>
> Many thanks in advance.
> Stefan Peinkofer
> > It's odd that the monitoring of the STONITH objects didn't detect that
> > they weren't running any more.  Guess we'll have to look at the logs
> > more closely.
> >
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.0 (GNU/Linux)
>
> iD8DBQBDX6+TlOJ92uOdG/4RAvuTAJ0cfMm9F0Q3OyxJo3yeLcoDFNIoLACeKeWY
> DUWYPwyigijbdaHeexxyC0g=
> =yJwI
> -----END PGP SIGNATURE-----
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>



More information about the Linux-HA mailing list