[Linux-HA] New problem(s) with heartbeat 2.0.3 and STONITH

Stefan Peinkofer peinkofe at fhm.edu
Wed Oct 26 10:32:19 MDT 2005


Hello Alan,
On Wed, 2005-10-26 at 10:14 -0600, Alan Robertson wrote:
> Stefan Peinkofer wrote:
> > Hello everybody,
> > 
> > unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
> > and stonith.
> > 
> > I ran a cvs heartbeat which was checked out on 2005-10-18 and
> > encountered a problem with stonithd which was killed by signal 11.
> > The effects were that the stonith resources were NOT_ACTIVE and when I
> > initiated a split brain no node could fence the other off.
> > 
> > I thought maybe it's already fixed in cvs and checkout a version today
> > (2005-10-26). But unfortunately this version seems to contain a even
> > worse problem with stonith. 
> > 
> > 
> > After I startup heartbeat on the two nodes, and wait until it's started
> > up completely I initiated the split brain situation. I had expected that
> > this works as expected because both stonith resources were active.
> > 
> > In the logs I saw:
> > Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
> > Scheduling Node sarek for STONITH
> > Thats what I want :)
> > But then the following message appeared:
> > Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
> > cannot add field to ha_msg.
> > 
> > And no node kills the other. The try it over and over again but it
> > breaks always with the above message.
> > 
> > I have attached the complete logfile of the DC. As well as my ha.cf and
> > the cib.xml.
> > Note that both nodes have the problem.
> > 
> > My system: two RHEL 4 Update 2 Kernel 2.6.0-11ELsmp
> > 2 wti_nps power switches.
> 
> IIRC used to see the signal 11 stuff in our testing a few months ago, 
> but it went away - so we could't fix it.

> Can you get us the stack trace from the core dump from this occurance?
> 
Sorry, my problem description may be ambiguous. I'm talking about two
presumably independent problems. Problem 1 is the 'killed by signal 11'
problem. That was the reason why I updated my heartbeat to a more recent
cvs version. Unfortunately I haven't keep the logs of this problem.
(Because I wanted to use the more recent cvs version to provide logs and
stuff)
Problem 2 is the problem with 'cannot add field to ha_msg' and it
appeared with the more recent cvs version. The logs attached are for
Prolbem 2. I will be able to provide logs, cores and stuff for Problem 1
if Problem 2 is fixed (since it takes place before Problem 1 occours).
I hope I did a better job this time.

Many thanks in advance.
Stefan Peinkofer
> It's odd that the monitoring of the STONITH objects didn't detect that 
> they weren't running any more.  Guess we'll have to look at the logs 
> more closely.
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linux-ha.org/pipermail/linux-ha/attachments/20051026/b54af028/attachment.pgp>


More information about the Linux-HA mailing list