[Linux-HA] New problem(s) with heartbeat 2.0.3 and STONITH

Andrew Beekhof beekhof at gmail.com
Mon Oct 31 10:18:23 MST 2005


On 10/28/05, Alan Robertson <alanr at unix.sh> wrote:
> Andrew Beekhof wrote:
> > On 10/28/05, Alan Robertson <alanr at unix.sh> wrote:
> >> Sun Jiang Dong wrote:
> >>>
> >>> Alan Robertson wrote:
> >>>> Stefan Peinkofer wrote:
> >>>>
> >>>>> Hello everybody,
> >>>>>
> >>>>> unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
> >>>>> and stonith.
> >>>>>
> >>>>> I ran a cvs heartbeat which was checked out on 2005-10-18 and
> >>>>> encountered a problem with stonithd which was killed by signal 11.
> >>>>> The effects were that the stonith resources were NOT_ACTIVE and when I
> >>>>> initiated a split brain no node could fence the other off.
> >>>>>
> >>>>> I thought maybe it's already fixed in cvs and checkout a version today
> >>>>> (2005-10-26). But unfortunately this version seems to contain a even
> >>>>> worse problem with stonith.
> >>>>>
> >>>>> After I startup heartbeat on the two nodes, and wait until it's started
> >>>>> up completely I initiated the split brain situation. I had expected that
> >>>>> this works as expected because both stonith resources were active.
> >>>>>
> >>>>> In the logs I saw:
> >>>>> Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
> >>>>> Scheduling Node sarek for STONITH
> >>>>> Thats what I want :)
> >>>>> But then the following message appeared:
> >>>>> Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
> >>>>> cannot add field to ha_msg.
> >>>>
> >>>> This is some kind of an issue in the lib/fencing/stonithd_lib.c file
> >>>>
> >>>>         if (  (ha_msg_add_int(request, F_STONITHD_OPTYPE, op->optype)
> >>>> != HA_OK )
> >>>>             ||(ha_msg_add(request, F_STONITHD_NODE, op->node_name ) !=
> >>>> HA_OK)
> >>>>             ||(op->node_uuid == NULL
> >>>>                || ha_msg_add(request, F_STONITHD_NODE_UUID,
> >>>> op->node_uuid) != HA_OK)
> >>>>             ||(op->private_data == NULL
> >>>>                || ha_msg_add(request, F_STONITHD_PDATA,
> >>>> op->private_data) != HA_OK)
> >>>>             ||(ha_msg_add_int(request, F_STONITHD_TIMEOUT, op->timeout)
> >>>>                 != HA_OK) ) {
> >>>>                 stdlib_log(LOG_ERR, "stonithd_node_fence: "
> >>>>                            "cannot add field to ha_msg.");
> >>>>                 ZAPMSG(request);
> >>>>                 return ST_FAIL;
> >>>>         }
> >>>>
> >>>> My guess is that op->node_name or op->optype is NULL.  The code should
> >>>> have validated those.  Since they're critical, and they come from
> >>>> who-knows-where (meaning some doofus user process), they should
> >>>> definitely have been error checked, and there should be a clear
> >>>> message about their errors.
> >>>>
> >>> Should be op->private_data == NULL. This condition is not reasonable.
> >>> I'll fix it.
> >>>
> >>>> Things I don't quite understand...
> >>>> UUIDs are normally special portable binary values with their own type
> >>>> in the structure world...  Having this be a string violates the law of
> >>>> least surprise.  If they're not really uuids, then they shouldn't be
> >>>> CALLED uuids.
> >>> There is a long story regarding this, it's required by Andrew.
> >>
> >> If Andrew requires you to call something which isn't a UUID as a uuid,
> >> then he screwed up and he should fix it.
> >
> > delightfully tactful as ever.
>
> Untactful, yes.  Delightful, no.  I screwed up.  Again.
>
> > from reading this one would think that its the first time time we've
> > had this discussion.
>
> I wasn't sure it was this same issue, and I had (foolishly) hoped that
> it wasn't really still broken.
>
> The project really does use the concept of a UUID.  It is (and has been
> and will continue to be) inappropriate to misuse terminology and/or use
> it in inconsistent ways.  It creates confusion - because that word
> already means something else.  Confusion violates the principle of least
> surprise.
>
> How would you suggest we go about fixing this?

My basic feeling about it is that requiring a uuid_t (rather than a
char*) doesnt help anyone - so there's nothing to fix :-)

Sure we could use a uuid_t instead, its just a call to cl_uuid_parse().

But the first thing that the function is going to (or at least should)
do is unparse it into a char* again so they can log what they're about
to do.

So I just dont see the added value of keeping it in one form vs. another.
But on the otherhand, I dont actually care so much... if you're that
keen on a uuid_t then we can use that.

Btw. the stonithd doesn't actually use it for anything internally.

> Would it be of value to have a bugzilla for this?

Its about a 2 line change in the TE where it calls stonith.

On the otherhand, if you want me using uuid_t EVERYWHERE... thats a
different story.


More information about the Linux-HA mailing list