[Linux-HA] New problem(s) with heartbeat 2.0.3 and STONITH

Alan Robertson alanr at unix.sh
Fri Oct 28 08:55:22 MDT 2005

Andrew Beekhof wrote:
> On 10/28/05, Alan Robertson <alanr at unix.sh> wrote:
>> Sun Jiang Dong wrote:
>>> Alan Robertson wrote:
>>>> Stefan Peinkofer wrote:
>>>>> Hello everybody,
>>>>> unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
>>>>> and stonith.
>>>>> I ran a cvs heartbeat which was checked out on 2005-10-18 and
>>>>> encountered a problem with stonithd which was killed by signal 11.
>>>>> The effects were that the stonith resources were NOT_ACTIVE and when I
>>>>> initiated a split brain no node could fence the other off.
>>>>> I thought maybe it's already fixed in cvs and checkout a version today
>>>>> (2005-10-26). But unfortunately this version seems to contain a even
>>>>> worse problem with stonith.
>>>>> After I startup heartbeat on the two nodes, and wait until it's started
>>>>> up completely I initiated the split brain situation. I had expected that
>>>>> this works as expected because both stonith resources were active.
>>>>> In the logs I saw:
>>>>> Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
>>>>> Scheduling Node sarek for STONITH
>>>>> Thats what I want :)
>>>>> But then the following message appeared:
>>>>> Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
>>>>> cannot add field to ha_msg.
>>>> This is some kind of an issue in the lib/fencing/stonithd_lib.c file
>>>>         if (  (ha_msg_add_int(request, F_STONITHD_OPTYPE, op->optype)
>>>> != HA_OK )
>>>>             ||(ha_msg_add(request, F_STONITHD_NODE, op->node_name ) !=
>>>> HA_OK)
>>>>             ||(op->node_uuid == NULL
>>>>                || ha_msg_add(request, F_STONITHD_NODE_UUID,
>>>> op->node_uuid) != HA_OK)
>>>>             ||(op->private_data == NULL
>>>>                || ha_msg_add(request, F_STONITHD_PDATA,
>>>> op->private_data) != HA_OK)
>>>>             ||(ha_msg_add_int(request, F_STONITHD_TIMEOUT, op->timeout)
>>>>                 != HA_OK) ) {
>>>>                 stdlib_log(LOG_ERR, "stonithd_node_fence: "
>>>>                            "cannot add field to ha_msg.");
>>>>                 ZAPMSG(request);
>>>>                 return ST_FAIL;
>>>>         }
>>>> My guess is that op->node_name or op->optype is NULL.  The code should
>>>> have validated those.  Since they're critical, and they come from
>>>> who-knows-where (meaning some doofus user process), they should
>>>> definitely have been error checked, and there should be a clear
>>>> message about their errors.
>>> Should be op->private_data == NULL. This condition is not reasonable.
>>> I'll fix it.
>>>> Things I don't quite understand...
>>>> UUIDs are normally special portable binary values with their own type
>>>> in the structure world...  Having this be a string violates the law of
>>>> least surprise.  If they're not really uuids, then they shouldn't be
>>>> CALLED uuids.
>>> There is a long story regarding this, it's required by Andrew.
>> If Andrew requires you to call something which isn't a UUID as a uuid,
>> then he screwed up and he should fix it.
> delightfully tactful as ever.

Untactful, yes.  Delightful, no.  I screwed up.  Again.

> from reading this one would think that its the first time time we've
> had this discussion.

I wasn't sure it was this same issue, and I had (foolishly) hoped that 
it wasn't really still broken.

The project really does use the concept of a UUID.  It is (and has been 
and will continue to be) inappropriate to misuse terminology and/or use 
it in inconsistent ways.  It creates confusion - because that word 
already means something else.  Confusion violates the principle of least 

How would you suggest we go about fixing this?

Would it be of value to have a bugzilla for this?

     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 

More information about the Linux-HA mailing list