[Linux-HA] New problem(s) with heartbeat 2.0.3 and STONITH

Alan Robertson alanr at unix.sh
Fri Oct 28 07:41:57 MDT 2005

Sun Jiang Dong wrote:
> Alan Robertson wrote:
>> Stefan Peinkofer wrote:
>>> Hello everybody,
>>> unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
>>> and stonith.
>>> I ran a cvs heartbeat which was checked out on 2005-10-18 and
>>> encountered a problem with stonithd which was killed by signal 11.
>>> The effects were that the stonith resources were NOT_ACTIVE and when I
>>> initiated a split brain no node could fence the other off.
>>> I thought maybe it's already fixed in cvs and checkout a version today
>>> (2005-10-26). But unfortunately this version seems to contain a even
>>> worse problem with stonith.
>>> After I startup heartbeat on the two nodes, and wait until it's started
>>> up completely I initiated the split brain situation. I had expected that
>>> this works as expected because both stonith resources were active.
>>> In the logs I saw:
>>> Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
>>> Scheduling Node sarek for STONITH
>>> Thats what I want :)
>>> But then the following message appeared:
>>> Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
>>> cannot add field to ha_msg.
>> This is some kind of an issue in the lib/fencing/stonithd_lib.c file
>>         if (  (ha_msg_add_int(request, F_STONITHD_OPTYPE, op->optype) 
>> != HA_OK )
>>             ||(ha_msg_add(request, F_STONITHD_NODE, op->node_name ) != 
>> HA_OK)
>>             ||(op->node_uuid == NULL
>>                || ha_msg_add(request, F_STONITHD_NODE_UUID, 
>> op->node_uuid) != HA_OK)
>>             ||(op->private_data == NULL
>>                || ha_msg_add(request, F_STONITHD_PDATA, 
>> op->private_data) != HA_OK)
>>             ||(ha_msg_add_int(request, F_STONITHD_TIMEOUT, op->timeout)
>>                 != HA_OK) ) {
>>                 stdlib_log(LOG_ERR, "stonithd_node_fence: "
>>                            "cannot add field to ha_msg.");
>>                 ZAPMSG(request);
>>                 return ST_FAIL;
>>         }
>> My guess is that op->node_name or op->optype is NULL.  The code should 
>> have validated those.  Since they're critical, and they come from 
>> who-knows-where (meaning some doofus user process), they should 
>> definitely have been error checked, and there should be a clear 
>> message about their errors.
> Should be op->private_data == NULL. This condition is not reasonable.
> I'll fix it.
>> Things I don't quite understand...
>> UUIDs are normally special portable binary values with their own type 
>> in the structure world...  Having this be a string violates the law of 
>> least surprise.  If they're not really uuids, then they shouldn't be 
>> CALLED uuids.
> There is a long story regarding this, it's required by Andrew.

If Andrew requires you to call something which isn't a UUID as a uuid, 
then he screwed up and he should fix it.

A UUID is not simply a random identifier which is forced to be unique 
(like he requires his id= in XML), it's an industry standard term as per 
  DCE 1.1, ISO/IEC 11578:1996 and RFC 4122.

So, it is not some string guaranteed to be unique.  In fact, it isn't a 
string at all, but a 128-bit binary value.  There are specified ways of 
printing UUIDs, but they're not precisely UUIDs, but ASCII 
representations of UUIDs.

So, if it's not a 128-bit binary value in compliance with DCE 1.2, 
ISO/IEC 11578:1996 or RFC 4122, it's not really a UUID.

[This URL even contains a sample UUID implementation]

     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 

More information about the Linux-HA mailing list