[Linux-HA] New problem(s) with heartbeat 2.0.3 and STONITH

Sun Jiang Dong hasjd at cn.ibm.com
Fri Oct 28 04:15:38 MDT 2005



Alan Robertson wrote:
> Stefan Peinkofer wrote:
> 
>> Hello everybody,
>>
>> unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
>> and stonith.
>>
>> I ran a cvs heartbeat which was checked out on 2005-10-18 and
>> encountered a problem with stonithd which was killed by signal 11.
>> The effects were that the stonith resources were NOT_ACTIVE and when I
>> initiated a split brain no node could fence the other off.
>>
>> I thought maybe it's already fixed in cvs and checkout a version today
>> (2005-10-26). But unfortunately this version seems to contain a even
>> worse problem with stonith.
>>
>> After I startup heartbeat on the two nodes, and wait until it's started
>> up completely I initiated the split brain situation. I had expected that
>> this works as expected because both stonith resources were active.
>>
>> In the logs I saw:
>> Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
>> Scheduling Node sarek for STONITH
>> Thats what I want :)
>> But then the following message appeared:
>> Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
>> cannot add field to ha_msg.
> 
> 
> This is some kind of an issue in the lib/fencing/stonithd_lib.c file
> 
>         if (  (ha_msg_add_int(request, F_STONITHD_OPTYPE, op->optype) != 
> HA_OK )
>             ||(ha_msg_add(request, F_STONITHD_NODE, op->node_name ) != 
> HA_OK)
>             ||(op->node_uuid == NULL
>                || ha_msg_add(request, F_STONITHD_NODE_UUID, 
> op->node_uuid) != HA_OK)
>             ||(op->private_data == NULL
>                || ha_msg_add(request, F_STONITHD_PDATA, 
> op->private_data) != HA_OK)
>             ||(ha_msg_add_int(request, F_STONITHD_TIMEOUT, op->timeout)
>                 != HA_OK) ) {
>                 stdlib_log(LOG_ERR, "stonithd_node_fence: "
>                            "cannot add field to ha_msg.");
>                 ZAPMSG(request);
>                 return ST_FAIL;
>         }
> 
> My guess is that op->node_name or op->optype is NULL.  The code should 
> have validated those.  Since they're critical, and they come from 
> who-knows-where (meaning some doofus user process), they should 
> definitely have been error checked, and there should be a clear message 
> about their errors.
> 

Should be op->private_data == NULL. This condition is not reasonable.
I'll fix it.

> Things I don't quite understand...
> UUIDs are normally special portable binary values with their own type in 
> the structure world...  Having this be a string violates the law of 
> least surprise.  If they're not really uuids, then they shouldn't be 
> CALLED uuids.
There is a long story regarding this, it's required by Andrew.

> 
> Normally private_data is also binary.  If either of this is actually 
> binary, then this would also be wrong.  Having them be strings violates 
> the law of least surprise...  So, as a design element, it's odd to have 
> them not be binary blobs.  Of course, sending the private data as binary 
> would cause it's own problems with portability.
Yes.
> 
> But, renaming it to private_string_data or something would alleviate the 
> confusion, and make it clearer.
It makes sense, i'll rename it.
> 
> 

-- 
BRs,

Sun Jiang Dong




More information about the Linux-HA mailing list