[Linux-HA] New problem(s) with heartbeat 2.0.3 and STONITH
Alan Robertson
alanr at unix.sh
Mon Oct 31 10:41:56 MST 2005
Andrew Beekhof wrote:
> On 10/28/05, Alan Robertson <alanr at unix.sh> wrote:
>> Andrew Beekhof wrote:
>>> On 10/28/05, Alan Robertson <alanr at unix.sh> wrote:
>>>> Sun Jiang Dong wrote:
>>>>> Alan Robertson wrote:
>>>>>> Stefan Peinkofer wrote:
>>>>>>
>>>>>>> Hello everybody,
>>>>>>>
>>>>>>> unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
>>>>>>> and stonith.
>>>>>>>
>>>>>>> I ran a cvs heartbeat which was checked out on 2005-10-18 and
>>>>>>> encountered a problem with stonithd which was killed by signal 11.
>>>>>>> The effects were that the stonith resources were NOT_ACTIVE and when I
>>>>>>> initiated a split brain no node could fence the other off.
>>>>>>>
>>>>>>> I thought maybe it's already fixed in cvs and checkout a version today
>>>>>>> (2005-10-26). But unfortunately this version seems to contain a even
>>>>>>> worse problem with stonith.
>>>>>>>
>>>>>>> After I startup heartbeat on the two nodes, and wait until it's started
>>>>>>> up completely I initiated the split brain situation. I had expected that
>>>>>>> this works as expected because both stonith resources were active.
>>>>>>>
>>>>>>> In the logs I saw:
>>>>>>> Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
>>>>>>> Scheduling Node sarek for STONITH
>>>>>>> Thats what I want :)
>>>>>>> But then the following message appeared:
>>>>>>> Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
>>>>>>> cannot add field to ha_msg.
>>>>>> This is some kind of an issue in the lib/fencing/stonithd_lib.c file
>>>>>>
>>>>>> if ( (ha_msg_add_int(request, F_STONITHD_OPTYPE, op->optype)
>>>>>> != HA_OK )
>>>>>> ||(ha_msg_add(request, F_STONITHD_NODE, op->node_name ) !=
>>>>>> HA_OK)
>>>>>> ||(op->node_uuid == NULL
>>>>>> || ha_msg_add(request, F_STONITHD_NODE_UUID,
>>>>>> op->node_uuid) != HA_OK)
>>>>>> ||(op->private_data == NULL
>>>>>> || ha_msg_add(request, F_STONITHD_PDATA,
>>>>>> op->private_data) != HA_OK)
>>>>>> ||(ha_msg_add_int(request, F_STONITHD_TIMEOUT, op->timeout)
>>>>>> != HA_OK) ) {
>>>>>> stdlib_log(LOG_ERR, "stonithd_node_fence: "
>>>>>> "cannot add field to ha_msg.");
>>>>>> ZAPMSG(request);
>>>>>> return ST_FAIL;
>>>>>> }
>>>>>>
>>>>>> My guess is that op->node_name or op->optype is NULL. The code should
>>>>>> have validated those. Since they're critical, and they come from
>>>>>> who-knows-where (meaning some doofus user process), they should
>>>>>> definitely have been error checked, and there should be a clear
>>>>>> message about their errors.
>>>>>>
>>>>> Should be op->private_data == NULL. This condition is not reasonable.
>>>>> I'll fix it.
>>>>>
>>>>>> Things I don't quite understand...
>>>>>> UUIDs are normally special portable binary values with their own type
>>>>>> in the structure world... Having this be a string violates the law of
>>>>>> least surprise. If they're not really uuids, then they shouldn't be
>>>>>> CALLED uuids.
>>>>> There is a long story regarding this, it's required by Andrew.
>>>> If Andrew requires you to call something which isn't a UUID as a uuid,
>>>> then he screwed up and he should fix it.
>>> delightfully tactful as ever.
>> Untactful, yes. Delightful, no. I screwed up. Again.
>>
>>> from reading this one would think that its the first time time we've
>>> had this discussion.
>> I wasn't sure it was this same issue, and I had (foolishly) hoped that
>> it wasn't really still broken.
>>
>> The project really does use the concept of a UUID. It is (and has been
>> and will continue to be) inappropriate to misuse terminology and/or use
>> it in inconsistent ways. It creates confusion - because that word
>> already means something else. Confusion violates the principle of least
>> surprise.
>>
>> How would you suggest we go about fixing this?
>
> My basic feeling about it is that requiring a uuid_t (rather than a
> char*) doesnt help anyone - so there's nothing to fix :-)
>
> Sure we could use a uuid_t instead, its just a call to cl_uuid_parse().
>
> But the first thing that the function is going to (or at least should)
> do is unparse it into a char* again so they can log what they're about
> to do.
>
> So I just dont see the added value of keeping it in one form vs. another.
> But on the otherhand, I dont actually care so much... if you're that
> keen on a uuid_t then we can use that.
>
> Btw. the stonithd doesn't actually use it for anything internally.
>
>> Would it be of value to have a bugzilla for this?
>
> Its about a 2 line change in the TE where it calls stonith.
>
> On the otherhand, if you want me using uuid_t EVERYWHERE... thats a
> different story.
No, no no.
I just meant - let's not call it a uuid. Call it a charhandle or
something. uniquestring or something.
It's simply a nomenclature issue.
--
Alan Robertson <alanr at unix.sh>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
More information about the Linux-HA
mailing list