[Linux-HA] New problem(s) with heartbeat 2.0.3 and STONITH

Alan Robertson alanr at unix.sh
Mon Oct 31 10:49:11 MST 2005


peinkofe at fhm.edu wrote:
> Hello Alan,
> On Mon, Oct 31, 2005 at 09:35:58AM -0700, Alan Robertson wrote:
>> peinkofe at fhm.edu wrote:
>>> Hello Alan, On Mon, Oct 31, 2005 at 08:18:10AM -0700, Alan Robertson
>>> wrote:
>>>> peinkofe at fhm.edu wrote:
>>>>> Hello everybody, On Sun, Oct 30, 2005 at 07:29:31PM +0100,
>>>>> peinkofe at fhm.edu wrote:
>>>>>> Yes, I just tried the current cvs version and it works.
>>>>>> (Problem 2 (the "cannot add field to ha_msg" Error) is gone and
>>>>>> Problem 1 seems to be solved either)
>>>>>>
>>>>> Seems that I was a little bit too optimistic. Problem 1 isn't
>>>>> solved yet. In fact it worked once and failed many times. In the
>>>>> case which worked, a timeout of the monitor op was discovered: 
>>>>> Oct 30 19:01:46 spock lrmd: [4468]: WARN: on_op_timeout_expired:
>>>>> TIMEOUT: operation monitor[15] on stonith::wti_nps::kill_sarek
>>>>> for client 4469, its parameters: timeout=5000
>>>>> ipaddr=192.168.1.205 te-target-rc=7 lrm-is-probe=true
>>>>> password=XXXXXXX crm_feature_set=1.0.3 interval=10000 .
>>>>>
>>>>> Oct 30 19:01:51 spock crmd: [4469]: ERROR:
>>>>> mask(lrm.c:do_lrm_event): LRM operation (15) monitor_10000 on
>>>>> kill_sarek Timed Out
>>>>>
>>>>> The it said that sontihd was killed by signal 11 and respawned
>>>>> it. Oct 30 19:01:55 spock heartbeat: [4447]: ERROR: Exiting
>>>>> /usr/lib/heartbeat/stonithd process 4467 killed by signal 11. Oct
>>>>> 30 19:01:55 spock heartbeat: [4447]: ERROR: Exiting
>>>>> /usr/lib/heartbeat/stonithd process 4467 dumped core
>>>> WE NEED THE STACK TRACE FROM THIS CORE DUMP.
>>>>
>>> Im sorry, I forgot. Attached some gdb backtraces (hope that is what
>>> you want, since pstack on linux seems not to support core files).
>>>
>>> To avoid misunderstandings, do you aggree, that solving the stonithd
>>> coredump cause solves not the whole problem. I mean, stonithd
>>> recovers through the respawning mechanism but what makes the
>>> situation worse is that the stonith resources fail to restart and
>>> therefore remain not active.
>> I agree that there are two problems.
>>
>> IMHO, the more serious of the two is the core dump.  The other wouldn't 
>> be a problem if the stonithd hadn't needed to restart.

> Form my humble users point of view it's the other way round, because
> overstated a user doesn't care that stonithd segfaults as long as the
> cluster does what it's supposed to do

I understand.  Obviously, I have a different perspective.

 > By the way I personally like
> the approach to accept that failures occour and to add "self healing"
> capabilities to recover, if possible.

We obviously agree on that.  Stuff happens.

>> I don't know why the CRM didn't restart the resources when the
>> monitor operation failed.  (At least, I think it failed)

The respawn should more often happen before the monitor failed - unless 
things were unlucky.

> I think CRM at least tried to restart the stonith resources and one
> time (see the first set of the logfiles for this) it even succeeded
> in doing so. Maybe there is a timing "problem" since the in the case
> it succeeded, the announcement of the resource restart was after the
> stointhd respawn announcment. In the other cases where restart didn't
> succed, it was exactly the other way round. Many thanks in advance.

OK

So it did succeed some times.

-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce


More information about the Linux-HA mailing list