[Linux-HA] New problem(s) with heartbeat 2.0.3 and STONITH
alanr at unix.sh
Mon Oct 31 10:49:11 MST 2005
peinkofe at fhm.edu wrote:
> Hello Alan,
> On Mon, Oct 31, 2005 at 09:35:58AM -0700, Alan Robertson wrote:
>> peinkofe at fhm.edu wrote:
>>> Hello Alan, On Mon, Oct 31, 2005 at 08:18:10AM -0700, Alan Robertson
>>>> peinkofe at fhm.edu wrote:
>>>>> Hello everybody, On Sun, Oct 30, 2005 at 07:29:31PM +0100,
>>>>> peinkofe at fhm.edu wrote:
>>>>>> Yes, I just tried the current cvs version and it works.
>>>>>> (Problem 2 (the "cannot add field to ha_msg" Error) is gone and
>>>>>> Problem 1 seems to be solved either)
>>>>> Seems that I was a little bit too optimistic. Problem 1 isn't
>>>>> solved yet. In fact it worked once and failed many times. In the
>>>>> case which worked, a timeout of the monitor op was discovered:
>>>>> Oct 30 19:01:46 spock lrmd: : WARN: on_op_timeout_expired:
>>>>> TIMEOUT: operation monitor on stonith::wti_nps::kill_sarek
>>>>> for client 4469, its parameters: timeout=5000
>>>>> ipaddr=192.168.1.205 te-target-rc=7 lrm-is-probe=true
>>>>> password=XXXXXXX crm_feature_set=1.0.3 interval=10000 .
>>>>> Oct 30 19:01:51 spock crmd: : ERROR:
>>>>> mask(lrm.c:do_lrm_event): LRM operation (15) monitor_10000 on
>>>>> kill_sarek Timed Out
>>>>> The it said that sontihd was killed by signal 11 and respawned
>>>>> it. Oct 30 19:01:55 spock heartbeat: : ERROR: Exiting
>>>>> /usr/lib/heartbeat/stonithd process 4467 killed by signal 11. Oct
>>>>> 30 19:01:55 spock heartbeat: : ERROR: Exiting
>>>>> /usr/lib/heartbeat/stonithd process 4467 dumped core
>>>> WE NEED THE STACK TRACE FROM THIS CORE DUMP.
>>> Im sorry, I forgot. Attached some gdb backtraces (hope that is what
>>> you want, since pstack on linux seems not to support core files).
>>> To avoid misunderstandings, do you aggree, that solving the stonithd
>>> coredump cause solves not the whole problem. I mean, stonithd
>>> recovers through the respawning mechanism but what makes the
>>> situation worse is that the stonith resources fail to restart and
>>> therefore remain not active.
>> I agree that there are two problems.
>> IMHO, the more serious of the two is the core dump. The other wouldn't
>> be a problem if the stonithd hadn't needed to restart.
> Form my humble users point of view it's the other way round, because
> overstated a user doesn't care that stonithd segfaults as long as the
> cluster does what it's supposed to do
I understand. Obviously, I have a different perspective.
> By the way I personally like
> the approach to accept that failures occour and to add "self healing"
> capabilities to recover, if possible.
We obviously agree on that. Stuff happens.
>> I don't know why the CRM didn't restart the resources when the
>> monitor operation failed. (At least, I think it failed)
The respawn should more often happen before the monitor failed - unless
things were unlucky.
> I think CRM at least tried to restart the stonith resources and one
> time (see the first set of the logfiles for this) it even succeeded
> in doing so. Maybe there is a timing "problem" since the in the case
> it succeeded, the announcement of the resource restart was after the
> stointhd respawn announcment. In the other cases where restart didn't
> succed, it was exactly the other way round. Many thanks in advance.
So it did succeed some times.
Alan Robertson <alanr at unix.sh>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
More information about the Linux-HA