[Linux-HA] New problem(s) with heartbeat 2.0.3 and STONITH
alanr at unix.sh
Mon Oct 31 09:35:58 MST 2005
peinkofe at fhm.edu wrote:
> Hello Alan, On Mon, Oct 31, 2005 at 08:18:10AM -0700, Alan Robertson
>> peinkofe at fhm.edu wrote:
>>> Hello everybody, On Sun, Oct 30, 2005 at 07:29:31PM +0100,
>>> peinkofe at fhm.edu wrote:
>>>> Yes, I just tried the current cvs version and it works.
>>>> (Problem 2 (the "cannot add field to ha_msg" Error) is gone and
>>>> Problem 1 seems to be solved either)
>>> Seems that I was a little bit too optimistic. Problem 1 isn't
>>> solved yet. In fact it worked once and failed many times. In the
>>> case which worked, a timeout of the monitor op was discovered:
>>> Oct 30 19:01:46 spock lrmd: : WARN: on_op_timeout_expired:
>>> TIMEOUT: operation monitor on stonith::wti_nps::kill_sarek
>>> for client 4469, its parameters: timeout=5000
>>> ipaddr=192.168.1.205 te-target-rc=7 lrm-is-probe=true
>>> password=XXXXXXX crm_feature_set=1.0.3 interval=10000 .
>>> Oct 30 19:01:51 spock crmd: : ERROR:
>>> mask(lrm.c:do_lrm_event): LRM operation (15) monitor_10000 on
>>> kill_sarek Timed Out
>>> The it said that sontihd was killed by signal 11 and respawned
>>> it. Oct 30 19:01:55 spock heartbeat: : ERROR: Exiting
>>> /usr/lib/heartbeat/stonithd process 4467 killed by signal 11. Oct
>>> 30 19:01:55 spock heartbeat: : ERROR: Exiting
>>> /usr/lib/heartbeat/stonithd process 4467 dumped core
>> WE NEED THE STACK TRACE FROM THIS CORE DUMP.
> Im sorry, I forgot. Attached some gdb backtraces (hope that is what
> you want, since pstack on linux seems not to support core files).
> To avoid misunderstandings, do you aggree, that solving the stonithd
> coredump cause solves not the whole problem. I mean, stonithd
> recovers through the respawning mechanism but what makes the
> situation worse is that the stonith resources fail to restart and
> therefore remain not active.
I agree that there are two problems.
IMHO, the more serious of the two is the core dump. The other wouldn't
be a problem if the stonithd hadn't needed to restart.
I don't know why the CRM didn't restart the resources when the monitor
operation failed. (At least, I think it failed)
Alan Robertson <alanr at unix.sh>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
More information about the Linux-HA