[Linux-HA] About time-out of STONITH.
Dejan Muhamedagic
dejanmm at fastmail.fm
Tue Jun 17 07:34:35 MDT 2008
Hi,
On Tue, Jun 17, 2008 at 11:16:00AM +0900, HIDEO YAMAUCHI wrote:
> Hi,
>
> I confirmed behavior of the time-out of the run time of STONITH.(Heartbeat 2.1.3 and ibmrsa-telnet)
>
> I confirmed it by the next sequence.
>
> 1)Start Heartbeat in two nodes.
> 2)Hung up in one node.
> 3)Time-out in STONITH.(Put a sleep code or drop all power supplies of the node.)
>
> But, unlike normal RA, plural RA of STONITH are started.
>
> I think that RA of STONITH should be started again after I was
> murdered properly like normal RA.
And that is what happens. After one stonith reset operation
fails, this time due to the timeout, another one is scheduled,
i.e. another stonith resource instance is started.
> //-------The state of the ps command
> Last login: Tue Jun 17 10:00:54 2008 from 172.30.96.92
> [root at x3650b ~]# ps -ef |grep ibm
> root 4562 1 0 Jun12 ? 00:00:00 /sbin/ibmasm
> root 4823 4562 0 Jun12 ? 00:00:00 /sbin/ibmasm
> root 11913 11912 0 10:23 ? 00:00:00 /usr/bin/python
> /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a
> root 11947 11917 0 10:23 pts/1 00:00:00 grep ibm
> [root at x3650b ~]# ps -ef |grep ibm
> root 4562 1 0 Jun12 ? 00:00:00 /sbin/ibmasm
> root 4823 4562 0 Jun12 ? 00:00:00 /sbin/ibmasm
> root 11913 1 0 10:23 ? 00:00:00 /usr/bin/python
> /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a
> root 11962 1 0 10:26 ? 00:00:00 /usr/bin/python
> /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a
> root 11977 1 0 10:29 ? 00:00:00 /usr/bin/python
> /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a
> root 11994 11993 0 10:32 ? 00:00:00 /usr/bin/python
> /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a
As you can see the instances are started three minutes apart from
each other. Though I wonder why the earlier ones remain. The only
possible explanation is that the stonith resource forks a new
process, though I don't know ibmrsa-telnet to confirm that. From
the logs:
stonithd[11751]: 2008/06/17_10:41:33 WARN: Managed external_r_stonith-node01_1 process 12047 killed by signal 9 [SIGKILL - Kill, unblockable].
stonithd[11751]: 2008/06/17_10:41:33 WARN: child exits, but not tracked.
and from the process list:
root 12048 0.0 0.0 108736 4228 ? S 10:38 0:00 /usr/bin/python /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a
The process 12047, which was probably the parent of PID 12048 had
been killed due to the timeout. Since it has been brutally
removed by signal 9, its child remained. This should probably be
changed, i.e. the process should be first sent a TERM signal so
that it has a chance to notify children and otherwise do a proper
cleanup.
I opened a bugzilla for this issue:
http://developerbugs.linux-foundation.org/show_bug.cgi?id=1922
Thanks,
Dejan
More information about the Linux-HA
mailing list