[Linux-HA] behavior of lrmd/crmd when lrmd process is killed
Junko IKEDA
ikedaj at intellilink.co.jp
Fri Jun 27 00:59:09 MDT 2008
> > When I checked the following bug using the latest heartbeat-dev and
> > pacemaker-dev,
> > http://developerbugs.linux-foundation.org/show_bug.cgi?id=1924
> >
> > I found the weird behavior.
> >
> > There are these five resources.
> >
> > ============
> > Last updated: Fri Jun 27 13:07:11 2008
> > Current DC: x3650b (db1e4cef-d242-419e-9393-bf5113384744)
> > 2 Nodes configured.
> > 1 Resources configured.
> > ============
> >
> > Node: x3650a (ce2caf3f-c150-4394-916d-3b4b635394d7): online
> > Node: x3650b (db1e4cef-d242-419e-9393-bf5113384744): online
> >
> > Resource Group: grpPostgreSQLDB
> > prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem): Started x3650a
> > prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem): Started x3650a
> > prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem): Started x3650a
> > prmIpPostgreSQLDB (ocf::heartbeat:IPaddr): Started x3650a
> > prmApPostgreSQLDB (ocf::heartbeat:pgsql): Started x3650a
> >
> >
> > When "lrmd" is killed, crmd can not notice that event due to (maybe) a
> > glib's problem.
> >
> > hb_report-10/x3650a:line 616
> > heartbeat[24311]: 2008/06/27_12:57:55 WARN: Managed
> > /usr/lib64/heartbeat/lrmd -r process 24327 killed by signal 9 [SIGKILL -
> > Kill, unblockable].
> >
> > but if I stop pgsql like this,
> >
> > # su - postgres
> > $ pg_ctl stop
> > waiting for server to shut down.... done
> > server stopped
> >
> > the frozen process is resumed.
> >
> > hb_report-10/x3650a:line 657
> > crmd[24330]: 2008/06/27_13:09:36 CRIT: lrm_connection_destroy: LRM
> > Connection failed
> >
> > Heartbeat 2.1.3 did the same.
> > I wonder why the status of Postgres affects this.
>
> This is seriously messed up.
> I wonder if could it be caused by the fact that a process spawned by
> the lrmd is still active.
>
> It might be worth seeing if you can repeat the result with a resource
> based on a simple daemon process ( while(1) { sleep(1); } ).
A simple daemon process showed the same result as pgsql.
See attached;
hb_report-simpledaemon/x3650a/ha-log.txt:line 528
heartbeat[4019]: 2008/06/27_15:28:04 WARN: Managed /usr/lib64/heartbeat/lrmd
-r process 4037 killed by signal 9 [SIGKILL - Kill, unblockable].
hb_report-simpledaemon/x3650a/ha-log.txt:line 569
crmd[4040]: 2008/06/27_15:35:45 CRIT: lrm_connection_destroy: LRM Connection
failed
My operation is here;
[root at x3650a ~]# pgrep -lf lrmd
4037 /usr/lib64/heartbeat/lrmd -r
[root at x3650a ~]# kill -9 4037; date
Fri Jun 27 15:28:04 JST 2008
[root at x3650a ~]# pgrep -lf simpledaemon
4088 /root/tmp/bin/simpledaemon
[root at x3650a ~]# kill -9 4088; date
Fri Jun 27 15:35:45 JST 2008
[root at x3650a ~]#
Thanks,
Junko
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hb_report-simpledaemon.tar.gz
Type: application/octet-stream
Size: 56407 bytes
Desc: not available
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20080627/a12ab871/hb_report-simpledaemon.tar-0001.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: simpledaemon.c
Type: application/octet-stream
Size: 108 bytes
Desc: not available
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20080627/a12ab871/simpledaemon-0002.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: simpledaemon
Type: application/octet-stream
Size: 4884 bytes
Desc: not available
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20080627/a12ab871/simpledaemon-0003.obj
More information about the Linux-HA
mailing list