[Linux-HA] Problem with restarting (or moving) failed resource

Andrew W. Nosenko andrew.w.nosenko at gmail.com
Thu Oct 4 06:12:02 MDT 2007


Heartbeat-2.1.2
If resource (test-daemon process) killed too frequently, then
heartbeat marks this resource/process as "failed" and doesn't try to
restart this process or move it to the another node.

If frequence of killing is low enough, then 'test-daemon' process
restarted on the same node without any problems (but doesn't try to
move it to the another node, but it seems like absolutelly different
story).

Interesting that after falling into this situation ('test-daemon' is
not restarted on the 'awn' node, nor migrate to the second node
'lisiy'), the "victim" 'test-daemon' resource is restarted
authomatically on the first node ('awn')  if second node goes away
(heartbeat is correctly shuted down).

Cluster configured as symmetric, all "stickness" values are default,
'test-daemon' process have 'monitor' operation with default (absent
"on_fail" attribute).  If I set "on_fail" set to "restart", then
problem doesn't go away, result is the same.  "Victim" 'test-daemon'
process lives under group 'test-group' on the node "awn" (at the time
of this test).

Some race-condition in the resource recover code?

Logs of the full cycle (from start to stop) and "cibadmin -Q" output
are attached.

The point of the last kill (after which 'test-daemon' was not
restarted) can be found in the ha-log.awn, line:

Oct  4 14:22:53 awn test-daemon[6759]: Signal #15 (Terminated: 15)
received.  Terminating...

Attached files:
ha-log.awn  -- log from node 'awn' (DC and node where "victim" process run)
ha-log.lisiy -- log from second node
cib.xml -- output of 'cibadmin -Q'

'crm_mon' cut'n'paste follows:

============
Last updated: Thu Oct  4 14:23:15 2007
Current DC: awn (2ac97182-5b64-4edb-a528-ee6d160c326a)
2 Nodes configured.
2 Resources configured.
============

Node: awn (2ac97182-5b64-4edb-a528-ee6d160c326a): online
Node: lisiy.ua3 (9888b89c-94bb-4505-ab34-f84deced5e9d): online

Resource Group: test-group
    test-ip     (heartbeat::ocf:IPaddr):        Started awn
    test-daemon (awn::ocf:test-daemon.ocf):     Started awn FAILED
Clone Set: test-pingd-clone
    test-pingd:0        (heartbeat::ocf:pingd): Started awn
    test-pingd:1        (heartbeat::ocf:pingd): Started lisiy.ua3

Failed actions:
    test-daemon_monitor_5000 (node=awn, call=17, rc=7): complete

-----[ end of crm_mon screen]-----

PS.  Excuse me my English, please.

-- 
Andrew W. Nosenko <andrew.w.nosenko at gmail.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ha-log.awn.gz
Type: application/x-gzip
Size: 38174 bytes
Desc: not available
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20071004/00d566b2/ha-log.awn-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ha-log.lisiy.gz
Type: application/x-gzip
Size: 13018 bytes
Desc: not available
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20071004/00d566b2/ha-log.lisiy-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cib.xml.gz
Type: application/x-gzip
Size: 1871 bytes
Desc: not available
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20071004/00d566b2/cib.xml-0001.bin


More information about the Linux-HA mailing list