[Linux-HA] Problem with restarting (or moving) failed resource

Andrew Beekhof beekhof at gmail.com
Fri Oct 5 04:51:19 MDT 2007


On 10/4/07, Andrew W. Nosenko <andrew.w.nosenko at gmail.com> wrote:
> Heartbeat-2.1.2
> If resource (test-daemon process) killed too frequently, then
> heartbeat marks this resource/process as "failed" and doesn't try to
> restart this process or move it to the another node.
>
> If frequence of killing is low enough, then 'test-daemon' process
> restarted on the same node without any problems (but doesn't try to
> move it to the another node, but it seems like absolutelly different
> story).

indeed - http://linux-ha.org/v2/faq/forced_failover

> Interesting that after falling into this situation ('test-daemon' is
> not restarted on the 'awn' node, nor migrate to the second node
> 'lisiy'), the "victim" 'test-daemon' resource is restarted
> authomatically on the first node ('awn')  if second node goes away
> (heartbeat is correctly shuted down).
>
> Cluster configured as symmetric, all "stickness" values are default,

which is why its not being moved automagically

> 'test-daemon' process have 'monitor' operation with default (absent
> "on_fail" attribute).  If I set "on_fail" set to "restart", then
> problem doesn't go away, result is the same.

right, thats the default behaviour

>  "Victim" 'test-daemon'
> process lives under group 'test-group' on the node "awn" (at the time
> of this test).
>
> Some race-condition in the resource recover code?
>
> Logs of the full cycle (from start to stop) and "cibadmin -Q" output
> are attached.

can you attach the following 2 files from awn:
  /var/lib/heartbeat/pengine/pe-warn-304.bz2
  /var/lib/heartbeat/pengine/pe-warn-305.bz2

they contain exactly what the PE was working with at the time

> The point of the last kill (after which 'test-daemon' was not
> restarted) can be found in the ha-log.awn, line:
>
> Oct  4 14:22:53 awn test-daemon[6759]: Signal #15 (Terminated: 15)
> received.  Terminating...
>
> Attached files:
> ha-log.awn  -- log from node 'awn' (DC and node where "victim" process run)
> ha-log.lisiy -- log from second node
> cib.xml -- output of 'cibadmin -Q'
>
> 'crm_mon' cut'n'paste follows:
>
> ============
> Last updated: Thu Oct  4 14:23:15 2007
> Current DC: awn (2ac97182-5b64-4edb-a528-ee6d160c326a)
> 2 Nodes configured.
> 2 Resources configured.
> ============
>
> Node: awn (2ac97182-5b64-4edb-a528-ee6d160c326a): online
> Node: lisiy.ua3 (9888b89c-94bb-4505-ab34-f84deced5e9d): online
>
> Resource Group: test-group
>     test-ip     (heartbeat::ocf:IPaddr):        Started awn
>     test-daemon (awn::ocf:test-daemon.ocf):     Started awn FAILED
> Clone Set: test-pingd-clone
>     test-pingd:0        (heartbeat::ocf:pingd): Started awn
>     test-pingd:1        (heartbeat::ocf:pingd): Started lisiy.ua3
>
> Failed actions:
>     test-daemon_monitor_5000 (node=awn, call=17, rc=7): complete
>
> -----[ end of crm_mon screen]-----
>
> PS.  Excuse me my English, please.
>
> --
> Andrew W. Nosenko <andrew.w.nosenko at gmail.com>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>


More information about the Linux-HA mailing list