[Linux-HA] standby does not take over on multiple power failure
Andrew Beekhof
beekhof at gmail.com
Mon Jun 4 05:20:22 MDT 2007
On 6/4/07, Thomas Åkerblom (HF/EBC) <thomas.akerblom at ericsson.com> wrote:
> Hi Andrew.
> I'm using 2.0.8-0.15, but I have seen the same behavior in 2.0.7.
> In this case ha-9 is DC and also the standby server.
> ha-8 has no power, but the standby server does not take over.
> The logs begin right before I pulled the power cord.
>
> Actually I do know how to get around this problem now, but I also have some new questions.
> If I remove the line:
> <nvpair id="default_resource_failure_stickiness" name="default_resource_failure_stickiness" value="-INFINITY"/>
> In the cib file the problem disappears.
> I wouldn't expect that parameter to have this effect, rather the opposite.
> Is this a known/expected correlation?
not so much "correlation" as "thats what its designed to do".
setting default_resource_failure_stickiness=-INFINITY means that if
heartbeat finds the rscX as failed on nodeY, then never ever consider
nodeY as a valid place to run rscX ever again... at least not until
the admin "clears" the error by resetting the failcount.
in the future we'll expire the failures after "a period of time" but
that is not yet implemented as the lrm doesn't provide the infomation
to do so.
> I would like to set that parameter in order to be able to use the failure counters.
> Furthermore I am not able to read and reset the counters using:
>
> crm_failcount -G -U ha-8 -r rsc_lim8
> The result is always 0
>
> crm_failcount -D -U ha-8 -r rsc_lim8
> Error performing operation: The object/attribute does not exist.
later versions return 0 instead of "The object/attribute does not exist."
updated packages for most distros/platforms are available at:
http://software.opensuse.org/download/server:/ha-clustering/
More information about the Linux-HA
mailing list