[Linux-HA] Re: bug in failcount handling?

Andrew Beekhof beekhof at gmail.com
Wed Oct 31 02:21:54 MDT 2007


On 10/31/07, Alan Robertson <alanr at unix.sh> wrote:
> Serge Dubrouski wrote:
> > On 10/30/07, Alan Robertson <alanr at unix.sh> wrote:
> >> Andrew Beekhof wrote:
> >>> On Oct 30, 2007, at 2:44 AM, Alan Robertson wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I've been working with a customer - trying to get them up and running
> >>>> on version 2.1.2.  I got everything to work except for one thing:
> >>>> They require that their web server fail over on the 3rd failure.  I
> >>>> read the documentation on the failcount stuff on the web site here:
> >>>> http://www.linux-ha.org/v2/faq/forced_failover
> >>>>
> >>>> I think I understood it, and I created a CIB to match.  In the CIB I
> >>>> created, I believe it should fail over on the 3rd failure.  In
> >>>> practice it fails over reliably on the 9th iteration instead.  We had
> >>>> been doing a "killall httpd" to fail the web server.
> >>> 9th is correct.
> >>>
> >>> As has been explained here on the list a number of times, the group's
> >>> stickiness is N * default-resource-stickiness, where N is the number of
> >>> resources in the group.
> >>>
> >>> Including the rsc_location constraint, the group stickiness is therefor:
> >>> 4 * 20 + 1 = 81
> >>> So clearly apache is going to need to fail 9 times (9 *
> >>> default-resource-failure-stickiness = -90) before the group is moved.
> >>>
> >>>
> >>> Of course it all starts getting even more complicated when one starts
> >>> creating rsc_colocation constraints with other groups and primitives.
> >> Can I specify the resource-failure-stickiness of a group either
> >> explicitly or implicitly?
> >>
> >> Since I'm writing this up for the web site, I want to make sure I have
> >> this absolutely clear so I can write it up correctly:
> >>
> >> Do you mean that you sum up the stickiness values for each resource in
> >> the group, or did you really mean that you it always uses n*default
> >> stickiness? (I'm asking for both for failure stickiness and resource
> >> stickiness).
> >>
> >> If I have a locational constraints for a group of 'p' points, does that
> >> then distribute across the group of 'n' nodes so that we get a group
> >> preference of 'p' * 'n' points?  Or is it just just a total of 'p'
> >> points for the group as a whole?
> >>
> >> My current attempt to document this can be found here:
> >>         http://linux-ha.org/v2/faq/forced_failover
> >>
> >
> > I always wandered why this is so complex. Why you guys can't implement
> > one more resource attribute that would simply identify after how many
> > failures the resource has to be moved out of the failing node?
>
> As far as I know, that's what everyone wants.  This customer will
> probably drop Heartbeat over this issue.  And I can understand why.
>
> This behavior adds a lot of complexity without adding power - in fact,

Wrong.
It allows a resource to fail on hostX N times, then fail on hostY a
few times more and then try X again.

Until we are able to expire failures, this is the only way to do that.

> it makes the whole solution less useful from my perspective - since the
> normal thing you want to do "fail over when my web server fails 3 times"
> can't be done _at all_ if there are colocation dependencies or groups
> involved.

Bullshit.

> Failures in a group are cumulative - which isn't what this
> customer (or AFAIK anyone) wants.

Huh?

So if the filesystem fails and some time later apache fails, then "the
group" has only failed once?

Uhh...

> Every other HA system on the planet just has a count of how many times
> to fail before migrating the resource.  They do that because it's what's
> needed.  We should do the same.

Its on my to-do list - like everything else.

If its such a high priority for you, you could try writing some code
instead of bitching and moaning.  You are allegedly part of the
development team remember.

It would have also been useful had you been involved 4 years ago
during the development of R2 when all these features were being
written and backwards compatibility wasn't a problem.


More information about the Linux-HA mailing list