[Linux-HA] Groups vs colocations.... etc
Serge Dubrouski
sergeyfd at gmail.com
Mon Dec 4 07:11:56 MST 2006
On 12/4/06, Andrew Beekhof <beekhof at gmail.com> wrote:
> On 11/28/06, Andreas Kurz <akurz at sms.at> wrote:
> > Serge Dubrouski wrote:
> > > Most of clusterware products, at least those that I've worked for
> > > (Veritas VCS, RedHat ClusterSuite, HP ServiceGuard, etc..) consider
> > > resources in a group dependent on each other. Upper resources depend
> > > om lower ones. Like DB depend on Filesystem with data files. That
> > > means that if Filesysten fails DB has to be restarted. And Heartbeat
> > > works exactly like this if you have a group with collocated property
> > > set to "true". Per my understanding it's completely right. If you
> > > don't want that dependency exclude yor NFS filesystem from the group
> > > but add collocated constaint between that group and separate NFS
> > > resource. That might help.
> > >
> > > As for stickiness I personally don't like how it's implemented in
> > > Heartbeat, I'd prefer having a simple property
> > > "number_of_fails_before_failover".
>
> which doesn't in any way affect this scenario (groups) because you've
> still got on part of the group trying to stay where it is and the
> other trying to move. at least with the scoring the CRM gets some
> hint as to which part of the group it should take the most notice of.
I wanted to say that I do not like mechaisms for controlling failover
in Heartbeat. It's too complex, per my personal opinion. No other
product makes admin to use any calculos to know after how many
failures resources will be moved to another node.
>
> there is a comment further down which talks about one resource being
> "buggy"... expecting cluster software to magically compensate for
> inherently broken resources is unrealistic.
>
> really the failure stickiness for such resources should always be zero
> - no matter where it runs, its going to fail, so trying to move it
> achieves nothing.
>
> > > All those stickiness calculations
> > > are too complex, IMHO.
>
> > > Especially that resources and group already
> > > have "failcount" property implemented.
>
> huh? this sentence doesnt make sense to me.
I meant if it's already there why not to use it to control a failover?
>
> > Yes, I agree completely with you. It would be much simpler to define how
> > many local restarts a resource is allowed to do before a failover
> > happens. In addition a parameter for a automatically reset of the fail
> > counter would be handy.
> >
> > With the current implementation IMHO it is not easy to find a working
> > combination of location scores, resource stickiness and failure stickiness.
> >
> > Am I wrong or is sometimes not possible to get the same restart count on
> > all nodes if you use all scores together
> > (location/stickiness/failure_stickiness)?
> >
> > eg:
> >
> > a group with 5 resources, 2 nodes
> > location constraint: score 1 for node1, score 10 for node2
> > resource stickiness: 10
> > failure stickiness: 5
> >
> > resource failed over to node1 because of a unexpected server hang of
> > node2, node2 up again (I assume the location scores working correctly ;-) )
> >
> > node1: 5*(1) + 5(10) = 55
> > node2: 5*(10) = 50
> >
> > ok ... resource stays on node1
> >
> > one resource is buggy, heartbeat starts do stop/start it
> >
> > restart1:
> >
> > node1: 5*1 + 5*10 - 1*5 = 50
> > node2: 5*10 = 50
> >
> > ok ... resource stays on node1
> >
> > restart2:
> > node1: 5*1 + 5*10 - 2*5 = 45
> > node2: 5*10 = 50
> >
> > takeover, after 1 local restart, am I right?
>
> you tell me - try ptest and see what it does.
>
> > resource group is on node2,
>
> > failcount reset on node1:
>
> the failcount is never reset automatically
>
> >
> > node1: 5*1 = 5
> > node2: 5*10 + 5*10 = 100
> >
> > hmm ... thats a problem, or have I missed something?
>
> why is this a problem?
>
> > that would lead to about 20 local restarts before a failover to node1
> > happens ....
>
> so choose different values
> or dont apply the rsc_location preference to every member of the group
>
> >
> > If I am completely wrong please correct me!
> >
> > Regards,
> > Andreas
> >
> > >
> > > On 11/28/06, Andre van der Vlies <andre at vandervlies.xs4all.nl> wrote:
> > >>
> > >> Andreas Kurz wrote:
> > >> > Andre van der Vlies wrote:
> > >> >> Andrew Beekhof wrote:
> > >> >>>> So, given:
> > >> >>>> IPaddr_1
> > >> >>>> IPaddr_2
> > >> >>>> NFS_1
> > >> >>>> NFS_2
> > >> >>>> PG
> > >> >>>>
> > >> >>>> there's no way I can prevent NFS_2 and PG from being stopped and
> > >> >>>> started
> > >> >>>> if NFS_1 fails, make NFS_1 retry 5 times and if it doesn't
> > >> succeed the
> > >> >>>> whole group needs to failover... :-/
> > >> >>>
> > >> >>> not in a group.
> > >> >>> but groups are only a syntactic shortcut for a bunch of colocation
> > >> and
> > >> >>> ordering constraints.
> > >> >>>
> > >> >>> so dont use a group and dont make NFS_2 depend on NFS_1
> > >> >>>
> > >> >>
> > >> >> Sorry, I still don't get it...
> > >> >>
> > >> >> I've got 5 resources.
> > >> >> I make constraints to start them in the right order (1, 2, 3, 4, 5)
> > >> >> I make constraints to get them start on the same node...
> > >> >
> > >> > That's what a group implies, you don't need to make them 'by hand'
> > >> or if
> > >> > you prefer it that way you can disable all constraints from the group.
> > >> > Then your group is only a naming convention for your convenience.
> > >> >
> > >> >>
> > >> >> As a bonus I can do stuff with the stickiness of a resource. For
> > >> >> instance
> > >> >> resource 3 fails and is retried 5 times before it fails over to
> > >> >> another
> > >> >> node; which makes all the other resources migrate...
> > >> >>
> > >> >
> > >> > Yes, because of the colocation constraints.
> > >> >
> > >> >> But....
> > >> >> If I put those 5 resources in a group (colocation, order), I can only
> > >> >> use
> > >> >> the stickiness of the last resource in the group. None of the others
> > >> >> seems
> > >> >> to have any vote in the matter. And if a 'midlist' resource fails all
> > >> >> lower resources are stopped and started....
> > >> >
> > >> > The stickiness, no matter if it's the 'resource_failure_stickiness' or
> > >> > the 'resource_stickiness', is bound to a resource independent from
> > >> where
> > >> > the resource is defined in the group.
> > >> >
> > >>
> > >> Okay.
> > >>
> > >> > All resources in a group are bound together by the colocation
> > >> > constraints so a failing resource has influence on the whole group and
> > >> > the score of the group. The sum of all scores of all resources in a
> > >> > group decides on which node the whole group has to run. So if you
> > >> define
> > >> > a failure stickiness every failing resource lowers the group score.
> > >> >
> > >>
> > >> That has been my reasoning too... My experience tells me otherwise
> > >>
> > >> > Because the ordering constraints are per default symmetric they imply
> > >> > also a stop_before and not only the defined start_before constraint,
> > >> and
> > >> > I think it makes sense most of the time ... but it can also be
> > >> disabled.
> > >> >
> > >>
> > >> Hmmm.... How do I do this exactly?
> > >>
> > >> > Hope that helps ;-)
> > >> >
> > >>
> > >> I bit. I have been reasoning along the same path. The behaviour of mys
> > >> cluster is (very) different from what I expected...
> > >>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
More information about the Linux-HA
mailing list