[Linux-HA] Groups vs colocations.... etc

Andrew Beekhof beekhof at gmail.com
Thu Dec 7 07:24:11 MST 2006


On 12/7/06, Andrew Beekhof <beekhof at gmail.com> wrote:
> On 12/6/06, Andreas Kurz <akurz at sms.at> wrote:
> > Andrew Beekhof wrote:
> > > On 11/28/06, Andreas Kurz <akurz at sms.at> wrote:
> > >> Serge Dubrouski wrote:
> > >> > Most of clusterware products, at least those that I've worked for
> > >> > (Veritas VCS, RedHat ClusterSuite, HP ServiceGuard, etc..) consider
> > >> > resources in a group dependent on each other. Upper resources depend
> > >> > om lower ones. Like DB depend on Filesystem with data files. That
> > >> > means that if Filesysten fails DB has to be restarted. And Heartbeat
> > >> > works exactly like this if you have a group with collocated property
> > >> > set to "true". Per my understanding it's completely right. If you
> > >> > don't want that dependency exclude yor NFS filesystem from the group
> > >> > but add collocated constaint between that group and separate NFS
> > >> > resource. That might help.
> > >> >
> > >> > As for stickiness I personally don't like how it's implemented in
> > >> > Heartbeat, I'd prefer having a simple property
> > >> > "number_of_fails_before_failover".
> > >
> > > which doesn't in any way affect this scenario (groups) because you've
> > > still got on part of the group trying to stay where it is and the
> > > other trying to move.  at least with the scoring the CRM gets some
> > > hint as to which part of the group it should take the most notice of.
> > >
> > > there is a comment further down which talks about one resource being
> > > "buggy"... expecting cluster software to magically compensate for
> > > inherently broken resources is unrealistic.
> >
> > You are right, of course! I only wanted to produce some errors for the
> > test-scenario ;-)
> >
> >
> > >> eg:
> > >>
> > >> a group with 5 resources, 2 nodes
> > >> location constraint: score 1 for node1, score 10 for node2
> > >> resource stickiness: 10
> > >> failure stickiness: 5
> > >>
> > >> resource failed over to node1 because of a unexpected server hang of
> > >> node2, node2 up again (I assume the location scores working correctly
> > >> ;-) )
> > >>
> > >> node1: 5*(1) + 5(10) = 55
> > >> node2: 5*(10)        = 50
> > >>
> > >> ok ... resource stays on node1
> > >>
> > >> one resource is buggy, heartbeat starts do stop/start it
> > >>
> > >> restart1:
> > >>
> > >> node1: 5*1 + 5*10 - 1*5 = 50
> > >> node2: 5*10             = 50
> > >>
> > >> ok ... resource stays on node1
> >
> > So with equal scores the group is moved away because of the "lower load"
> > of node2? Is this computed by the number of resources running on each node?
>
> right
>
> >
> > >>
> > >> restart2:
> > >> node1: 5*1 + 5*10 - 2*5  = 45
> > >> node2: 5*10              = 50
> > >>
> > >> takeover, after 1 local restart, am I right?
> > >
> > > you tell me - try ptest and see what it does.
> >
> > OK. The group is moved away when either the combined score of node1 is
> > lower than node2 or if the score for one resource is negative.
> >
> > >
> > >> resource group is on node2,
> > >
> > >> failcount reset on node1:
> > >
> > > the failcount is never reset automatically
> >
> > I did it manually ;-)
> >
> > >> node1: 5*1         = 5
> > >> node2: 5*10 + 5*10 = 100
> > >>
> > >> hmm ... thats a problem, or have I missed something?
> > >
> > > why is this a problem?
> > >
> > >> that would lead to about 20 local restarts before a failover to node1
> > >> happens ....
> >
> > Not so many, but more than on the other node whith the lower scores. The
> > group fails over when the local score for the failing resource is negative.
> >
> > >
> > > so choose different values
> > > or dont apply the rsc_location preference to every member of the group
> >
> > I tried to configure instance_attributes for the group with different
> > resource_failure_stickiness values but without success, the rule never
> > matches:
> >
> > ptest[27606]: 2006/12/06_17:30:43 debug: debug2: test_rule:rules.c
> > Testing rule higher_failure_stickiness_rule
> > ptest[27606]: 2006/12/06_17:30:43 debug: debug2: test_expression:rules.c
> > Expression test failed on all ndoes
> > ptest[27606]: 2006/12/06_17:30:43 debug: debug3: test_rule:rules.c
> > Expression higher_failure_stickiness_rule/test failed
> > ptest[27606]: 2006/12/06_17:30:43 debug: debug3: unpack_attr_set:rules.c
> > Adding attributes from lower_failure_stickiness_inst
> >
> >
> > <instance_attributes id="higher_failure_stickiness_inst" score="100">
> >         <rule id="higher_failure_stickiness_rule" boolean_op="and">
> >            <expression attribute="#uname" operation="eq"
> > value="sms-nfs-02" id="test"/>
> >         </rule>
> >         <attributes>
> >                 <nvpair id="higher_failure_stickiness_id"
> > name="resource_failure_stickiness" value="-10"/>
> >         </attributes>
> >         </instance_attributes>
> >         <instance_attributes id="lower_failure_stickiness_inst" score="10">
> >                 <attributes>
> >                         <nvpair id="lower_failure_stickiness_id"
> > name="resource_failure_stickiness" value="-1"/>
> >                 </attributes>
> >         </instance_attributes>
> >
> > Andrew, do you have a hint why this is not working? The group is
> > currently running on the node sms-nfs-02. I tried the same with a time
> > based rule and it worked.
>
> i dont think what you want to do is possible (yet anyway)
> the mechanism was intended for setting RA properties _after_ we've
> decided to place it somewhere (ie. on nodeX use NIC=eth1, otherwise
> use NIC=eth0)
>
> so trying to set some variables based on the current location is
> somewhat more problematic - though i seem to remember it working in
> the past so maybe i broke something.
>
> let me get back to you...

as of this version it should work:
    http://hg.beekhof.net/lha/crm-stable/rev/1045cec0d37d

>
> >
> > Regards,
> > Andi
> >
> > >
> > >>
> > >> If I am completely wrong please correct me!
> > >>
> > >> Regards,
> > >> Andreas
> > >>
> > >> >
> > >> > On 11/28/06, Andre van der Vlies <andre at vandervlies.xs4all.nl> wrote:
> > >> >>
> > >> >> Andreas Kurz wrote:
> > >> >> > Andre van der Vlies wrote:
> > >> >> >> Andrew Beekhof wrote:
> > >> >> >>>> So, given:
> > >> >> >>>>   IPaddr_1
> > >> >> >>>>   IPaddr_2
> > >> >> >>>>   NFS_1
> > >> >> >>>>   NFS_2
> > >> >> >>>>   PG
> > >> >> >>>>
> > >> >> >>>> there's no way I can prevent NFS_2 and PG from being stopped and
> > >> >> >>>> started
> > >> >> >>>> if NFS_1 fails, make NFS_1 retry 5 times and if it doesn't
> > >> >> succeed the
> > >> >> >>>> whole group needs to failover...  :-/
> > >> >> >>>
> > >> >> >>> not in a group.
> > >> >> >>> but groups are only a syntactic shortcut for a bunch of colocation
> > >> >> and
> > >> >> >>> ordering constraints.
> > >> >> >>>
> > >> >> >>> so dont use a group and dont make NFS_2 depend on NFS_1
> > >> >> >>>
> > >> >> >>
> > >> >> >> Sorry, I still don't get it...
> > >> >> >>
> > >> >> >> I've got 5 resources.
> > >> >> >> I make constraints to start them in the right order (1, 2, 3, 4, 5)
> > >> >> >> I make constraints to get them start on the same node...
> > >> >> >
> > >> >> > That's what a group implies, you don't need to make them 'by hand'
> > >> >> or if
> > >> >> > you prefer it that way you can disable all constraints from the
> > >> group.
> > >> >> > Then your group is only a naming convention for your convenience.
> > >> >> >
> > >> >> >>
> > >> >> >>   As a bonus I can do stuff with the stickiness of a resource. For
> > >> >> >> instance
> > >> >> >>   resource 3 fails and is retried 5 times before it fails over to
> > >> >> >> another
> > >> >> >>   node; which makes all the other resources migrate...
> > >> >> >>
> > >> >> >
> > >> >> > Yes, because of the colocation constraints.
> > >> >> >
> > >> >> >> But....
> > >> >> >> If I put those 5 resources in a group (colocation, order), I can
> > >> only
> > >> >> >> use
> > >> >> >> the stickiness of the last resource in the group. None of the
> > >> others
> > >> >> >> seems
> > >> >> >> to have any vote in the matter. And if a 'midlist' resource
> > >> fails all
> > >> >> >> lower resources are stopped and started....
> > >> >> >
> > >> >> > The stickiness, no matter if it's the
> > >> 'resource_failure_stickiness' or
> > >> >> > the 'resource_stickiness', is bound to a resource independent from
> > >> >> where
> > >> >> > the resource is defined in the group.
> > >> >> >
> > >> >>
> > >> >> Okay.
> > >> >>
> > >> >> > All resources in a group are bound together by the colocation
> > >> >> > constraints so a failing resource has influence on the whole
> > >> group and
> > >> >> > the score of the group. The sum of all scores of all resources in a
> > >> >> > group decides on which node the whole group has to run. So if you
> > >> >> define
> > >> >> > a failure stickiness every failing resource lowers the group score.
> > >> >> >
> > >> >>
> > >> >> That has been my reasoning too...  My experience tells me otherwise
> > >> >>
> > >> >> > Because the ordering constraints are per default symmetric they
> > >> imply
> > >> >> > also a stop_before and not only the defined start_before constraint,
> > >> >> and
> > >> >> > I think it makes sense most of the time ... but it can also be
> > >> >> disabled.
> > >> >> >
> > >> >>
> > >> >> Hmmm....  How do I do this exactly?
> > >> >>
> > >> >> > Hope that helps ;-)
> > >> >> >
> > >> >>
> > >> >> I bit. I have been reasoning along the same path. The behaviour of mys
> > >> >> cluster is (very) different from what I expected...
> > >> >>
> >
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
>


More information about the Linux-HA mailing list