[Linux-HA] Ipfail support for heartbeat 2.0.x

Alan Robertson alanr at unix.sh
Thu Oct 20 08:52:22 MDT 2005


Lars Marowsky-Bree wrote:
> On 2005-10-19T08:38:02, Andrew Beekhof <beekhof at gmail.com> wrote:
> 
> (Moving discussion to -dev)
> 
>>>In this case, having it in the one common place for all variables is a
>>>simplification - since it can be done once instead of many times.
>>>
>>So what do people want?  I'm not getting it...
> 
>>It the PE detects a resource should be moved it will move it then and
>>there - not at some arbitrary point in the future.
> 
> OK. I'll try to summarize again, maybe a more formal description of the
> problem will help us all to get to a common understanding. The good part
> is that this is clearly simpler than multi-state resources ;-)
> 
> We have a metric "N"; say, number of paths to the storage, number of
> external nodes which can be reached, whatever. 
> 
> This metric is monitored on each node n -> N(n). As nodes are not
> running in lock-step, it's also observed at a specific time -> N(n,t).
> 
> Requirements:
> 
> 1. We want to be able to specify a dependency on running where N is
>    maximal; so that our webserver runs on the node with the best
>    connectivity, for example, or that, if the storage of a node has
>    failed, we do a pro-active switch to a node which still has >=2 paths
>    at least.
> 
> 2. We want to _minimize_ switching resources, because otherwise we
>    create more downtime (ie, by switching to a node which doesn't
>    provide us any benefit and we have to switch again) than as if we had
>    done nothing.
>  
> These requirements are slightly conflicting. 
> 
> R.1 is easy: just feed the attribute into the CIB raw, and let the PE do
> its thing and right away - select a node based on a maximum value for a
> given node attribute isn't difficult. In fact, this will _converge_ on a
> correct solution, yet, we violate R.2.
> 
> For example, for ping nodes to monitor external connectivity, it is
> quite likely that not all of them will be reachable all the time; it's
> expected they fluctuate. If we bounce resources every time a single ping
> node hiccups for a few seconds, the admin will not be happy - the
> switch-over caused unneeded downtime.

If that's the case, it will be equally visible to all nodes in the 
cluster (given the note below).  AND, we already have this type of 
hysteresis in the ping nodes "deadtime" computation - so this isn't needed.

 > Or, if the ping node goes down for
> real, all nodes will eventually see that - so it's silly to bounce
> resources because n1 has already noticed while n2 hasn't _yet_.

This part is exactly what ipfail currently does.  And, without any 
special effort for the previous case, there have been no complaints 
about it.

> So, R.2 requires that we dampen the events, and just trigger the PE
> after the situation had a chance to stabilize. (Or, as any update to the
> CIB triggers the PE, this for us means to not update the CIB before it
> has stabilized a bit.)
> 
> I think we a) need to average the N(n,t) metric over a configurable
> history - this will dampen at a per-node level and prevent minor hiccups
> from a single node to bounce resources, unless the error re-occurs
> frequently.

I disagree on this approach.  It's more complicated than needed and only 
works with integer values.

When someone tells you to set an attribute value with hysteresis, then 
go ahead and set it internally now, but delay a specified amount before 
notifying the CRM/PE that it has changed.  This is what we do in 
reporting newly-added nodes (and it's basically what ipfail does):

Here is how it might work:
	Hysteresis set to 3 time units

	time 1:	node A changes the value of its node attribute to 2
	time 2:	node B changes the value of its node attribute to 2
	time 4:	CIB reports the CIB update to the CRM/PE.
		no action is taken, because the values are now
		the same.
(ping device recovers)
	time 21: node A changes the value of its node attribute to 3
	time 22: node B changes the value of its node attribute to 3
	time 24: CIB reports the CIB update to the CRM/PE.
		 no action is taken, because the values are now
		 the same.

Without this change, here's what happens (worst case):
	time 1: node A changes the attribute to 2
		CIB reports change to CRM/PE
		CRM->PE->TE->move resources around
	time 2: node B changes the attribute
		CIB reports change to CRM/PE
		CRM->PE->TE->move resources back where they were
(ping device recovers)
	time 21: node A changes the value of its node attribute to 3
		CIB reports change to CRM/PE
		CRM->PE->TE->move resources around
	time 22: node B changes the value of its node attribute to 3
	time 24: CIB reports the CIB update to the CRM/PE.
		CIB reports change to CRM/PE
		CRM->PE->TE->move resources around

	In this case up to 4 outages were incurred when none was needed.
	This was a cause of major complaints with early versions of
	ipfail.

Of course, given the difficulties you've seen with CIB consistency, this 
may be an evil choice.  If so, then by all means, say so...

There's probably another way...

> But, this isn't enough; it's still a black-or-white decision whether
> that value is higher or smaller than some other nodes.
> 
> So, b), we can't do a black-or-white decision, but we need to be able to
> say "N(n_1) greater than all other N(n_x) by d".

Take care of this yourself - in the attribute values.  For example, 
don't report temperatures in tenths of a degree.  Report them as being 
OK, too warm, and way too warm with just 3 values.  Unlike the 
hysteresis, this is easily done by the monitor processes.

Or your could report them as:
	"green"
	"yellow"
	"red"
and write corresponding rules with arbitrary weights...

> This is, I think, sufficient to achieve the desired effect.

I believe that you can already do this by giving a rule whose weight is 
an attribute value.

Hysteresis is something the monitor agents can only do by great effort 
by themselves.  Everything else you proposed is easily built into the 
monitoring agents.

I would suggest putting nothing in the CRM that the monitoring agents 
can easily do themselves.

-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce



More information about the Linux-HA mailing list