[Linux-HA] Ipfail support for heartbeat 2.0.x
lmb at suse.de
Wed Oct 19 06:59:37 MDT 2005
On 2005-10-19T08:38:02, Andrew Beekhof <beekhof at gmail.com> wrote:
(Moving discussion to -dev)
> > In this case, having it in the one common place for all variables is a
> > simplification - since it can be done once instead of many times.
> So what do people want? I'm not getting it...
> It the PE detects a resource should be moved it will move it then and
> there - not at some arbitrary point in the future.
OK. I'll try to summarize again, maybe a more formal description of the
problem will help us all to get to a common understanding. The good part
is that this is clearly simpler than multi-state resources ;-)
We have a metric "N"; say, number of paths to the storage, number of
external nodes which can be reached, whatever.
This metric is monitored on each node n -> N(n). As nodes are not
running in lock-step, it's also observed at a specific time -> N(n,t).
1. We want to be able to specify a dependency on running where N is
maximal; so that our webserver runs on the node with the best
connectivity, for example, or that, if the storage of a node has
failed, we do a pro-active switch to a node which still has >=2 paths
2. We want to _minimize_ switching resources, because otherwise we
create more downtime (ie, by switching to a node which doesn't
provide us any benefit and we have to switch again) than as if we had
These requirements are slightly conflicting.
R.1 is easy: just feed the attribute into the CIB raw, and let the PE do
its thing and right away - select a node based on a maximum value for a
given node attribute isn't difficult. In fact, this will _converge_ on a
correct solution, yet, we violate R.2.
For example, for ping nodes to monitor external connectivity, it is
quite likely that not all of them will be reachable all the time; it's
expected they fluctuate. If we bounce resources every time a single ping
node hiccups for a few seconds, the admin will not be happy - the
switch-over caused unneeded downtime. Or, if the ping node goes down for
real, all nodes will eventually see that - so it's silly to bounce
resources because n1 has already noticed while n2 hasn't _yet_.
So, R.2 requires that we dampen the events, and just trigger the PE
after the situation had a chance to stabilize. (Or, as any update to the
CIB triggers the PE, this for us means to not update the CIB before it
has stabilized a bit.)
I think we a) need to average the N(n,t) metric over a configurable
history - this will dampen at a per-node level and prevent minor hiccups
from a single node to bounce resources, unless the error re-occurs
But, this isn't enough; it's still a black-or-white decision whether
that value is higher or smaller than some other nodes.
So, b), we can't do a black-or-white decision, but we need to be able to
say "N(n_1) greater than all other N(n_x) by d".
This is, I think, sufficient to achieve the desired effect.
Now, averaging the per-node history is something I think which should be
done outside the CIB prior to feeding the value in. That's easy enough:
Small local daemon which tracks this, gathers the metric externally
automatically or is fed the metric, whatever...
But, the decision made in b) - if we simply feed it into the CIB
every single time the value changes, the PE can make a correct decision,
but the PE will be invoked excessively, and running the PE is quite high
overhead. However, it might be acceptable for now, as the real solution
We cannot simply delay feeding said value until it changes by at least
d locally though, which at least to me seemed the first solution ;-) The
problem here is that then, again, one node will be the first to feed a
value which has changed by this margin, and we don't know yet whether
this is true for other nodes too or not. We'd have recreated the same
If we want to reduce the number of PE invocations, we need to move this
decision on the values into a small daemon which coordinates this across
the cluster and feeds the values for the attributes to the CIB en-bloc
when the margin d is exceeded, so the PE sees a consistent picture.
I can think of a second path, which doesn't require the daemon to be
cluster-aware and yet reduces PE invocations: We have two margins. d1 is
the difference between the metric and the one in the CIB at which we
_update_ the CIB; d2 would be the margin at which the PE considers the
difference across nodes significant. If d1 would be approximately half
the value of d2, this would cause the invocations of the PE to be
reduced significantly, and yet should also reduce pointless bouncing.
If setting d1 to 1/2 d2 is sufficient, this might be the best way to go,
because then it can be automatically done, and it might be in fact the
best solution, because it doesn't require yet another daemon to be
Lars Marowsky-Brée <lmb at suse.de>
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"
More information about the Linux-HA