[Linux-HA] ipfail in V2
david.lang at digitalinsight.com
Thu Oct 20 15:59:55 MDT 2005
On Thu, 20 Oct 2005, Andrew Beekhof wrote:
>> You MUST know what everyone else's values are - before they update the
> for the perfect algorithm, yes. but i never said this was trying to be.
the problem is that in most cases failing over is a fairly tramatic event.
in some cases (stateless packet filtering firewall for example) the outage
durign the failover may be fast enough that you don't care about it, but
as your boxes start to actually do more things a failover event causes
also the tendancy is that as your box is doing more things the process of
failing over also takes longer, causing even more impact.
and heaven help you if you have stonith configured, you have boxes
actually power down and require manual intervention to get them back up.
this is a case where a poor implementation is in most cases worse then no
This being said I want to raise one additional possibility for people to
consider as a future enhancement.
once there is the ability to have health values that get propogated around
then the possibility arises to do even fancier things when you have
for example you could put the 5 min loadave into the CIB and have 20
resources on 3 machines and tell the system that when the difference in
the load gets to be >x migrate some resources away from the heavily loaded
now doing this sort of thing will require changes to the CIB from what I
am reading, but the concept is very powerful and adding it is probably
worth the effort involved.
If I am understanding things properly, currently the CIB has a health
value that is acted on immediatly.
this would require a second health value with the following
characteristics (or possibly allow for an arbatrary number of such values)
1. like the normal health value it needs to be updated regularly and if
not updated for a sufficiant time period needs to be considered to be bad.
2. there needs to be a configurable delay before acting on a difference in
3. there needs to be a configurable delta that's acceptable (i.e. in some
cases health values of 78 from one machine and 79 from another should not
trigger an action)
4. the action to be taken when a difference exceeds the threashold needs
to be able to be specified (either a built-in function like 'fail the
node' or an external script gets run)
5. after an action takes place it should be possible to raise a flag that
will prevent further actions for a configurable time period (in my example
above, you move resources off a loaded machine, now you need to let the
loadave settle again before you decide if you need to move more off)
note that with a value of 0 for #2, 1 for #3, 'fail the node' for #4 and
'no delay' for #5 this degenerates down to the existing health value
(which may be the way to go for it instead of implementing two completely
different types of things)
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare
More information about the Linux-HA