[Linux-HA] ipfail in V2

David Lang david.lang at digitalinsight.com
Thu Oct 20 15:59:55 MDT 2005


On Thu, 20 Oct 2005, Andrew Beekhof wrote:

>>
>> You MUST know what everyone else's values are - before they update the
>> CRM.
>
> for the perfect algorithm, yes.  but i never said this was trying to be.
>

the problem is that in most cases failing over is a fairly tramatic event. 
in some cases (stateless packet filtering firewall for example) the outage 
durign the failover may be fast enough that you don't care about it, but 
as your boxes start to actually do more things a failover event causes 
more grief.

also the tendancy is that as your box is doing more things the process of 
failing over also takes longer, causing even more impact.

and heaven help you if you have stonith configured, you have boxes 
actually power down and require manual intervention to get them back up.

this is a case where a poor implementation is in most cases worse then no 
implementation.

This being said I want to raise one additional possibility for people to 
consider as a future enhancement.

once there is the ability to have health values that get propogated around 
then the possibility arises to do even fancier things when you have 
multiple resources.

for example you could put the 5 min loadave into the CIB and have 20 
resources on 3 machines and tell the system that when the difference in 
the load gets to be >x migrate some resources away from the heavily loaded 
machine

now doing this sort of thing will require changes to the CIB from what I 
am reading, but the concept is very powerful and adding it is probably 
worth the effort involved.

If I am understanding things properly, currently the CIB has a health 
value that is acted on immediatly.

this would require a second health value with the following 
characteristics (or possibly allow for an arbatrary number of such values)

1. like the normal health value it needs to be updated regularly and if 
not updated for a sufficiant time period needs to be considered to be bad.

2. there needs to be a configurable delay before acting on a difference in 
the value

3. there needs to be a configurable delta that's acceptable (i.e. in some 
cases health values of 78 from one machine and 79 from another should not 
trigger an action)

4. the action to be taken when a difference exceeds the threashold needs 
to be able to be specified (either a built-in function like 'fail the 
node' or an external script gets run)

5. after an action takes place it should be possible to raise a flag that 
will prevent further actions for a configurable time period (in my example 
above, you move resources off a loaded machine, now you need to let the 
loadave settle again before you decide if you need to move more off)

note that with a value of 0 for #2, 1 for #3, 'fail the node' for #4 and 
'no delay' for #5 this degenerates down to the existing health value 
(which may be the way to go for it instead of implementing two completely 
different types of things)

David Lang


-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare



More information about the Linux-HA mailing list