[Linux-HA] ipfail in V2

Andrew Beekhof beekhof at gmail.com
Fri Oct 21 07:57:11 MDT 2005


On 10/21/05, Alan Robertson <alanr at unix.sh> wrote:

[snip]

> > the are a couple of problems with a timer like your proposing.
> >
> > the first is that it may be nullified by an unrelated update (ie. from
> > a failed monitor) that _must_ be acted on straight away.
>
> Understood that this is going to happen from time to time - but rarely -
> since I presume that one only updates values when something has changed
> - like the number of visible ping nodes.  The moral of the story for
> this is _always_ use a fairly coarse granularity measure.
>
> In general _any_ kind of recovery action is supposed to be rare.
>
> > the second is that ping node access isnt the only thing you'd want to
> > monitor in this fashion.  so you also potentially have multiple,
> > unrelated, hysteresi tripping over each other.
>
> If one sticks to the principle that updates only happen when something
> interesting (like # of ping nodes) changes, "event tripping" should be
> rare.  And, then the shortest hysteresis interval will effectively win -
> so you don't need to have more than one of these timers active.  One
> should suffice.  If another event comes in with a shorter interval than
> the amount remaining on the current one, replace it with the shorter
> one.  Otherwise ignore it.
>
> The hysteresis interval for ping nodes is "keepalive" time - which is
> typically short - minimizing this danger _for ping nodes_.  For
> temperature, the hysteresis interval might be a minute or maybe even more.
>
> But, if this happens it means it's getting hotter and ping nodes have
> both gone out at the same time...  [Sounds hinky to me].
>
> > third, it sounds a lot like work :-)
>
> Does having a single "repoke" interval like I described make it any
> easier?  From what you say below, it may be much closer to being done
> than I had thought.  Minor changes to the repoke interval (or a clone of
> the code) might be just what the doctor ordered.
>
> > there is also the ability to set a "repoke" interval for the PE -
> > would that be sufficient (again, only as a short-term option)?
>
> Is this a one-shot timer?  What happens when you get conflicting
> "repokes"?  Does the last one win?  That might be OK - at least for
> events with similar hysteresis intervals.

the repoke is controlled by the DC.
it is started when the DC enters the idle state and cancelled if it
ever moves out of it.
so there is never a conflict - because there is only 1 timer and only
1 node running it.

what you're thinking of is a timer running in the CIB.
you'd need to indicate somehow that this change should start/extend a timer.
you'd also need to keep track of which updates have been sent out
you'll start confusing clients because the order will be all messed up
you could even have a situation where the update doesnt even exit
anymore because the whole CIB was replaced in the meantime.

the other option is to send the notifications, but have the TE (which
normally triggers the PE) managing the timers.

but thats messy too.

>
> > the other option is to later trigger the change with an extra update
> > that doesnt use the super-top-secret flag.
>
> By the way, I'm not 100% sure that having this interval be the shortest
> is always the best choice.  Having it be the longest might be a better
> choice in some circumstances.
>
> If this is true, there are circumstances when the optimal choice is
> undecidable.  But, since this is rare - we probably shouldn't worry
> about it _that_ much :-).

sorry, you lost me here.



More information about the Linux-HA mailing list