[Linux-HA] Ipfail support for heartbeat 2.0.x

Guochun Shi gshi at ncsa.uiuc.edu
Tue Oct 18 13:33:06 MDT 2005


At 09:14 PM 10/18/2005 +0200, you wrote:
>On 2005-10-18T20:50:12, Tim Verhoeven <tim.verhoeven.be at gmail.com> wrote:
>
>> Will ipfail ever get support in 2.0.x / CRM style resource management ?
>
>Yes, we'll eventually implement similar functionality in 2.0.x.
>
>A cheap-skate version would be quite easily implemented: just feed the
>number of nodes any node can ping into the CIB as a node attribute and
>put dependencies on that.
>
>Or use a resource agent which just checks the ping status on "monitor"
>and have it fail if not; then the resources would migrate too.
>
>However, this isn't good enough, and the reason is subtle and seems to
>escape most people most of the time ;-)
>
>The problem here is that this will cause pointless resource bouncing if
>the ping node is actually having the problem, or if the error affects
>all nodes (or even just a subset); in that case, bouncing the resource
>around is totally pointless and causes actual harm - because we might
>bounce it to a node which just hasn't yet noticed it has the same
>error.

I think the cheap version is good.
If  ping nodes are unstable, then we cannot do anything about it but moving resources 
accordingly. As for normal cases whether a ping node dies, we can avoid moving
a resource if ipfial delays reporting the node by one heartbeat interval -- at that time all nodes should
all notices the ping node is dead.


>So, the node attribute needs to be coordinated, dampened and hysteresis
>etc be implemented. Preferrably be a small external daemon or something
>which provides this feature generically, so we can also use it for, say,
>connectivity to storage too...
>
>Any takers? ;-)

Open an entry in bugzilla and temporarily assign it to someone, people who have free time
will jump out and take it :)

-Guochun




More information about the Linux-HA mailing list