STONITH implementations

Alan Robertson alanr@suse.com
Fri, 28 Apr 2000 13:00:18 -0600


David Brower wrote:
> 
> This doesn't totally work.  One of the things you are trying
> to protect against is the temporarily hung kernel.  The other
> nodes may send their 'shoot him' messages, but it is hard to
> know if it is listening.

I assume that an ACK could be used to solve this.  The problem with this
would be that you really need to keep the protocol up and running long
enough to be sure that the ack itself didn't get lost.  A hardware
safeguard is probably more reliable than a software safeguard (sigh).

If you changed heartbeat to lock itself in memory and use one of the
more realtime scheduling methods, then the kernel is probably only a
little more reliable than heartbeat.  If you limit yourself to only one
media type in the kernel, then it's probably more reliable.  I've
thought about a shutdown signal or message.  This would still have to be
backed up by a hardware method for the more paranoid IMHO.

> At the same time, if your are willing to go down the path of
> an active kernel agent, then your should also be trying to
> do something more intellegent than panicing.  It becomes reasonable
> and appropriate to consider it the agent of a generic resource
> fencing protocol.  For sake of argument, the GRITS protocol we
> have started to discuss on linux-ha-dev.
> 
> -dB

Neither of these protect very well against loss of communications
media.  Heartbeat is configured to talk over as many media as possible.

Have you thought about whether it would make sense to use of the
heartbeat comm layer rather than implement a kernel agent?

It would probably be slightly less reliable than doing it all in the
kernel, but you could ride it's coattails (so to speak) in terms of
redundancy, and multiple media types.  If you're really paranoid, you
still need to have a hardware safeguard anyway.

Heartbeat is already doing strong authentication as well, and implements
serial ring protocols, so Fábio's major concerns are all addressed.

It is this kind of application, where you need extremely high
reliability, bounded latency, and low bandwidth, that made me architect
the heartbeat comm layer the way I did.  This kind of application is
exactly what it's designed for.

	Comments?

	-- Alan Robertson
	   alanr@suse.com