More Linux-HA heartbeat thoughts

alanr at bell-labs.com alanr at bell-labs.com
Wed Mar 18 08:50:10 MST 1998


Andy Poling wrote:
> I've been working (slowly, for years) on a protocol to make a (system-wise
> and geographically) distributed application robust.  My best stab at it so
> far involves the notion of a quorum along with a tolerance for slowly
> changing battery size (I call it a "battery" instead of a "cluster")
> permitting a re-formulation of the size of the required quorum.  The idea
> is that if systems disappear slowly and individually, they will probably
> not come back together on the other side of some black hole and form a new
> quorum.  It's really a gamble though, since you still have to have some way
> of preventing such a reformation.  At present, I require human intervention
> when a majority of the systems find themselves trying to form a battery
> from scratch.

This is a good overview of the kinds of problems one runs into when doing
"real" HA with a non-trivial application suite.

> PS - I think maybe it would be fruitful to think more about these higher
> level issues before we begin crafting the supporting protocols (such as
> heartbeats)...

I did consider doing this, but decided to plunge ahead where angels fear to
tread because of the value of having a platform to experiment with, learn from
and criticize,and the energizing value of having something working.  Also, the
mechanism I designed is very low-level.  It doesn't DECIDE anything, it reports
what it sees -- at the lowest level.  Some applications (like web servers)
don't require much in the way of application-level recovery, so it seems a
shame to not move forward and learn from a working platform.

Simple two and three machine clusters with close physical proximity don't have
nearly the complexity of larger more distributed clusters, yet are very useful
in their own right.  My particular application is for an HA software build
engine.  A 3-node cluster of dual processor DEC alphas is a VERY useful cluster
for this purpose, but avoids the highest level of complexity associated with
transaction processing, truly distributed nodes and larger clusters.  My
recovery might be "don't put check files in or out of source control" when in
degraded mode.  This level of support is very simple and yet quite useful in
this application.

As processors get faster and faster, such 2-3 node HA clusters will become more
and more prevelant.

	-- Alan Robertson
	   alanr at bell-labs.com



More information about the Linux-HA mailing list