Web page for Linux-HA
Michael Rowan
mtr@cutaway.com
Mon, 19 Oct 1998 14:39:20 -0400
Jim Trocki wrote:
>
> On Mon, 19 Oct 1998 alanr@bell-labs.com wrote:
>
> > The purpose of the serial port in the diagram is ONLY so that the cluster
> > can communicate it's internal status with itself, including diagnostic
> > status. For example, if two machines cannot communicate with each other,
> > there are at least three possible failed components. M1's ethernet, M2,
> > ethernet, or the hub/switch that connects them. If you use diagnostics like
> > "I can ping my router/switch" you can gracefully take down the truly
> > isolated node with both sides knowing and agreeing on it.
>
> A fourth thing which can contribute to a failure is faulty software in
> any number of devices.
>
> One thing I've experienced in many many different ways with many many
> different pieces of equipment is that failure modes vary widely. One of
> the big problems with any system which provides some amount of resiliency
> is that they depend on knowing what may fail and what the failure mode
> may be. It is not always possible to know the failure mode.
>
> For example, to throw a wrench into the above example of pinging the
> switch, if some bunk-head accidentally gives his machine the same
> IP address as the switch (I've seen it happen) or some mis-configured
> device starts to proxy arp for the entire universe (I've seen it happen)
> then it will look like a network failure, when quite possibly the switch
> might be forwarding packets between everything fine.
>
> The point is that things go wrong which you do not and cannot predict.
>
> I'm just saying that it's a tricky problem, not that there isn't anything that
> can be done about it.
>
> Jim
You be right from my standpoint. The trick, I have found,
with HA is to address as many of the things you can predict
directly, provide a mechanism for classifying the things you
can't predict, and have a way to promote types of failures
to other types like moving things to node failure when you
can't figure out exactly the right response.
For instance, I have seen a case where the network becomes
read-only for a particular node. I get packets, but none
get out. It reeks havok on a distributed app like the
cluster manager, as you can imagine. Often times, in this
particular case, cycling the nic helps, but in general, you
would promote this to node or network failure depending on
the configuration (do you have another network you can
rebind you apps to? etc)
It's the massive list of failures that are the problem; we
usually try to handle all the first order failures, in
directly or indirectly if its falls in the not-predicted
catagory, but the cluster must be able to stabalize in the
presense of second order failures, particularly those that
happen within the processing of the first failure. Often,
the natural response within a cluster itself to a first
order failure causes other failures. This is what makes the
state transition stuff so intense, and why there is such an
interdependency on resources within the cluster proper.
mtr