Web page for Linux-HA

alanr at bell-labs.com alanr at bell-labs.com
Mon Oct 19 09:50:59 MDT 1998

Jim Trocki wrote:

> A fourth thing which can contribute to a failure is faulty software in
> any number of devices.
> One thing I've experienced in many many different ways with many many
> different pieces of equipment is that failure modes vary widely. One of
> the big problems with any system which provides some amount of resiliency
> is that they depend on knowing what may fail and what the failure mode
> may be. It is not always possible to know the failure mode.
> For example, to throw a wrench into the above example of pinging the
> switch, if some bunk-head accidentally gives his machine the same
> IP address as the switch (I've seen it happen) or some mis-configured
> device starts to proxy arp for the entire universe (I've seen it happen)
> then it will look like a network failure, when quite possibly the switch
> might be forwarding packets between everything fine.
> The point is that things go wrong which you do not and cannot predict.
> I'm just saying that it's a tricky problem, not that there isn't anything that
> can be done about it.

Certainly these kinds of things happen, and more frequently than anyone would
want.  The right question is:

    From the *symptoms* that I observe, can I take an action which will make
        things better?

This is not the same as:

        Do I really know what's wrong?
        Am I taking the optimal action?

But it's very important that we not take the pessimal action :-)

It's also important that eventually we develop a model of failure dependencies
which will allow us to percolate things up the chain.  Mon is headed that way,
but this is a little more deliberative thing.  Paging someone unnecessarily is a
smaller crime than taking a node out of service unnecessarily.

    -- Alan Robertson
       alanr at bell-labs.com

More information about the Linux-HA mailing list