Web page for Linux-HA
Mon, 19 Oct 1998 09:50:59 -0600
Jim Trocki wrote:
> A fourth thing which can contribute to a failure is faulty software in
> any number of devices.
> One thing I've experienced in many many different ways with many many
> different pieces of equipment is that failure modes vary widely. One of
> the big problems with any system which provides some amount of resiliency
> is that they depend on knowing what may fail and what the failure mode
> may be. It is not always possible to know the failure mode.
> For example, to throw a wrench into the above example of pinging the
> switch, if some bunk-head accidentally gives his machine the same
> IP address as the switch (I've seen it happen) or some mis-configured
> device starts to proxy arp for the entire universe (I've seen it happen)
> then it will look like a network failure, when quite possibly the switch
> might be forwarding packets between everything fine.
> The point is that things go wrong which you do not and cannot predict.
> I'm just saying that it's a tricky problem, not that there isn't anything that
> can be done about it.
Certainly these kinds of things happen, and more frequently than anyone would
want. The right question is:
From the *symptoms* that I observe, can I take an action which will make
This is not the same as:
Do I really know what's wrong?
Am I taking the optimal action?
But it's very important that we not take the pessimal action :-)
It's also important that eventually we develop a model of failure dependencies
which will allow us to percolate things up the chain. Mon is headed that way,
but this is a little more deliberative thing. Paging someone unnecessarily is a
smaller crime than taking a node out of service unnecessarily.
-- Alan Robertson