More Linux-HA heartbeat thoughts
hm at seneca.muc.de
Sat Mar 21 04:33:00 MST 1998
alanr at bell-labs.com wrote:
> If you implement a "bidirectional ring" network with serial ports and you have
> a two or 3-machine network, you *have* redundancy in your communication path.
> And since the underlying hardware is much simpler and less failure prone than
> ethernet (and doesn't require any additional slots), you have a very
> inexpensive, reliable (low-bandwidth) communication medium with no single point
> of failure.
Alan I get your point but you need to check the correct function of all
critical network adapters anyway. The easiest and cheapest way to do this
is to issue HB packets along the networks, between Service IP addresses
and Standby IP addresses, separately. This also buys you redundancy in
the HB mechanism itself, for free. I totally agree that the RS232 path
is likely to be more reliable (except if someone pulls a cable which
can lead to node isolation) but that was not my point.
What we must avoid is having a single HB path/technology which could be a
SPOF. For example most PC mainboards have 2 serial interfaces but they
are attached to one single communications chip on the MB. If this chip
fails, the machine is isolated if no other HB path exists. This is
where IP-based HB comes in handy. You could put in an additional serial
card but for HA purposes, most PCI/ISA boards have very few free slots! And
I don't like the idea very much of telling end users many rules like "do
not use both internal serial interfaces for HB". These things are
error-prone and, since nobody reads documentation, likely to be ignored :-)
My proposal: have a common high level HB API which hides what is below.
Could be serial, any IP-based, TM SCSI or whatever. If an entire network
fails (in case the single serial comms chip on the MB fails for example),
issue a "network down" event, and let the end user customize the behaviour
for "network down". Could be an immediate shutdown in some cases (to
prevent data corruption for example), could be a SNMP trap in others
(to notify a Netview administrator that you just lost HB redundancy and
should do something about it).
Your bidirectional ring is exactly what I was thinking of. Let us have the
same logical bi-di ring for IP HB as well. That is, for all low-level HB
paths that are supported by the proposed high level HB interface.
More information about the Linux-HA