ACT/ACT [was:A few issues with heartbeat]
banibrata dutta
bdutta@hotmail.com
Fri, 07 Apr 2000 22:37:41 PDT
hi horms,
i have not been an active developer of Linux-HA, but i must admit
i have learnt a lot from these mailing lists. i had completed the
development of a Solaris-HA (prorpietary) implementation some 6
months back (which i am still maintining), and till the very end
of this project, the only major problem left was the ACTIVE/ACTIVE
case, i.e. all communication links (we have 4 level redundancy),
on FT-SUN Netra boxes, have failed.
after many discussions we concluded that our product, which happens
to be deployed in an absolute mission-critical scenario, would be
better off with complete outage, than the total chaos that can be
caused by two isolated ACTIVE/ACTIVE boxes. due to the floating IP
addresses, and intermittent, last-link between Service-Router (the
router which connects towards service users, of the service provided
by our HA-servers), coming up/down (due to any electrical problem),
could cause the floating-IP address to switch between the two boxes,
with the other always thinking, it is also the owner of the floating
IP. So we went in favor of both box down, than two boxes ACTIVE. I
don't know whether this is the case with stateless, and not-so-
critical HTTP servers etc., but i guess the idea is quite general.
of course, when such a thing happens, we'd get all the alarm bells
ringing at the top of their voice -- paging, e-mail, fax-out, alarms,
logs etc.
we were sure that, if we are able to determine, after a series of
checks, like ethernet-layer-2 checks, electrical-circuit-completion
checks, state-transition-checks, layer-3 checks, redundant-path
between box checks, and redundant path to Service Router checks that
both box MIGHT be active, we shut down both boxes.
In this case, we have sufficient checks at machine startup, bootup,
cold-start, warm-start etc. to see if the last outage on THIS box
was due to that ACTIVE/ACTIVE possibility detection or not. If it
was then all sorts of cleanup actions, and floating IP arbitration
is done, and if other-box is up, a hankshake establishes the real
scenario more clearly, and more cleanups are done.
Due to the prioprietary nature of our product, and my commitment
to my employers i can't disclose more (although, given a choice
i'd love to), but if friends in this group have better ideas, and
in the discussion i can contribute anything more, i'd love to
go for it.
Thanks and Regards,
Banibrata Dutta.
----
>As an aside. One test that you haven't reported a problem with, which we
>are still working on a solution to is if the nodes lose communication with
>each other. In your situation this will occor if both the serial link and
>ethernet link are broken, while both nodes are functional. In this case you
>can expect both nodes to become active :( We are working on this and in any
>case you do have two links so the likely hood of this occuring in
>production is low.
______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com