Any tools for switch over, changing IP addresses, etc?
mtr at cutaway.com
Tue Oct 6 17:39:16 MDT 1998
Per-Ola Mard wrote:
> Hello Richard,
> Three follow-up questions,
> 1) What type of apps do you run at the two (web/oracle/sybase/informix?
> 2) How quickly (no outage, one transaction loss or 3 minutes outage)?
> 3) What type of meltdown do you plan to sustain?
> I'm just a bit curious about how we, people on this list, think and what the
> targets are.
> Where are the boundaries/limitations, targets and possibilities?
> Richard Sharpe wrote:
> > Hi,
> > I notice that this list is not a high volume list.
> > I have a site where we have two identical machines (dual 200MHz
> > dual-processor Pentium Pro's), and we would like to switch the machines
> > around quickly in the event of a failure on the primary machine.
> > Are there any tools to help with this? We need to move IP addresses
> > around, and it seems like Linux 2.0.35 does not do gratuitous ARP when an
> > interface is ifconfig'd.
> > Regards
> > -------
> > Richard Sharpe, sharpe at ns.aus.com, NIC-Handle:RJS96
> > NS Computer Software and Services P/L,
> > Ph: +61-8-8281-0063, FAX: +61-8-8250-2080,
> > Samba, Linux, Apache, Digital UNIX, AIX, Netscape, Stronghold, C, ...
It's not really the server machines that are a problem when it comes to the IP
takeover/arp issue. It's the client machines. There are two general methods
of forcing ARP updates in a enterprise where there are OS's that don't pick up
ARP responses not directed at them. Both are annoying. In the first case, you
ping a list of client machines, and the ping will force an arp update. The
second is that the client machines, upon seeing a change through a number of
possible mechanisms, deletes the approriate entries in their arp cache. There
are other combinations or derivatives of the two (like machines in the local
loop dinking the whole arp cache every N seconds) Luckily you only need to do
that on your side of the router, so the list is finite, though that doesn't
have to mean small.
Or you can opt for taking over addresses at the MAC address level -- the nic's
hardware address. You avoid ARP issues in this case since ARP is a translation
from logical IP addresses to hardware MAC addresses. Swap both and the network
continues running without any arp fuss.
Back to IP takeover, the other issue with doing this comes from conflicting
addresses on two machines -- if you have an address flipping back and forth,
there needs to be at least three address assigned -- one for each machine to
come online during the boot process, and the third shared address to toss back
and forth. Otherwise, you might do a takeover so that your second machine is
using the primary IP address you have, and the primary machine gets rebooted,
or comes back online, and shoves the old IP address back out on the nic, and
thoroughly confuses thiings from there on out.
On timings, there are two time issues really -- how fast do you want to
recongize the failure and how fast can you have the cluster "black box" back up
and operational. The former is fairly fixed depending on a few give/take
issues; a certian amount of time has to elaspe in order to be sure that you are
not dealing with a simple short term outage versus a real machine/nic failure.
You can crank this number way down, but you end up with more false failures
(where you think it failed, and initiate an (expensive) failover, but in
reality the network has an issue that delayed packets by a few seconds, or a
machine was abnormally busy during a burst that normally would have fixed
itself in a very short time period). Crank it up and you get a much better
chance of only dealing with real failures, but your failover window grows with
The latter (getting the cluster operational again) is dependent on many, many
things. Acquring disks that are on a shared SCSI bus or the equiiv, how many
of these disks are there, re-mounting filesystems and checking them after a
failure (or redoing the logs if you have a logged or journaled system), how
many of those are there? How long does it take to start your shared services
(oracle, web servers, what not).... all of these things add up and are
inclusive in the failure recovery time you are targeting.
Timing these things at the transaction level is not feasable, although you can
do some things at that level, but more at the middleware layer or disk driver
layer assuming you are doing some kind of network forwarding of disk data, but
even then... Transactional integrity is a higher level concept....
More information about the Linux-HA