Cluster Management Strategy

alanr@bell-labs.com alanr@bell-labs.com
Sat, 31 Oct 1998 00:48:24 -0700


I was thinking some about cluster management.  I wrote up my thoughts
and put them on my HA web page, and have included them below for your
reading pleasure :-)

As it says at the bottom, I'm sure there are those of you who won't
agree with me, but here are my thoughts after thinking some more about
our previous discussions.

    -- Alan Robertson
       alanr@bell-labs.com


---------- Excerpt from http://www.henge.com/~alanr/ha/index.html
----------------

A primary capability for a High-Availability system is what I call a
cluster management strategy.  A good cluster management strategy should
consist of the following things:

   1.A set of resources and resource states to be managed by the
strategy
   2.A collection of (diagnostic) tests to run periodically and/or on
demand
   3.A set of recovery actions to take when the various tests fail
   4.A set of tools for manually observing and managing the resources
and their states

These components work together to monitor and manage the node's
resources, and ultimately the cluster itself.  Each node in the cluster
has a view of itself, and of the cluster as a whole.

Examples of resources which might be managed in this fashion include:

     HTTP service (on port 80)
     IP connectivity via IP address www.xxx.yyy.zzz
     Ethernet connectivity via MAC address [aa bb cc dd ...]
     Ethernet hub or switch
     Ethernet NIC eth0
     Disk Subsystem
     Basic Node Sanity

Sample recovery actions which might be taken include:

     Restart the HTTP service
     ifconfig off, then ifconfig on the NIC in question
     Reconfigure the IP address to a redundant NIC on the same node
     Reconfigure another node in the cluster to take over the IP address

     Reconfigure another node in the cluster to take over the MAC
address
     Reconfigure DNS to remove the node from a DNS round-robin group
     Reconfigure IP addresses to reroute traffic through a redundant hub
or switch
     Reboot the node
     Notify system adminstration staff via pager
     et cetera

Note that in my personal view, the monitoring and management of the
various networking components is just another element of the cluster
management strategy.  It is an important element, with many
dependencies, but in my view it should be managed in the same fashion as
any other resource.  This view is not universally shared.