Cluster Management Strategy
alanr@bell-labs.com
alanr@bell-labs.com
Sat, 31 Oct 1998 00:48:24 -0700
I was thinking some about cluster management. I wrote up my thoughts
and put them on my HA web page, and have included them below for your
reading pleasure :-)
As it says at the bottom, I'm sure there are those of you who won't
agree with me, but here are my thoughts after thinking some more about
our previous discussions.
-- Alan Robertson
alanr@bell-labs.com
---------- Excerpt from http://www.henge.com/~alanr/ha/index.html
----------------
A primary capability for a High-Availability system is what I call a
cluster management strategy. A good cluster management strategy should
consist of the following things:
1.A set of resources and resource states to be managed by the
strategy
2.A collection of (diagnostic) tests to run periodically and/or on
demand
3.A set of recovery actions to take when the various tests fail
4.A set of tools for manually observing and managing the resources
and their states
These components work together to monitor and manage the node's
resources, and ultimately the cluster itself. Each node in the cluster
has a view of itself, and of the cluster as a whole.
Examples of resources which might be managed in this fashion include:
HTTP service (on port 80)
IP connectivity via IP address www.xxx.yyy.zzz
Ethernet connectivity via MAC address [aa bb cc dd ...]
Ethernet hub or switch
Ethernet NIC eth0
Disk Subsystem
Basic Node Sanity
Sample recovery actions which might be taken include:
Restart the HTTP service
ifconfig off, then ifconfig on the NIC in question
Reconfigure the IP address to a redundant NIC on the same node
Reconfigure another node in the cluster to take over the IP address
Reconfigure another node in the cluster to take over the MAC
address
Reconfigure DNS to remove the node from a DNS round-robin group
Reconfigure IP addresses to reroute traffic through a redundant hub
or switch
Reboot the node
Notify system adminstration staff via pager
et cetera
Note that in my personal view, the monitoring and management of the
various networking components is just another element of the cluster
management strategy. It is an important element, with many
dependencies, but in my view it should be managed in the same fashion as
any other resource. This view is not universally shared.