Cluster manager structure for Linux-HA (?)

alanr at bell-labs.com alanr at bell-labs.com
Sat Mar 21 06:04:43 MST 1998


> alanr at bell-labs.com wrote:
> > But the ability to monitor "things", detect non-working facilities and
> > services, invoke recovery actions (and in a future release) handle
> > dependencies seems like a good place to start, rather than re-inventing
> > the wheel. 
> 
> This is why I proposed to unify SCSI adapter error reporting in the first
> place. Michael Neuffer said he wanted to implment this during the re-write
> of the SCSI mid layer code but we aren't very likely to see this before the
> 2.3 kernels I am afraid. 
> 
> Alan, your right with your proposal. What we need is a generic, unified
> error reporting mechanism in the kernel itself, like AIX has it... so that
> device drivers can report error codes i.a.w. a error code table which in
> turn could be easily parsed by the cluster manager. One _could_ also set up
> syslogd (together with klogd) to write to a pipe and let the cluster
> manager read from that pipe but if a device driver developer decides to
> slightly modify his error strings you get a hard time regexp parsing these
> strings :-( 
> 
> I don't know mon but does it work different from this approach?
> 
> As far as HA is specifically concerned, what we need is asynchronous error
> detection to make sure errors are noticed instantly. The "voting" part
> would be a functionality within the CM daemon. 

What "mon" creates is a structure to implement diagnostics which allow you to
test things on a periodic basis.  For example, mon comes with a monitor which
tests ethernet connectivity.  You can use that to test the path to your hub (if
it has its own ethernet address), to other elements of the cluster, and to the
router.  It comes with a test which tests connectivity to an httpd server so
you can see if your httpd server is working.  Since more failures are software
than hardware, and some (like your hubs and routers) are outside the cluster,
you need to be able to test many things -- only a few of which can be measured
easily by device drivers as a matter of course.  Of course, if your drivers are
capable, they can feed mon also.

What mon allows is a structure for running periodic tests.  When he finishes
putting in dependencies, you can also give it knowledge of the system
dependency structure.  For example, connectivity to your router depends on
connectivy to your hub.  If your hub goes out, you don't want to diagnose the
router as bad, because something it is based on (the hub) is marked bad. 
Similarly, you don't want to call your other nodes' networks dead if you can't
talk to your hub.

Here's how I would see a "mon-based" structure:


		Configuration	   Diagnostic Tests
                  Manager	   (mon "monitors)
	   (mon "alert" scripts)	  |
		     |			  |
		     |			  |
		      \			 /
		       \		/
			\	       /
			 \	      /
		          "Mon" daemon
				|
			Heartbeat Monitor
			 (invokes "mon" ?)

I'm not 100% sure exactly how the heartbeat monitor should tie in.
Mon schedules and "interprets" the results of the tests (by looking
at how often something has failed, what the dependencies are, etc.)

It in turn, notifies the configuration manager of a system component
or service failure (httpd daemon has died, ethernet connectivity lost, etc.)
and the configuration manager then decides what to do, and takes appropriate
actions.

I found out after sending out this suggestion that the author of mon is on this
list.  I'm not 100% sure that it is easy for the heartbeat monitor (or log
monitor) to invoke "mon" directly.  Jim Trockji:  If you're reading this mail,
could you please comment on this aspect?

	-- Alan Robertson
	   alanr at bell-labs.com



More information about the Linux-HA mailing list