cluster layering

Peter J. Braam braam@cs.cmu.edu
Wed, 24 Mar 1999 11:08:18 -0500


Stephen,

Can we talk a bit more about the layering.   I'm thinking about a lock
manager and a caching protocol for  SAN file systems - ala VMS with some
changes.  Clearly the DLM will sit above the connection manager. Both the
DLM and the connection manager will sit on top of a communications layer
(note that UDP could be a substrate exporting the communications layer
interface, but should not be idetified with it).  So I suppose that while a
large part of the connection manager can work in user land, a component will
live in the kernel.

I'd like to understand the following example in some more detail.  Suppose
our cluster has three hosts as members,
A,B and C connected by, for example memory channel.  Suppose that A and B
share a disk over SCSI, and that host C is working with files on the shared
disk, and C is doing its communication through A.  So A is mastering the
locks on some part of the file system on the shared disk, and C holds such a
lock.

Now a state change happens: A dies.  We would like for B to take over from
A, so that C can continue using the disk.  From Coda I know that C can find
out about A's disappearance in multiple ways:

1.  the membership component connection manager notices it first

2.  an error is returned by the lower layers of the file & I/O system on C
when doing I/O between C and  A and the disk

Let's look at case 2 (I believe that 1 is slightly simpler).  C's I/O
subsystem will do some retries before it decides that something pretty bad
has happened.  I'd like to raise the following two questions for discussion:

A. How do you envision that the connection manager gains control when the
retries have failed a few times?

B. How can the connection manager restart the operation initiated by C's I/O
subsystem, in effect replacing A by B, after the transition in the cluster
has completed?

I envision something like the following.   Each resource (think of the disk)
has a name and a storage group associated wtih it (the storage group would
be {A,B}).  When we get a lock, we also get a preferred server for the
resource.  If I/O fails, with ETIMEOUT, we (i) trap the error, (ii) detect
it is a cluster resource (iii) ask the connection manager to give us a new
preferred server, and retry.

Where do we trap the error?  In the buffer cache which fails during
flushing? It probably cannot be done in the file system above it, since that
merely writes to the buffer cache.   Also, it seems like the context in
which this happens is possibly not the context of (e.g.) the writer in the
file system, but instead the context of another process which needs memory.
So what is the layer here?  It looks like the communications layer or the
lock manager exports state (namely the preferred servers) to the buffer
cache.

We could also make a clearer separation, and build a disk "class driver".
The file system talks with that device and the disk class driver is in turn
a customer of the buffer cache.  Is this perhaps preferrable in a future
with NAS etc?

Also note that we are asking for a lot of action while flushing buffers -
namely to reconfigure the cluster and lock database and then try again.  In
particular, we need to have sufficient memory to spare to run your user
level programs to reconfigure stuff.

Just some thoughts.  Are yours going in the same direction?

- Peter -


----- Original Message -----
From: Stephen C. Tweedie <sct@redhat.com>
To: <alanr@bell-labs.com>
Cc: Tom Vogt <tv@wlwonline.de>; Linux-HA mailing list <linux-ha@muc.de>;
Stephen Tweedie <sct@redhat.com>
Sent: Wednesday, March 24, 1999 9:07 AM
Subject: Re: udp broadcast


> Hi,
>
> On Tue, 23 Mar 1999 07:11:15 -0700, alanr@bell-labs.com said:
>
> >> em... can anyone tell me how I listen on the broadcast address without
a
> >> need for root priviledges? is that possible? if not, what's the
> >> recommended solution?
>
> > I *think* that's required.
>
> No, the udp side of my cluster-comms code already does neighbourhood
> discovery automatically using broadcast, all from an unprivileged
> daemon.  It just sets the SO_BROADCAST socket option for sending, and
> binds to the local host adapter's IP address for receiving.
>
> > However, you should note that the HA subsystem needs lots of privileges
> > anyway because it has to do things only trusted users can do (like
> > change IP configurations, reboot machines, mount filesystems, etc.)
>
> The HA subsystem needs to be layered very carefully.  The layer which
> keeps track of the cluster state needs no privileges, but the layer
> which runs user-level startup/failover scripts obviously needs to be
> able to run as the appropriate user for each service.
>
> --Stephen
>