drbd question

Alan Robertson alanr@suse.com
Thu, 06 Apr 2000 01:34:49 -0600


"Stephen C. Tweedie" wrote:
> 
> Hi,
> 
> On Tue, Apr 04, 2000 at 09:12:43PM -0600, Alan Robertson wrote:
> >
> > It is reliable in that it *won't* provide bad data.  In the proposal I
> > described before, the machine would not come up automatically in the
> > circumstances I believe you have described.
> 
> What prevents it?
> 
> >  Although I do confess to
> > not being quite sure exactly what you were describing.
> 
> 5 machines: A,B,C,D,E.  A drbd device replicated on A, B, C, and D.
> The cluster partitions into (A,B,E) and (C,D).  The first partition
> has quorum.  We then take another fault and repartition into
> (A,B) and (C,D,E).  The second partition now has quorum, and has
> two copies of the drbd data which (as far as it is concerned) are
> still recent, because neither C nor D has seen the new data on A or
> B.  However, that new quorate partition has stale data.  Bad news.

Good.  This helps.

When you partition into (A,B,E) and (C,D), then A and B have the "good"
data and quorum, and the (C, D) drbd mirrors are shut down due to lack
of quorum.  They no longer have continuous memory.  So far, everything
is OK.  Next, when you partition into (A,B) and (C,D,E), then the mirror
on (A,B) is shut down due to lack of quorum, and *they* no longer have
continuous memory.

When the DRBD replicator on C and D restart after gaining quorum in the
cluster transition, they go into sit_and_cry() mode because neither can
contact a node which has continuous memory.  They have no one with valid
data to slave from when they restart.
 
> If you keep the updated sequence number for the device replicated
> on all voting nodes, then in this scenario, the (A,B,E) partition
> will record the new sequence on all nodes, including E; and in the
> second half of the situation, the (C,D,E) partition knows that
> neither C nor D are uptodate, because they can see the incremented
> sequence number held by E.
> 
> > Hopefully this (CRACC) is a rare occurance :-)  A complete shutdown for
> > some administrative reason is not too surprising, but hopefully a
> > complete and catastrophic failure resulting in no partition having
> > quorum happens only rarely.
>
> Yes, but the scenario above, in which we have a partition and where
> one node migrates from one partition to another taking quorum with
> it, is not at all uncommon if you have dodgy ethernet cabling or
> bridging.  This scenario is just as bad as the complete cluster
> reboot case if you don't allow the moving node to hold a sufficient
> record of the cluster state of the last quorate partition it was a
> member of.

Up to this point, it appears that this case is handled identically by
both methods.  In both methods, access to data is denied - as it should
be. 

What the sequence number method does is allow the data on (A,B) to be
automatically declared as the newest when they become part of the quorum
at some point in the future.  The continuous memory method needs human
intervention to know what to do.

Of course, if you add or remove nodes to the cluster, then the "memory"
of the last, best sequence number becomes tricky to handle, because you
want to make sure that you know that a number of machines which would
have constituted a majority under the old cluster size all agree that
this is the best data.  If you have a flaky piece of hardware, one of
the more likely things to do is to reconfigure the cluster to take it
out, so that you can keep quorum, and keep the cluster running...  This
complicates the voting.  Perhaps human intervention is required for this
case?

The nice thing about the continuous memory method is that it's simple,
and the cluster manager doesn't have to know or do anything special for
a drbd resource, or know anything about the drbd topology.  In this
sense, it's a cleaner design.  Too bad it requires human intervention in
a few more (hopefully rare) cases.

	-- Alan Robertson
	   alanr@suse.com