Stephen C. Tweedie
Thu, 6 Apr 2000 11:11:31 +0100
On Thu, Apr 06, 2000 at 01:34:49AM -0600, Alan Robertson wrote:
> When you partition into (A,B,E) and (C,D), then A and B have the "good"
> data and quorum, and the (C, D) drbd mirrors are shut down due to lack
> of quorum. They no longer have continuous memory.
Ah, OK. I follow you now --- I was just missing the exact definition
of "continuous memory".
> When the DRBD replicator on C and D restart after gaining quorum in the
> cluster transition, they go into sit_and_cry() mode because neither can
> contact a node which has continuous memory. They have no one with valid
> data to slave from when they restart.
That's a big problem. Complete cluster reboots _do_ happen. Power
supply problems are the main reason (somebody mistook the big red
switch by the door for a light switch, for example). You really
_have_ to deal with that properly. Unfortunately, this is a situation
where multiple faults are depressingly common: if you have a number of
machines that are running 24*265, then the chances are quite high that
on a power cycle, one or more will fail to start up (disks are
notorious for failing in this manner). That just makes things harder.
> What the sequence number method does is allow the data on (A,B) to be
> automatically declared as the newest when they become part of the quorum
> at some point in the future. The continuous memory method needs human
> intervention to know what to do.
Yes. Fortunately, it is possible to hide this kind of detail behind
a decent API: if you can say "I have a resource on nodes X,Y,Z, please
tell me if I have quorum", then the mechanism for determining quorum
can be hidden behind a clean abstraction layer and can be replaced in
the future if necessary.
> Of course, if you add or remove nodes to the cluster, then the "memory"
> of the last, best sequence number becomes tricky to handle, because you
> want to make sure that you know that a number of machines which would
> have constituted a majority under the old cluster size all agree that
> this is the best data.
My quorum design deals with changing the cluster size very cleanly.
That was an important part of the design (I want to be able to add
nodes to the quorum database as automatically as possible as the admin
adds machines to the cluster).
Basically, the quorum database cannot be modified unless you have
quorum. That one provision makes the implementation enormously more
simple. The entire quorum database is replicated on all voting nodes.
If you want to expand the cluster, you first of all need to add the
new node to the quorum database. Only once that operation is complete
can the node join in quorum.
> If you have a flaky piece of hardware, one of
> the more likely things to do is to reconfigure the cluster to take it
> out, so that you can keep quorum, and keep the cluster running... This
> complicates the voting.
A node which is in the cluster can adjust the number of votes it has,
as long as we have quorum. In particular, it can adjust its votes down
to zero before leaving the cluster. (There's a subtle difference
between having a contribution of zero votes and having no vote at all:
all contributing nodes are expected to replicate the quorum database,
even those whose votes are zero, so we don't have to adjust the
replicator list for the quorum database when we abdicate a vote on
a node which is about to leave the cluster for maintenance.)
> Perhaps human intervention is required for this
Human intervention is required to know when to do this. A normal
reboot of a node should not try to return votes if we expect the
node to come back immediately, as the more votes we lose in this
manner, the greater the chance that a fault happens which loses
quorum (and remember, a node which has abdicated its vote cannot
restore quorum when it returns to the cluster). However, if a
node is expected to be offline for some time, then the admin may
well want its vote to be ommitted from quorum calculations. That
decision should be up to the admin. The actual mechanics of
doing quorum can be done automatically in either case.
> The nice thing about the continuous memory method is that it's simple,
> and the cluster manager doesn't have to know or do anything special for
> a drbd resource, or know anything about the drbd topology.
Yes, but it has the disadvantage that it isn't highly available. :-)
> In this
> sense, it's a cleaner design. Too bad it requires human intervention in
> a few more (hopefully rare) cases.
Cluster reboots are not rare. Think of the simple cases: a couple of
machines beside each other on a desk sharing the same power cord, and
a cleaner coming into the room and needing a power socket for the