[Linux-ha-dev] Psuedo-quorum and nice failback: Was: The nice
nice_failback :)
Alan Robertson
alanr@suse.com
Tue, 18 Apr 2000 07:37:13 -0600
Horms wrote:
>
> On Mon, Apr 17, 2000 at 12:10:10PM -0600, Alan Robertson wrote:
> > it seems to me that the nice_failback still has to
> > worry about whether the other side has any of the resources (which may be
> > where we started this conversation). You could always add a message type
> > and ask... The only difference between nice_failback and normal is if
> > you ask whether the other side has the resources, or if you just take
> > them over anyway.
>
> I don't follow this, if you take over a resource (that has no master)
> without checking to see if it is already owned by a node then aren't you
> going to end up with two nodes owning the resource?
>
> > If you don't want to add any new message types, you could always
> > implement nice failback as a case where the side coming back up (the
> > "natural" master) gets a "no" response when it asks for the resources
> > from the other side. You could then even make nice_failback a special
> > resource, so that the nice-failback property is then a property of the
> > group, not the whole configuration. When asked to give up any group with
> > the nice-failback resource in it, the far end machine always says "no".
>
> I'm completely lost now. If it asks for the resouce, don't we need
> a new message type to do the asking?
There is already a message type which says "give me the resources". It
just always assumes that it will get an OK from the other end. If you
change the semantics of the message slightly, then it would be
permissible to respond with "no-way-jose", in which case it should just
say "oh, well", and go on and *not* take the resources over.
Current scenario:
"natural master" sends "ip-request", and gets back "ip-request-resp"
ALWAYS with "ok=OK", or getting nothing back (timeout).
New scenario:
"natural master" sends "ip-request", and gets back nothing (timeout)
or a an "ip-request-resp" message with ok=OK or ok=NO.
If it gets timeout or OK=OK, then it initiates takeover.
If it gets ok=NO, then it doesn't take over the resource.
The current protocol has the possibility of refusal implicitly designed
in,
but the code implementing it doesn't know what to do if it gets a
refusal.
There is currently an issue if giving up a resource takes "too long".
Then it takes over the resource assuming the other side is dead. This
is not fixed nor made worse by this change.
-- Alan Robertson
alanr@suse.com