[Linux-HA] Standby Node Refuses to Take Over
lars.ellenberg at linbit.com
Wed Oct 6 06:22:01 MDT 2010
On Tue, Oct 05, 2010 at 11:47:37AM +0100, Steve Davies wrote:
> Yes, the haresources model is simple. I have encountered the issue
> above, and other similar issues.
> Right now I have 3 situations that I have discovered, and plan to work
> around them (comments from more experienced HA users are welcome):
> Fail 1) A node needs to go active, but this fails. This causes an
> attempt to go back to slave. RM does not record that it is not-active
> unless it can speak to the other node.
> Solution 1) Really? I just stopped everything... Of course I should
> no-longer be active! I plan to have the RM record that I am inactive
> even after the failed ha_standby request, or perhaps beforehand (I'll
> add a timeout I guess) This will have knock-on effects, which will
> need chasing down :)
> Fail 2) Split-brain. This restarts both nodes 'heartbeat' daemons, and
> will kill a perfectly working node.
> Solution 2) An understandable solution, but sometimes it can be more
> clever. I hope to add a F_SPLITBRAIN message that includes a SETWEIGHT
> - This will then run an rc script on each node, and allow the 2 nodes
> to fight it out. If that fails, then we'll do the restart. The script
> in its simplest form can of course just do a heartbeat daemon restart
> Fail 3) If 2 nodes get split, but also get out-of-sync. Split brain is
> not recognised, and when reconnected, an "Active" message is
> exchanged/logged, but ignored.
> Solution 3) The "Active" message already causes a 'status' script to
> run. I plan to extend this script to cause a Splitbrain alert when
> appropriate to cause the same resolution as in 2) above.
> Note, all of the above are theoretical solutions, and I do not know
> when I might get round to improving them, I just thought it might be
> useful to publish my findings so far given that they seem to relate to
> this thread.
If you can create test cases for either of these,
maybe even in a form that the "CTS" understands,
that would be probably help a lot.
> The "old" resource manager is beautifully lightweight, and does not
> /require/ hundreds of megabytes of Python and XML libraries to
> operate. I am working on keeping it lightweight so it can be used in
> small systems. Wish me luck :)
I certainly do.
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
More information about the Linux-HA