[Linux-HA] Standby Node Refuses to Take Over

Lars Ellenberg lars.ellenberg at linbit.com
Fri Oct 1 09:59:18 MDT 2010


On Mon, Sep 27, 2010 at 09:43:37AM -0700, Robinson, Eric wrote:
> The primary node hung and the applications became unresponsive, but DRBD
> status was good and up to date on both nodes, so I did a hb_takeover on
> the standby node. Following  is all that appeared in the ha-debug.log on
> the standby. (I could not see the log on the primary because I could not
> login to it.)
> 
> heartbeat[15853]: 2010/09/27_05:47:12 debug: }/*G_remove_client;*/
> heartbeat[15853]: 2010/09/27_05:50:29 debug: StartNextRemoteRscReq() -
> calling hook
> heartbeat[15853]: 2010/09/27_05:50:29 debug: notify_world: invoking
> harc: OLD status: active
> heartbeat[15853]: 2010/09/27_05:50:29 debug: Process [hb_takeover]
> started pid 23304
> heartbeat[15853]: 2010/09/27_05:50:29 debug: Starting notify process
> [hb_takeover]
> heartbeat[23304]: 2010/09/27_05:50:29 debug: notify_world: setting
> SIGCHLD Handler to SIG_DFL
> heartbeat[23304]: 2010/09/27_05:50:29 debug: notify_world: Running harc
> hb_takeover
> harc[23304]:    2010/09/27_05:50:29 info: Running
> /etc/ha.d/rc.d/hb_takeover hb_takeover
> heartbeat[15853]: 2010/09/27_05:50:29 info: Managed hb_takeover process
> 23304 exited with return code 0.
> heartbeat[15853]: 2010/09/27_05:50:29 debug: RscMgmtProc 'hb_takeover'
> exited code 0

This is haresources mode, resource management model is simplistic.

It thought it successfully took over, and marked itself as holding
"all resources".  hb_takeover was over very quickly, so possibly it
thought it held all resources already, for whatever reason.

Hm. Maybe it is even worse: hb_takeover is actually implemented as
sending a "please shut down your resources" message to the other node,
then waiting for its "thanks, I went standby on my resources please
proceed" answer. So there is no "forceful takeover" here, only
cooperative takeover, and if one refuses to cooperate, then nothing
moves.

I'm not sure what happens if A sent that takeover request,
B is too busy to respond, then B finally dies, while A is still waiting
for that standby message. Possibly a "node dead" event is not
deemed good enough while waiting for a "I'm standby now" message?

Probably exactly your situation.

> I went so far as to turn off the primary, but the standby still never
> took over. When I brought the power on the primary back up, it came up
> secondary and I had to do a hb_takeover on it, but after that all was
> well.

The rebooted node joined the cluster, the still running node told it it
held all resources, both thought there was nothing to do.
Then you asked the rebooted node to take over, they both ran their
scripts again, and this time actually started something.

Seemingly a limitation of the simplistic haresources model.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.



More information about the Linux-HA mailing list