[Linux-HA] failover trouble with DRBD after Stonith reset of active server
dwdha at drdykstra.us
Mon Jul 18 11:36:18 MDT 2005
On Thu, Jul 14, 2005 at 06:47:36PM -0600, Alan Robertson wrote:
> Dave Dykstra wrote:
> >In the failover my production server experienced this morning, the standby
> >server initially failed to take over because starting DRBD right after it
> >had Stonithed the active server failed. I'm quite sure this is a case
> >of the same problem I had previously reported to DRBD mailing list but
> >which has so far gotten not received a satisfactory answer. I now think
> >the answer may be primarily have to come from heartbeat, not DRBD.
> >The problem comes in when as far as DRBD is concerned the other node
> >is active right up until the time the Stonith happens: DRBD refuses to
> >allow the standby node to become active until its own timeout period has
> >elapsed. In the past I had only noticed that when doing a kill -9 of the
> >active heartbeat process on the active server, in order to test Stonith.
> >I wasn't too bothered by not having an answer for that because I reasoned
> >that maybe it wasn't too likely that heartbeat would die before DRBD in
> >practice, but this morning I experienced that kind of situation; DRBD
> >uses the point-to-point link between the two servers, and the standby
> >DRBD noticed no problem until the active server was Stonithed.
> >Heartbeat on the standby server did eventually recover by restarting
> >itself, then noticing that its peer wasn't alive yet it attempted to take
> >over again and this time enough time had elapsed for DRBD to not complain.
> >It sure seems that there should be a more controlled recovery than that
> >though. What would it take for heartbeat to be able to tell DRBD that
> >it knows for sure that the other side is dead so DRBD should go ahead
> >and take over? I had first suggested on the DRBD list that DRBD send a
> >flurry of packets in a short time to try to determine whether or not the
> >other side is still up, but they didn't go for that and I don't blame them.
> >Somehow the information that heartbeat knows, that the other side is
> >really dead, needs to be passed to DRBD.
> >On the other hand, should DRBD always be trusting that heartbeat would
> >never tell it to start unless the other side is really stopped? Maybe all
> >that really needs to change is /etc/ha.d/resource.d/drbddisk, to pass
> >the drbdsetup --do-what-I-say option on the command to become primary.
> >That would send the ball back to DRBD because drbddisk is part of that
> >package. What's the opinion of the linux-ha team?
> >Below is the full log entries for the failed & retried takeover. Note that
> >there's no error message from drbddisk saved in the log, but I'm 98%
> >sure that's what the problem was because that's how DRBD behaves.
> IRC, if you set your heartbeat deadtime a bit higher than your DRBD dead
> time, I think this problem should solve itself.
No, this is a different situation. Deadtime specifies the time *before*
heartbeat declares the other node dead. In this situation, DRBD is still
communicating right up to the time that the standby heartbeat hits the
active node with stonith, and then heartbeat immediately (the exact same
second according to the logs although stonith takes 2 seconds) tries
to bring up the standby server. DRBD has too little time to realize
the other side is dead and refuses to come up until after its waiting
period has elapsed. I use the default 6 seconds for DRBD's timeout and
the default 10 seconds for heartbeat's deadtime.
Now if there were a parameter for heartbeat to wait *after* it stoniths
the other side to allow time for DRBD to also notice that the other side
is dead, that would be one way to handle it. I hate the extra delay,
however, and would rather have there way be a way for heartbeat to tell
DRBD to --do-what-I-say to skip the wait.
More information about the Linux-HA