[Linux-HA] Re: deadtime, warntime, and drbd

Jason Joines support at bus.okstate.edu
Tue Mar 8 11:27:55 MST 2005


Lars Marowsky-Bree wrote:

>*snip*
>
>So, you got indeed five problems, in order of severity:
>
>1. You only have ONE heartbeat channel according to these logs. That's
>BROKEN.
>  
>
    I think I have two channels if this is what you meant.  Each node is 
supposed to communicate with the other over both eth0 and eth1 and with 
the router over eth0.

########### begin nodea ha.cf ###########
logfacility daemon
node nodea nodeb
keepalive 1
deadtime 64               # 16 at the time
warntime 16               # nonexistent at the time
ucast eth0 172.18.88.93   # nodeb's public interface
ucast eth1 172.17.1.1     # nodeb's private interface
ping 172.18.91.254        # Ping router
auto_failback off
respawn hacluster /usr/lib/heartbeat/ipfail
########### end nodea ha.cf ###########

########### begin nodeb ha.cf ###########
logfacility daemon
node nodea nodeb
keepalive 1
deadtime 64               # 16 at the time
warntime 16               # nonexistent at the time
ucast eth0 172.18.89.67   # nodea's public interface
ucast eth1 172.17.1.2     # nodea's private interface
ping 172.18.91.254        # Ping router
auto_failback off
respawn hacluster /usr/lib/heartbeat/ipfail
########### end nodeb ha.cf ###########

>2. You don't have STONITH to protect you against split-brain scenarios.
>  
>
    I'll try to remedy this.

>3. Something pretty bad happened to that one channel causing it to be
>completely gone for 7 minutes. You better find out what that was... Did
>someone unplug the cable?
>  
>
    Not that I know of.  It seemed fine during testing before bringing 
this stuff up and has seemed fine ever since.  I've invalidated the drbd 
device on the secondary nodes and let them get a complete resync (320 
GB) from the Primary (nodea:drbd0-->nodeb:drbd0 and 
nodeb:drbd1-->nodea:drbd1) several time without a problem.
    Something must've happened, I just went back through the logs on 
nodeb and found this:
Feb 25 12:37:56 nodeb kernel: NETDEV WATCHDOG: eth1: transmit timed out
Feb 25 12:37:56 nodeb kernel: tg3: eth1: transmit timed out, resetting
Feb 25 12:37:56 nodeb kernel: tg3: tg3_stop_block timed out, ofs=2c00 
enable_bit=2
Feb 25 12:37:56 nodeb kernel: tg3: tg3_stop_block timed out, ofs=2400 
enable_bit=2
Feb 25 12:37:56 nodeb kernel: tg3: tg3_stop_block timed out, ofs=1800 
enable_bit=2
Feb 25 12:37:56 nodeb kernel: tg3: tg3_stop_block timed out, ofs=4800 
enable_bit=2
Feb 25 12:37:56 nodeb kernel: tg3: eth1: Link is down.
Feb 25 12:37:59 nodeb kernel: tg3: eth1: Link is up at 1000 Mbps, full 
duplex.
Feb 25 12:37:59 nodeb kernel: tg3: eth1: Flow control is on for TX and 
on for RX.

>4. drbd got stuck somewhere. That's bad, and shouldn't have happened,
>and the logs don't tell why. The good news is that heartbeat recovered
>fine ;-) I'd report that one to the drbd list.
>  
>
    Will do.

>5. udev/hotplug/subfs etc are acting up and trying to mount stuff they
>shouldn't touch. Bad, no cookies for them, that spews the logs quite
>awkwardly. Maybe that's what confused drbd, too. I wonder why there's no
>such message in the logs on nodea? Is the configuration different
>somewhere?
>  
>
    I may just get rid of subfs.

>
>Sincerely,
>    Lars Marowsky-Brée <lmb at suse.de>
>  
>

    Thanks for all the analysis.  It helps a lot!

Jason
===========



More information about the Linux-HA mailing list