[Linux-HA] Re: deadtime, warntime, and drbd
Jason Joines
support at bus.okstate.edu
Tue Mar 8 11:27:55 MST 2005
Lars Marowsky-Bree wrote:
>*snip*
>
>So, you got indeed five problems, in order of severity:
>
>1. You only have ONE heartbeat channel according to these logs. That's
>BROKEN.
>
>
I think I have two channels if this is what you meant. Each node is
supposed to communicate with the other over both eth0 and eth1 and with
the router over eth0.
########### begin nodea ha.cf ###########
logfacility daemon
node nodea nodeb
keepalive 1
deadtime 64 # 16 at the time
warntime 16 # nonexistent at the time
ucast eth0 172.18.88.93 # nodeb's public interface
ucast eth1 172.17.1.1 # nodeb's private interface
ping 172.18.91.254 # Ping router
auto_failback off
respawn hacluster /usr/lib/heartbeat/ipfail
########### end nodea ha.cf ###########
########### begin nodeb ha.cf ###########
logfacility daemon
node nodea nodeb
keepalive 1
deadtime 64 # 16 at the time
warntime 16 # nonexistent at the time
ucast eth0 172.18.89.67 # nodea's public interface
ucast eth1 172.17.1.2 # nodea's private interface
ping 172.18.91.254 # Ping router
auto_failback off
respawn hacluster /usr/lib/heartbeat/ipfail
########### end nodeb ha.cf ###########
>2. You don't have STONITH to protect you against split-brain scenarios.
>
>
I'll try to remedy this.
>3. Something pretty bad happened to that one channel causing it to be
>completely gone for 7 minutes. You better find out what that was... Did
>someone unplug the cable?
>
>
Not that I know of. It seemed fine during testing before bringing
this stuff up and has seemed fine ever since. I've invalidated the drbd
device on the secondary nodes and let them get a complete resync (320
GB) from the Primary (nodea:drbd0-->nodeb:drbd0 and
nodeb:drbd1-->nodea:drbd1) several time without a problem.
Something must've happened, I just went back through the logs on
nodeb and found this:
Feb 25 12:37:56 nodeb kernel: NETDEV WATCHDOG: eth1: transmit timed out
Feb 25 12:37:56 nodeb kernel: tg3: eth1: transmit timed out, resetting
Feb 25 12:37:56 nodeb kernel: tg3: tg3_stop_block timed out, ofs=2c00
enable_bit=2
Feb 25 12:37:56 nodeb kernel: tg3: tg3_stop_block timed out, ofs=2400
enable_bit=2
Feb 25 12:37:56 nodeb kernel: tg3: tg3_stop_block timed out, ofs=1800
enable_bit=2
Feb 25 12:37:56 nodeb kernel: tg3: tg3_stop_block timed out, ofs=4800
enable_bit=2
Feb 25 12:37:56 nodeb kernel: tg3: eth1: Link is down.
Feb 25 12:37:59 nodeb kernel: tg3: eth1: Link is up at 1000 Mbps, full
duplex.
Feb 25 12:37:59 nodeb kernel: tg3: eth1: Flow control is on for TX and
on for RX.
>4. drbd got stuck somewhere. That's bad, and shouldn't have happened,
>and the logs don't tell why. The good news is that heartbeat recovered
>fine ;-) I'd report that one to the drbd list.
>
>
Will do.
>5. udev/hotplug/subfs etc are acting up and trying to mount stuff they
>shouldn't touch. Bad, no cookies for them, that spews the logs quite
>awkwardly. Maybe that's what confused drbd, too. I wonder why there's no
>such message in the logs on nodea? Is the configuration different
>somewhere?
>
>
I may just get rid of subfs.
>
>Sincerely,
> Lars Marowsky-Brée <lmb at suse.de>
>
>
Thanks for all the analysis. It helps a lot!
Jason
===========
More information about the Linux-HA
mailing list