[Linux-HA] Re: deadtime, warntime, and drbd

Marc Cousin mcousin at sigma.fr
Mon Mar 7 02:16:38 MST 2005


On Friday 04 March 2005 19:34, Jason Joines wrote:
> Alan Robertson wrote:
> > Jason Joines wrote:
> >>    I recently experienced the "Cluster node returning after
> >> partition" problem described in FAQ #12.  I have two nodes and two
> >> resource groups, one is the prefered node for each.  Nodea is the
> >> prefered node for drbd0, it's filesystem, an ip address, and samba.
> >> Nodeb is the prefered node for drbd1.  Both are connected to a public
> >> 100 Mbps switch via eth0 and a private 1 Gbps switch via eth1.
> >>    At the time this occurred, nodea was serving smb requests to a
> >> large number of clients via eth0.  I had mounted drbd1 on nodeb,
> >> exported it via NFS, and was rapidly copying the entire filesystem of
> >> another box to it via eth1.  Apparently the load got high enough on
> >> nodeb that communication between the nodes failed and mass confusion
> >> ensued (at least that's what I can make of the logs).  Eventually
> >> nodeb rebooted itself, the drbds went into either StandAlone or
> >> Disconnected mode and I had to manually tell nodea to take the smb
> >> resource group back.
> >>    My timing settings in ha.cf at the time were
> >> keepalive 1
> >> deadtime 16
> >>    Following the FAQ suggestion I have upped deadtime to 64 and set
> >> warntime to 16 so I can watch the logs for a while.  However, I'm
> >> unsure how my drbd timing settings are interacting with this.  They
> >> were, and at the moment still are, connect-int 8
> >> ping-int 4
> >> timeout 20
> >>    Any suggestions for modifying these settings to be more in tune
> >> with heartbeat?
> >
> > What version of heartbeat are you trying this on?
>
>     I'm using heartbeat 1.2.3 and drbd 0.7.10 on SuSE 9.2 with kernel
> 2.6.10.

I'm reacting about 2.6.10 ...
We just tried to use heartbeat inside Xen hosts ...
Heartbeat fails here with 2.6.10
heartbeat[608]: ERROR: No local heartbeat. Forcing restart, 
and then 
heartbeat[608]: WARN: Late heartbeat: Node fwautocom-2: interval 41580 ms

, and not with 2.4.29 (rock stable). Is anybody having the same kind of 
problems with 2.6.10 ?
I must mention we have no load whatsoever on the failing nodes with 2.6.10, 
and that the failure is systematic.

We're using XEN 2.0.4, Debian unstable, heartbeat 1.2.3 (packaged with 
debian).
Of course, the kernels are patched to use Xen ...

>
> Jason
> ===========
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha



More information about the Linux-HA mailing list