Jason Joines support at bus.okstate.edu
Mon Mar 7 13:00:12 MST 2005

Lars Marowsky-Bree wrote:

>On 2005-03-04T09:43:21, Jason Joines <support at bus.okstate.edu> wrote:
>>   At the time this occurred, nodea was serving smb requests to a large 
>>number of clients via eth0.  I had mounted drbd1 on nodeb, exported it 
>>via NFS, and was rapidly copying the entire filesystem of another box to 
>>it via eth1.  Apparently the load got high enough on nodeb that 
>>communication between the nodes failed and mass confusion ensued (at 
>>least that's what I can make of the logs).  Eventually nodeb rebooted 
>>itself, the drbds went into either StandAlone or Disconnected mode and I 
>>had to manually tell nodea to take the smb resource group back.
>It literally rebooted itself? Are you using the watchdog timer?
>Please provide the log messages of the node directly prior to the
>    Lars Marowsky-Brée <lmb at suse.de>

    Yep, literally.  I'm having trouble getting the logs through due to 
the 40 Kb message size limit on the list.  Looks like mine hit 57 Kb.  
I'm going to try and send them separately.

    Honestly, I don't even know what the "watchdog timer" is.  Both 
boxes are Dell Poweredges.  Nodea is a 2450 and nodeb is a 2550.  Both 
are using onboard Adaptec aic7899 Ultra160 SCSI adapters.  Both boxes 
are using IBM Ultrastar, 146 GB, Ultra320 SCSI drives.  Both have drbd0 
as sdb and drbd1 as sdc.  The following messages contain everything from 
the logs on both boxes that contains drbd OR ipfail OR heartbeat from 
the time I started the NFS operation on nodeb (12:02:42) through the 
reboot of nodeb (12:38:16) up until nodeb came back up (12:42:38).


