[Linux-HA] heartbeat 2.0.8: causing nfs kernel oops

Gerry Reno greno at verizon.net
Tue May 1 10:48:14 MDT 2007


Alan Robertson wrote:
> Gerry Reno wrote:
>   
>> I'm seeing some very strange things lately.  Whenever heartbeat is
>> running there are these messages in the log:
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: write failure on
>> bcast eth0.: No such device
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: glib: Unable to
>> send bcast [-1] packet(len=214): No such device
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG: Dumping
>> message with 10 fields
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[0] :
>> [t=NS_ackmsg]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[1] :
>> [dest=grp-01-30-02]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[2] :
>> [ackseq=40cd2]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[3] :
>> [(1)destuuid=0x835cfc8(37 28)]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[4] :
>> [src=grp-01-30-01]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[5] :
>> [(1)srcuuid=0x8361848(36 27)]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[6] : [hg=a1]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[7] :
>> [ts=46367de0]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[8] : [ttl=4]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[9] : [auth=1
>> dcf0feb393f46354b060306713eb72adc15eecf3]
>>
>> But yet, in most other respects eth0 seems to behave perfectly normal. 
>> I even went so far as to swap out the NIC card for eth0 and same
>> result.  I can ping, ftp, ssh, etc. using eth0 with no problems.  Where
>> I do see a problem is with using NFS.  If I mount a remote NFS mount and
>> try to push a compressed tar to the NFS mounted directory, after about
>> 1GB of transfer I get a kernel oops in the NFS code.  Now, if I shutdown
>> heartbeat and perform the same compressed tar it completes correctly
>> without any oops.  So I'm baffled by this.  Is there any known problem
>> that would cause the above log messages on an otherwise perfectly good
>> network connection and also cause some type of interaction with NFS? 
>> This problem seems to follow the primary node.  In other words the
>> lockup occurs on whichever node has the primary IPaddr.  I can post the
>> log, but it's hundreds of megabytes of this same message.
>>     
>
> Yes.
>
> Running DHCP on a network link.  Taking the link down manually.  Other
> things that involve messing around with eth0.
>   
Alan,
  Where do you think this problem lies?  Is it a kernel problem; a 
heartbeat problem?  Is this something that is/has been/can be addressed 
by the heartbeat team?  Is there a workaround/fix?  This problem greatly 
interferes with other network activities that need to take place on our 
servers such as backups and that is how I discovered it because none of 
the backups were completing overnight and the whole machine would be 
locked up due to the kernel oops.

thx,
-Gerry




More information about the Linux-HA mailing list