[Linux-HA] heartbeat 2.0.8: causing nfs kernel oops
Gerry Reno
greno at verizon.net
Tue May 1 10:48:14 MDT 2007
Alan Robertson wrote:
> Gerry Reno wrote:
>
>> I'm seeing some very strange things lately. Whenever heartbeat is
>> running there are these messages in the log:
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: write failure on
>> bcast eth0.: No such device
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: glib: Unable to
>> send bcast [-1] packet(len=214): No such device
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG: Dumping
>> message with 10 fields
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[0] :
>> [t=NS_ackmsg]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[1] :
>> [dest=grp-01-30-02]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[2] :
>> [ackseq=40cd2]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[3] :
>> [(1)destuuid=0x835cfc8(37 28)]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[4] :
>> [src=grp-01-30-01]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[5] :
>> [(1)srcuuid=0x8361848(36 27)]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[6] : [hg=a1]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[7] :
>> [ts=46367de0]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[8] : [ttl=4]
>> Apr 30 19:38:08 grp-01-30-01 heartbeat: [2533]: ERROR: MSG[9] : [auth=1
>> dcf0feb393f46354b060306713eb72adc15eecf3]
>>
>> But yet, in most other respects eth0 seems to behave perfectly normal.
>> I even went so far as to swap out the NIC card for eth0 and same
>> result. I can ping, ftp, ssh, etc. using eth0 with no problems. Where
>> I do see a problem is with using NFS. If I mount a remote NFS mount and
>> try to push a compressed tar to the NFS mounted directory, after about
>> 1GB of transfer I get a kernel oops in the NFS code. Now, if I shutdown
>> heartbeat and perform the same compressed tar it completes correctly
>> without any oops. So I'm baffled by this. Is there any known problem
>> that would cause the above log messages on an otherwise perfectly good
>> network connection and also cause some type of interaction with NFS?
>> This problem seems to follow the primary node. In other words the
>> lockup occurs on whichever node has the primary IPaddr. I can post the
>> log, but it's hundreds of megabytes of this same message.
>>
>
> Yes.
>
> Running DHCP on a network link. Taking the link down manually. Other
> things that involve messing around with eth0.
>
Alan,
Where do you think this problem lies? Is it a kernel problem; a
heartbeat problem? Is this something that is/has been/can be addressed
by the heartbeat team? Is there a workaround/fix? This problem greatly
interferes with other network activities that need to take place on our
servers such as backups and that is how I discovered it because none of
the backups were completing overnight and the whole machine would be
locked up due to the kernel oops.
thx,
-Gerry
More information about the Linux-HA
mailing list