[Linux-HA] Broadcasts stop on all links with only 1 link broken.

Chris Paulson-Ellis chris at edesix.com
Wed Jul 6 04:54:26 MDT 2005


Hi,

I have 2 private LANs (eth0, eth1) used for broadcast heartbeats 
(version 1.2.3). Everything works fine as long as both LANs remain 
connected. If I disconnect one of them (say eth1), then the heartbeat 
broadcasts continue on the other (eth0) for about 10 minutes, then stop! 
With full debug logging turned on, all logging stops at the same time as 
the heartbeats stop.

I used strace on the heartbeat processes to work out what is going on...

The master control process appears to be stuck in an endless loop trying 
to send to a full FIFO...

Process 19875 attached - interrupt to quit
send(11, "r\0\0\0", 4, MSG_DONTWAIT|0x4000) = -1 EAGAIN (Resource 
temporarily unavailable)
nanosleep({0, 2000001}, NULL)           = 0
send(11, "r\0\0\0", 4, MSG_DONTWAIT|0x4000) = -1 EAGAIN (Resource 
temporarily unavailable)
nanosleep({0, 2000001}, NULL)           = 0
...

The broadcast process for the disconnected interface (eth1) is blocked 
trying to send a heartbeat packet...

Process 19929 attached - interrupt to quit
sendto(14, ">>>\nt=status\nst=active\nsrc=helm\n"..., 114, 0, 
{sa_family=AF_INET, sin_port=htons(694), 
sin_addr=inet_addr("192.168.2.255")}, 16

All other processes are waiting for something to happen either by 
reading FIFOs or waiting for a network packet.

I assume that what is happening here is that the sendto() on the 
disconnected network blocks, as it is allowed to do according to the 
manual page and that this causes the broadcast process to stop reading 
and writing the command FIFO, which causes the master control process to 
get stuck in an endless loop trying to place a message in a full FIFO.

The ethernet driver is the e1000 (Intel Gigabit) on a 2.4.26 kernel. I'm 
a bit suprised that it blocks sendto() rather than just tossing away 
packets, but on the other hand it is not unreasonable to apply back 
pressure to prevent a process from flooding it with packets when it 
knows they are not leaving its transmit buffer. It will certainly do 
this when ethernet flow control is enabled and the link is busy.

If you reconnect the link, then 2 heartbeat packets come out 
back-to-back and then normal service is resumed, so maybe the driver is 
tossing away packets (2 packets is not enough to fill the send queue), 
but is also queuing packets it cannot deliver at least for a while. It 
may not be tossing them away, but rather they get removed when the link 
comes up because they are too old to transmit.

The 10 minute delay on this failure is either the time it takes to fill 
the transmit queue on the interface, or the time it takes to fill the 
command FIFO with messages that are not being drained. I'm not sure which.

Perhaps the broadcast code (and other code which uses sendto()) should 
use a non-blocking socket to prevent this happening? It would simply log 
a warning and give up if it gets EAGAIN.

Regards,
Chris.




More information about the Linux-HA mailing list