[Linux-HA] Broadcasts stop on all links with only 1 link broken.

Alan Robertson alanr at unix.sh
Wed Jul 6 07:22:18 MDT 2005


Chris Paulson-Ellis wrote:
> Hi,
> 
> I have 2 private LANs (eth0, eth1) used for broadcast heartbeats 
> (version 1.2.3). Everything works fine as long as both LANs remain 
> connected. If I disconnect one of them (say eth1), then the heartbeat 
> broadcasts continue on the other (eth0) for about 10 minutes, then stop! 
> With full debug logging turned on, all logging stops at the same time as 
> the heartbeats stop.
> 
> I used strace on the heartbeat processes to work out what is going on...
> 
> The master control process appears to be stuck in an endless loop trying 
> to send to a full FIFO...
> 
> Process 19875 attached - interrupt to quit
> send(11, "r\0\0\0", 4, MSG_DONTWAIT|0x4000) = -1 EAGAIN (Resource 
> temporarily unavailable)
> nanosleep({0, 2000001}, NULL)           = 0
> send(11, "r\0\0\0", 4, MSG_DONTWAIT|0x4000) = -1 EAGAIN (Resource 
> temporarily unavailable)
> nanosleep({0, 2000001}, NULL)           = 0
> ...
> 
> The broadcast process for the disconnected interface (eth1) is blocked 
> trying to send a heartbeat packet...
> 
> Process 19929 attached - interrupt to quit
> sendto(14, ">>>\nt=status\nst=active\nsrc=helm\n"..., 114, 0, 
> {sa_family=AF_INET, sin_port=htons(694), 
> sin_addr=inet_addr("192.168.2.255")}, 16
> 
> All other processes are waiting for something to happen either by 
> reading FIFOs or waiting for a network packet.
> 
> I assume that what is happening here is that the sendto() on the 
> disconnected network blocks, as it is allowed to do according to the 
> manual page and that this causes the broadcast process to stop reading 
> and writing the command FIFO, which causes the master control process to 
> get stuck in an endless loop trying to place a message in a full FIFO.
> 
> The ethernet driver is the e1000 (Intel Gigabit) on a 2.4.26 kernel. I'm 
> a bit suprised that it blocks sendto() rather than just tossing away 
> packets, but on the other hand it is not unreasonable to apply back 
> pressure to prevent a process from flooding it with packets when it 
> knows they are not leaving its transmit buffer. It will certainly do 
> this when ethernet flow control is enabled and the link is busy.
> 
> If you reconnect the link, then 2 heartbeat packets come out 
> back-to-back and then normal service is resumed, so maybe the driver is 
> tossing away packets (2 packets is not enough to fill the send queue), 
> but is also queuing packets it cannot deliver at least for a while. It 
> may not be tossing them away, but rather they get removed when the link 
> comes up because they are too old to transmit.
> 
> The 10 minute delay on this failure is either the time it takes to fill 
> the transmit queue on the interface, or the time it takes to fill the 
> command FIFO with messages that are not being drained. I'm not sure which.
> 
> Perhaps the broadcast code (and other code which uses sendto()) should 
> use a non-blocking socket to prevent this happening? It would simply log 
> a warning and give up if it gets EAGAIN.

What version is this?

We haven't used the FIFO for much of anything since 1.0.x days.  I'm 
pretty sure that 1.2.x is better in that respect.

If a similar thing happens in 1.2.x time, the symptoms should be quite 
different.  It is something we could handle in 1.2.x much better (but 
maybe we do, and maybe we don't).


-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce



More information about the Linux-HA mailing list