[Linux-HA] Broadcasts stop on all links with only 1 link broken.
Chris Paulson-Ellis
chris at edesix.com
Wed Jul 6 04:54:26 MDT 2005
Hi,
I have 2 private LANs (eth0, eth1) used for broadcast heartbeats
(version 1.2.3). Everything works fine as long as both LANs remain
connected. If I disconnect one of them (say eth1), then the heartbeat
broadcasts continue on the other (eth0) for about 10 minutes, then stop!
With full debug logging turned on, all logging stops at the same time as
the heartbeats stop.
I used strace on the heartbeat processes to work out what is going on...
The master control process appears to be stuck in an endless loop trying
to send to a full FIFO...
Process 19875 attached - interrupt to quit
send(11, "r\0\0\0", 4, MSG_DONTWAIT|0x4000) = -1 EAGAIN (Resource
temporarily unavailable)
nanosleep({0, 2000001}, NULL) = 0
send(11, "r\0\0\0", 4, MSG_DONTWAIT|0x4000) = -1 EAGAIN (Resource
temporarily unavailable)
nanosleep({0, 2000001}, NULL) = 0
...
The broadcast process for the disconnected interface (eth1) is blocked
trying to send a heartbeat packet...
Process 19929 attached - interrupt to quit
sendto(14, ">>>\nt=status\nst=active\nsrc=helm\n"..., 114, 0,
{sa_family=AF_INET, sin_port=htons(694),
sin_addr=inet_addr("192.168.2.255")}, 16
All other processes are waiting for something to happen either by
reading FIFOs or waiting for a network packet.
I assume that what is happening here is that the sendto() on the
disconnected network blocks, as it is allowed to do according to the
manual page and that this causes the broadcast process to stop reading
and writing the command FIFO, which causes the master control process to
get stuck in an endless loop trying to place a message in a full FIFO.
The ethernet driver is the e1000 (Intel Gigabit) on a 2.4.26 kernel. I'm
a bit suprised that it blocks sendto() rather than just tossing away
packets, but on the other hand it is not unreasonable to apply back
pressure to prevent a process from flooding it with packets when it
knows they are not leaving its transmit buffer. It will certainly do
this when ethernet flow control is enabled and the link is busy.
If you reconnect the link, then 2 heartbeat packets come out
back-to-back and then normal service is resumed, so maybe the driver is
tossing away packets (2 packets is not enough to fill the send queue),
but is also queuing packets it cannot deliver at least for a while. It
may not be tossing them away, but rather they get removed when the link
comes up because they are too old to transmit.
The 10 minute delay on this failure is either the time it takes to fill
the transmit queue on the interface, or the time it takes to fill the
command FIFO with messages that are not being drained. I'm not sure which.
Perhaps the broadcast code (and other code which uses sendto()) should
use a non-blocking socket to prevent this happening? It would simply log
a warning and give up if it gets EAGAIN.
Regards,
Chris.
More information about the Linux-HA
mailing list