[Linux-HA] Re: Broadcasts stop on all links with only 1 link broken.

Chris Paulson-Ellis chris at edesix.com
Wed Jul 6 09:39:09 MDT 2005


Alan Robertson wrote:
> 1.2.x NEVER blocks writing to IPC sockets.  It should eventually get an 
> error and then restart.

Strace showed that the master control process is not blocked, but stuck 
in a busy loop doing send() on the IPC socket which returns EAGAIN (also 
called EWOULDBLOCK in this context), then nanosleep(). It is logically 
blocked in a retry loop.

Looking at socket_resume_io_write(), it does cl_shortsleep() & loops on 
EAGAIN. There is no timeout at this level. The following comment appears 
at this point in the code :-):

                     /* FIXME! KLUDGE! */
                     /* We could fix this if we kept better
                      * state info so we could retry this
                      * operation later and not be confused.
                      * This is the right thing to do!
                      */

> I'd like to see the logs for this...

They stop (for all processes) when the heartbeats stop and don't seem to 
contain anything out of the ordinary towards the end, but I can dig them 
up or reproduce them if you like.

I don't have heartbeat compiled with -DDEBUG, so there is no repeating 
"Sent n byte message header", "socket send returned EAGAIN" in the log, 
but from the strace output it is clear where we are.

Chris.



More information about the Linux-HA mailing list