[Linux-HA] Re: Broadcasts stop on all links with only 1 link broken.
chris at edesix.com
Wed Jul 6 09:39:09 MDT 2005
Alan Robertson wrote:
> 1.2.x NEVER blocks writing to IPC sockets. It should eventually get an
> error and then restart.
Strace showed that the master control process is not blocked, but stuck
in a busy loop doing send() on the IPC socket which returns EAGAIN (also
called EWOULDBLOCK in this context), then nanosleep(). It is logically
blocked in a retry loop.
Looking at socket_resume_io_write(), it does cl_shortsleep() & loops on
EAGAIN. There is no timeout at this level. The following comment appears
at this point in the code :-):
/* FIXME! KLUDGE! */
/* We could fix this if we kept better
* state info so we could retry this
* operation later and not be confused.
* This is the right thing to do!
> I'd like to see the logs for this...
They stop (for all processes) when the heartbeats stop and don't seem to
contain anything out of the ordinary towards the end, but I can dig them
up or reproduce them if you like.
I don't have heartbeat compiled with -DDEBUG, so there is no repeating
"Sent n byte message header", "socket send returned EAGAIN" in the log,
but from the strace output it is clear where we are.
More information about the Linux-HA