[Linux-ha-dev] Error in serial code of heartbeat?

Horms horms at vergenet.net
Fri Apr 21 02:27:20 MDT 2000


On Thu, Apr 20, 2000 at 09:24:43AM +0200, Holger Kiehl wrote:
> Hello
> 
> There seems to be a bug in heartbeat serial code. I have been using
> heartbeat for a very long time and have had no problems. But since I moved
> the machine and put a higher constant load on it, I am getting
> the following errors every hour:
>    TTY write timeout on [/dev/ttyS1] (no connection?)
> 
> At first I was running version 0.4.6c when these errors popped up. I
> rebooted both nodes several times, but this did not help. The error
> always popped up again. I then tried to do an strace on the heartbeat
> doing the serial stuff and could see that it always reads every two
> seconds from the serial fd, although the serial buffer was full with
> data! I could verify this by simply disconnecting the serial connection
> and the heartbeat process was still reading data from the serial
> port for about 5 - 10 minutes before the buffer was empty! Connecting
> it again, this time with a serial analyser between the two, one
> could see the buffer fill up until it was full again and the RTS
> signal dropped.
>                                                                 
> It seems that heartbeat is reading just one record every two seconds
> and does not read everything from the buffer. So if the process
> writing to the port writes faster, it will always fill the
> buffer and heartbeat will NOT detect if the other node has
> crashed for 5 - 10 minutes until the buffer is empty.
> 
> Two days ago I decided to upgrade to 0.4.7 and everything seemed to
> be running. However looking at the log files this morning I see that
> the same messages appear in my log files on both nodes.
> 
> As I said this all started to happen when I moved the nodes from one
> room to another one and have more procceses running on it causing a
> higher load on the active node:
> 
>  9:06am  up 1 day, 22:19,  5 users,  load average: 0.87, 0.61, 0.44
> 
> There are about 195 processes now running on the active node. Before
> I moved load average was always around zero.


I'm trying to track this down but I'm not having a lot of luck.
The serial code does only read one line at a time, but the process
that handles the reading of the data should (by my reading of the code)
be coninuously reading information from the serial port.

One possibliy I thought of is that the buffer for the pipe
used to communicat status between heartbeat processes is being
filled. The process reads the pipe, again should be doing this continously
by my reading of the code.

Another posiblility is that heartbeat is taking so long to write
messages that it is unable to read messages fast enough (from the pipe).
This seems unlikely, though would tie in with the load requirement.
If this is the case then a mechanism for flushing the buffers continuously,
and discarding backlogged messages would be required. 

I am going to try and repoduce this problem to try and understand it
better. Alan do you have any ideas on what the cause might be?



-- 
Horms




More information about the Linux-HA-Dev mailing list