A few issues with heartbeat
Alan Robertson
alanr at suse.com
Fri Apr 7 22:05:09 MDT 2000
Horms wrote:
>
> On Fri, Apr 07, 2000 at 11:22:31PM +1000, Adrian Head wrote:
> >
> First up I would recomend moving to heartbeat 0.4.7 which I believe
> will address some of your problems, and at any rate 0.4.7 is
> more polished than 0.4.6c. I believe that 0.4.7 also contains
> Rudy Pawul's rsync document. I would also recpmend moving to a 2.2.14
> kernel.
>
> [snip]
> > heartbeat: 2000/04/05_14:24:01 error: ha_msg_add_nv: line doesn't
> > contain '='
> > heartbeat: 2000/04/05_14:24:01 error: 1 0.03 0.01 2/34 5132
> > heartbeat: 2000/04/05_14:24:01 INFO: Running /etc/ha.d/resource.d/IPaddr
> > 192.168.0.200 status
> >
> > I assume that a transmission error occurred. How can I tell whether it
> > occurred on the Ethernet crossover or on the null serial cable? In the
> > last 7 days this error has only occurred once so I think I can assume
> > that it is a non-serious error. Does this error harm heartbeat's
> > operation at all?
>
> It shouldn't effect heartbeat other than that heartbeat message
> would be ignored.
And, in 0.4.7, that message would actually be resent. 0.4.7 is really
the way to go.
> > heartbeat: 2000/04/06_03:39:11 info: MSG stats: 100/43211 age 2
> > [pid4860/CONTROL]
> > heartbeat: 2000/04/06_03:39:11 info: ha_malloc stats: 2134/950692
> > 85952/48870 [pid4860/CONTROL]
> > heartbeat: 2000/04/06_03:39:11 info: RealMalloc stats: 87216 total
> > malloc bytes. pid 4860/CONTROL]
> > heartbeat: 2000/04/06_03:39:11 info: MSG stats: 0/134890 age 0
> > [pid4863/MST_STATUS]
> > heartbeat: 2000/04/06_03:39:11 info: ha_malloc stats: 0/2357458 0/0
> > [pid4863/MST_STATUS]
> > heartbeat: 2000/04/06_03:39:11 info: RealMalloc stats: 1616 total malloc
> > bytes. pid 4863/MST_STATUS]
>
> [snip]
>
> Heartbeat is dumping statistics for each of its processes.
> This is not an error, just informational.
And in particular, it dumps them so that everyone can be sure that
heartbeat doesn't have any nasty memory leaks in it. The Malloc stats
shouldn't be growing over time. When heartbeat is known to be really
mature (dead, I suppose :-)), these could be turned off. I don't
anticipate that happening soon. I actually thought about having
heartbeat restart itself if it sees it's memory statistics growing too
large. Of course, I haven't done it, but I thought about it. Memory
leaks are incompatible with high-availability :-)
> > /dev/ttyS0
> > heartbeat: 2000/04/05_09:13:40 error: Error binding socket: Address
> > already in use
> > heartbeat: 2000/04/05_09:13:40 error: cannot open udp eth1
>
> Breifly what is happening is that when heartbeat exits it is
> not closing the socket cleanly. This behaviour appears to
> be caused by the SO_BINDTODEVICE socket option that heartbeat
> utilises to enable it to have separate sockets - and hence
> processes - listening on different interfaces.
>
> Here are some options to get around this problem.
>
> 1. I have been unable to reproduce this problem using 2.2.14 kernels. It
> seems that you have found an environment where this problem occurs, which
> is good because I have been trying to find one so the problem can be
> resolved, but upgrding your kernel should help. Please let me know if the
> problem persists under 2.2.14 as I would like to know what is causing the
> problem.
>
> 2. I have a patch that removes the code that sets SO_BINDTODEVICE. This
> effecively means that heartbeat can only listen on one interface. This is
> fine as you only have one ethernet interface.
>
> 3. It would be possible to change heartbeat so binding is controled
> by addresses rather than devices.
>
> > The other problem is that the cluster sometimes partitions when one of
> > the nodes is restarted. I was looking through the mail list and found
> > someone discussion a problem about cluster partitioning during startup
> > caused by the time the scripts take to execute, but I don't think that
> > this applies here as one node is already up. In this situation using
> > "cat </dev/ttyS0" it seems that heartbeat has stopped sending the
> > heartbeat as nothing seems to be coming through. I'm not sure if my
> > diagnosis is correct as I've not put a serial analyser on the serial line
> > to double check. The logs give no clues at all - ha-log or ha-debug give
> > nothing away. It seems as if either heartbeat just doesn't see the
> > other.
>
> I noticed this too :) You should try hearbeat 0.4.7, I have been unable to
> reproduce the problem with this version.
If you think that heartbeat isn't working right, you can send a SIGUSR1
to the lead process, and it will up the debug level by one. Doing this
5 or 6 times gets into some really serious debugging. Giving it SIGUSR2
decrements it correspondingly.
> As an aside. One test that you haven't reported a problem with, which we
> are still working on a solution to is if the nodes lose communication with
> each other. In your situation this will occor if both the serial link and
> ethernet link are broken, while both nodes are functional. In this case you
> can expect both nodes to become active :( We are working on this and in any
> case you do have two links so the likely hood of this occuring in
> production is low.
Well said, Horms.
Thanks!
-- Alan Robertson
alanr at suse.com
More information about the Linux-HA
mailing list