A few issues with heartbeat

Horms horms at vergenet.net
Fri Apr 7 20:49:54 MDT 2000


On Fri, Apr 07, 2000 at 11:22:31PM +1000, Adrian Head wrote:
> 
> I'm a newbie to this Linux thing (4Months) and HA (7days) so if I'm way
> out please excuse me.
> 
> At work I'm looking to replace our current M$ Windoze PDC/BDC with a
> HA-Linux SAMBA solution and am therefore trailing different methods of
> achieving this.
> 
> Currently I have SAMBA as a HA service running on RH6.1 boxes working
> quite nicely using the smb script from the French Linux Expo sourced
> from this mail list and a few rsync scripts I wrote - and it works very
> well.  (Since I don't have CVS access at the moment could I ask a kind
> soul to email Rudy Pawul's rsync guide to me please so I can determine
> if I'm crossing ground that has already been explored - Thanks in
> advance)
> 
> Linux Distribution:	RH6.1 on both machines.
> Kernel version:		2.2.12-20
> heartbeat version:	0.4.6c RPM link from the website; however, "less
> heartbeat" gives ..... heartbeat.c ... v1.37 25/12/1999 ....	and
> "heartbeat -v" doesn't display the version.

First up I would recomend moving to heartbeat 0.4.7 which I believe
will address some of your problems, and at any rate 0.4.7 is
more polished than 0.4.6c. I believe that 0.4.7 also contains
Rudy Pawul's rsync document. I would also recpmend moving to a 2.2.14
kernel.

[snip]

> The original test procedure was to kill the power to one of the
> machines; however, I scrambled my ex2fs :(.  So I modified it to just
> stopping the heartbeat service as Rudy Pawul suggested in his Getting
> Started Guide.
> However, after one of the tests this was found in the ha-log.
> 
> Apollo ha-log
> 
> heartbeat: 2000/04/05_14:23:59 info: node artemis.local: status up
> heartbeat: 2000/04/05_14:23:59 INFO: Running /etc/ha.d/rc.d/status
> status
> heartbeat: 2000/04/05_14:24:00 INFO: Running /etc/ha.d/rc.d/ip-request
> ip-request
> heartbeat: 2000/04/05_14:24:01 error: ha_msg_add_nv: line doesn't
> contain '='
> heartbeat: 2000/04/05_14:24:01 error: 1 0.03 0.01 2/34 5132
> heartbeat: 2000/04/05_14:24:01 INFO: Running /etc/ha.d/resource.d/IPaddr
> 192.168.0.200 status
> 
> I assume that a transmission error occurred.  How can I tell whether it
> occurred on the Ethernet crossover or on the null serial cable?  In the
> last 7 days this error has only occurred once so I think I can assume
> that it is a non-serious error.  Does this error harm heartbeat's
> operation at all?

It shouldn't effect heartbeat other than that heartbeat message
would be ignored.

> After a test which left artemis turned off for a 24hour period I found
> these messages in the ha-log file on apollo. It appeared to occur after
> 10.5 hours of operation (heartbeat started 17:04 - problem occured at
> 03:39) I have been unable to recreate this problem again - but I'm still
> trying.  What does it mean?  Is this a bug? Do you need more
> information? What information do you require?
> 
> heartbeat: 2000/04/06_03:39:11 info: MSG stats: 100/43211 age 2
> [pid4860/CONTROL]
> heartbeat: 2000/04/06_03:39:11 info: ha_malloc stats: 2134/950692
> 85952/48870 [pid4860/CONTROL]
> heartbeat: 2000/04/06_03:39:11 info: RealMalloc stats: 87216 total
> malloc bytes. pid 4860/CONTROL]
> heartbeat: 2000/04/06_03:39:11 info: MSG stats: 0/134890 age 0
> [pid4863/MST_STATUS]
> heartbeat: 2000/04/06_03:39:11 info: ha_malloc stats: 0/2357458  0/0
> [pid4863/MST_STATUS]
> heartbeat: 2000/04/06_03:39:11 info: RealMalloc stats: 1616 total malloc
> bytes. pid 4863/MST_STATUS]

[snip]

Heartbeat is dumping statistics for each of its processes.
This is not an error, just informational.

> Then on artemis sometimes I get the following message.  I get it mostly
> when heartbeat has been stopped for a while (hours).  I assume that
> something has stolen the socket - how can I determine what?  Only a
> reboot fixes the problem, restarting networking won't work.  Does anyone
> have any suggestions on what else to try to free the socket up?
> 
> Artemis ha-log
> 
> heartbeat: 2000/04/05_09:13:39 info: ***********************
> heartbeat: 2000/04/05_09:13:39 info: Configuration validated. Starting
> heartbeat.
> heartbeat: 2000/04/05_09:13:40 notice: Starting serial heartbeat on tty
> /dev/ttyS0
> heartbeat: 2000/04/05_09:13:40 error: Error binding socket: Address
> already in use
> heartbeat: 2000/04/05_09:13:40 error: cannot open udp eth1

Breifly what is happening is that when heartbeat exits it is
not closing the socket cleanly. This behaviour appears to
be caused by the SO_BINDTODEVICE socket option that heartbeat
utilises to enable it to have separate sockets - and hence
processes - listening on different interfaces. 

Here are some options to get around this problem.

1. I have been unable to reproduce this problem using 2.2.14 kernels.  It
seems that you have found an environment where this problem occurs, which
is good because I have been trying to find one so the problem can be
resolved, but upgrding your kernel should help. Please let me know if the
problem persists under 2.2.14 as I would like to know what is causing the
problem.

2. I have a patch that removes the code that sets SO_BINDTODEVICE.  This
effecively means that heartbeat can only listen on one interface. This is
fine as you only have one ethernet interface.

3. It would be possible to change heartbeat so binding is controled
by addresses rather than devices. 

> The other problem is that the cluster sometimes partitions when one of
> the nodes is restarted.  I was looking through the mail list and found
> someone discussion a problem about cluster partitioning during startup
> caused by the time the scripts take to execute, but I don't think that
> this applies here as one node is already up.  In this situation using
> "cat </dev/ttyS0" it seems that heartbeat has stopped sending the
> heartbeat as nothing seems to be coming through.  I'm not sure if my
> diagnosis is correct as I've not put a serial analyser on the serial line
> to double check.  The logs give no clues at all - ha-log or ha-debug give
> nothing away.  It seems as if either heartbeat just doesn't see the
> other.

I noticed this too :) You should try hearbeat 0.4.7, I have been unable to
reproduce the problem with this version.

As an aside. One test that you haven't reported a problem with, which we
are still working on a solution to is if the nodes lose communication with
each other. In your situation this will occor if both the serial link and
ethernet link are broken, while both nodes are functional. In this case you
can expect both nodes to become active :( We are working on this and in any
case you do have two links so the likely hood of this occuring in
production is low.

-- 
Horms



More information about the Linux-HA mailing list