A few issues with heartbeat

Adrian Head adrian.head at bytecomm.com.au
Fri Apr 7 07:22:31 MDT 2000


I'm a newbie to this Linux thing (4Months) and HA (7days) so if I'm way
out please excuse me.

At work I'm looking to replace our current M$ Windoze PDC/BDC with a
HA-Linux SAMBA solution and am therefore trailing different methods of
achieving this.

Currently I have SAMBA as a HA service running on RH6.1 boxes working
quite nicely using the smb script from the French Linux Expo sourced
from this mail list and a few rsync scripts I wrote - and it works very
well.  (Since I don't have CVS access at the moment could I ask a kind
soul to email Rudy Pawul's rsync guide to me please so I can determine
if I'm crossing ground that has already been explored - Thanks in
advance)

Linux Distribution:	RH6.1 on both machines.
Kernel version:		2.2.12-20
heartbeat version:	0.4.6c RPM link from the website; however, "less
heartbeat" gives ..... heartbeat.c ... v1.37 25/12/1999 ....	and
"heartbeat -v" doesn't display the version.

However, I have come across a few things that I need help on.

The setup is as follows:  (a picture is worth a 1000 words)

            /--------------x----------------\
            |                               |
            |      /-------x--------\       |
            |eth1  |ttyS0           |ttyS0  |eth1              
         /------------\          /------------\
         |            |          |            |
         |   apollo   |          |  artemis   |
         |            |          |            |
         |            |          |            |
         \------------/          \------------/
               |eth0                    |eth0
               |                        |
               |                        |
               V      To Network        V

The original test procedure was to kill the power to one of the
machines; however, I scrambled my ex2fs :(.  So I modified it to just
stopping the heartbeat service as Rudy Pawul suggested in his Getting
Started Guide.
However, after one of the tests this was found in the ha-log.

Apollo ha-log

heartbeat: 2000/04/05_14:23:59 info: node artemis.local: status up
heartbeat: 2000/04/05_14:23:59 INFO: Running /etc/ha.d/rc.d/status
status
heartbeat: 2000/04/05_14:24:00 INFO: Running /etc/ha.d/rc.d/ip-request
ip-request
heartbeat: 2000/04/05_14:24:01 error: ha_msg_add_nv: line doesn't
contain '='
heartbeat: 2000/04/05_14:24:01 error: 1 0.03 0.01 2/34 5132
heartbeat: 2000/04/05_14:24:01 INFO: Running /etc/ha.d/resource.d/IPaddr
192.168.0.200 status

I assume that a transmission error occurred.  How can I tell whether it
occurred on the Ethernet crossover or on the null serial cable?  In the
last 7 days this error has only occurred once so I think I can assume
that it is a non-serious error.  Does this error harm heartbeat's
operation at all?

After a test which left artemis turned off for a 24hour period I found
these messages in the ha-log file on apollo. It appeared to occur after
10.5 hours of operation (heartbeat started 17:04 - problem occured at
03:39) I have been unable to recreate this problem again - but I'm still
trying.  What does it mean?  Is this a bug? Do you need more
information? What information do you require?

heartbeat: 2000/04/06_03:39:11 info: MSG stats: 100/43211 age 2
[pid4860/CONTROL]
heartbeat: 2000/04/06_03:39:11 info: ha_malloc stats: 2134/950692
85952/48870 [pid4860/CONTROL]
heartbeat: 2000/04/06_03:39:11 info: RealMalloc stats: 87216 total
malloc bytes. pid 4860/CONTROL]
heartbeat: 2000/04/06_03:39:11 info: MSG stats: 0/134890 age 0
[pid4863/MST_STATUS]
heartbeat: 2000/04/06_03:39:11 info: ha_malloc stats: 0/2357458  0/0
[pid4863/MST_STATUS]
heartbeat: 2000/04/06_03:39:11 info: RealMalloc stats: 1616 total malloc
bytes. pid 4863/MST_STATUS]
heartbeat: 2000/04/06_03:39:11 info: MSG stats: 0/43242 age 2
[pid4864/HBWRITE]
heartbeat: 2000/04/06_03:39:11 info: ha_malloc stats: 0/951340  0/0
[pid4864/HBWRITE]
heartbeat: 2000/04/06_03:39:11 info: RealMalloc stats: 1264 total malloc
bytes. pid 4864/HBWRITE]
heartbeat: 2000/04/06_03:39:11 info: MSG stats: 1/281 age 47710
[pid4865/HBREAD]
heartbeat: 2000/04/06_03:39:11 info: ha_malloc stats: 5/6150  544/348
[pid4865/HBREAD]
heartbeat: 2000/04/06_03:39:11 info: RealMalloc stats: 1216 total malloc
bytes. pid 4865/HBREAD]
heartbeat: 2000/04/06_03:39:11 info: MSG stats: 0/43243 age 0
[pid4866/HBWRITE]
heartbeat: 2000/04/06_03:39:11 info: ha_malloc stats: 0/951362  0/0
[pid4866/HBWRITE]
heartbeat: 2000/04/06_03:39:11 info: RealMalloc stats: 1264 total malloc
bytes. pid 4866/HBWRITE]
heartbeat: 2000/04/06_03:39:11 info: MSG stats: 0/48194 age 0
[pid4867/HBREAD]
heartbeat: 2000/04/06_03:39:11 info: ha_malloc stats: 0/1060284  0/0
[pid4867/HBREAD]
heartbeat: 2000/04/06_03:39:11 info: RealMalloc stats: 1264 total malloc
bytes. pid 4867/HBREAD]

Then on artemis sometimes I get the following message.  I get it mostly
when heartbeat has been stopped for a while (hours).  I assume that
something has stolen the socket - how can I determine what?  Only a
reboot fixes the problem, restarting networking won't work.  Does anyone
have any suggestions on what else to try to free the socket up?

Artemis ha-log

heartbeat: 2000/04/05_09:13:39 info: ***********************
heartbeat: 2000/04/05_09:13:39 info: Configuration validated. Starting
heartbeat.
heartbeat: 2000/04/05_09:13:40 notice: Starting serial heartbeat on tty
/dev/ttyS0
heartbeat: 2000/04/05_09:13:40 error: Error binding socket: Address
already in use
heartbeat: 2000/04/05_09:13:40 error: cannot open udp eth1


The other problem is that the cluster sometimes partitions when one of
the nodes is restarted.  I was looking through the mail list and found
someone discussion a problem about cluster partitioning during startup
caused by the time the scripts take to execute, but I don't think that
this applies here as one node is already up.  In this situation using
"cat </dev/ttyS0" it seems that heartbeat has stopped sending the
heartbeat as nothing seems to be coming through.  I'm not sure if my
diagnosis is correct as I've not put a serial analyser on the serial
line to double check.  The logs give no clues at all - ha-log or
ha-debug give nothing away.  It seems as if either heartbeat just
doesn't see the other.


Please, if I have not given enough information tell me what is needed
and I will email it to the list.

Thanks in Advance.

HA-Linux stuff seems really cool - keep up the good work.

Adrian Head



More information about the Linux-HA mailing list