[Linux-HA] two node firewall using heartbeat v2 problems

Andrew Beekhof beekhof at gmail.com
Mon Oct 1 06:37:54 MDT 2007


On 10/1/07, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> Hi,
>
> On Fri, Sep 28, 2007 at 04:20:23PM -0500, Matt Zagrabelny wrote:
> > Hello,
> >
> > I am having problems of biblical proportions. This may be long winded,
> > apologies up front. I have been reading about Heartbeat (version 2) all
> > week and have been making progress in my specific implementation,
> > however I have hit a major wall and can no longer make heads or tails of
> > what is broken or how to fix it.
> >
> > I am trying to setup a firewall cluster in the following manner:
> >
> >
> >                  Internet
> >
> > +---eth0---+                 +---eth0---+
> > |          |                 |          |
> > |       eth2-----------------eth2       |
> > |  (cody)  |     Heartbeat   |   (tim)  |
> > | /dev/ttyS0-----------------/dev/ttyS0 |
> > |          |                 |          |
> > +---eth1---+                 +---eth1---+
> >
> >                  Intranet
> >
> >
> > 'cody' is the primary firewall box and 'tim' is the backup.
> > Things seem to go "okay" if only of the nodes is up, however when the
> > second node starts up, things go whacky.
> >
> > Heartbeat details:
> >
> > Debian Etch
> > % dpkg -l heatbeat-2
> > ii  heartbeat-2                              2.0.7-2
> >
> >
> > Here are some config files:
> >
> > % ls -l /etc/ha.d/ha.cf
> > -rw-r--r--  1 root root  324 2007-09-28 09:39 /etc/ha.d/ha.cf
> >
> > % cat /etc/ha.d/ha.cf
> > use_logd on
> >
> > keepalive 1
> > deadtime 5
> > initdead 120
> >
> > udpport 694
> > baud 19200
> > serial /dev/ttyS0
> > bcast  eth2
> >
> > autojoin any
> > crm on
> >
> > % ls -l /etc/ha.d/authkeys
> > -rw------- 1 root root 49 2007-09-28 11:16 /etc/ha.d/authkeys
> >
> > % cat /etc/ha.d/authkeys
> > auth 2
> > 1 sha1 $uper$secret
> > 2 crc
> >
> > % ls -l /var/lib/heartbeat/crm/cib.xml
> > -rw------- 1 hacluster haclient 3564 2007-09-28
> > 11:30 /var/lib/heartbeat/crm/cib.xml
> >
> > (there will be no <status/> element in the following file, I believe
> > that this is due to me manually 'kill -9'ing the processes after they
> > would not stop nicely)
>
> No, the status section is never saved to a file. It only exists
> in running nodes.
>
> > Here are some snippets from the log files, I am not sure what are the
> > valuable pieces and what are not. The files themselves are long (600 and
> > 900 lines for the primary and backup servers). Locations of the (almost
> > complete) log files is:
> >
> > http://www.d.umn.edu/~mzagrabe/ha-log.cody.txt
> > http://www.d.umn.edu/~mzagrabe/ha-log.tim.txt
>
> >From cody:
>
> heartbeat[18326]: 2007/09/28_11:29:27 WARN: string2msg_ll: node [tim] failed authentication
>
> This one's interesting. It shouldn't be happening.
>
> heartbeat[18326]: 2007/09/28_11:29:27 WARN: 6 lost packet(s) for [tim] [253:260]
> heartbeat[18326]: 2007/09/28_11:29:27 WARN: Late heartbeat: Node tim: interval 3000 ms
>
> Flaky network?
>
> heartbeat[18330]: 2007/09/28_11:29:29 WARN: glib: TTY write timeout on [/dev/ttyS0] (no connection or bad cable? [see documentation])
>
> Problems with serial?

one of the nice things about v2 is that it keeps the resource config
in sync between nodes.  however this also includes the status section
and means that the data being transferred could quite conceivably
max-out a serial connection.

a second NIC and a crossover cable is usually a good alternative

> heartbeat[18326]: 2007/09/28_11:29:56 CRIT: Cluster node tim returning after partition.
>
> The node is leaving and coming back. Looks like the
> network/serial connection doesn't deliver what we expect. Perhaps
> you could try some other combinations:
>
> - without serial/higher baud
> - ucast or mcast instead of bcast
>
> Also, check the interface statistics/cables/firewall rules.
>
> > % grep -i ERROR ha-log.tim
> > crmd[14298]: 2007/09/28_11:29:37 info: process_lrm_event:lrm.c LRM
> > operation (2) monitor_0 on external_VIP Error: (7) not running
> > crmd[14298]: 2007/09/28_11:29:37 info: process_lrm_event:lrm.c LRM
> > operation (3) monitor_0 on internal_VIP Error: (7) not running
> > tengine[14305]: 2007/09/28_11:29:48 info: match_graph_event:events.c
> > Re-mapping op status to LRM_OP_ERROR for external_VIP_monitor_0
> > tengine[14305]: 2007/09/28_11:29:48 ERROR: match_graph_event:events.c
> > Action external_VIP_monitor_0 on cody failed (target: 7 vs. rc: -1):
> > Error
>
> There is a problem with the RA. Returns -1. Should be 7 for not
> running and 0 for ok.

i believe -1 is timeout

> Finally, you should try a newer version: 2.1.2.

definitly



More information about the Linux-HA mailing list