[Linux-HA] two node firewall using heartbeat v2 problems

Dejan Muhamedagic dejanmm at fastmail.fm
Mon Oct 1 03:44:32 MDT 2007


Hi,

On Fri, Sep 28, 2007 at 04:20:23PM -0500, Matt Zagrabelny wrote:
> Hello,
> 
> I am having problems of biblical proportions. This may be long winded,
> apologies up front. I have been reading about Heartbeat (version 2) all
> week and have been making progress in my specific implementation,
> however I have hit a major wall and can no longer make heads or tails of
> what is broken or how to fix it.
> 
> I am trying to setup a firewall cluster in the following manner:
> 
> 
>                  Internet
> 
> +---eth0---+                 +---eth0---+
> |          |                 |          |
> |       eth2-----------------eth2       |
> |  (cody)  |     Heartbeat   |   (tim)  |
> | /dev/ttyS0-----------------/dev/ttyS0 |
> |          |                 |          |
> +---eth1---+                 +---eth1---+
> 
>                  Intranet
> 
> 
> 'cody' is the primary firewall box and 'tim' is the backup.
> Things seem to go "okay" if only of the nodes is up, however when the
> second node starts up, things go whacky.
> 
> Heartbeat details:
> 
> Debian Etch
> % dpkg -l heatbeat-2
> ii  heartbeat-2                              2.0.7-2 
> 
> 
> Here are some config files:
> 
> % ls -l /etc/ha.d/ha.cf
> -rw-r--r--  1 root root  324 2007-09-28 09:39 /etc/ha.d/ha.cf
> 
> % cat /etc/ha.d/ha.cf
> use_logd on
> 
> keepalive 1
> deadtime 5
> initdead 120
> 
> udpport 694
> baud 19200
> serial /dev/ttyS0
> bcast  eth2
> 
> autojoin any
> crm on
> 
> % ls -l /etc/ha.d/authkeys
> -rw------- 1 root root 49 2007-09-28 11:16 /etc/ha.d/authkeys
> 
> % cat /etc/ha.d/authkeys
> auth 2
> 1 sha1 $uper$secret
> 2 crc
> 
> % ls -l /var/lib/heartbeat/crm/cib.xml
> -rw------- 1 hacluster haclient 3564 2007-09-28
> 11:30 /var/lib/heartbeat/crm/cib.xml
> 
> (there will be no <status/> element in the following file, I believe
> that this is due to me manually 'kill -9'ing the processes after they
> would not stop nicely)

No, the status section is never saved to a file. It only exists
in running nodes.

> Here are some snippets from the log files, I am not sure what are the
> valuable pieces and what are not. The files themselves are long (600 and
> 900 lines for the primary and backup servers). Locations of the (almost
> complete) log files is:
> 
> http://www.d.umn.edu/~mzagrabe/ha-log.cody.txt
> http://www.d.umn.edu/~mzagrabe/ha-log.tim.txt

>From cody:

heartbeat[18326]: 2007/09/28_11:29:27 WARN: string2msg_ll: node [tim] failed authentication

This one's interesting. It shouldn't be happening.

heartbeat[18326]: 2007/09/28_11:29:27 WARN: 6 lost packet(s) for [tim] [253:260]
heartbeat[18326]: 2007/09/28_11:29:27 WARN: Late heartbeat: Node tim: interval 3000 ms

Flaky network?

heartbeat[18330]: 2007/09/28_11:29:29 WARN: glib: TTY write timeout on [/dev/ttyS0] (no connection or bad cable? [see documentation])

Problems with serial?

heartbeat[18326]: 2007/09/28_11:29:56 CRIT: Cluster node tim returning after partition.

The node is leaving and coming back. Looks like the
network/serial connection doesn't deliver what we expect. Perhaps
you could try some other combinations:

- without serial/higher baud
- ucast or mcast instead of bcast

Also, check the interface statistics/cables/firewall rules.

> % grep -i ERROR ha-log.tim
> crmd[14298]: 2007/09/28_11:29:37 info: process_lrm_event:lrm.c LRM
> operation (2) monitor_0 on external_VIP Error: (7) not running
> crmd[14298]: 2007/09/28_11:29:37 info: process_lrm_event:lrm.c LRM
> operation (3) monitor_0 on internal_VIP Error: (7) not running
> tengine[14305]: 2007/09/28_11:29:48 info: match_graph_event:events.c
> Re-mapping op status to LRM_OP_ERROR for external_VIP_monitor_0
> tengine[14305]: 2007/09/28_11:29:48 ERROR: match_graph_event:events.c
> Action external_VIP_monitor_0 on cody failed (target: 7 vs. rc: -1):
> Error

There is a problem with the RA. Returns -1. Should be 7 for not
running and 0 for ok.

Finally, you should try a newer version: 2.1.2.

Thanks,

Dejan

> heartbeat[14284]: 2007/09/28_11:29:59 ERROR: Cannot write to media pipe
> 0: Resource temporarily unavailable
> heartbeat[14284]: 2007/09/28_11:29:59 ERROR: Shutting down.
> heartbeat[14284]: 2007/09/28_11:29:59 ERROR: Cannot write to media pipe
> 0: Resource temporarily unavailable
> tengine[14305]: 2007/09/28_11:29:59 info: match_graph_event:events.c
> Re-mapping op status to LRM_OP_ERROR for internal_VIP_monitor_0
> heartbeat[14284]: 2007/09/28_11:29:59 ERROR: Shutting down.
> tengine[14305]: 2007/09/28_11:29:59 ERROR: match_graph_event:events.c
> Action internal_VIP_monitor_0 on cody failed (target: 7 vs. rc: -1):
> Error
> heartbeat[14284]: 2007/09/28_11:29:59 ERROR: Cannot write to media pipe
> 0: Resource temporarily unavailable
> heartbeat[14284]: 2007/09/28_11:29:59 ERROR: Shutting down.
> 
> [ many many of the previous two log messages repeated here ]
> 
> heartbeat[14284]: 2007/09/28_11:29:59 ERROR: Cannot write to media pipe
> 0: Resource temporarily unavailable
> heartbeat[14284]: 2007/09/28_11:29:59 ERROR: Shutting down.
> cib[14294]: 2007/09/28_11:29:59 ERROR: cib_ha_connection_destroy:main.c
> Heartbeat connection lost!  Exiting.
> stonithd[14296]: 2007/09/28_11:29:59 ERROR: Disconnected with heartbeat
> daemon
> crmd[14298]: 2007/09/28_11:29:59 ERROR: cib_native_msgready:cib_native.c
> Message pending on command channel [14294]
> ccm[14293]: 2007/09/28_11:29:59 ERROR: Lost connection to heartbeat
> service. Need to bail out.
> tengine[14305]: 2007/09/28_11:29:59 ERROR:
> cib_native_msgready:cib_native.c Message pending on command channel
> [14294]
> crmd[14298]: 2007/09/28_11:29:59 ERROR: #========= cib:cmd message start
> ==========#
> tengine[14305]: 2007/09/28_11:29:59 ERROR: #========= cib:cmd message
> start ==========#
> crmd[14298]: 2007/09/28_11:29:59 ERROR: MSG: No message to dump
> attrd[14297]: 2007/09/28_11:29:59 ERROR:
> cib_native_msgready:cib_native.c Message pending on command channel
> [14294]
> tengine[14305]: 2007/09/28_11:29:59 ERROR: MSG: No message to dump
> attrd[14297]: 2007/09/28_11:29:59 ERROR: #========= cib:cmd message
> start ==========#
> attrd[14297]: 2007/09/28_11:29:59 ERROR: MSG: No message to dump
> crmd[14298]: 2007/09/28_11:29:59 ERROR:
> crmd_cib_connection_destroy:callbacks.c Connection to the CIB
> terminated...
> tengine[14305]: 2007/09/28_11:29:59 ERROR: stonithd_op_result_ready:
> failed due to not on signon status.
> tengine[14305]: 2007/09/28_11:29:59 ERROR:
> tengine_stonith_connection_destroy:callbacks.c Fencing daemon has left
> us
> crmd[14298]: 2007/09/28_11:29:59 ERROR: do_log:misc.c [[FSA]] Input
> I_ERROR from crmd_cib_connection_destroy() received in state
> (S_INTEGRATION)
> crmd[14298]: 2007/09/28_11:29:59 info: do_state_transition:fsa.c tim:
> State transition S_INTEGRATION -> S_RECOVERY [ input=I_ERROR
> cause=C_FSA_INTERNAL origin=crmd_cib_connection_destroy ]
> crmd[14298]: 2007/09/28_11:29:59 ERROR: do_recover:control.c Action
> A_RECOVER (0000000001000000) not supported
> crmd[14298]: 2007/09/28_11:29:59 ERROR: do_log:misc.c [[FSA]] Input
> I_STOP from do_recover() received in state (S_RECOVERY)
> crmd[14298]: 2007/09/28_11:29:59 ERROR: ccm_dispatch:callbacks.c CCM
> connection appears to have failed: rc=-1.
> crmd[14298]: 2007/09/28_11:29:59 ERROR: do_log:misc.c [[FSA]] Input
> I_ERROR from ccm_dispatch() received in state (S_STOPPING)
> crmd[14298]: 2007/09/28_11:29:59 info: do_state_transition:fsa.c tim:
> State transition S_STOPPING -> S_TERMINATE [ input=I_ERROR
> cause=C_CCM_CALLBACK origin=ccm_dispatch ]
> crmd[14298]: 2007/09/28_11:29:59 ERROR: do_exit:control.c Performing
> A_EXIT_1 - forcefully exiting the CRMd
> crmd[14298]: 2007/09/28_11:29:59 ERROR: do_exit:control.c Could not
> recover from internal error
> 
> Again, these logs are so verbose and don't seem to point to a specific
> error, I don't really know where to look from here.
> 
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems



More information about the Linux-HA mailing list