[Linux-HA] weird messages from ipfail

Sebastian Vieira sebvieira at gmail.com
Fri Jun 19 06:15:16 MDT 2009


Hi,

We have a pair of heartbeat nodes, Balancer1 and 2, that have their NICs
setup as follows:

eth0 and eth2 are bonded as bond0 for the user LAN
eth1 and eth3 are bonded as bond1 for the heartbeats

Between eth1 on ha01 and ha02 there's a cross cable. The link between the
two eth3's runs over a switch.

Ipfail is configured to ping the default gw which is reachable over bond0.

In a test we unplugged the cable from eth1 on Balancer2, waited about 10s
and reinstalled it. Then we did the same with eth3. After that, ipfail
started to display some weird messages. We did the same with eth0 and eth2
but that went fine:

Jun 18 13:54:25 Balancer2 ipfail: [10570]: info: Ping node count is
balanced.
Jun 18 13:55:36 Balancer2 kernel: bnx2: eth1 NIC Copper Link is Down
Jun 18 13:55:36 Balancer2 kernel: bonding: bond1: link status definitely
down for interface eth1, disabling it
Jun 18 13:55:54 Balancer2 kernel: bnx2: eth1 NIC Copper Link is Up, 1000
Mbps full duplex, receive & transmit flow control ON
Jun 18 13:55:54 Balancer2 kernel: bonding: bond1: link status definitely up
for interface eth1.
Jun 18 13:56:06 Balancer2 kernel: bnx2: eth3 NIC Copper Link is Down
Jun 18 13:56:06 Balancer2 kernel: bonding: bond1: link status definitely
down for interface eth3, disabling it
Jun 18 13:56:06 Balancer2 kernel: bonding: bond1: making interface eth1 the
new active one.
Jun 18 13:56:16 Balancer2 ipfail: [10570]: info: Status update: Node
Balancer1.amg.local now has status dead
Jun 18 13:56:17 Balancer2 ipfail: [10570]: info: NS: We are still alive!
Jun 18 13:56:17 Balancer2 ipfail: [10570]: info: Link Status update: Link
Balancer1.amg.local/bond1 now has status dead
Jun 18 13:56:19 Balancer2 ipfail: [10570]: info: Asking other side for ping
node count.
Jun 18 13:56:19 Balancer2 ipfail: [10570]: info: Checking remote count of
ping nodes.
Jun 18 13:56:20 Balancer2 kernel: bnx2: eth3 NIC Copper Link is Up, 100 Mbps
full duplex
Jun 18 13:56:20 Balancer2 kernel: bonding: bond1: link status definitely up
for interface eth3.
Jun 18 13:56:26 Balancer2 kernel: bnx2: eth0 NIC Copper Link is Down
Jun 18 13:56:26 Balancer2 kernel: bonding: bond0: link status definitely
down for interface eth0, disabling it
Jun 18 13:56:26 Balancer2 kernel: bonding: bond0: making interface eth2 the
new active one.
Jun 18 13:56:39 Balancer2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000
Mbps full duplex
Jun 18 13:56:39 Balancer2 kernel: bonding: bond0: link status definitely up
for interface eth0.
Jun 18 13:57:03 Balancer2 kernel: bnx2: eth2 NIC Copper Link is Down
Jun 18 13:57:03 Balancer2 kernel: bonding: bond0: link status definitely
down for interface eth2, disabling it
Jun 18 13:57:03 Balancer2 kernel: bonding: bond0: making interface eth0 the
new active one.
Jun 18 13:57:21 Balancer2 kernel: bnx2: eth2 NIC Copper Link is Up, 1000
Mbps full duplex
Jun 18 13:57:21 Balancer2 kernel: bonding: bond0: link status definitely up
for interface eth2.

This is from Balancer1:

Jun 18 13:54:13 Balancer1 logd: [7710]: info: logd started with default
configuration.
Jun 18 13:54:13 Balancer1 logd: [7711]: info: G_main_add_SignalHandler:
Added signal handler for signal 15
Jun 18 13:54:13 Balancer1 logd: [7710]: info: G_main_add_SignalHandler:
Added signal handler for signal 15
Jun 18 13:54:21 Balancer1 ipfail: [7816]: info: Status update: Node
Balancer2.amg.local now has status active
Jun 18 13:54:22 Balancer1 ipfail: [7816]: info: Asking other side for ping
node count.
Jun 18 13:54:25 Balancer1 ipfail: [7816]: info: No giveup timer to abort.
Jun 18 13:55:36 Balancer1 kernel: bnx2: eth1 NIC Copper Link is Down
Jun 18 13:55:36 Balancer1 kernel: bonding: bond1: link status definitely
down for interface eth1, disabling it
Jun 18 13:55:54 Balancer1 kernel: bnx2: eth1 NIC Copper Link is Up, 1000
Mbps full duplex, receive & transmit flow control ON
Jun 18 13:55:54 Balancer1 kernel: bonding: bond1: link status definitely up
for interface eth1.
Jun 18 13:56:16 Balancer1 ipfail: [7816]: info: Status update: Node
Balancer2.amg.local now has status dead
Jun 18 13:56:16 Balancer1 ipfail: [7816]: info: NS: We are still alive!
Jun 18 13:56:17 Balancer1 ipfail: [7816]: info: Link Status update: Link
Balancer2.amg.local/bond1 now has status dead
Jun 18 13:56:18 Balancer1 ipfail: [7816]: info: Asking other side for ping
node count.
Jun 18 13:56:18 Balancer1 ipfail: [7816]: info: Checking remote count of
ping nodes.

Now, everything went okay and a failover didn't occur but the ipfail
messages are strange; why suddenly declare the node as dead after more than
20s after both slaves of bond1 were up again? It doesn't make sense (to me).
And why did ipfail immediately said 'NS: We are still alive!'.

Anyone?

Regards,

Sebastian


More information about the Linux-HA mailing list