[Linux-HA] Cluster unexpectedly failing over
Chris Rogers
chris.rogers at netwise.net
Mon Apr 19 11:58:02 MDT 2004
Kevin:
Below are excerpts of the ipfail messages from my messages log file for the time frame of this event.
Here is the info for linux1:
Apr 15 23:01:55 linux1 ipfail[1409]: info: Status update: Node group1 now has status dead
Apr 15 23:01:55 linux1 ipfail[1409]: info: NS: We are dead. :<
Apr 15 23:01:55 linux1 ipfail[1409]: info: Link Status update: Link group1/group1 now has status dead
Apr 15 23:01:55 linux1 ipfail[1409]: info: We are dead. :<
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG: Dumping message with 12 fields
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[0]: [t=ask_resources]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[1]: [rsc_hold=all]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[2]: [src=linux1]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[3]: [info=me]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[4]: [from_id=ipfail]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[5]: [to_id=ipfail]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[6]: [seq=100b1]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[7]: [hg=52]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[8]: [ts=407f5ab3]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[9]: [ld=0.29 0.27 0.23 1/170 24368]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[10]: [ttl=4]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[11]: [auth=1 39f97b2d]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG: Dumping message with 10 fields
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[0]: [t=ask_resources]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[1]: [rsc_hold=all]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[2]: [info=other]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[3]: [src=linux0]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[4]: [seq=eb74]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[5]: [hg=31]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[6]: [ts=407f5ac3]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[7]: [ld=0.12 0.08 0.01 3/136 28563]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[8]: [ttl=4]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[9]: [auth=1 214bd22b]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG: Dumping message with 10 fields
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[0]: [t=ask_resources]
Apr 15 23:01:55 linux1 ipfail[1409]: info: MSG[1]: [rsc_hold=all]
Apr 15 23:01:56 linux1 ipfail[1409]: info: MSG[2]: [info=done]
Apr 15 23:01:56 linux1 ipfail[1409]: info: MSG[3]: [src=linux1]
Apr 15 23:01:56 linux1 ipfail[1409]: info: MSG[4]: [seq=100b3]
Apr 15 23:01:56 linux1 ipfail[1409]: info: MSG[5]: [hg=52]
Apr 15 23:01:56 linux1 ipfail[1409]: info: MSG[6]: [ts=407f5ab3]
Apr 15 23:01:56 linux1 ipfail[1409]: info: MSG[7]: [ld=0.29 0.27 0.23 1/174 24567]
Apr 15 23:01:56 linux1 ipfail[1409]: info: MSG[8]: [ttl=4]
Apr 15 23:01:56 linux1 ipfail[1409]: info: MSG[9]: [auth=1 bde5d335]
Apr 15 23:01:57 linux1 ipfail[1409]: info: MSG: Dumping message with 10 fields
Apr 15 23:01:57 linux1 ipfail[1409]: info: MSG[0]: [t=ask_resources]
Apr 15 23:01:57 linux1 ipfail[1409]: info: MSG[1]: [rsc_hold=all]
Apr 15 23:01:57 linux1 ipfail[1409]: info: MSG[2]: [info=done]
Apr 15 23:01:57 linux1 ipfail[1409]: info: MSG[3]: [src=linux0]
Apr 15 23:01:57 linux1 ipfail[1409]: info: MSG[4]: [seq=eb77]
Apr 15 23:01:57 linux1 ipfail[1409]: info: MSG[5]: [hg=31]
Apr 15 23:01:57 linux1 ipfail[1409]: info: MSG[6]: [ts=407f5ac5]
Apr 15 23:01:57 linux1 ipfail[1409]: info: MSG[7]: [ld=0.12 0.08 0.01 3/142 28859]
Apr 15 23:01:57 linux1 ipfail[1409]: info: MSG[8]: [ttl=4]
Apr 15 23:01:57 linux1 ipfail[1409]: info: MSG[9]: [auth=1 7669f67]
Here is the info for linux0:
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG: Dumping message with 12 fields
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[0]: [t=ask_resources]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[1]: [rsc_hold=all]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[2]: [src=linux1]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[3]: [info=me]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[4]: [from_id=ipfail]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[5]: [to_id=ipfail]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[6]: [seq=100b1]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[7]: [hg=52]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[8]: [ts=407f5ab3]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[9]: [ld=0.29 0.27 0.23 1/170 24368]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[10]: [ttl=4]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[11]: [auth=1 39f97b2d]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG: Dumping message with 10 fields
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[0]: [t=ask_resources]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[1]: [rsc_hold=all]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[2]: [info=other]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[3]: [src=linux0]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[4]: [seq=eb74]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[5]: [hg=31]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[6]: [ts=407f5ac3]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[7]: [ld=0.12 0.08 0.01 3/136 28563]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[8]: [ttl=4]
Apr 15 23:02:11 linux0 ipfail[1384]: info: MSG[9]: [auth=1 214bd22b]
Apr 15 23:02:12 linux0 ipfail[1384]: info: MSG: Dumping message with 10 fields
Apr 15 23:02:12 linux0 ipfail[1384]: info: MSG[0]: [t=ask_resources]
Apr 15 23:02:12 linux0 ipfail[1384]: info: MSG[1]: [rsc_hold=all]
Apr 15 23:02:12 linux0 ipfail[1384]: info: MSG[2]: [info=done]
Apr 15 23:02:12 linux0 ipfail[1384]: info: MSG[3]: [src=linux1]
Apr 15 23:02:12 linux0 ipfail[1384]: info: MSG[4]: [seq=100b3]
Apr 15 23:02:12 linux0 ipfail[1384]: info: MSG[5]: [hg=52]
Apr 15 23:02:12 linux0 ipfail[1384]: info: MSG[6]: [ts=407f5ab3]
Apr 15 23:02:12 linux0 ipfail[1384]: info: MSG[7]: [ld=0.29 0.27 0.23 1/174 24567]
Apr 15 23:02:12 linux0 ipfail[1384]: info: MSG[8]: [ttl=4]
Apr 15 23:02:12 linux0 ipfail[1384]: info: MSG[9]: [auth=1 bde5d335]
Apr 15 23:02:13 linux0 ipfail[1384]: info: MSG: Dumping message with 10 fields
Apr 15 23:02:13 linux0 ipfail[1384]: info: MSG[0]: [t=ask_resources]
Apr 15 23:02:13 linux0 ipfail[1384]: info: MSG[1]: [rsc_hold=all]
Apr 15 23:02:13 linux0 ipfail[1384]: info: MSG[2]: [info=done]
Apr 15 23:02:13 linux0 ipfail[1384]: info: MSG[3]: [src=linux0]
Apr 15 23:02:13 linux0 ipfail[1384]: info: MSG[4]: [seq=eb77]
Apr 15 23:02:13 linux0 ipfail[1384]: info: MSG[5]: [hg=31]
Apr 15 23:02:13 linux0 ipfail[1384]: info: MSG[6]: [ts=407f5ac5]
Apr 15 23:02:13 linux0 ipfail[1384]: info: MSG[7]: [ld=0.12 0.08 0.01 3/142 28859]
Apr 15 23:02:13 linux0 ipfail[1384]: info: MSG[8]: [ttl=4]
Apr 15 23:02:13 linux0 ipfail[1384]: info: MSG[9]: [auth=1 7669f67]
--Chris
-----Original Message-----
From: linux-ha-bounces at lists.linux-ha.org [mailto:linux-ha-bounces at lists.linux-ha.org] On Behalf Of Kevin Dwyer
Sent: Monday, April 19, 2004 9:51 AM
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] Cluster unexpectedly failing over
Hi Chris,
Chris Rogers wrote:
>
> I am at a loss for why this is happening. I need your help. I have
> attached excerpts of the ha-log file from each of the linux nodes. The
> excerpt captures a failover that happened around 11:02pm. I have also
> included the ha.cf and haresources file from each node.
>
I think your logs are missing some of the important information needed to diagnose this problem. For instance, I don't see any messages from ipfail (other than the one or two logged API messages). ipfail usually spews out quite a bit of informational and debug messages which should end up in syslog.
You mentioned that one node will see the ping group die. Is it possible to test connectivity to that group from that node when this happens? If it indeed cannot reach the ping nodes, then it's not heartbeat's fault I think. If pings can still be heard, then we'll need those other logs.
I know you tested by sniffing with tcpdump, but I'm concerned that the OS isn't actually handling the reply for some reason.
--
- kpd
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
More information about the Linux-HA
mailing list