[Linux-HA] When both nodes lose contact with ping node
Andrew Beekhof
beekhof at gmail.com
Tue Jan 9 13:10:34 MST 2007
this is a haresources cluster right?
On 1/9/07, Paul Walsh <Paul.Walsh at uce.ac.uk> wrote:
> Just thought I'd share this with the list in case anyone hits a similar problem:
>
> We have two systems (nodeA and nodeB for the purposes of this email) running heartbeat 2.0.2 and drbd 0.7.15 under
> SLES9. The two nodes use their secondary NICs connected via a crossover cable for heartbeat/drbd traffic. Resources
> (other than drbd filesystems) controlled by heartbeat are Apache and MySQL. Each node has the same IP address as its
> designated ping node.
>
> Shortly before Christmas we had an incident which resulted in the heartbeat resources being DOWN on *both* nodes for
> approx 8 hours. This email is by way of a "heads up" as to what can happen, and a request for info to try and avoid the
> same happening in future...
>
> On the night in question, there was a problem with the core router which resulted in both nodes losing contact with the
> ping node (the fact they both use the same ping node is slightly irrelevant in this case as the nature of the "hiccup"
> in the network would have resulted in them losing contact with *any* ping node). After trawling through the logs, this
> is what I think happened:
>
>
> 23:01:43 nodeA loses contact with ping node and asks nodeB for a ping node count
> 23:01:43 nodeB tells nodeA that it can still see its ping node.
> 23:01:43 nodeA initiates a delayed give-up of resources (to occur in 4 secs time)
>
> (this, I understand is normal behaviour. A can't see its ping node so assumes its NIC has died. At this point, B *can*
> see its ping node, so A gets ready to release its resources)
>
> 23:01:45 nodeB detects loss of connection to pingNode
> 23:01:45 nodeA and nodeB both now have a ping node count of 0 but consider themselves to be "balanced"
> 23:01:45 as the node count is balanced, nodeA aborts the delayed giveup
>
> (which you would normally expect if the ping node counts were equal)
>
> 23:01:51 nodeB regains contact with its ping node. nodeB then checks ping node count with nodeA
> 23:01:51 nodeA still can't see the ping node so initiates another delayed giveup
> 23:01:52 nodeA re-establishes contact with pingNode and aborts the giveup
> 23:01:52 nodeB initiates a delayed giveup as the counts are now balanced
> 23:01:56 nodeB asks to go into standby
> 23:02:06 nodeA acquires resources from nodeB (null effect as nodeA hadn't relinquished them anyway at this point)
> 23:04:23 nodeB loses contact with ping node and asks nodeA for node count
> 23:04:23 nodeA replies that it can see its node
> 23:04:23 nodeA loses connection
> 23:04:23 nodeB aborts giveup as node count is balanced
> 23:04:33 nodeB re-establishes contact and asks nodeA for ping count
> 23:04:33 nodeA has lower count than nodeB so initiates giveup
> 23:04:34 nodeA regains connection
> 23:04:34 nodeA aborts delayed giveup
> 23:04:34 nodeB initiates a giveup
> 23:04:38 nodeB asks to go standby
> 23:04:49 nodeA "acquires" resources from nodeB
> 23:04:57 nodeA loses connection
> 23:04:57 nodeA has lower count than nodeB so initiates giveup
> 23:05:01 nodeA wants to go standby
> 23:05:03 nodeB loses connection
> 23:05:06 nodeB regains connection
> 23:05:06 nodeA initiates delayed giveup
> 23:05:06 nodeA regains connection
> 23:05:30 nodeA completes giveup
>
> So, in essence, due to the freak timings of the network outages the nodes both ended up with the resources down and,
> because the ping node count was the same, no action was then taken to start them up. I'm not sure what the solution to
> this is, other than having some process check a) the ping node count and b) the status of the resources:
>
> IF ping node counts are equal
> THEN
> IF resources down on both nodes
> THEN
> start resources on nodeA
> ENDIF
> ELSE
> failover to node with highest pind node count
> ENDIF
>
> Or maybe the problems were caused by a combination of the timing of the outages and these settings in ha.cf?:
>
> keepalive 2
> deadtime 15
> warntime 10
>
> Any suggestions/advice gratefully received.
>
>
>
> --
> Paul Walsh
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
More information about the Linux-HA
mailing list