[Linux-HA] Ping node disconnection triggers stopping of resources

Andrew Beekhof beekhof at gmail.com
Wed Feb 14 08:12:26 MST 2007


what version are you running?  there was a well known bug in .7 that did this.

On 2/14/07, Pavol Gono <palo.gono at gmail.com> wrote:
> Hi
>
> When using
> respawn root /usr/local/lib/heartbeat/pingd -m 10 -d 5s
> in my ha.cf, and this constraint:
> <rsc_location id="x_location_pingd" rsc="x_Dummy">
>   <rule id="x_rule_pingd" score_attribute="pingd">
>     <expression id="x_expr_pingd" attribute="pingd" operation="defined"/>
>   </rule>
> </rsc_location>
>
> I expected the following happens:
> - let's destroy network connection between a ping node and some of cluster nodes
> - after 5 seconds, each cluster node shall know exactly if it has
> connection to the ping node
> - cluster nodes with no connection shall decrease score of x_Dummy by 10
> - now the decision about stopping / starting resources shall appear
>
> But I observerd that decision about stopping resources comes sooner
> than score changes for all cluster nodes.
> The result: after disconnecting ping node from two-node
> cluster,resources are stopped temporarily (sometimes).
>
> Is this known and expected behaviour?
>
>
> Test environment:
> HB revision 10161 with small patch of Dummy (see attachment)
> cluster nodeA - sk16251c (DC)
> cluster nodeB - linux-sles1
> pingnodes - 10.0.0.8, 10.0.0.9
>
> In the testscript I emulate disconnection of the ping node from both
> cluster nodes. Since I use the iptables on the cluster nodes, there
> are two possibilities:
> - disconnect nodeA, then nodeB
> - disconnect nodeB, then nodeA
>
> Test script puts group of characters (such MMMM, NNNN, OOOO, PPPP)
> after each test script step, so that I can identify relevant log
> messages. I run the test script on the DC machine (cluster nodeA,
> sk16251c). I use parsing script to filter relevant messages from the
> DC's log.
>
> >From the filtered output I recognized following behaviour:
>
> In case the first disconnect possibility was performed:
> disconnect nodeA - sk16251c (DC)
> disconnect nodeB - linux-sles1
>
> After this operation, scoring evaluated nodeA with score less by 10
> than on the other node. However, iptables disconnected them almost at
> the same moment. This results in STOP request on this node. But, the
> very next scoring evaluation has updated info from the nodeB and is
> therefore comparing equal scores (resources stopped, one pingnode
> connected) and chooses the DC node to run resources, so they are
> restarted.
>
> In case the second disconnect possibility was performed:
> disconnect nodeB - linux-sles1
> disconnect nodeA - sk16251c (DC)
>
> Now there are two results observed, depending on the moment in which
> the message from the nodeB is processed. If it is received before the
> first score evaluation, both nodes are scored as they have only one
> ping node connection alive, and nothing happens, resources remain on
> the DC node.
> But if the message from the peer node is processed after the first
> scoring evaluation, then this leads to the same behaviour (STOP, and
> then after next evaluation START on the same node) as described above
> in section.
>
> Palo
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>


More information about the Linux-HA mailing list