[Linux-HA] observations after some fencing tests in a two node

Andrew Beekhof beekhof at gmail.com
Fri Nov 9 09:23:30 MST 2007


On Nov 7, 2007, at 4:43 PM, Sebastian Reitenbach wrote:

> Hi all,
>
> I did some fencing tests in a two node cluster, here are some  
> details of my
> setup:
>
> - use stonith external/ilo for fencing (ssh to ilo board and issue a  
> reset
> command)
> - both nodes are connected via two bridged ethernet interfaces to two
> redundant switches. The ilo boards are connected to the each of the
> switches.
>
> My first observation:
> - when removing the network cables from the node that is the DC at the
> moment, it took at least three minutes, until it decided to stonith  
> the
> other node and to startup the resources that ran on the node without  
> network
> connectivity
> - when removing the network cables from the node that is not the DC,  
> then it
> was a matter of e.g. 20 seconds, then this node fenced the DC, and  
> then
> became DC
>
> Why is there such a difference? The first one takes too long in my  
> eyes to
> detect the outage, but I hope there are timeout values that I can  
> tweak. For
> which ones shall I take a look?

I see later on you said you can't reproduce this, but I'd really like  
to see that logs if you still have them.
Also, hb_report and be used after you find a problem - it's not  
necessary to be able to reproduce it.

> Also I recognized the following line in the logfile from the DC in  
> the first
> case:
> tengine: ... info: extract_event: Stonith/shutdown of <uuid> not  
> matched
> This line shows up immediately after the DC detects that the other  
> node is
> unreachable.

Thats the TE noticing the node go away - which is good

> From then it takes at least two minutes until the DC decides to
> fence the other node.

This part - not so good.

> The second thing I observed:
> My stonith is working via ssh to the ilo board to the node that  
> shall be
> fenced. When I remove the ethernet cables from one node, stonith  
> will fail
> to kill the other node.
>
> take case two from above, remove the cables from the node that is  
> not the
> DC, where I observed the following:
> The DC needs about some minutes to decide to fence the other node,  
> because
> of the above observed behaviour. Meanwhile the non DC node without  
> network
> cables tried to fence the DC, that failed, and the node was in a  
> unclean
> state, until the DC fenced it in the end.
> Luckily the stonith of the DC failed, then assume instead of ssh as  
> stonith
> resource, use a stonith devied connected to e.g. serial port.
> In that case, the non DC node were able to fence the DC, and then  
> become DC
> itself, starting all resources, mounting all filesystems, ...
> Meanwhile the DC is restarted, and either heartbeat is not started
> automatically, then the cluster is unusable, because the one node  
> that is DC
> has no network. Or when heartbeat is started automatically, it cannot
> communicate to the second node, and will assume this one is dead,

Actually it wont assume that.
Instead it will try to shoot the other node and only after that  
succeeds will it start any resources.

Safe but not very smart (since clearly each side will take turns  
shooting the other until the fault is repaired).

Which is why 2 node clusters are not a very good idea :-)
In a 3-node cluster the disconnected node wont have quorum and isn't  
allowed to try and kill anyone.

Alternatively, use stonith-action=poweroff

> and start
> all its resources, so that e.g. filesystems could be mounted on both  
> nodes.
>
> I don't have a hardware fencing device to test my theory, but could  
> that
> happen or not? Could the usage of some ping nodes, combined with a  
> pingd or
> an external quorumd help to solve the dilemma?
>
> Well, I am running heartbeat 2.1.2-15 on sles10sp1, any hints and  
> comments
> are appreciated.
>
> kind regards
> Sebastian
>
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems



More information about the Linux-HA mailing list