[Linux-HA] invalid config info?

Andrew Beekhof beekhof at gmail.com
Wed Oct 5 11:18:07 MDT 2005


On 10/5/05, Peter Kruse <pk at q-leap.com> wrote:
> Alan Robertson wrote:
> >>
> >> The powerswitch indeed didn't respond to snmp requests.  I had
> >> to restart it, and the error disappeared...
> >
> > Great!
>
> right, but it still does not fence the node with the failed stop action.
> attached are the syslogs of both nodes and the cib.xml.
> In the logs there is:
>
> Oct  5 14:24:38 ha-test-1 pengine: [5906]: WARN: mask(stages.c:stage6):
> Scheduling Node ha-test-1 for STONITH
> ...
> Oct  5 14:24:38 ha-test-1 tengine: [5905]: info:
> mask(tengine.c:initiate_action): Executing pseudo-event (14): stop on (null)
> Oct  5 14:24:38 ha-test-1 tengine: [5905]: info:
> mask(tengine.c:cib_action_updated): Initiating action 9: stop
> fence1:apc1:0 on ha-test-1
> Oct  5 14:24:38 ha-test-1 tengine: [5905]: info:
> mask(tengine.c:cib_action_updated): Initiating action 11: stop
> fence1:apc1:1 on ha-test-2
> Oct  5 14:24:38 ha-test-1 tengine: [5905]: info:
> mask(tengine.c:initiate_action): Executing pseudo-event (27): stop on (null)
> Oct  5 14:24:38 ha-test-1 crmd: [5873]: info: mask(lrm.c:do_lrm_rsc_op):
> Performing op stop on fence1:apc1:0
> ...
> Oct  5 14:24:38 ha-test-1 crmd: [5873]: WARN: mask(lrm.c:do_lrm_event):
> LRM operation (3) monitor on fence1:apc1:0 cancelled
> Oct  5 14:24:38 ha-test-1 lrmd: [7609]: debug: Will send the stonith RA
> operation to stonithd: apcmastersnmp stop
>
> but I'm not sure what this really means.  Did Heartbeat try
> to fence node ha-test-1?  and why is this operation done on
> the node itself?  Shouldn't the other node ha-test-2 call the stonith
> (that's at least what the name suggests ;)
> It's not easy for me to understand all these messages.  At the moment
> stonith doesn't work for me...

For the future reference of all:

A log containing "Executing fencing operation" means the CRM actually
called the STONITHd.  If you see that and the node wasn't fenced, ask
China :)

(Peter, this log isn't present so it's my bug.)

A log containing "Scheduling Node {some node} for STONITH" means the
CRM knew it was supposed to fence the node.  If you see this but not
the previous message - come complain to me (via the list of course ;-)
because you may have found an ordering bug in the transition graph. 
(Basically the TE is waiting for something to happen before it pulls
the trigger - but that something doesn't ever happen.)

The other logs, ie.
> Oct  5 14:24:38 ha-test-1 tengine: [5905]: info:
> mask(tengine.c:cib_action_updated): Initiating action 11: stop
> fence1:apc1:1 on ha-test-2

is the CRM moving healthy resources away from the node that is about to be shot.

Normally, thats a really good idea. Except when that resource is
STONITHd RA thats going to shoot your node.  That _may_ be what's
going on here.

The problem is made even worse with the way clones were allocated
(sequentially starting from 0).  So if clone:1 was stopped/died/etc,
then all the other ones clone:N (N>1) would be shuffled around
(usually with no benefit at all).  So right when the stonith would
happen is the same point at which all the stonith RAs may have been
stopped.

But there is good news.

I have actually spent the last few days working out how to avoid this
- and its now in CVS.


As to the question of why ha-test-1 is shooting itself (or at least
trying to), in that in this case an otherwise healthy node was being
shot because a resource failed to stop.

"Otherwise healthy" includes the ability to be the DC, which is where
the decision to fence nodes is made and where all the fencing requests
are made.  The STONITHd then figures out internally which node should
pull the trigger.

So just because we made the request on ha-test-1, doesn't mean
ha-test-1 will also pull the trigger.

>
>         Peter
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>
>



More information about the Linux-HA mailing list