[Linux-HA] strange monitor behaviour

Andrew Beekhof beekhof at gmail.com
Thu Jan 11 02:59:01 MST 2007


On 1/10/07, Pavol Gono <palo.gono at gmail.com> wrote:
> On 1/10/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> > On 1/10/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> > > > But let's go to the original topic :)
> > > > I installed heartbeat from sources, changeset 9934, configure options
> > > > are custom like in previous posts. Distribution SLES10, nodes
> > > > deboserver and pgbook. BasicSanityCheck was successful on both.
> > > > I made very similar configuration like in the first post, resources
> > > > IPaddr and Dummy.
> > > > When I removed directory /tmp/a on machine, where resources were
> > > > running, the same situation occured: Dummy resource is stopped, IPaddr
> > > > resource remains on original node, no failover.
> > > >
> > > > Is this correct behaviour?
> > >
> > > if failure_count was not incremented for that resource on that node,
> > > then this is not the expected behavior
> > >
> > > i will look at the logs momentarily
> >
> > i see:
> >
> > tengine[9819]: 2007/01/10_16:49:22 WARN: update_failcount: Updating
> > failcount for x_Dummy on 92ba1bad-9c97-4f5d-b2f7-48492256893c after
> > failed monitor: rc=7
> >
> > tengine[9819]: 2007/01/10_16:49:22 debug: log_data_element:
> > abort_transition_graph: Cause       <nvpair
> > id="status-92ba1bad-9c97-4f5d-b2f7-48492256893c-fail-count-x_Dummy"
> > name="fail-count-x_Dummy" value="1"/>
> >
> > cib[9752]: 2007/01/10_16:49:22 debug: log_data_element: cib:diff: +
> >          <nvpair
> > id="status-92ba1bad-9c97-4f5d-b2f7-48492256893c-fail-count-x_Dummy"
> > name="fail-count-x_Dummy" value="1"/>
> >
> > which would indicate that things are working as they should so far.
> >
> > can you also attach the following file on pgbook:
> > /var/lib/heartbeat/pengine/pe-input-47.bz2
>
> attached
>
> >
> > for some reason we consider deboserv out-of-bounds for x_Dummy:
> > pengine[9820]: 2007/01/10_16:49:24 debug: native_print: Allocating:
> > x_Dummy     (heartbeat::ocf:Dummy): Stopped
> > pengine[9820]: 2007/01/10_16:49:24 debug: native_assign_node: Color
> > x_Dummy, Node[0] pgbook: 1000000
> > pengine[9820]: 2007/01/10_16:49:24 debug: native_assign_node: Color
> > x_Dummy, Node[1] deboserver: -1000000
> > pengine[9820]: 2007/01/10_16:49:24 debug: native_assign_node:
> > Assigning pgbook to x_Dummy
> > pengine[9820]: 2007/01/10_16:49:24 notice: StartRsc:  pgbook    Start x_Dummy
> > pengine[9820]: 2007/01/10_16:49:24 notice: Recurring: pgbook
> > x_Dummy_monitor_5000
> >
> > (btw. those are the node weights for the x_Dummy resource)
>
> My intention was forcing failover when one of resources fails (by
> monitor or start). Is anything wrong with my configuration or are
> out-of-bounds the problem?


the problem here is:

       <rsc_colocation id="x_colocation" from="x_Dummy" to="x_IPaddrL"
score="INFINITY"/>


prior to 2.0.7, the members of a resource group were either _all_
running or _none_ of them were.  this was not completely acceptable.

so colocation constraints are no longer symmetrical.  to get the
behavior you want, add:
       <rsc_colocation id="x_colocation_2" from="x_IPaddrL"
to="x_Dummy" score="INFINITY"/>

i'm still working on allowing:
       <rsc_colocation id="x_colocation" from="x_Dummy" to="x_IPaddrL"
score="INFINITY" symmetrical="true"/>
which would do the same thing


More information about the Linux-HA mailing list