[Linux-HA] strange monitor behaviour

Andrew Beekhof beekhof at gmail.com
Thu Jan 4 09:01:24 MST 2007


you dont still have /var/lib/heartbeat/pengine/pe-input-1557.bz2 on
fico by any chance do you?

i've been running tests today without luck... any chance you could you
reproduce it with "debug 1" in ha.cf please?

On 1/4/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> On 1/3/07, Pavol Gono <palo.gono at gmail.com> wrote:
> > Hi
> >
> > I was doing some tests with my configuration of two nodes, focusing on
> > monitoring ability of heartbeat. The following description should
> > reveal at least two bugs (or maybe configuration mistakes :)
> >
> > To have it simple, I replaced my resources with IPaddr and Dummy,
> > which shall run on the same node. Dummy has its state file in
> > /tmp/a/b. Both resources are monitored in 5-second intervals.
> >
> > First I started heartbeat on both machines, resources appeared on node
> > debo. Everything working fine...
> >
> > Then I removed directory /tmp/a - to achieve monitor operation
> > failure. As expected, monitor failed on node debo:
> > crmd[20580]: 2007/01/03_17:26:23 info: process_lrm_event: LRM
> > operation (7) monitor_5000 on x_Dummy complete
> > Heartbeat stopped Dummy resource:
> > crmd[20580]: 2007/01/03_17:26:26 info: process_lrm_event: LRM
> > operation (9) stop_0 on x_Dummy complete
> >
> > Now I would expect, that Dummy resource on debo will obtain failure
> > count, heartbeat will try to stop IPaddr on debo and then start both
> > resources on node fico.
> > Because default-resource-failure-stickiness is -INFINITY.
> >
> > Heartbeat then tried to start Dummy resource again on debo:
> > crmd[20580]: 2007/01/03_17:26:26 info: do_lrm_rsc_op: Performing
> > op=x_Dummy_start_0 key=3:80d7e03b-06a7-4583-b3e3-a9bf755cc5af)
> > But now Dummy is not able to touch file /tmp/a/b, start is unsuccessful:
> > crmd[20580]: 2007/01/03_17:26:28 ERROR: process_lrm_event: LRM
> > operation (10) start_0 on x_Dummy Error: (1) unknown error
> > crmd[20580]: 2007/01/03_17:26:29 info: do_lrm_rsc_op: Performing
> > op=x_Dummy_stop_0 key=4:80d7e03b-06a7-4583-b3e3-a9bf755cc5af)
> >
> > So now I would expect, heartbeat really try to move resources to node
> > fico, but it remained in following state:
> > IPaddr running on node debo, Dummy running nowhere.
> >
> > In attachment you can find logs, cibadmin outputs and ha.cf from this state.
> >
> > I tried the above procedure once again after both heartbeats restart
> > (and recreating /var/a directory), the same thing happened. Then I
> > stopped heartbeat on node debo.
> > Now resources moved to node fico, as expected. But to my surprise, on
> > node debo the virtual IP address 10.0.12 on interface eth0:0 remained
> > active. Snippet from log:
> > crmd[28502]: 2007/01/03_18:07:39 ERROR: ghash_print_pending: Pending
> > action: x_IPaddrL:6
> > crmd[28502]: 2007/01/03_18:07:39 ERROR: lrm_get_all_rscs(615): failed
> > to receive a reply message of getall.
> > crmd[28502]: 2007/01/03_18:07:39 ERROR: do_exit: Performing A_EXIT_1 -
> > forcefully exiting the CRMd
> > crmd[28502]: 2007/01/03_18:07:39 ERROR: do_exit: Could not recover
> > from internal error
> > (the complete log of this situation is in file ha-log_ipaddr_bug)
> >
> > Sources of heartbeat were taken from http://hg.linux-ha.org/dev
> > changeset 9857 (latest commit at 12 Dec 2006 08:01:37 -0700).
> > I hope it is easy reproducible for you.
>
> grumble... you go on holidays and look what happens :-(
>
> i'll take a look at this tomorrow
>


More information about the Linux-HA mailing list