[Linux-HA] strange monitor behaviour
Andrew Beekhof
beekhof at gmail.com
Wed Jan 3 17:20:01 MST 2007
On 1/3/07, Pavol Gono <palo.gono at gmail.com> wrote:
> Hi
>
> I was doing some tests with my configuration of two nodes, focusing on
> monitoring ability of heartbeat. The following description should
> reveal at least two bugs (or maybe configuration mistakes :)
>
> To have it simple, I replaced my resources with IPaddr and Dummy,
> which shall run on the same node. Dummy has its state file in
> /tmp/a/b. Both resources are monitored in 5-second intervals.
>
> First I started heartbeat on both machines, resources appeared on node
> debo. Everything working fine...
>
> Then I removed directory /tmp/a - to achieve monitor operation
> failure. As expected, monitor failed on node debo:
> crmd[20580]: 2007/01/03_17:26:23 info: process_lrm_event: LRM
> operation (7) monitor_5000 on x_Dummy complete
> Heartbeat stopped Dummy resource:
> crmd[20580]: 2007/01/03_17:26:26 info: process_lrm_event: LRM
> operation (9) stop_0 on x_Dummy complete
>
> Now I would expect, that Dummy resource on debo will obtain failure
> count, heartbeat will try to stop IPaddr on debo and then start both
> resources on node fico.
> Because default-resource-failure-stickiness is -INFINITY.
>
> Heartbeat then tried to start Dummy resource again on debo:
> crmd[20580]: 2007/01/03_17:26:26 info: do_lrm_rsc_op: Performing
> op=x_Dummy_start_0 key=3:80d7e03b-06a7-4583-b3e3-a9bf755cc5af)
> But now Dummy is not able to touch file /tmp/a/b, start is unsuccessful:
> crmd[20580]: 2007/01/03_17:26:28 ERROR: process_lrm_event: LRM
> operation (10) start_0 on x_Dummy Error: (1) unknown error
> crmd[20580]: 2007/01/03_17:26:29 info: do_lrm_rsc_op: Performing
> op=x_Dummy_stop_0 key=4:80d7e03b-06a7-4583-b3e3-a9bf755cc5af)
>
> So now I would expect, heartbeat really try to move resources to node
> fico, but it remained in following state:
> IPaddr running on node debo, Dummy running nowhere.
>
> In attachment you can find logs, cibadmin outputs and ha.cf from this state.
>
> I tried the above procedure once again after both heartbeats restart
> (and recreating /var/a directory), the same thing happened. Then I
> stopped heartbeat on node debo.
> Now resources moved to node fico, as expected. But to my surprise, on
> node debo the virtual IP address 10.0.12 on interface eth0:0 remained
> active. Snippet from log:
> crmd[28502]: 2007/01/03_18:07:39 ERROR: ghash_print_pending: Pending
> action: x_IPaddrL:6
> crmd[28502]: 2007/01/03_18:07:39 ERROR: lrm_get_all_rscs(615): failed
> to receive a reply message of getall.
> crmd[28502]: 2007/01/03_18:07:39 ERROR: do_exit: Performing A_EXIT_1 -
> forcefully exiting the CRMd
> crmd[28502]: 2007/01/03_18:07:39 ERROR: do_exit: Could not recover
> from internal error
> (the complete log of this situation is in file ha-log_ipaddr_bug)
>
> Sources of heartbeat were taken from http://hg.linux-ha.org/dev
> changeset 9857 (latest commit at 12 Dec 2006 08:01:37 -0700).
> I hope it is easy reproducible for you.
grumble... you go on holidays and look what happens :-(
i'll take a look at this tomorrow
More information about the Linux-HA
mailing list