[Linux-HA] strange monitor behaviour

vini.bill at gmail.com vini.bill at gmail.com
Wed Jan 3 12:34:33 MST 2007


I think there's a configuration issue reported on both machines as ha-log
reports on line 50:

cib[23419]: 2007/01/03_17:20:30 WARN: crm_is_writable:
/var/lib/heartbeat/crm/cib.xml should be owned and r/w by group cluster

then on line 62 of the ha-log provided by debo you get this message:

crmd[20580]: 2007/01/03_17:20:28 WARN: cib_native_signon: Connection to CIB
failed: connection failed

So... I think It's a configuration issue. But I'm a bit new to HA and
heartbeat to say that precisely.

... Vinicius Menezes ...

On 1/3/07, Pavol Gono <palo.gono at gmail.com> wrote:
>
> Hi
>
> I was doing some tests with my configuration of two nodes, focusing on
> monitoring ability of heartbeat. The following description should
> reveal at least two bugs (or maybe configuration mistakes :)
>
> To have it simple, I replaced my resources with IPaddr and Dummy,
> which shall run on the same node. Dummy has its state file in
> /tmp/a/b. Both resources are monitored in 5-second intervals.
>
> First I started heartbeat on both machines, resources appeared on node
> debo. Everything working fine...
>
> Then I removed directory /tmp/a - to achieve monitor operation
> failure. As expected, monitor failed on node debo:
> crmd[20580]: 2007/01/03_17:26:23 info: process_lrm_event: LRM
> operation (7) monitor_5000 on x_Dummy complete
> Heartbeat stopped Dummy resource:
> crmd[20580]: 2007/01/03_17:26:26 info: process_lrm_event: LRM
> operation (9) stop_0 on x_Dummy complete
>
> Now I would expect, that Dummy resource on debo will obtain failure
> count, heartbeat will try to stop IPaddr on debo and then start both
> resources on node fico.
> Because default-resource-failure-stickiness is -INFINITY.
>
> Heartbeat then tried to start Dummy resource again on debo:
> crmd[20580]: 2007/01/03_17:26:26 info: do_lrm_rsc_op: Performing
> op=x_Dummy_start_0 key=3:80d7e03b-06a7-4583-b3e3-a9bf755cc5af)
> But now Dummy is not able to touch file /tmp/a/b, start is unsuccessful:
> crmd[20580]: 2007/01/03_17:26:28 ERROR: process_lrm_event: LRM
> operation (10) start_0 on x_Dummy Error: (1) unknown error
> crmd[20580]: 2007/01/03_17:26:29 info: do_lrm_rsc_op: Performing
> op=x_Dummy_stop_0 key=4:80d7e03b-06a7-4583-b3e3-a9bf755cc5af)
>
> So now I would expect, heartbeat really try to move resources to node
> fico, but it remained in following state:
> IPaddr running on node debo, Dummy running nowhere.
>
> In attachment you can find logs, cibadmin outputs and ha.cf from this
> state.
>
> I tried the above procedure once again after both heartbeats restart
> (and recreating /var/a directory), the same thing happened. Then I
> stopped heartbeat on node debo.
> Now resources moved to node fico, as expected. But to my surprise, on
> node debo the virtual IP address 10.0.12 on interface eth0:0 remained
> active. Snippet from log:
> crmd[28502]: 2007/01/03_18:07:39 ERROR: ghash_print_pending: Pending
> action: x_IPaddrL:6
> crmd[28502]: 2007/01/03_18:07:39 ERROR: lrm_get_all_rscs(615): failed
> to receive a reply message of getall.
> crmd[28502]: 2007/01/03_18:07:39 ERROR: do_exit: Performing A_EXIT_1 -
> forcefully exiting the CRMd
> crmd[28502]: 2007/01/03_18:07:39 ERROR: do_exit: Could not recover
> from internal error
> (the complete log of this situation is in file ha-log_ipaddr_bug)
>
> Sources of heartbeat were taken from http://hg.linux-ha.org/dev
> changeset 9857 (latest commit at 12 Dec 2006 08:01:37 -0700).
> I hope it is easy reproducible for you.
>
> Palo
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>
>


-- 
... Vinicius Menezes ...


More information about the Linux-HA mailing list