[Linux-HA] strange monitor behaviour
Pavol Gono
palo.gono at gmail.com
Wed Jan 3 11:10:35 MST 2007
Hi
I was doing some tests with my configuration of two nodes, focusing on
monitoring ability of heartbeat. The following description should
reveal at least two bugs (or maybe configuration mistakes :)
To have it simple, I replaced my resources with IPaddr and Dummy,
which shall run on the same node. Dummy has its state file in
/tmp/a/b. Both resources are monitored in 5-second intervals.
First I started heartbeat on both machines, resources appeared on node
debo. Everything working fine...
Then I removed directory /tmp/a - to achieve monitor operation
failure. As expected, monitor failed on node debo:
crmd[20580]: 2007/01/03_17:26:23 info: process_lrm_event: LRM
operation (7) monitor_5000 on x_Dummy complete
Heartbeat stopped Dummy resource:
crmd[20580]: 2007/01/03_17:26:26 info: process_lrm_event: LRM
operation (9) stop_0 on x_Dummy complete
Now I would expect, that Dummy resource on debo will obtain failure
count, heartbeat will try to stop IPaddr on debo and then start both
resources on node fico.
Because default-resource-failure-stickiness is -INFINITY.
Heartbeat then tried to start Dummy resource again on debo:
crmd[20580]: 2007/01/03_17:26:26 info: do_lrm_rsc_op: Performing
op=x_Dummy_start_0 key=3:80d7e03b-06a7-4583-b3e3-a9bf755cc5af)
But now Dummy is not able to touch file /tmp/a/b, start is unsuccessful:
crmd[20580]: 2007/01/03_17:26:28 ERROR: process_lrm_event: LRM
operation (10) start_0 on x_Dummy Error: (1) unknown error
crmd[20580]: 2007/01/03_17:26:29 info: do_lrm_rsc_op: Performing
op=x_Dummy_stop_0 key=4:80d7e03b-06a7-4583-b3e3-a9bf755cc5af)
So now I would expect, heartbeat really try to move resources to node
fico, but it remained in following state:
IPaddr running on node debo, Dummy running nowhere.
In attachment you can find logs, cibadmin outputs and ha.cf from this state.
I tried the above procedure once again after both heartbeats restart
(and recreating /var/a directory), the same thing happened. Then I
stopped heartbeat on node debo.
Now resources moved to node fico, as expected. But to my surprise, on
node debo the virtual IP address 10.0.12 on interface eth0:0 remained
active. Snippet from log:
crmd[28502]: 2007/01/03_18:07:39 ERROR: ghash_print_pending: Pending
action: x_IPaddrL:6
crmd[28502]: 2007/01/03_18:07:39 ERROR: lrm_get_all_rscs(615): failed
to receive a reply message of getall.
crmd[28502]: 2007/01/03_18:07:39 ERROR: do_exit: Performing A_EXIT_1 -
forcefully exiting the CRMd
crmd[28502]: 2007/01/03_18:07:39 ERROR: do_exit: Could not recover
from internal error
(the complete log of this situation is in file ha-log_ipaddr_bug)
Sources of heartbeat were taken from http://hg.linux-ha.org/dev
changeset 9857 (latest commit at 12 Dec 2006 08:01:37 -0700).
I hope it is easy reproducible for you.
Palo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: strange_monitor_behaviour.tar.bz2
Type: application/x-bzip2
Size: 13160 bytes
Desc: not available
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20070103/27852d96/strange_monitor_behaviour.tar.bin
More information about the Linux-HA
mailing list