[Linux-HA] strange monitor behaviour
Pavol Gono
palo.gono at gmail.com
Thu Jan 4 16:39:32 MST 2007
In meantime I reinstalled heartbeat on fico and debo to changeset 9918
(latest commit at Tue, 02 Jan 2007 11:33:56 +0100). I repeated
procedure and saved hopefully every possible bit of information. I
added some comments to logs too, to see better what is going on.
The unsuccessful failover and bug seen by IPaddr were reproduced, the
strange try to start Dummy was not seen (now failure count of Dummy is
set).
Palo
On 1/4/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> you dont still have /var/lib/heartbeat/pengine/pe-input-1557.bz2 on
> fico by any chance do you?
>
> i've been running tests today without luck... any chance you could you
> reproduce it with "debug 1" in ha.cf please?
>
> On 1/4/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> > On 1/3/07, Pavol Gono <palo.gono at gmail.com> wrote:
> > > Hi
> > >
> > > I was doing some tests with my configuration of two nodes, focusing on
> > > monitoring ability of heartbeat. The following description should
> > > reveal at least two bugs (or maybe configuration mistakes :)
> > >
> > > To have it simple, I replaced my resources with IPaddr and Dummy,
> > > which shall run on the same node. Dummy has its state file in
> > > /tmp/a/b. Both resources are monitored in 5-second intervals.
> > >
> > > First I started heartbeat on both machines, resources appeared on node
> > > debo. Everything working fine...
> > >
> > > Then I removed directory /tmp/a - to achieve monitor operation
> > > failure. As expected, monitor failed on node debo:
> > > crmd[20580]: 2007/01/03_17:26:23 info: process_lrm_event: LRM
> > > operation (7) monitor_5000 on x_Dummy complete
> > > Heartbeat stopped Dummy resource:
> > > crmd[20580]: 2007/01/03_17:26:26 info: process_lrm_event: LRM
> > > operation (9) stop_0 on x_Dummy complete
> > >
> > > Now I would expect, that Dummy resource on debo will obtain failure
> > > count, heartbeat will try to stop IPaddr on debo and then start both
> > > resources on node fico.
> > > Because default-resource-failure-stickiness is -INFINITY.
> > >
> > > Heartbeat then tried to start Dummy resource again on debo:
> > > crmd[20580]: 2007/01/03_17:26:26 info: do_lrm_rsc_op: Performing
> > > op=x_Dummy_start_0 key=3:80d7e03b-06a7-4583-b3e3-a9bf755cc5af)
> > > But now Dummy is not able to touch file /tmp/a/b, start is unsuccessful:
> > > crmd[20580]: 2007/01/03_17:26:28 ERROR: process_lrm_event: LRM
> > > operation (10) start_0 on x_Dummy Error: (1) unknown error
> > > crmd[20580]: 2007/01/03_17:26:29 info: do_lrm_rsc_op: Performing
> > > op=x_Dummy_stop_0 key=4:80d7e03b-06a7-4583-b3e3-a9bf755cc5af)
> > >
> > > So now I would expect, heartbeat really try to move resources to node
> > > fico, but it remained in following state:
> > > IPaddr running on node debo, Dummy running nowhere.
> > >
> > > In attachment you can find logs, cibadmin outputs and ha.cf from this
> state.
> > >
> > > I tried the above procedure once again after both heartbeats restart
> > > (and recreating /var/a directory), the same thing happened. Then I
> > > stopped heartbeat on node debo.
> > > Now resources moved to node fico, as expected. But to my surprise, on
> > > node debo the virtual IP address 10.0.12 on interface eth0:0 remained
> > > active. Snippet from log:
> > > crmd[28502]: 2007/01/03_18:07:39 ERROR: ghash_print_pending: Pending
> > > action: x_IPaddrL:6
> > > crmd[28502]: 2007/01/03_18:07:39 ERROR: lrm_get_all_rscs(615): failed
> > > to receive a reply message of getall.
> > > crmd[28502]: 2007/01/03_18:07:39 ERROR: do_exit: Performing A_EXIT_1 -
> > > forcefully exiting the CRMd
> > > crmd[28502]: 2007/01/03_18:07:39 ERROR: do_exit: Could not recover
> > > from internal error
> > > (the complete log of this situation is in file ha-log_ipaddr_bug)
> > >
> > > Sources of heartbeat were taken from http://hg.linux-ha.org/dev
> > > changeset 9857 (latest commit at 12 Dec 2006 08:01:37 -0700).
> > > I hope it is easy reproducible for you.
> >
> > grumble... you go on holidays and look what happens :-(
> >
> > i'll take a look at this tomorrow
> >
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: strange_monitor_behaviour3.tar.bz2
Type: application/x-bzip2
Size: 24835 bytes
Desc: not available
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20070105/e31e8038/strange_monitor_behaviour3.tar-0001.bin
More information about the Linux-HA
mailing list