[Linux-HA] strange monitor behaviour

Andrew Beekhof beekhof at gmail.com
Fri Jan 5 07:24:59 MST 2007


something is very very wrong with this installation:

heartbeat[5658]: 2007/01/04_23:36:23 info: Starting
"/usr/local/lib/heartbeat/crmd" as uid 103  gid 104 (pid 5658)
crmd[5662]: 2007/01/04_23:36:23 debug: crm_set_env_options:
HA_conn_logd_time = 60

these pid numbers are supposed to be the same.  this is causing no end
of trouble when heartbeat is shutdown and does not bode well for the
health of the installation.

can you try running BasicSanityCheck (no options required) on both
nodes (when the cluster is shut down) and report the results please?

On 1/5/07, Pavol Gono <palo.gono at gmail.com> wrote:
> In meantime I reinstalled heartbeat on fico and debo to changeset 9918
> (latest commit at Tue, 02 Jan 2007 11:33:56 +0100). I repeated
> procedure and saved hopefully every possible bit of information. I
> added some comments to logs too, to see better what is going on.
>
> The unsuccessful failover and bug seen by IPaddr were reproduced, the
> strange try to start Dummy was not seen (now failure count of Dummy is
> set).
>
> Palo
>
> On 1/4/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> > you dont still have /var/lib/heartbeat/pengine/pe-input-1557.bz2 on
> > fico by any chance do you?
> >
> > i've been running tests today without luck... any chance you could you
> > reproduce it with "debug 1" in ha.cf please?
> >
> > On 1/4/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> > > On 1/3/07, Pavol Gono <palo.gono at gmail.com> wrote:
> > > > Hi
> > > >
> > > > I was doing some tests with my configuration of two nodes, focusing on
> > > > monitoring ability of heartbeat. The following description should
> > > > reveal at least two bugs (or maybe configuration mistakes :)
> > > >
> > > > To have it simple, I replaced my resources with IPaddr and Dummy,
> > > > which shall run on the same node. Dummy has its state file in
> > > > /tmp/a/b. Both resources are monitored in 5-second intervals.
> > > >
> > > > First I started heartbeat on both machines, resources appeared on node
> > > > debo. Everything working fine...
> > > >
> > > > Then I removed directory /tmp/a - to achieve monitor operation
> > > > failure. As expected, monitor failed on node debo:
> > > > crmd[20580]: 2007/01/03_17:26:23 info: process_lrm_event: LRM
> > > > operation (7) monitor_5000 on x_Dummy complete
> > > > Heartbeat stopped Dummy resource:
> > > > crmd[20580]: 2007/01/03_17:26:26 info: process_lrm_event: LRM
> > > > operation (9) stop_0 on x_Dummy complete
> > > >
> > > > Now I would expect, that Dummy resource on debo will obtain failure
> > > > count, heartbeat will try to stop IPaddr on debo and then start both
> > > > resources on node fico.
> > > > Because default-resource-failure-stickiness is -INFINITY.
> > > >
> > > > Heartbeat then tried to start Dummy resource again on debo:
> > > > crmd[20580]: 2007/01/03_17:26:26 info: do_lrm_rsc_op: Performing
> > > > op=x_Dummy_start_0 key=3:80d7e03b-06a7-4583-b3e3-a9bf755cc5af)
> > > > But now Dummy is not able to touch file /tmp/a/b, start is unsuccessful:
> > > > crmd[20580]: 2007/01/03_17:26:28 ERROR: process_lrm_event: LRM
> > > > operation (10) start_0 on x_Dummy Error: (1) unknown error
> > > > crmd[20580]: 2007/01/03_17:26:29 info: do_lrm_rsc_op: Performing
> > > > op=x_Dummy_stop_0 key=4:80d7e03b-06a7-4583-b3e3-a9bf755cc5af)
> > > >
> > > > So now I would expect, heartbeat really try to move resources to node
> > > > fico, but it remained in following state:
> > > > IPaddr running on node debo, Dummy running nowhere.
> > > >
> > > > In attachment you can find logs, cibadmin outputs and ha.cf from this
> > state.
> > > >
> > > > I tried the above procedure once again after both heartbeats restart
> > > > (and recreating /var/a directory), the same thing happened. Then I
> > > > stopped heartbeat on node debo.
> > > > Now resources moved to node fico, as expected. But to my surprise, on
> > > > node debo the virtual IP address 10.0.12 on interface eth0:0 remained
> > > > active. Snippet from log:
> > > > crmd[28502]: 2007/01/03_18:07:39 ERROR: ghash_print_pending: Pending
> > > > action: x_IPaddrL:6
> > > > crmd[28502]: 2007/01/03_18:07:39 ERROR: lrm_get_all_rscs(615): failed
> > > > to receive a reply message of getall.
> > > > crmd[28502]: 2007/01/03_18:07:39 ERROR: do_exit: Performing A_EXIT_1 -
> > > > forcefully exiting the CRMd
> > > > crmd[28502]: 2007/01/03_18:07:39 ERROR: do_exit: Could not recover
> > > > from internal error
> > > > (the complete log of this situation is in file ha-log_ipaddr_bug)
> > > >
> > > > Sources of heartbeat were taken from http://hg.linux-ha.org/dev
> > > > changeset 9857 (latest commit at 12 Dec 2006 08:01:37 -0700).
> > > > I hope it is easy reproducible for you.
> > >
> > > grumble... you go on holidays and look what happens :-(
> > >
> > > i'll take a look at this tomorrow
> > >
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>
>


More information about the Linux-HA mailing list