[Linux-HA] strange monitor behaviour

Pavol Gono palo.gono at gmail.com
Fri Jan 5 09:25:24 MST 2007


In attachment there is the log from fico.
The only difference in installation is beginning of configure options
(because debo is debian, fico is gentoo distro):
./configure --with-group-name=cluster --with-ccmuser-name=cluster
--with-group-id=65 --with-ccmuser-id=65 "CFLAGS=-fno-unit-at-a-time -g
-O0" ...

Palo

On 1/5/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> something is very very wrong with this installation:
>
> heartbeat[5658]: 2007/01/04_23:36:23 info: Starting
> "/usr/local/lib/heartbeat/crmd" as uid 103  gid 104 (pid 5658)
> crmd[5662]: 2007/01/04_23:36:23 debug: crm_set_env_options:
> HA_conn_logd_time = 60
>
> these pid numbers are supposed to be the same.  this is causing no end
> of trouble when heartbeat is shutdown and does not bode well for the
> health of the installation.
>
> can you try running BasicSanityCheck (no options required) on both
> nodes (when the cluster is shut down) and report the results please?
>
> On 1/5/07, Pavol Gono <palo.gono at gmail.com> wrote:
> > In meantime I reinstalled heartbeat on fico and debo to changeset 9918
> > (latest commit at Tue, 02 Jan 2007 11:33:56 +0100). I repeated
> > procedure and saved hopefully every possible bit of information. I
> > added some comments to logs too, to see better what is going on.
> >
> > The unsuccessful failover and bug seen by IPaddr were reproduced, the
> > strange try to start Dummy was not seen (now failure count of Dummy is
> > set).
> >
> > Palo
> >
> > On 1/4/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> > > you dont still have /var/lib/heartbeat/pengine/pe-input-1557.bz2 on
> > > fico by any chance do you?
> > >
> > > i've been running tests today without luck... any chance you could you
> > > reproduce it with "debug 1" in ha.cf please?
> > >
> > > On 1/4/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> > > > On 1/3/07, Pavol Gono <palo.gono at gmail.com> wrote:
> > > > > Hi
> > > > >
> > > > > I was doing some tests with my configuration of two nodes, focusing
> on
> > > > > monitoring ability of heartbeat. The following description should
> > > > > reveal at least two bugs (or maybe configuration mistakes :)
> > > > >
> > > > > To have it simple, I replaced my resources with IPaddr and Dummy,
> > > > > which shall run on the same node. Dummy has its state file in
> > > > > /tmp/a/b. Both resources are monitored in 5-second intervals.
> > > > >
> > > > > First I started heartbeat on both machines, resources appeared on
> node
> > > > > debo. Everything working fine...
> > > > >
> > > > > Then I removed directory /tmp/a - to achieve monitor operation
> > > > > failure. As expected, monitor failed on node debo:
> > > > > crmd[20580]: 2007/01/03_17:26:23 info: process_lrm_event: LRM
> > > > > operation (7) monitor_5000 on x_Dummy complete
> > > > > Heartbeat stopped Dummy resource:
> > > > > crmd[20580]: 2007/01/03_17:26:26 info: process_lrm_event: LRM
> > > > > operation (9) stop_0 on x_Dummy complete
> > > > >
> > > > > Now I would expect, that Dummy resource on debo will obtain failure
> > > > > count, heartbeat will try to stop IPaddr on debo and then start both
> > > > > resources on node fico.
> > > > > Because default-resource-failure-stickiness is -INFINITY.
> > > > >
> > > > > Heartbeat then tried to start Dummy resource again on debo:
> > > > > crmd[20580]: 2007/01/03_17:26:26 info: do_lrm_rsc_op: Performing
> > > > > op=x_Dummy_start_0 key=3:80d7e03b-06a7-4583-b3e3-a9bf755cc5af)
> > > > > But now Dummy is not able to touch file /tmp/a/b, start is
> unsuccessful:
> > > > > crmd[20580]: 2007/01/03_17:26:28 ERROR: process_lrm_event: LRM
> > > > > operation (10) start_0 on x_Dummy Error: (1) unknown error
> > > > > crmd[20580]: 2007/01/03_17:26:29 info: do_lrm_rsc_op: Performing
> > > > > op=x_Dummy_stop_0 key=4:80d7e03b-06a7-4583-b3e3-a9bf755cc5af)
> > > > >
> > > > > So now I would expect, heartbeat really try to move resources to
> node
> > > > > fico, but it remained in following state:
> > > > > IPaddr running on node debo, Dummy running nowhere.
> > > > >
> > > > > In attachment you can find logs, cibadmin outputs and ha.cf from
> this
> > > state.
> > > > >
> > > > > I tried the above procedure once again after both heartbeats restart
> > > > > (and recreating /var/a directory), the same thing happened. Then I
> > > > > stopped heartbeat on node debo.
> > > > > Now resources moved to node fico, as expected. But to my surprise,
> on
> > > > > node debo the virtual IP address 10.0.12 on interface eth0:0
> remained
> > > > > active. Snippet from log:
> > > > > crmd[28502]: 2007/01/03_18:07:39 ERROR: ghash_print_pending: Pending
> > > > > action: x_IPaddrL:6
> > > > > crmd[28502]: 2007/01/03_18:07:39 ERROR: lrm_get_all_rscs(615):
> failed
> > > > > to receive a reply message of getall.
> > > > > crmd[28502]: 2007/01/03_18:07:39 ERROR: do_exit: Performing A_EXIT_1
> -
> > > > > forcefully exiting the CRMd
> > > > > crmd[28502]: 2007/01/03_18:07:39 ERROR: do_exit: Could not recover
> > > > > from internal error
> > > > > (the complete log of this situation is in file ha-log_ipaddr_bug)
> > > > >
> > > > > Sources of heartbeat were taken from http://hg.linux-ha.org/dev
> > > > > changeset 9857 (latest commit at 12 Dec 2006 08:01:37 -0700).
> > > > > I hope it is easy reproducible for you.
> > > >
> > > > grumble... you go on holidays and look what happens :-(
> > > >
> > > > i'll take a look at this tomorrow
> > > >
> > > _______________________________________________
> > > Linux-HA mailing list
> > > Linux-HA at lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > >
> >
> >
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> >
> >
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: linux-ha.testlog_fico.bz2
Type: application/x-bzip2
Size: 29871 bytes
Desc: not available
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20070105/a80969d2/linux-ha.testlog_fico-0001.bin


More information about the Linux-HA mailing list