[Linux-HA] Message hist queue is filling up

Matt Wilder grewaru at gmail.com
Wed Dec 6 10:22:37 MST 2006


How do I go about disabling resource management without shutting down
my services?

On 12/6/06, Andreas Kurz <akurz at sms.at> wrote:
> ha at ew.nsci.us wrote:
> > On Tue, 5 Dec 2006, Matt Wilder wrote:
> >
> >> Greetings,
> >>
> >> I applied the patch pointed to above with no issue.  I have installed
> >> the patched version and restarted heartbeat on both nodes and the 99%
> >> cpu issue appears to be gone.  However, I am still getting the
> >> following messages in syslog and It seems as if resource handover isnt
> >> working quite right.  Can anyone point me to what these messages mean?
> >> I can provide more logs if necessary.
> >>
> >
> > I am posting a me-too.  We had the same problem with a node doing this
> > and have not found a resolution.  The node ran out of disk space and
> > hung. Ultimately I ripped out anything heartbeat related I could find
> > and deleted anything that was left which might be heartbeat related on
> > that node.  Next I removed the 2.0.5 rpm, and reinstalled with 2.0.7.
> > After reinstalling, we had the same error and the node would not see the
> > cluster.  The only thing I can think of is to stop the entire cluster,
> > upgrade to 2.0.7, and start again.  Unfortunately we have not had a
> > moment to restart the cluster to do this over the past month or so; the
> > node with problems is still offline.
>
> How about disabling resource management? This should allow a restart of
> heartbeat without interrupting your services.
>
> Regards,
> Andreas
>
> Originally the entire cluster was
> > 2.0.5.  Now the cluster is all 2.0.5 except for the node which was
> > having trouble, which is now 2.0.7 'cause yum installed the latest
> > version (FC5).
> >
> > Any thoughts?
> >
> > -Eric
> >
> >
> >> Thanks.
> >>
> >> Primary Node (active):
> >> Dec  5 12:25:49 glider1 lrmd: [886]: WARN: G_SIG_dispatch: Dispatch
> >> function for SIGCHLD was delayed 1000 ms (> 100 ms) before being
> >> called (GSource: 0x522418)
> >> Dec  5 12:25:49 glider1 crmd: [888]: WARN:
> >> do_dc_join_finalize:join_dc.c join-2: We are still in a transition.
> >> Delaying until the TE completes.
> >> Dec  5 12:25:49 glider1 crmd: [888]: WARN:
> >> do_dc_join_finalize:join_dc.c join-2: We are still in a transition.
> >> Delaying until the TE completes.
> >> Dec  5 12:25:51 glider1 tengine: [899]: notice: run_graph:graph.c
> >> Transition 1: (Complete=18, Pending=0, Fired=0, Skipped=2,
> >> Incomplete=0)
> >> Dec  5 12:29:52 glider1 heartbeat: [837]: ERROR: Message hist queue is
> >> filling up (151 messages in queue)
> >> Dec  5 12:29:54 glider1 heartbeat: [837]: ERROR: Message hist queue is
> >> filling up (152 messages in queue)
> >> Dec  5 12:29:56 glider1 heartbeat: [837]: ERROR: Message hist queue is
> >> filling up (153 messages in queue)
> >> Dec  5 12:29:58 glider1 heartbeat: [837]: ERROR: Message hist queue is
> >> filling up (154 messages in queue)
> >>
> >> Secondary node:
> >> Dec  5 12:30:03 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> >> packet: node glider1.domainit.com seq 135
> >> Dec  5 12:30:03 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> >> packet: node glider1.domainit.com seq 135
> >> Dec  5 12:30:18 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> >> packet: node glider1.domainit.com seq 143
> >> Dec  5 12:30:28 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> >> packet: node glider1.domainit.com seq 148
> >> Dec  5 12:30:34 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> >> packet: node glider1.domainit.com seq 151
> >> Dec  5 12:30:39 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> >> packet: node glider1.domainit.com seq 153
> >>
> >>
> >>
> >> On 11/30/06, Matt Wilder <grewaru at gmail.com> wrote:
> >>> I will look into this, as I am also having the 99% cpu issue.
> >>>
> >>> Any ideas as to if this will make it into a release?
> >>>
> >>>
> >>> On 11/30/06, Oren Nechushtan <oren at forescout.com> wrote:
> >>> > Hi,
> >>> > We've encountered something like that in the past.
> >>> > Check out the messages titled "[Linux-HA] RE: 99% CPU heartbeat &
> >>> rexmit (seqno too low)"
> >>> > from September 2006. The (unofficial) patch there solved it for us
> >>> thought it may require minor changes to date.
> >>> >
> >>> > Best,
> >>> > Oren.
> >>> >
> >>> > > -----Original Message-----
> >>> > > From: linux-ha-bounces at lists.linux-ha.org
> >>> > > [mailto:linux-ha-bounces at lists.linux-ha.org]On Behalf Of Matt Wilder
> >>> > > Sent: Thursday, November 30, 2006 8:03 PM
> >>> > > To: General Linux-HA mailing list
> >>> > > Subject: Re: [Linux-HA] Message hist queue is filling up
> >>> > >
> >>> > >
> >>> > > What would cause this to happen?  There are no network connectivity
> >>> > > issues between the two nodes.
> >>> > >
> >>> > > On 11/30/06, Serge Dubrouski <sergeyfd at gmail.com> wrote:
> >>> > > > Lost packets between nodes in cluster.
> >>> > > >
> >>> > > > On 11/30/06, Matt Wilder <grewaru at gmail.com> wrote:
> >>> > > > > Can anyone tell me what the cause of the following
> >>> > > messages showing up
> >>> > > > > in syslog from heartbeat?  I have checked network
> >>> > > connectivity between
> >>> > > > > the two machines in my cluster and everything looks fine.  These
> >>> > > > > messages are occurring on a semi-frequent basis and do
> >>> > > not seem to be
> >>> > > > > stopping.
> >>> > > > >
> >>> > > > > Node1 syslog (currently serving all resources):
> >>> > > > > Nov 28 18:06:36 glider1 heartbeat: [80229]: ERROR:
> >>> > > Message hist queue
> >>> > > > > is filling up (196 messages in queue)
> >>> > > > > Nov 28 18:06:38 glider1 heartbeat: [80229]: ERROR:
> >>> > > Message hist queue
> >>> > > > > is filling up (197 messages in queue)
> >>> > > > > Nov 28 18:06:40 glider1 heartbeat: [80229]: ERROR:
> >>> > > Message hist queue
> >>> > > > > is filling up (198 messages in queue)
> >>> > > > > Nov 28 18:06:42 glider1 heartbeat: [80229]: ERROR:
> >>> > > Message hist queue
> >>> > > > > is filling up (199 messages in queue)
> >>> > > > > Nov 28 18:06:44 glider1 heartbeat: [80229]: ERROR:
> >>> > > Message hist queue
> >>> > > > > is filling up (200 messages in queue)
> >>> > > > > Nov 28 18:06:50 glider1 last message repeated 3 times
> >>> > > > > Nov 28 18:06:50 glider1 heartbeat: [80229]: ERROR: Cannot
> >>> > > rexmit pkt
> >>> > > > > 614508 for glider2.domainit.com: seqno too low
> >>> > > > > Nov 28 18:06:52 glider1 heartbeat: [80229]: ERROR:
> >>> > > Message hist queue
> >>> > > > > is filling up (200 messages in queue)
> >>> > > > > Nov 28 18:06:56 glider1 last message repeated 2 times
> >>> > > > > Nov 28 18:06:56 glider1 heartbeat: [80229]: ERROR: Cannot
> >>> > > rexmit pkt
> >>> > > > > 614511 for glider2.domainit.com: seqno too low
> >>> > > > > Nov 28 18:06:58 glider1 heartbeat: [80229]: ERROR:
> >>> > > Message hist queue
> >>> > > > > is filling up (200 messages in queue)
> >>> > > > > Nov 28 18:07:06 glider1 last message repeated 4 times
> >>> > > > >
> >>> > > > >
> >>> > > > > Node2 syslog:
> >>> > > > > Nov 28 18:05:05 glider2 heartbeat: [568]: ERROR:
> >>> > > Irretrievably lost
> >>> > > > > packet: node glider1.domainit.com seq 614508
> >>> > > > > Nov 28 18:05:11 glider2 heartbeat: [568]: ERROR:
> >>> > > Irretrievably lost
> >>> > > > > packet: node glider1.domainit.com seq 614511
> >>> > > > > _______________________________________________
> >>> > > > > Linux-HA mailing list
> >>> > > > > Linux-HA at lists.linux-ha.org
> >>> > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >>> > > > > See also: http://linux-ha.org/ReportingProblems
> >>> > > > >
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>


More information about the Linux-HA mailing list