[Linux-HA] Message hist queue is filling up
Oren Nechushtan
oren at forescout.com
Tue Dec 5 18:03:58 MST 2006
Hi,
In addition to the mentioned patch we also changed ha.cf as follows
#max_rexmit_delay 250 # set the maximum rexmit delay time
max_rexmit_delay 10000
hbgenmethod time
I don't remember any errors (of this kind) for months now.
Hope this helps!
Best,
Oren.
P.S.
If you could send a minimal cluster configuration (and logs) demonstrating the problem maybe someone can retest it and make the relevant release patches:)
> -----Original Message-----
> From: linux-ha-bounces at lists.linux-ha.org
> [mailto:linux-ha-bounces at lists.linux-ha.org]On Behalf Of Matt Wilder
> Sent: Tuesday, December 05, 2006 8:26 PM
> To: General Linux-HA mailing list
> Subject: Re: [Linux-HA] Message hist queue is filling up
>
>
> FYI I am running heartbeat 2.0.7 with the patch listed above under
> FreeBSD 6.1-RELEASE-p3
>
> On 12/5/06, ha at ew.nsci.us <ha at ew.nsci.us> wrote:
> > On Tue, 5 Dec 2006, Matt Wilder wrote:
> >
> > > Greetings,
> > >
> > > I applied the patch pointed to above with no issue. I
> have installed
> > > the patched version and restarted heartbeat on both nodes
> and the 99%
> > > cpu issue appears to be gone. However, I am still getting the
> > > following messages in syslog and It seems as if resource
> handover isnt
> > > working quite right. Can anyone point me to what these
> messages mean?
> > > I can provide more logs if necessary.
> > >
> >
> > I am posting a me-too. We had the same problem with a node
> doing this and
> > have not found a resolution. The node ran out of disk
> space and hung.
> > Ultimately I ripped out anything heartbeat related I could find and
> > deleted anything that was left which might be heartbeat
> related on that
> > node. Next I removed the 2.0.5 rpm, and reinstalled with
> 2.0.7. After
> > reinstalling, we had the same error and the node would not see the
> > cluster. The only thing I can think of is to stop the
> entire cluster,
> > upgrade to 2.0.7, and start again. Unfortunately we have
> not had a moment
> > to restart the cluster to do this over the past month or
> so; the node with
> > problems is still offline. Originally the entire cluster
> was 2.0.5. Now
> > the cluster is all 2.0.5 except for the node which was
> having trouble,
> > which is now 2.0.7 'cause yum installed the latest version (FC5).
> >
> > Any thoughts?
> >
> > -Eric
> >
> >
> > > Thanks.
> > >
> > > Primary Node (active):
> > > Dec 5 12:25:49 glider1 lrmd: [886]: WARN:
> G_SIG_dispatch: Dispatch
> > > function for SIGCHLD was delayed 1000 ms (> 100 ms) before being
> > > called (GSource: 0x522418)
> > > Dec 5 12:25:49 glider1 crmd: [888]: WARN:
> > > do_dc_join_finalize:join_dc.c join-2: We are still in a
> transition.
> > > Delaying until the TE completes.
> > > Dec 5 12:25:49 glider1 crmd: [888]: WARN:
> > > do_dc_join_finalize:join_dc.c join-2: We are still in a
> transition.
> > > Delaying until the TE completes.
> > > Dec 5 12:25:51 glider1 tengine: [899]: notice: run_graph:graph.c
> > > Transition 1: (Complete=18, Pending=0, Fired=0, Skipped=2,
> > > Incomplete=0)
> > > Dec 5 12:29:52 glider1 heartbeat: [837]: ERROR: Message
> hist queue is
> > > filling up (151 messages in queue)
> > > Dec 5 12:29:54 glider1 heartbeat: [837]: ERROR: Message
> hist queue is
> > > filling up (152 messages in queue)
> > > Dec 5 12:29:56 glider1 heartbeat: [837]: ERROR: Message
> hist queue is
> > > filling up (153 messages in queue)
> > > Dec 5 12:29:58 glider1 heartbeat: [837]: ERROR: Message
> hist queue is
> > > filling up (154 messages in queue)
> > >
> > > Secondary node:
> > > Dec 5 12:30:03 glider2 heartbeat: [559]: ERROR:
> Irretrievably lost
> > > packet: node glider1.domainit.com seq 135
> > > Dec 5 12:30:03 glider2 heartbeat: [559]: ERROR:
> Irretrievably lost
> > > packet: node glider1.domainit.com seq 135
> > > Dec 5 12:30:18 glider2 heartbeat: [559]: ERROR:
> Irretrievably lost
> > > packet: node glider1.domainit.com seq 143
> > > Dec 5 12:30:28 glider2 heartbeat: [559]: ERROR:
> Irretrievably lost
> > > packet: node glider1.domainit.com seq 148
> > > Dec 5 12:30:34 glider2 heartbeat: [559]: ERROR:
> Irretrievably lost
> > > packet: node glider1.domainit.com seq 151
> > > Dec 5 12:30:39 glider2 heartbeat: [559]: ERROR:
> Irretrievably lost
> > > packet: node glider1.domainit.com seq 153
> > >
> > >
> > >
> > > On 11/30/06, Matt Wilder <grewaru at gmail.com> wrote:
> > >> I will look into this, as I am also having the 99% cpu issue.
> > >>
> > >> Any ideas as to if this will make it into a release?
> > >>
> > >>
> > >> On 11/30/06, Oren Nechushtan <oren at forescout.com> wrote:
> > >> > Hi,
> > >> > We've encountered something like that in the past.
> > >> > Check out the messages titled "[Linux-HA] RE: 99% CPU
> heartbeat & rexmit
> > >> (seqno too low)"
> > >> > from September 2006. The (unofficial) patch there
> solved it for us
> > >> thought it may require minor changes to date.
> > >> >
> > >> > Best,
> > >> > Oren.
> > >> >
> > >> > > -----Original Message-----
> > >> > > From: linux-ha-bounces at lists.linux-ha.org
> > >> > > [mailto:linux-ha-bounces at lists.linux-ha.org]On
> Behalf Of Matt Wilder
> > >> > > Sent: Thursday, November 30, 2006 8:03 PM
> > >> > > To: General Linux-HA mailing list
> > >> > > Subject: Re: [Linux-HA] Message hist queue is filling up
> > >> > >
> > >> > >
> > >> > > What would cause this to happen? There are no
> network connectivity
> > >> > > issues between the two nodes.
> > >> > >
> > >> > > On 11/30/06, Serge Dubrouski <sergeyfd at gmail.com> wrote:
> > >> > > > Lost packets between nodes in cluster.
> > >> > > >
> > >> > > > On 11/30/06, Matt Wilder <grewaru at gmail.com> wrote:
> > >> > > > > Can anyone tell me what the cause of the following
> > >> > > messages showing up
> > >> > > > > in syslog from heartbeat? I have checked network
> > >> > > connectivity between
> > >> > > > > the two machines in my cluster and everything
> looks fine. These
> > >> > > > > messages are occurring on a semi-frequent basis and do
> > >> > > not seem to be
> > >> > > > > stopping.
> > >> > > > >
> > >> > > > > Node1 syslog (currently serving all resources):
> > >> > > > > Nov 28 18:06:36 glider1 heartbeat: [80229]: ERROR:
> > >> > > Message hist queue
> > >> > > > > is filling up (196 messages in queue)
> > >> > > > > Nov 28 18:06:38 glider1 heartbeat: [80229]: ERROR:
> > >> > > Message hist queue
> > >> > > > > is filling up (197 messages in queue)
> > >> > > > > Nov 28 18:06:40 glider1 heartbeat: [80229]: ERROR:
> > >> > > Message hist queue
> > >> > > > > is filling up (198 messages in queue)
> > >> > > > > Nov 28 18:06:42 glider1 heartbeat: [80229]: ERROR:
> > >> > > Message hist queue
> > >> > > > > is filling up (199 messages in queue)
> > >> > > > > Nov 28 18:06:44 glider1 heartbeat: [80229]: ERROR:
> > >> > > Message hist queue
> > >> > > > > is filling up (200 messages in queue)
> > >> > > > > Nov 28 18:06:50 glider1 last message repeated 3 times
> > >> > > > > Nov 28 18:06:50 glider1 heartbeat: [80229]: ERROR: Cannot
> > >> > > rexmit pkt
> > >> > > > > 614508 for glider2.domainit.com: seqno too low
> > >> > > > > Nov 28 18:06:52 glider1 heartbeat: [80229]: ERROR:
> > >> > > Message hist queue
> > >> > > > > is filling up (200 messages in queue)
> > >> > > > > Nov 28 18:06:56 glider1 last message repeated 2 times
> > >> > > > > Nov 28 18:06:56 glider1 heartbeat: [80229]: ERROR: Cannot
> > >> > > rexmit pkt
> > >> > > > > 614511 for glider2.domainit.com: seqno too low
> > >> > > > > Nov 28 18:06:58 glider1 heartbeat: [80229]: ERROR:
> > >> > > Message hist queue
> > >> > > > > is filling up (200 messages in queue)
> > >> > > > > Nov 28 18:07:06 glider1 last message repeated 4 times
> > >> > > > >
> > >> > > > >
> > >> > > > > Node2 syslog:
> > >> > > > > Nov 28 18:05:05 glider2 heartbeat: [568]: ERROR:
> > >> > > Irretrievably lost
> > >> > > > > packet: node glider1.domainit.com seq 614508
> > >> > > > > Nov 28 18:05:11 glider2 heartbeat: [568]: ERROR:
> > >> > > Irretrievably lost
> > >> > > > > packet: node glider1.domainit.com seq 614511
> > >> > > > > _______________________________________________
> > >> > > > > Linux-HA mailing list
> > >> > > > > Linux-HA at lists.linux-ha.org
> > >> > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > >> > > > > See also: http://linux-ha.org/ReportingProblems
> > >> > > > >
> > >> > > > _______________________________________________
> > >> > > > Linux-HA mailing list
> > >> > > > Linux-HA at lists.linux-ha.org
> > >> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > >> > > > See also: http://linux-ha.org/ReportingProblems
> > >> > > >
> > >> > > _______________________________________________
> > >> > > Linux-HA mailing list
> > >> > > Linux-HA at lists.linux-ha.org
> > >> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > >> > > See also: http://linux-ha.org/ReportingProblems
> > >> > >
> > >> > _______________________________________________
> > >> > Linux-HA mailing list
> > >> > Linux-HA at lists.linux-ha.org
> > >> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > >> > See also: http://linux-ha.org/ReportingProblems
> > >> >
> > >>
> > > _______________________________________________
> > > Linux-HA mailing list
> > > Linux-HA at lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > >
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
More information about the Linux-HA
mailing list