[Linux-HA] Message hist queue is filling up

Oren Nechushtan oren at forescout.com
Tue Dec 5 18:03:58 MST 2006


Hi,
In addition to the mentioned patch we also changed ha.cf as follows
#max_rexmit_delay        250     #       set the maximum rexmit delay time
max_rexmit_delay        10000    
hbgenmethod time                 

I don't remember any errors (of this kind) for months now.
Hope this helps!

Best,
Oren.

P.S.
If you could send a minimal cluster configuration (and logs) demonstrating the problem maybe someone can retest it and make the relevant release patches:)

> -----Original Message-----
> From: linux-ha-bounces at lists.linux-ha.org 
> [mailto:linux-ha-bounces at lists.linux-ha.org]On Behalf Of Matt Wilder
> Sent: Tuesday, December 05, 2006 8:26 PM
> To: General Linux-HA mailing list
> Subject: Re: [Linux-HA] Message hist queue is filling up
> 
> 
> FYI I am running heartbeat 2.0.7 with the patch listed above under
> FreeBSD 6.1-RELEASE-p3
> 
> On 12/5/06, ha at ew.nsci.us <ha at ew.nsci.us> wrote:
> > On Tue, 5 Dec 2006, Matt Wilder wrote:
> >
> > > Greetings,
> > >
> > > I applied the patch pointed to above with no issue.  I 
> have installed
> > > the patched version and restarted heartbeat on both nodes 
> and the 99%
> > > cpu issue appears to be gone.  However, I am still getting the
> > > following messages in syslog and It seems as if resource 
> handover isnt
> > > working quite right.  Can anyone point me to what these 
> messages mean?
> > > I can provide more logs if necessary.
> > >
> >
> > I am posting a me-too.  We had the same problem with a node 
> doing this and
> > have not found a resolution.  The node ran out of disk 
> space and hung.
> > Ultimately I ripped out anything heartbeat related I could find and
> > deleted anything that was left which might be heartbeat 
> related on that
> > node.  Next I removed the 2.0.5 rpm, and reinstalled with 
> 2.0.7.  After
> > reinstalling, we had the same error and the node would not see the
> > cluster.  The only thing I can think of is to stop the 
> entire cluster,
> > upgrade to 2.0.7, and start again.  Unfortunately we have 
> not had a moment
> > to restart the cluster to do this over the past month or 
> so; the node with
> > problems is still offline.  Originally the entire cluster 
> was 2.0.5.  Now
> > the cluster is all 2.0.5 except for the node which was 
> having trouble,
> > which is now 2.0.7 'cause yum installed the latest version (FC5).
> >
> > Any thoughts?
> >
> > -Eric
> >
> >
> > > Thanks.
> > >
> > > Primary Node (active):
> > > Dec  5 12:25:49 glider1 lrmd: [886]: WARN: 
> G_SIG_dispatch: Dispatch
> > > function for SIGCHLD was delayed 1000 ms (> 100 ms) before being
> > > called (GSource: 0x522418)
> > > Dec  5 12:25:49 glider1 crmd: [888]: WARN:
> > > do_dc_join_finalize:join_dc.c join-2: We are still in a 
> transition.
> > > Delaying until the TE completes.
> > > Dec  5 12:25:49 glider1 crmd: [888]: WARN:
> > > do_dc_join_finalize:join_dc.c join-2: We are still in a 
> transition.
> > > Delaying until the TE completes.
> > > Dec  5 12:25:51 glider1 tengine: [899]: notice: run_graph:graph.c
> > > Transition 1: (Complete=18, Pending=0, Fired=0, Skipped=2,
> > > Incomplete=0)
> > > Dec  5 12:29:52 glider1 heartbeat: [837]: ERROR: Message 
> hist queue is
> > > filling up (151 messages in queue)
> > > Dec  5 12:29:54 glider1 heartbeat: [837]: ERROR: Message 
> hist queue is
> > > filling up (152 messages in queue)
> > > Dec  5 12:29:56 glider1 heartbeat: [837]: ERROR: Message 
> hist queue is
> > > filling up (153 messages in queue)
> > > Dec  5 12:29:58 glider1 heartbeat: [837]: ERROR: Message 
> hist queue is
> > > filling up (154 messages in queue)
> > >
> > > Secondary node:
> > > Dec  5 12:30:03 glider2 heartbeat: [559]: ERROR: 
> Irretrievably lost
> > > packet: node glider1.domainit.com seq 135
> > > Dec  5 12:30:03 glider2 heartbeat: [559]: ERROR: 
> Irretrievably lost
> > > packet: node glider1.domainit.com seq 135
> > > Dec  5 12:30:18 glider2 heartbeat: [559]: ERROR: 
> Irretrievably lost
> > > packet: node glider1.domainit.com seq 143
> > > Dec  5 12:30:28 glider2 heartbeat: [559]: ERROR: 
> Irretrievably lost
> > > packet: node glider1.domainit.com seq 148
> > > Dec  5 12:30:34 glider2 heartbeat: [559]: ERROR: 
> Irretrievably lost
> > > packet: node glider1.domainit.com seq 151
> > > Dec  5 12:30:39 glider2 heartbeat: [559]: ERROR: 
> Irretrievably lost
> > > packet: node glider1.domainit.com seq 153
> > >
> > >
> > >
> > > On 11/30/06, Matt Wilder <grewaru at gmail.com> wrote:
> > >> I will look into this, as I am also having the 99% cpu issue.
> > >>
> > >> Any ideas as to if this will make it into a release?
> > >>
> > >>
> > >> On 11/30/06, Oren Nechushtan <oren at forescout.com> wrote:
> > >> > Hi,
> > >> > We've encountered something like that in the past.
> > >> > Check out the messages titled "[Linux-HA] RE: 99% CPU 
> heartbeat & rexmit
> > >> (seqno too low)"
> > >> > from September 2006. The (unofficial) patch there 
> solved it for us
> > >> thought it may require minor changes to date.
> > >> >
> > >> > Best,
> > >> > Oren.
> > >> >
> > >> > > -----Original Message-----
> > >> > > From: linux-ha-bounces at lists.linux-ha.org
> > >> > > [mailto:linux-ha-bounces at lists.linux-ha.org]On 
> Behalf Of Matt Wilder
> > >> > > Sent: Thursday, November 30, 2006 8:03 PM
> > >> > > To: General Linux-HA mailing list
> > >> > > Subject: Re: [Linux-HA] Message hist queue is filling up
> > >> > >
> > >> > >
> > >> > > What would cause this to happen?  There are no 
> network connectivity
> > >> > > issues between the two nodes.
> > >> > >
> > >> > > On 11/30/06, Serge Dubrouski <sergeyfd at gmail.com> wrote:
> > >> > > > Lost packets between nodes in cluster.
> > >> > > >
> > >> > > > On 11/30/06, Matt Wilder <grewaru at gmail.com> wrote:
> > >> > > > > Can anyone tell me what the cause of the following
> > >> > > messages showing up
> > >> > > > > in syslog from heartbeat?  I have checked network
> > >> > > connectivity between
> > >> > > > > the two machines in my cluster and everything 
> looks fine.  These
> > >> > > > > messages are occurring on a semi-frequent basis and do
> > >> > > not seem to be
> > >> > > > > stopping.
> > >> > > > >
> > >> > > > > Node1 syslog (currently serving all resources):
> > >> > > > > Nov 28 18:06:36 glider1 heartbeat: [80229]: ERROR:
> > >> > > Message hist queue
> > >> > > > > is filling up (196 messages in queue)
> > >> > > > > Nov 28 18:06:38 glider1 heartbeat: [80229]: ERROR:
> > >> > > Message hist queue
> > >> > > > > is filling up (197 messages in queue)
> > >> > > > > Nov 28 18:06:40 glider1 heartbeat: [80229]: ERROR:
> > >> > > Message hist queue
> > >> > > > > is filling up (198 messages in queue)
> > >> > > > > Nov 28 18:06:42 glider1 heartbeat: [80229]: ERROR:
> > >> > > Message hist queue
> > >> > > > > is filling up (199 messages in queue)
> > >> > > > > Nov 28 18:06:44 glider1 heartbeat: [80229]: ERROR:
> > >> > > Message hist queue
> > >> > > > > is filling up (200 messages in queue)
> > >> > > > > Nov 28 18:06:50 glider1 last message repeated 3 times
> > >> > > > > Nov 28 18:06:50 glider1 heartbeat: [80229]: ERROR: Cannot
> > >> > > rexmit pkt
> > >> > > > > 614508 for glider2.domainit.com: seqno too low
> > >> > > > > Nov 28 18:06:52 glider1 heartbeat: [80229]: ERROR:
> > >> > > Message hist queue
> > >> > > > > is filling up (200 messages in queue)
> > >> > > > > Nov 28 18:06:56 glider1 last message repeated 2 times
> > >> > > > > Nov 28 18:06:56 glider1 heartbeat: [80229]: ERROR: Cannot
> > >> > > rexmit pkt
> > >> > > > > 614511 for glider2.domainit.com: seqno too low
> > >> > > > > Nov 28 18:06:58 glider1 heartbeat: [80229]: ERROR:
> > >> > > Message hist queue
> > >> > > > > is filling up (200 messages in queue)
> > >> > > > > Nov 28 18:07:06 glider1 last message repeated 4 times
> > >> > > > >
> > >> > > > >
> > >> > > > > Node2 syslog:
> > >> > > > > Nov 28 18:05:05 glider2 heartbeat: [568]: ERROR:
> > >> > > Irretrievably lost
> > >> > > > > packet: node glider1.domainit.com seq 614508
> > >> > > > > Nov 28 18:05:11 glider2 heartbeat: [568]: ERROR:
> > >> > > Irretrievably lost
> > >> > > > > packet: node glider1.domainit.com seq 614511
> > >> > > > > _______________________________________________
> > >> > > > > Linux-HA mailing list
> > >> > > > > Linux-HA at lists.linux-ha.org
> > >> > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > >> > > > > See also: http://linux-ha.org/ReportingProblems
> > >> > > > >
> > >> > > > _______________________________________________
> > >> > > > Linux-HA mailing list
> > >> > > > Linux-HA at lists.linux-ha.org
> > >> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > >> > > > See also: http://linux-ha.org/ReportingProblems
> > >> > > >
> > >> > > _______________________________________________
> > >> > > Linux-HA mailing list
> > >> > > Linux-HA at lists.linux-ha.org
> > >> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > >> > > See also: http://linux-ha.org/ReportingProblems
> > >> > >
> > >> > _______________________________________________
> > >> > Linux-HA mailing list
> > >> > Linux-HA at lists.linux-ha.org
> > >> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > >> > See also: http://linux-ha.org/ReportingProblems
> > >> >
> > >>
> > > _______________________________________________
> > > Linux-HA mailing list
> > > Linux-HA at lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > >
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 


More information about the Linux-HA mailing list