[Linux-HA] Message hist queue is filling up

ha at ew.nsci.us ha at ew.nsci.us
Tue Dec 5 10:43:29 MST 2006


On Tue, 5 Dec 2006, Matt Wilder wrote:

> Greetings,
>
> I applied the patch pointed to above with no issue.  I have installed
> the patched version and restarted heartbeat on both nodes and the 99%
> cpu issue appears to be gone.  However, I am still getting the
> following messages in syslog and It seems as if resource handover isnt
> working quite right.  Can anyone point me to what these messages mean?
> I can provide more logs if necessary.
>

I am posting a me-too.  We had the same problem with a node doing this and 
have not found a resolution.  The node ran out of disk space and hung. 
Ultimately I ripped out anything heartbeat related I could find and 
deleted anything that was left which might be heartbeat related on that 
node.  Next I removed the 2.0.5 rpm, and reinstalled with 2.0.7.  After 
reinstalling, we had the same error and the node would not see the 
cluster.  The only thing I can think of is to stop the entire cluster, 
upgrade to 2.0.7, and start again.  Unfortunately we have not had a moment 
to restart the cluster to do this over the past month or so; the node with 
problems is still offline.  Originally the entire cluster was 2.0.5.  Now 
the cluster is all 2.0.5 except for the node which was having trouble, 
which is now 2.0.7 'cause yum installed the latest version (FC5).

Any thoughts?

-Eric


> Thanks.
>
> Primary Node (active):
> Dec  5 12:25:49 glider1 lrmd: [886]: WARN: G_SIG_dispatch: Dispatch
> function for SIGCHLD was delayed 1000 ms (> 100 ms) before being
> called (GSource: 0x522418)
> Dec  5 12:25:49 glider1 crmd: [888]: WARN:
> do_dc_join_finalize:join_dc.c join-2: We are still in a transition.
> Delaying until the TE completes.
> Dec  5 12:25:49 glider1 crmd: [888]: WARN:
> do_dc_join_finalize:join_dc.c join-2: We are still in a transition.
> Delaying until the TE completes.
> Dec  5 12:25:51 glider1 tengine: [899]: notice: run_graph:graph.c
> Transition 1: (Complete=18, Pending=0, Fired=0, Skipped=2,
> Incomplete=0)
> Dec  5 12:29:52 glider1 heartbeat: [837]: ERROR: Message hist queue is
> filling up (151 messages in queue)
> Dec  5 12:29:54 glider1 heartbeat: [837]: ERROR: Message hist queue is
> filling up (152 messages in queue)
> Dec  5 12:29:56 glider1 heartbeat: [837]: ERROR: Message hist queue is
> filling up (153 messages in queue)
> Dec  5 12:29:58 glider1 heartbeat: [837]: ERROR: Message hist queue is
> filling up (154 messages in queue)
>
> Secondary node:
> Dec  5 12:30:03 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> packet: node glider1.domainit.com seq 135
> Dec  5 12:30:03 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> packet: node glider1.domainit.com seq 135
> Dec  5 12:30:18 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> packet: node glider1.domainit.com seq 143
> Dec  5 12:30:28 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> packet: node glider1.domainit.com seq 148
> Dec  5 12:30:34 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> packet: node glider1.domainit.com seq 151
> Dec  5 12:30:39 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> packet: node glider1.domainit.com seq 153
>
>
>
> On 11/30/06, Matt Wilder <grewaru at gmail.com> wrote:
>> I will look into this, as I am also having the 99% cpu issue.
>> 
>> Any ideas as to if this will make it into a release?
>> 
>> 
>> On 11/30/06, Oren Nechushtan <oren at forescout.com> wrote:
>> > Hi,
>> > We've encountered something like that in the past.
>> > Check out the messages titled "[Linux-HA] RE: 99% CPU heartbeat & rexmit 
>> (seqno too low)"
>> > from September 2006. The (unofficial) patch there solved it for us 
>> thought it may require minor changes to date.
>> >
>> > Best,
>> > Oren.
>> >
>> > > -----Original Message-----
>> > > From: linux-ha-bounces at lists.linux-ha.org
>> > > [mailto:linux-ha-bounces at lists.linux-ha.org]On Behalf Of Matt Wilder
>> > > Sent: Thursday, November 30, 2006 8:03 PM
>> > > To: General Linux-HA mailing list
>> > > Subject: Re: [Linux-HA] Message hist queue is filling up
>> > >
>> > >
>> > > What would cause this to happen?  There are no network connectivity
>> > > issues between the two nodes.
>> > >
>> > > On 11/30/06, Serge Dubrouski <sergeyfd at gmail.com> wrote:
>> > > > Lost packets between nodes in cluster.
>> > > >
>> > > > On 11/30/06, Matt Wilder <grewaru at gmail.com> wrote:
>> > > > > Can anyone tell me what the cause of the following
>> > > messages showing up
>> > > > > in syslog from heartbeat?  I have checked network
>> > > connectivity between
>> > > > > the two machines in my cluster and everything looks fine.  These
>> > > > > messages are occurring on a semi-frequent basis and do
>> > > not seem to be
>> > > > > stopping.
>> > > > >
>> > > > > Node1 syslog (currently serving all resources):
>> > > > > Nov 28 18:06:36 glider1 heartbeat: [80229]: ERROR:
>> > > Message hist queue
>> > > > > is filling up (196 messages in queue)
>> > > > > Nov 28 18:06:38 glider1 heartbeat: [80229]: ERROR:
>> > > Message hist queue
>> > > > > is filling up (197 messages in queue)
>> > > > > Nov 28 18:06:40 glider1 heartbeat: [80229]: ERROR:
>> > > Message hist queue
>> > > > > is filling up (198 messages in queue)
>> > > > > Nov 28 18:06:42 glider1 heartbeat: [80229]: ERROR:
>> > > Message hist queue
>> > > > > is filling up (199 messages in queue)
>> > > > > Nov 28 18:06:44 glider1 heartbeat: [80229]: ERROR:
>> > > Message hist queue
>> > > > > is filling up (200 messages in queue)
>> > > > > Nov 28 18:06:50 glider1 last message repeated 3 times
>> > > > > Nov 28 18:06:50 glider1 heartbeat: [80229]: ERROR: Cannot
>> > > rexmit pkt
>> > > > > 614508 for glider2.domainit.com: seqno too low
>> > > > > Nov 28 18:06:52 glider1 heartbeat: [80229]: ERROR:
>> > > Message hist queue
>> > > > > is filling up (200 messages in queue)
>> > > > > Nov 28 18:06:56 glider1 last message repeated 2 times
>> > > > > Nov 28 18:06:56 glider1 heartbeat: [80229]: ERROR: Cannot
>> > > rexmit pkt
>> > > > > 614511 for glider2.domainit.com: seqno too low
>> > > > > Nov 28 18:06:58 glider1 heartbeat: [80229]: ERROR:
>> > > Message hist queue
>> > > > > is filling up (200 messages in queue)
>> > > > > Nov 28 18:07:06 glider1 last message repeated 4 times
>> > > > >
>> > > > >
>> > > > > Node2 syslog:
>> > > > > Nov 28 18:05:05 glider2 heartbeat: [568]: ERROR:
>> > > Irretrievably lost
>> > > > > packet: node glider1.domainit.com seq 614508
>> > > > > Nov 28 18:05:11 glider2 heartbeat: [568]: ERROR:
>> > > Irretrievably lost
>> > > > > packet: node glider1.domainit.com seq 614511
>> > > > > _______________________________________________
>> > > > > Linux-HA mailing list
>> > > > > Linux-HA at lists.linux-ha.org
>> > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> > > > > See also: http://linux-ha.org/ReportingProblems
>> > > > >
>> > > > _______________________________________________
>> > > > Linux-HA mailing list
>> > > > Linux-HA at lists.linux-ha.org
>> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> > > > See also: http://linux-ha.org/ReportingProblems
>> > > >
>> > > _______________________________________________
>> > > Linux-HA mailing list
>> > > Linux-HA at lists.linux-ha.org
>> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> > > See also: http://linux-ha.org/ReportingProblems
>> > >
>> > _______________________________________________
>> > Linux-HA mailing list
>> > Linux-HA at lists.linux-ha.org
>> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> > See also: http://linux-ha.org/ReportingProblems
>> >
>> 
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>


More information about the Linux-HA mailing list