[Linux-HA] Message hist queue is filling up
Matt Wilder
grewaru at gmail.com
Tue Dec 12 09:32:23 MST 2006
All,
I discovered the source of this problem:
I am running pf (firewall software) on both systems. I had enabled
the "scrub" ruleset for the interfaces transmitting the heartbeat
packets. This apparently causes problems with the transmitting of the
crm data from node to node. This did not affect the actual heartbeat.
I removed the scrub ruleset from both nodes and these messages have
disappeared.
Matt
On 12/7/06, Matt Wilder <grewaru at gmail.com> wrote:
> I set the following in my ha.cf on both cluster machines and restarted
> heartbeat.
> I am still having the 99% cpu issue and the Message hist queue issues.
>
> I have attached my ha.cf, my cib.xml, and the logs from both machines
> during the startup phase. Please let me know if anything else can be
> of assistance:
>
> FYI I omitted the IP addresses from all config files for security reasons :)
>
> ha.cf:
> use_logd on
> logfacility local7
> debug 0
> keepalive 2
> warntime 10
> deadtime 30
> initdead 40
> max_rexmit_delay 10000
> hbgenmethod time
> auto_failback on
> ucast bge0 <omitted>
> node glider1.domainit.com glider2.domainit.com
> respawn hacluster /usr/local/lib/heartbeat/ipfail
> crm yes
>
>
> cib.xml:
> <cib generated="true" admin_epoch="0" have_quorum="true"
> num_peers="2" cib_feature_revision="1.3" epoch="173"
> num_updates="14968" cib-last-written="Thu Dec 7 12:47:29 2006"
> ccm_transition="2" dc_uuid="756cb762-f9f5-42b0-bfb3-52ed277d2f97">
> <configuration>
> <crm_config>
> <cluster_property_set id="60451331-6583-42e4-b40d-117256f59751">
> <attributes>
> <nvpair id="stonith_enabled" name="stonith_enabled" value="no"/>
> <nvpair id="remove_after_stop" name="remove_after_stop" value="yes"/>
> <nvpair id="stop_orphan_resources"
> name="stop_orphan_resources" value="true"/>
> <nvpair id="stop_orphan_actions" name="stop_orphan_actions"
> value="true"/>
> <nvpair id="is_managed_default" name="is_managed_default"
> value="false"/>
> <nvpair id="default_resource_failure_stickiness"
> name="default_resource_failure_stickiness" value="-100"/>
> <nvpair id="default_resource_stickiness"
> name="default_resource_stickiness" value="500"/>
> </attributes>
> </cluster_property_set>
> <cluster_property_set id="cib-bootstrap-options">
> <attributes>
> <nvpair name="last-lrm-refresh"
> id="cib-bootstrap-options-last-lrm-refresh" value="1158861804"/>
> <nvpair id="cib-bootstrap-options-is_managed_default"
> name="is_managed_default" value="true"/>
> </attributes>
> </cluster_property_set>
> </crm_config>
> <nodes>
> <node id="eb9ed0c8-66ae-4368-9f4c-5a562550df3b"
> uname="glider2.domainit.com" type="normal"/>
> <node id="756cb762-f9f5-42b0-bfb3-52ed277d2f97"
> uname="glider1.domainit.com" type="normal"/>
> </nodes>
> <resources>
> <group id="apache_group">
> <primitive class="ocf" type="IPaddr" provider="heartbeat"
> id="ip_domainit">
> <instance_attributes id="f30220b3-ace2-48e6-9585-cab6e1b2bcb9">
> <attributes>
> <nvpair name="ip" value="<omitted>"
> id="e1ccdf0b-c0b2-4dbd-8c85-7495eeb02209"/>
> </attributes>
> </instance_attributes>
> <instance_attributes id="ip_domainit">
> <attributes>
> <nvpair id="ip_domainit-is_managed" name="is_managed"
> value="true"/>
> </attributes>
> </instance_attributes>
> </primitive>
> <primitive class="ocf" type="IPaddr" provider="heartbeat"
> id="ip_nervecenter">
> <instance_attributes id="29068aef-a233-48c0-883b-b8e42876debe">
> <attributes>
> <nvpair name="ip" value="<omitted>"
> id="aa743d14-8872-4564-b701-e5382e2d85fc"/>
> </attributes>
> </instance_attributes>
> <instance_attributes id="ip_nervecenter">
> <attributes>
> <nvpair id="ip_nervecenter-is_managed"
> name="is_managed" value="true"/>
> </attributes>
> </instance_attributes>
> </primitive>
> <primitive id="apache" class="ocf" type="DmitApache"
> provider="heartbeat">
> <operations>
> <op id="apache-monitor" name="monitor" interval="1min"
> timeout="30s"/>
> </operations>
> <instance_attributes id="apache">
> <attributes>
> <nvpair id="apache-is_managed" name="is_managed" value="true"/>
> </attributes>
> </instance_attributes>
> </primitive>
> <primitive id="periodics" class="ocf" type="PeriodicScripts"
> provider="heartbeat">
> <instance_attributes id="periodics">
> <attributes>
> <nvpair id="periodics-is_managed" name="is_managed"
> value="true"/>
> </attributes>
> </instance_attributes>
> </primitive>
> <primitive id="verisigntransport" class="ocf"
> type="TransportVerisign" provider="heartbeat">
> <operations>
> <op id="verisigntransport-monitor" name="monitor"
> interval="1min" timeout="30s"/>
> </operations>
> <instance_attributes id="verisigntransport">
> <attributes>
> <nvpair id="verisigntransport-is_managed"
> name="is_managed" value="true"/>
> </attributes>
> </instance_attributes>
> </primitive>
> <primitive id="orgtransport" class="ocf" type="TransportOrg"
> provider="heartbeat">
> <operations>
> <op id="orgtransport-monitor" name="monitor"
> interval="1min" timeout="30s"/>
> </operations>
> <instance_attributes id="orgtransport">
> <attributes>
> <nvpair id="orgtransport-is_managed" name="is_managed"
> value="true"/>
> </attributes>
> </instance_attributes>
> </primitive>
> <primitive id="biztransport" class="ocf" type="TransportBiz"
> provider="heartbeat">
> <operations>
> <op id="biztransport-monitor" name="monitor"
> interval="1min" timeout="30s"/>
> </operations>
> <instance_attributes id="biztransport">
> <attributes>
> <nvpair id="biztransport-is_managed" name="is_managed"
> value="true"/>
> </attributes>
> </instance_attributes>
> </primitive>
> <primitive id="ustransport" class="ocf" type="TransportUS"
> provider="heartbeat">
> <operations>
> <op id="ustransport-monitor" name="monitor"
> interval="1min" timeout="30s"/>
> </operations>
> <instance_attributes id="ustransport">
> <attributes>
> <nvpair id="ustransport-is_managed" name="is_managed"
> value="true"/>
> </attributes>
> </instance_attributes>
> </primitive>
> <primitive id="infotransport" class="ocf"
> type="TransportInfo" provider="heartbeat">
> <operations>
> <op id="infotransport-monitor" name="monitor"
> interval="1min" timeout="30s"/>
> </operations>
> <instance_attributes id="infotransport">
> <attributes>
> <nvpair id="infotransport-is_managed" name="is_managed"
> value="true"/>
> </attributes>
> </instance_attributes>
> </primitive>
> <primitive id="namestoretransport" class="ocf"
> type="TransportNamestore" provider="heartbeat">
> <operations>
> <op id="namestoretransport-monitor" name="monitor"
> interval="1min" timeout="30s"/>
> </operations>
> <instance_attributes id="namestoretransport">
> <attributes>
> <nvpair id="namestoretransport-is_managed"
> name="is_managed" value="true"/>
> </attributes>
> </instance_attributes>
> </primitive>
> <primitive id="mobitransport" class="ocf"
> type="TransportMobi" provider="heartbeat">
> <operations>
> <op id="mobitransport-monitor" name="monitor"
> interval="1min" timeout="30s"/>
> </operations>
> <instance_attributes id="mobitransport">
> <attributes>
> <nvpair id="mobitransport-is_managed" name="is_managed"
> value="true"/>
> </attributes>
> </instance_attributes>
> </primitive>
> </group>
> </resources>
> <constraints>
> <rsc_location id="run_apache_group" rsc="apache_group">
> <rule id="pref_gliderweb1" score="1500">
> <expression attribute="#uname" operation="eq"
> value="glider1.domainit.com"
> id="d4794175-9f03-44b6-970f-93a84e04f183"/>
> </rule>
> <rule id="pref_gliderweb2" score="1000">
> <expression attribute="#uname" operation="eq"
> value="glider2.domainit.com"
> id="116cbf7a-268a-40dc-994c-8a3884bd8a96"/>
> </rule>
> </rsc_location>
> </constraints>
> </configuration>
> </cib>
>
>
>
> On 12/6/06, Andreas Kurz <akurz at sms.at> wrote:
> >
> > >How do I go about disabling resource management without shutting down
> > >my services?
> >
> > Have a look at http://www.linux-ha.org/v2/upgrade/reattach
> >
> > You can use:
> > crm_attribute -t crm_config -n is_managed_default -v false
> >
> > or manipulate the cib with cibadmin:
> > cibadmin -R -o crm_config -X '<nvpair id="is_managed_default" name="is_managed_default" value="false"/>'
> >
> > Regards,
> > Andreas
> >
> > On 12/6/06, Andreas Kurz <akurz at sms.at> wrote:
> > > ha at ew.nsci.us wrote:
> > > > On Tue, 5 Dec 2006, Matt Wilder wrote:
> > > >
> > > >> Greetings,
> > > >>
> > > >> I applied the patch pointed to above with no issue. I have installed
> > > >> the patched version and restarted heartbeat on both nodes and the 99%
> > > >> cpu issue appears to be gone. However, I am still getting the
> > > >> following messages in syslog and It seems as if resource handover isnt
> > > >> working quite right. Can anyone point me to what these messages mean?
> > > >> I can provide more logs if necessary.
> > > >>
> > > >
> > > > I am posting a me-too. We had the same problem with a node doing this
> > > > and have not found a resolution. The node ran out of disk space and
> > > > hung. Ultimately I ripped out anything heartbeat related I could find
> > > > and deleted anything that was left which might be heartbeat related on
> > > > that node. Next I removed the 2.0.5 rpm, and reinstalled with 2.0.7.
> > > > After reinstalling, we had the same error and the node would not see the
> > > > cluster. The only thing I can think of is to stop the entire cluster,
> > > > upgrade to 2.0.7, and start again. Unfortunately we have not had a
> > > > moment to restart the cluster to do this over the past month or so; the
> > > > node with problems is still offline.
> > >
> > > How about disabling resource management? This should allow a restart of
> > > heartbeat without interrupting your services.
> > >
> > > Regards,
> > > Andreas
> > >
> > > Originally the entire cluster was
> > > > 2.0.5. Now the cluster is all 2.0.5 except for the node which was
> > > > having trouble, which is now 2.0.7 'cause yum installed the latest
> > > > version (FC5).
> > > >
> > > > Any thoughts?
> > > >
> > > > -Eric
> > > >
> > > >
> > > >> Thanks.
> > > >>
> > > >> Primary Node (active):
> > > >> Dec 5 12:25:49 glider1 lrmd: [886]: WARN: G_SIG_dispatch: Dispatch
> > > >> function for SIGCHLD was delayed 1000 ms (> 100 ms) before being
> > > >> called (GSource: 0x522418)
> > > >> Dec 5 12:25:49 glider1 crmd: [888]: WARN:
> > > >> do_dc_join_finalize:join_dc.c join-2: We are still in a transition.
> > > >> Delaying until the TE completes.
> > > >> Dec 5 12:25:49 glider1 crmd: [888]: WARN:
> > > >> do_dc_join_finalize:join_dc.c join-2: We are still in a transition.
> > > >> Delaying until the TE completes.
> > > >> Dec 5 12:25:51 glider1 tengine: [899]: notice: run_graph:graph.c
> > > >> Transition 1: (Complete=18, Pending=0, Fired=0, Skipped=2,
> > > >> Incomplete=0)
> > > >> Dec 5 12:29:52 glider1 heartbeat: [837]: ERROR: Message hist queue is
> > > >> filling up (151 messages in queue)
> > > >> Dec 5 12:29:54 glider1 heartbeat: [837]: ERROR: Message hist queue is
> > > >> filling up (152 messages in queue)
> > > >> Dec 5 12:29:56 glider1 heartbeat: [837]: ERROR: Message hist queue is
> > > >> filling up (153 messages in queue)
> > > >> Dec 5 12:29:58 glider1 heartbeat: [837]: ERROR: Message hist queue is
> > > >> filling up (154 messages in queue)
> > > >>
> > > >> Secondary node:
> > > >> Dec 5 12:30:03 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> > > >> packet: node glider1.domainit.com seq 135
> > > >> Dec 5 12:30:03 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> > > >> packet: node glider1.domainit.com seq 135
> > > >> Dec 5 12:30:18 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> > > >> packet: node glider1.domainit.com seq 143
> > > >> Dec 5 12:30:28 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> > > >> packet: node glider1.domainit.com seq 148
> > > >> Dec 5 12:30:34 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> > > >> packet: node glider1.domainit.com seq 151
> > > >> Dec 5 12:30:39 glider2 heartbeat: [559]: ERROR: Irretrievably lost
> > > >> packet: node glider1.domainit.com seq 153
> > > >>
> > > >>
> > > >>
> > > >> On 11/30/06, Matt Wilder <grewaru at gmail.com> wrote:
> > > >>> I will look into this, as I am also having the 99% cpu issue.
> > > >>>
> > > >>> Any ideas as to if this will make it into a release?
> > > >>>
> > > >>>
> > > >>> On 11/30/06, Oren Nechushtan <oren at forescout.com> wrote:
> > > >>> > Hi,
> > > >>> > We've encountered something like that in the past.
> > > >>> > Check out the messages titled "[Linux-HA] RE: 99% CPU heartbeat &
> > > >>> rexmit (seqno too low)"
> > > >>> > from September 2006. The (unofficial) patch there solved it for us
> > > >>> thought it may require minor changes to date.
> > > >>> >
> > > >>> > Best,
> > > >>> > Oren.
> > > >>> >
> > > >>> > > -----Original Message-----
> > > >>> > > From: linux-ha-bounces at lists.linux-ha.org
> > > >>> > > [mailto:linux-ha-bounces at lists.linux-ha.org]On Behalf Of Matt Wilder
> > > >>> > > Sent: Thursday, November 30, 2006 8:03 PM
> > > >>> > > To: General Linux-HA mailing list
> > > >>> > > Subject: Re: [Linux-HA] Message hist queue is filling up
> > > >>> > >
> > > >>> > >
> > > >>> > > What would cause this to happen? There are no network connectivity
> > > >>> > > issues between the two nodes.
> > > >>> > >
> > > >>> > > On 11/30/06, Serge Dubrouski <sergeyfd at gmail.com> wrote:
> > > >>> > > > Lost packets between nodes in cluster.
> > > >>> > > >
> > > >>> > > > On 11/30/06, Matt Wilder <grewaru at gmail.com> wrote:
> > > >>> > > > > Can anyone tell me what the cause of the following
> > > >>> > > messages showing up
> > > >>> > > > > in syslog from heartbeat? I have checked network
> > > >>> > > connectivity between
> > > >>> > > > > the two machines in my cluster and everything looks fine. These
> > > >>> > > > > messages are occurring on a semi-frequent basis and do
> > > >>> > > not seem to be
> > > >>> > > > > stopping.
> > > >>> > > > >
> > > >>> > > > > Node1 syslog (currently serving all resources):
> > > >>> > > > > Nov 28 18:06:36 glider1 heartbeat: [80229]: ERROR:
> > > >>> > > Message hist queue
> > > >>> > > > > is filling up (196 messages in queue)
> > > >>> > > > > Nov 28 18:06:38 glider1 heartbeat: [80229]: ERROR:
> > > >>> > > Message hist queue
> > > >>> > > > > is filling up (197 messages in queue)
> > > >>> > > > > Nov 28 18:06:40 glider1 heartbeat: [80229]: ERROR:
> > > >>> > > Message hist queue
> > > >>> > > > > is filling up (198 messages in queue)
> > > >>> > > > > Nov 28 18:06:42 glider1 heartbeat: [80229]: ERROR:
> > > >>> > > Message hist queue
> > > >>> > > > > is filling up (199 messages in queue)
> > > >>> > > > > Nov 28 18:06:44 glider1 heartbeat: [80229]: ERROR:
> > > >>> > > Message hist queue
> > > >>> > > > > is filling up (200 messages in queue)
> > > >>> > > > > Nov 28 18:06:50 glider1 last message repeated 3 times
> > > >>> > > > > Nov 28 18:06:50 glider1 heartbeat: [80229]: ERROR: Cannot
> > > >>> > > rexmit pkt
> > > >>> > > > > 614508 for glider2.domainit.com: seqno too low
> > > >>> > > > > Nov 28 18:06:52 glider1 heartbeat: [80229]: ERROR:
> > > >>> > > Message hist queue
> > > >>> > > > > is filling up (200 messages in queue)
> > > >>> > > > > Nov 28 18:06:56 glider1 last message repeated 2 times
> > > >>> > > > > Nov 28 18:06:56 glider1 heartbeat: [80229]: ERROR: Cannot
> > > >>> > > rexmit pkt
> > > >>> > > > > 614511 for glider2.domainit.com: seqno too low
> > > >>> > > > > Nov 28 18:06:58 glider1 heartbeat: [80229]: ERROR:
> > > >>> > > Message hist queue
> > > >>> > > > > is filling up (200 messages in queue)
> > > >>> > > > > Nov 28 18:07:06 glider1 last message repeated 4 times
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > > Node2 syslog:
> > > >>> > > > > Nov 28 18:05:05 glider2 heartbeat: [568]: ERROR:
> > > >>> > > Irretrievably lost
> > > >>> > > > > packet: node glider1.domainit.com seq 614508
> > > >>> > > > > Nov 28 18:05:11 glider2 heartbeat: [568]: ERROR:
> > > >>> > > Irretrievably lost
> > > >>> > > > > packet: node glider1.domainit.com seq 614511
> > > >>> > > > > _______________________________________________
> > > >>> > > > > Linux-HA mailing list
> > > >>> > > > > Linux-HA at lists.linux-ha.org
> > > >>> > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > >>> > > > > See also: http://linux-ha.org/ReportingProblems
> > > >>> > > > >
> > > _______________________________________________
> > > Linux-HA mailing list
> > > Linux-HA at lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > >
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> >
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> >
>
>
>
More information about the Linux-HA
mailing list