heartbeat restarting every morning?
Matt Stockdale
mstockda at logicworks.net
Wed Dec 18 10:13:29 MST 2002
Wonderful. What's the most recent known working redhat kernel?
I'm assuming that if I compile a 2.4.20 kernel from source, I won't have this problem?
On Wed, Dec 18, 2002 at 09:59:03AM -0600, Brian Tinsley wrote:
> Sounds like the now infamous Red Hat kernel bug:
>
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=77058
>
> Although the report never explicitly mentions 2.4.18-10, I can personally
> attest to the fact that it also suffers from this problem.
>
> Matt Stockdale wrote:
>
> Redhat 7.3, with the redhat 2.4.18-10 kernel.
>
> On Wed, Dec 18, 2002 at 08:27:01AM -0600, Brian Tinsley wrote:
>
>
> What kernel/distribution are you using?
>
>
> Matt Stockdale wrote:
>
>
>
> I've got a fairly simple ha setup for a firewall, but I'm seeing some strange behaviour every morning at 5:55am (give or take a few seconds)..
>
> A few snippets from the secondary machines log..
>
> heartbeat: 2002/12/18_05:54:09 info: Daily informational memory statistics
> heartbeat: 2002/12/18_05:54:09 info: MSG stats: 100/85563 age 1 [pid25180/CONTROL]
> heartbeat: 2002/12/18_05:54:09 info: ha_malloc stats: 2500/2224638 92800/49900 [pid25180/CONTROL]
> heartbeat: 2002/12/18_05:54:09 info: RealMalloc stats: 94064 total malloc bytes. pid [25180/CONTROL]
> heartbeat: 2002/12/18_05:54:09 info: MSG stats: 0/85563 age 1 [pid25182/HBWRITE]
> heartbeat: 2002/12/18_05:54:09 info: ha_malloc stats: 0/2224638 0/0 [pid25182/HBWRITE]
> heartbeat: 2002/12/18_05:54:09 info: RealMalloc stats: 1264 total malloc bytes. pid [25182/HBWRITE]
> heartbeat: 2002/12/18_05:54:09 info: MSG stats: 0/128554 age 1 [pid25183/HBREAD]
> heartbeat: 2002/12/18_05:54:09 info: ha_malloc stats: 0/3342398 0/0 [pid25183/HBREAD]
> heartbeat: 2002/12/18_05:54:09 info: RealMalloc stats: 1264 total malloc bytes. pid [25183/HBREAD]
> heartbeat: 2002/12/18_05:54:09 info: MSG stats: 0/299680 age 1 [pid25184/MST_STATUS]
> heartbeat: 2002/12/18_05:54:09 info: ha_malloc stats: 0/6379673 0/0 [pid25184/MST_STATUS]
> heartbeat: 2002/12/18_05:54:09 info: RealMalloc stats: 1696 total malloc bytes. pid [25184/MST_STATUS]
> heartbeat: 2002/12/18_05:54:09 info: These are nothing to worry about.
> heartbeat: 2002/12/18_05:55:22 WARN: node mailpat-pri: is dead
> heartbeat: 2002/12/18_05:55:22 info: Resources being acquired from mailpat-pri.
> heartbeat: 2002/12/18_05:55:22 WARN: node mailpat-sec: is dead
> heartbeat: 2002/12/18_05:55:22 ERROR: No local heartbeat. Forcing shutdown.
> heartbeat: 2002/12/18_05:55:22 info: Link mailpat-pri:eth2 dead.
> heartbeat: 2002/12/18_05:55:22 WARN: Cluster node mailpat-prireturning after partition
> heartbeat: 2002/12/18_05:55:22 info: giveup_resources: current status: active
> heartbeat: 2002/12/18_05:55:22 info: killing notify world process group 27118 with signal 9
> heartbeat: 2002/12/18_05:55:22 info: Heartbeat shutdown in progress. (25184)
> heartbeat: 2002/12/18_05:55:22 info: Link mailpat-pri:eth2 up.
> heartbeat: 2002/12/18_05:55:22 WARN: Late heartbeat: Node mailpat-pri: interval 5570 ms
> heartbeat: 2002/12/18_05:55:22 info: Status update for node mailpat-pri: status active
> heartbeat: 2002/12/18_05:55:22 info: Giving up all HA resources.
> heartbeat: 2002/12/18_05:55:22 info: Heartbeat shutdown already underway.
> heartbeat: 2002/12/18_05:55:22 WARN: node mailpat-sec: is dead
> heartbeat: 2002/12/18_05:55:22 ERROR: No local heartbeat. Forcing shutdown.
> heartbeat: 2002/12/18_05:55:22 info: heartbeat: version 0.4.9e
> heartbeat: 2002/12/18_05:55:22 info: Running /etc/ha.d/rc.d/status status
> heartbeat: 2002/12/18_05:55:22 info: Running /etc/ha.d/rc.d/status status
> heartbeat: 2002/12/18_05:55:22 WARN: node mailpat-sec: is dead
> heartbeat: 2002/12/18_05:55:22 ERROR: No local heartbeat. Forcing shutdown.
> heartbeat: 2002/12/18_05:55:22 info: Taking over resource group IPaddr::206.252.135.253
>
> It's worrying that it sees mailpat-pri (the master node) as up, yet it continues to take over the IP resource anyway..
>
> I can't see anything that's activating a shutdown/restart, is hearbeat coded to do this at 5:55 am? it's been happening ever since I brought up the cluster a few days ago. heartbeat does this on both machines almost simultaneously. It's usually not a problem, as it gets back to normal in 10 seconds or so, but this morning the secondary machine didn't relinquish the resources, even after the primary machine took them back.
>
> What does "ERROR: No local heartbeat. Forcing shutdown." mean? this always seems to happen about a minute and 10 seconds after it prints out the daily informational memory statistics..
>
> There's not alot of configuration info here, if noone has seen anything similar I'd be happy to go into more detail.
>
> Thanks,
> Matt
>
>
>
>
>
> --
>
> -[========================]-
> -[ Brian Tinsley ]-
> -[ Chief Systems Engineer ]-
> -[ Emageon ]-
> -[========================]-
>
>
>
>
>
>
> --
>
> -[========================]-
> -[ Brian Tinsley ]-
> -[ Chief Systems Engineer ]-
> -[ Emageon ]-
> -[========================]-
--
---------------------------------------------------------------
Matt Stockdale Sr. Network Engineer - logicworks.net
mstockda at logicworks.net "Dura lex, sed lex"
More information about the Linux-HA
mailing list