[Linux-HA] Re: memory leaks of crmd and tengine in 2.0.8
Andrew Beekhof
beekhof at gmail.com
Wed Feb 7 09:09:06 MST 2007
On 2/7/07, Pavol Gono <palo.gono at gmail.com> wrote:
> On 2/6/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> > On 2/5/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> > > Hi Pavol,
> > >
> > > Sorry for the delay, I'm not ignoring you, I've just been busy elsewhere.
> > >
> > > If I'm reading your data correctly, attrd and crmd seem to be the
> > > worst offenders with tengine a bit behind. I'd not realized the
> > > numbers were so extreme :-(
> > >
> > > if you look in lib/cl_plumbing/cl_malloc.c there are a number of
> > > #defines that may help tracking this down. i will start tackling this
> > > tomorrow (starting with attrd given its low complexity).
> >
> > If you're inclined, you could rerun your tests to verify my attrd changes.
> >
> > Patches are (in order):
> > * http://hg.linux-ha.org/dev/rev/30b947bd77e5
> > * http://hg.linux-ha.org/dev/rev/8ff8ca1f9294
> > * http://hg.linux-ha.org/dev/rev/5cc8305990e2
> >
>
> Hi
>
> I compiled revision 10101 and it seems many attrd leaks remained. Now
> I am using more complex test scenario. Dummy resource is patched to
> not support reloads.
>
> I defined HA_MALLOC_TRACK, which seems to have negative effect to BSC.
nod, it makes the crm very noisy and not always correctly (since the
comms layer is often populating message queues which throws off the
results).
my latest strategy, as per a separate mail on the subject, is to check
for free'd memory when the process exits. I'm having success but it
will be an iterative process.
> All details and logs are huge, I put it on
> http://fornax.elf.stuba.sk/~palino/hb_10101_leaks.tar.bz2
>
> Memory of processes after first and last iteration:
>
> PID VIRT RES DATA SHR %MEM TIME+ S COMMAND
> Wed Feb 7 12:51:56 CET 2007 - mach14s10
> 411 3084 1208 536 744 0.0 0:10.15 S ha_logd: read process
> 412 2816 908 268 628 0.0 0:05.53 S ha_logd: write process
> 443 11372 10m 8336 2920 0.3 0:00.49 S heartbeat: master control process
> 456 4300 4300 1264 2920 0.1 0:00.00 S heartbeat: FIFO reader
> 457 4428 4428 1392 2920 0.1 0:00.05 S heartbeat: write: ucast eth4
> 458 4428 4428 1392 2920 0.1 0:00.04 S heartbeat: read: ucast eth4
> 459 4428 4428 1392 2920 0.1 0:00.06 S heartbeat: write: ucast eth5
> 460 4428 4428 1392 2920 0.1 0:00.03 S heartbeat: read: ucast eth5
> 461 4428 4428 1392 2920 0.1 0:00.36 S heartbeat: write: ping 10.54.0.6
> 462 4428 4428 1392 2920 0.1 0:00.08 S heartbeat: read: ping 10.54.0.6
> 463 4428 4428 1392 2920 0.1 0:00.28 S heartbeat: write: ping 10.55.0.4
> 464 4428 4428 1392 2920 0.1 0:00.12 S heartbeat: read: ping 10.55.0.4
> 465 4428 4428 1392 2920 0.1 0:00.28 S heartbeat: write: ping 10.42.100.1
> 466 4428 4428 1392 2920 0.1 0:00.05 S heartbeat: read: ping 10.42.100.1
> 477 2976 1312 264 1096 0.0 0:00.01 S /usr/local/lib/heartbeat/pingd -m
> 478 3356 1396 704 1112 0.0 0:00.02 S /usr/local/lib/heartbeat/ccm
> 479 6440 3356 2180 1624 0.1 0:08.13 S /usr/local/lib/heartbeat/cib
> 480 3300 1608 528 1144 0.0 0:00.13 S /usr/local/lib/heartbeat/lrmd -r
> 481 3060 3060 392 2572 0.1 0:00.00 S /usr/local/lib/heartbeat/stonithd
> 482 4092 2388 1312 1168 0.1 0:00.15 S /usr/local/lib/heartbeat/attrd
> 483 6260 4104 2956 1684 0.1 0:01.61 S /usr/local/lib/heartbeat/crmd
> 550 3924 2312 1052 1220 0.1 0:00.50 S /usr/local/lib/heartbeat/tengine
> 551 4752 3096 1760 1360 0.1 0:01.14 S /usr/local/lib/heartbeat/pengine
> ...
> Wed Feb 7 14:02:10 CET 2007 - mach14s10
> 411 3084 1268 536 744 0.0 1:49.60 S ha_logd: read process
> 412 2816 908 268 628 0.0 0:56.62 S ha_logd: write process
> 443 11372 10m 8336 2920 0.3 0:04.65 S heartbeat: master control process
> 456 4300 4300 1264 2920 0.1 0:00.00 S heartbeat: FIFO reader
> 457 4428 4428 1392 2920 0.1 0:00.68 S heartbeat: write: ucast eth4
> 458 4428 4428 1392 2920 0.1 0:00.57 S heartbeat: read: ucast eth4
> 459 4428 4428 1392 2920 0.1 0:00.56 S heartbeat: write: ucast eth5
> 460 4428 4428 1392 2920 0.1 0:00.50 S heartbeat: read: ucast eth5
> 461 4428 4428 1392 2920 0.1 0:04.19 S heartbeat: write: ping 10.54.0.6
> 462 4428 4428 1392 2920 0.1 0:03.26 S heartbeat: read: ping 10.54.0.6
> 463 4428 4428 1392 2920 0.1 0:03.12 S heartbeat: write: ping 10.55.0.4
> 464 4428 4428 1392 2920 0.1 0:02.60 S heartbeat: read: ping 10.55.0.4
> 465 4428 4428 1392 2920 0.1 0:02.56 S heartbeat: write: ping 10.42.100.1
> 466 4428 4428 1392 2920 0.1 0:03.05 S heartbeat: read: ping 10.42.100.1
> 477 2976 1312 264 1096 0.0 0:00.05 S /usr/local/lib/heartbeat/pingd -m
> 478 3488 1516 836 1112 0.0 0:00.06 S /usr/local/lib/heartbeat/ccm
> 479 6440 3396 2180 1640 0.1 1:29.81 S /usr/local/lib/heartbeat/cib
> 480 3300 1608 528 1144 0.0 0:01.25 S /usr/local/lib/heartbeat/lrmd -r
> 481 3060 3060 392 2572 0.1 0:00.04 S /usr/local/lib/heartbeat/stonithd
> 482 15972 13m 12m 1168 0.4 0:02.68 S /usr/local/lib/heartbeat/attrd
> 483 16952 14m 13m 1684 0.4 0:20.32 S /usr/local/lib/heartbeat/crmd
> 550 4848 3236 1976 1220 0.1 0:05.47 S /usr/local/lib/heartbeat/tengine
> 551 4824 3180 1832 1360 0.1 0:15.31 S /usr/local/lib/heartbeat/pengine
>
> Summary of 70-minute test:
> attrd: +11 MB, increase after each test loop
> crmd: +10 MB, increase after each test loop
> tengine: +924 KB
> pengine: +72 KB, very random increase
> ccm: +132 KB, very random increase
>
> Palo
>
>
> > > On 2/4/07, Pavol Gono <palo.gono at gmail.com> wrote:
> > > > Hi
> > > >
> > > > I started another type of testing - simulation of disconnecting cables
> > > > with iptables. Failovers between nodes are triggered by blocking ICMP
> > > > responses from ping nodes (see script.txt).
> > > >
> > > > There are another two leaking processes:
> > > > attrd eats 396 KB per while loop
> > > > ccm displays following type of messages sometimes
> > > > ccm: [27757]: WARN: leaking memory? previous arena=3108864 present
> > arena=3244032
> > > > (very small memory increase)
> > > >
> > > > Configuration is similar to previous post, only Dummy resource is
> > > > replaced by custom one.
> > > >
> > > > For my tests it is annoying that heartbeat eats hundreds of megabytes
> > > > after some hours/days. Can I help you to make fixes sooner?
> > > > What are the best configure switches for memory leak detection
> > > >
> > (--enable-dmalloc/--enable-crm-dev/--enable-crm-dmalloc/--enable-crm-force-malloc)?
> > > > Is it better to make up simple testcases (less resources, less
> > > > operations) or the complex testcase, which contains all possible
> > > > memory leaks?
> > > > Should I use latest dev sources or latest stable sources?
> > > > (I would like to have fixes against 2.0.8 currently)
> > > >
> > > > The output of script for node sk16251c:
> > > > PID VIRT RES DATA SHR %MEM TIME+ S COMMAND
> > > > Fri Feb 2 18:26:13 CET 2007 - sk16251c
> > > > 27708 2944 1056 396 744 0.2 0:00.88 S ha_logd: read process
> > > > 27713 2812 864 264 620 0.2 0:00.87 S ha_logd: write process
> > > > 27756 2976 1284 264 1084 0.3 0:00.01 S
> > > > /usr/local/lib/heartbeat/pingd -m 10 -d 5s
> > > > 27757 3356 1368 704 1104 0.3 0:00.01 S /usr/local/lib/heartbeat/ccm
> > > > 27758 4452 2308 1356 1388 0.5 0:10.35 S /usr/local/lib/heartbeat/cib
> > > > 27759 3168 1488 396 1136 0.3 0:00.25 S
> > /usr/local/lib/heartbeat/lrmd -r
> > > > 27760 3060 3060 392 2572 0.6 0:00.00 S
> > /usr/local/lib/heartbeat/stonithd
> > > > 27761 3968 2316 1188 1164 0.5 0:00.19 S
> > /usr/local/lib/heartbeat/attrd
> > > > 27762 5500 3416 2192 1680 0.7 0:01.50 S
> > /usr/local/lib/heartbeat/crmd
> > > > 27769 3660 1896 788 1196 0.4 0:00.46 S
> > /usr/local/lib/heartbeat/tengine
> > > > 27770 4404 2596 1132 1416 0.5 0:02.39 S
> > /usr/local/lib/heartbeat/pengine
> > > > ...
> > > > Sat Feb 3 01:54:47 CET 2007 - sk16251c
> > > > 27708 2944 1076 396 744 0.2 0:52.57 S ha_logd: read process
> > > > 27713 2812 876 264 620 0.2 0:43.66 S ha_logd: write process
> > > > 27756 2976 1284 264 1084 0.3 0:00.16 S
> > > > /usr/local/lib/heartbeat/pingd -m 10 -d 5s
> > > > 27757 4016 2056 1364 1104 0.4 0:00.26 S /usr/local/lib/heartbeat/ccm
> > > > 27758 4452 2352 1356 1404 0.5 9:34.45 S /usr/local/lib/heartbeat/cib
> > > > 27759 3168 1500 396 1140 0.3 0:08.75 S
> > /usr/local/lib/heartbeat/lrmd -r
> > > > 27760 3060 3060 392 2572 0.6 0:00.29 S
> > /usr/local/lib/heartbeat/stonithd
> > > > 27761 69440 66m 65m 1164 13.4 0:17.46 S
> > /usr/local/lib/heartbeat/attrd
> > > > 27762 34540 31m 30m 1680 6.4 1:41.60 S
> > /usr/local/lib/heartbeat/crmd
> > > > 27769 3660 1900 788 1200 0.4 0:18.97 S
> > /usr/local/lib/heartbeat/tengine
> > > > 27770 4980 3148 1708 1416 0.6 2:31.45 S
> > /usr/local/lib/heartbeat/pengine
> > > >
> > > >
> > > > Palo
> > > >
> > > >
> > > > On 1/29/07, Pavol Gono <palo.gono at gmail.com> wrote:
> > > > > Hi
> > > > >
> > > > > I found memory leaks of described processes when doing following
> > failovers:
> > > > > deboserver -> pgbook: with crm_standby
> > > > > pgbook -> deboserver: failing monitor operation of resource Dummy
> > > > > Frequency is 2 failovers per minute. Script and configuration
> > attached.
> > > > > Memory leaks of crmd are the most markant: 132 KB per failover.
> > > > > pengine displays the "Potential memory leak detected" messages, and
> > > > > shall be fixed in upstream already.
> > > > >
> > > > > Output:
> > > > > PID USER VIRT RES DATA SHR %MEM TIME+ S COMMAND
> > > > > Mon Jan 29 15:30:47 CET 2007
> > > > > 3437 hacluste 6152 2844 1492 1816 0.6 0:00.18 S crmd
> > > > > 3443 hacluste 5020 2084 796 1340 0.4 0:00.08 S tengine
> > > > > 3444 hacluste 5560 2564 940 1548 0.5 0:00.10 S pengine
> > > > > Mon Jan 29 15:31:13 CET 2007
> > > > > 3437 hacluste 6304 2980 1644 1820 0.6 0:00.36 S crmd
> > > > > 3443 hacluste 5020 2104 796 1352 0.4 0:00.15 S tengine
> > > > > 3444 hacluste 5768 2724 1148 1552 0.5 0:00.35 S pengine
> > > > > ...
> > > > > Mon Jan 29 15:34:17 CET 2007
> > > > > 3437 hacluste 7360 4096 2700 1820 0.8 0:01.63 S crmd
> > > > > 3443 hacluste 5152 2272 928 1352 0.4 0:00.61 S tengine
> > > > > 3444 hacluste 5768 2760 1148 1552 0.5 0:02.31 S pengine
> > > > > ...
> > > > > Mon Jan 29 15:48:19 CET 2007
> > > > > 3437 hacluste 12376 9084 7716 1820 1.8 0:07.75 S crmd
> > > > > 3443 hacluste 6472 3604 2248 1352 0.7 0:02.76 S tengine
> > > > > 3444 hacluste 5768 2804 1148 1552 0.5 0:11.46 S pengine
> > > > > Mon Jan 29 15:48:46 CET 2007
> > > > > 3437 hacluste 12508 9240 7848 1820 1.8 0:07.92 S crmd
> > > > > 3443 hacluste 6472 3648 2248 1352 0.7 0:02.81 S tengine
> > > > > 3444 hacluste 5840 2808 1220 1552 0.5 0:11.73 S pengine
> > > > > ...
> > > > > Mon Jan 29 16:16:26 CET 2007
> > > > > 3437 hacluste 22276 18m 17m 1820 3.7 0:19.82 S crmd
> > > > > 3443 hacluste 9244 6324 5020 1352 1.2 0:07.04 S tengine
> > > > > 3444 hacluste 5912 2888 1292 1552 0.6 0:29.18 S pengine
> > > > >
> > > > >
> > > > > I used stable 2.0.8 sources with minor modifications from upstream
> > > > > (see attached patch).
> > > > >
> > > > > Palo
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>
More information about the Linux-HA
mailing list