[Linux-HA] Re: memory leaks of crmd and tengine in 2.0.8
Pavol Gono
palo.gono at gmail.com
Wed Feb 7 07:58:37 MST 2007
On 2/6/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> On 2/5/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> > Hi Pavol,
> >
> > Sorry for the delay, I'm not ignoring you, I've just been busy elsewhere.
> >
> > If I'm reading your data correctly, attrd and crmd seem to be the
> > worst offenders with tengine a bit behind. I'd not realized the
> > numbers were so extreme :-(
> >
> > if you look in lib/cl_plumbing/cl_malloc.c there are a number of
> > #defines that may help tracking this down. i will start tackling this
> > tomorrow (starting with attrd given its low complexity).
>
> If you're inclined, you could rerun your tests to verify my attrd changes.
>
> Patches are (in order):
> * http://hg.linux-ha.org/dev/rev/30b947bd77e5
> * http://hg.linux-ha.org/dev/rev/8ff8ca1f9294
> * http://hg.linux-ha.org/dev/rev/5cc8305990e2
>
Hi
I compiled revision 10101 and it seems many attrd leaks remained. Now
I am using more complex test scenario. Dummy resource is patched to
not support reloads.
I defined HA_MALLOC_TRACK, which seems to have negative effect to BSC.
All details and logs are huge, I put it on
http://fornax.elf.stuba.sk/~palino/hb_10101_leaks.tar.bz2
Memory of processes after first and last iteration:
PID VIRT RES DATA SHR %MEM TIME+ S COMMAND
Wed Feb 7 12:51:56 CET 2007 - mach14s10
411 3084 1208 536 744 0.0 0:10.15 S ha_logd: read process
412 2816 908 268 628 0.0 0:05.53 S ha_logd: write process
443 11372 10m 8336 2920 0.3 0:00.49 S heartbeat: master control process
456 4300 4300 1264 2920 0.1 0:00.00 S heartbeat: FIFO reader
457 4428 4428 1392 2920 0.1 0:00.05 S heartbeat: write: ucast eth4
458 4428 4428 1392 2920 0.1 0:00.04 S heartbeat: read: ucast eth4
459 4428 4428 1392 2920 0.1 0:00.06 S heartbeat: write: ucast eth5
460 4428 4428 1392 2920 0.1 0:00.03 S heartbeat: read: ucast eth5
461 4428 4428 1392 2920 0.1 0:00.36 S heartbeat: write: ping 10.54.0.6
462 4428 4428 1392 2920 0.1 0:00.08 S heartbeat: read: ping 10.54.0.6
463 4428 4428 1392 2920 0.1 0:00.28 S heartbeat: write: ping 10.55.0.4
464 4428 4428 1392 2920 0.1 0:00.12 S heartbeat: read: ping 10.55.0.4
465 4428 4428 1392 2920 0.1 0:00.28 S heartbeat: write: ping 10.42.100.1
466 4428 4428 1392 2920 0.1 0:00.05 S heartbeat: read: ping 10.42.100.1
477 2976 1312 264 1096 0.0 0:00.01 S /usr/local/lib/heartbeat/pingd -m
478 3356 1396 704 1112 0.0 0:00.02 S /usr/local/lib/heartbeat/ccm
479 6440 3356 2180 1624 0.1 0:08.13 S /usr/local/lib/heartbeat/cib
480 3300 1608 528 1144 0.0 0:00.13 S /usr/local/lib/heartbeat/lrmd -r
481 3060 3060 392 2572 0.1 0:00.00 S /usr/local/lib/heartbeat/stonithd
482 4092 2388 1312 1168 0.1 0:00.15 S /usr/local/lib/heartbeat/attrd
483 6260 4104 2956 1684 0.1 0:01.61 S /usr/local/lib/heartbeat/crmd
550 3924 2312 1052 1220 0.1 0:00.50 S /usr/local/lib/heartbeat/tengine
551 4752 3096 1760 1360 0.1 0:01.14 S /usr/local/lib/heartbeat/pengine
...
Wed Feb 7 14:02:10 CET 2007 - mach14s10
411 3084 1268 536 744 0.0 1:49.60 S ha_logd: read process
412 2816 908 268 628 0.0 0:56.62 S ha_logd: write process
443 11372 10m 8336 2920 0.3 0:04.65 S heartbeat: master control process
456 4300 4300 1264 2920 0.1 0:00.00 S heartbeat: FIFO reader
457 4428 4428 1392 2920 0.1 0:00.68 S heartbeat: write: ucast eth4
458 4428 4428 1392 2920 0.1 0:00.57 S heartbeat: read: ucast eth4
459 4428 4428 1392 2920 0.1 0:00.56 S heartbeat: write: ucast eth5
460 4428 4428 1392 2920 0.1 0:00.50 S heartbeat: read: ucast eth5
461 4428 4428 1392 2920 0.1 0:04.19 S heartbeat: write: ping 10.54.0.6
462 4428 4428 1392 2920 0.1 0:03.26 S heartbeat: read: ping 10.54.0.6
463 4428 4428 1392 2920 0.1 0:03.12 S heartbeat: write: ping 10.55.0.4
464 4428 4428 1392 2920 0.1 0:02.60 S heartbeat: read: ping 10.55.0.4
465 4428 4428 1392 2920 0.1 0:02.56 S heartbeat: write: ping 10.42.100.1
466 4428 4428 1392 2920 0.1 0:03.05 S heartbeat: read: ping 10.42.100.1
477 2976 1312 264 1096 0.0 0:00.05 S /usr/local/lib/heartbeat/pingd -m
478 3488 1516 836 1112 0.0 0:00.06 S /usr/local/lib/heartbeat/ccm
479 6440 3396 2180 1640 0.1 1:29.81 S /usr/local/lib/heartbeat/cib
480 3300 1608 528 1144 0.0 0:01.25 S /usr/local/lib/heartbeat/lrmd -r
481 3060 3060 392 2572 0.1 0:00.04 S /usr/local/lib/heartbeat/stonithd
482 15972 13m 12m 1168 0.4 0:02.68 S /usr/local/lib/heartbeat/attrd
483 16952 14m 13m 1684 0.4 0:20.32 S /usr/local/lib/heartbeat/crmd
550 4848 3236 1976 1220 0.1 0:05.47 S /usr/local/lib/heartbeat/tengine
551 4824 3180 1832 1360 0.1 0:15.31 S /usr/local/lib/heartbeat/pengine
Summary of 70-minute test:
attrd: +11 MB, increase after each test loop
crmd: +10 MB, increase after each test loop
tengine: +924 KB
pengine: +72 KB, very random increase
ccm: +132 KB, very random increase
Palo
> > On 2/4/07, Pavol Gono <palo.gono at gmail.com> wrote:
> > > Hi
> > >
> > > I started another type of testing - simulation of disconnecting cables
> > > with iptables. Failovers between nodes are triggered by blocking ICMP
> > > responses from ping nodes (see script.txt).
> > >
> > > There are another two leaking processes:
> > > attrd eats 396 KB per while loop
> > > ccm displays following type of messages sometimes
> > > ccm: [27757]: WARN: leaking memory? previous arena=3108864 present
> arena=3244032
> > > (very small memory increase)
> > >
> > > Configuration is similar to previous post, only Dummy resource is
> > > replaced by custom one.
> > >
> > > For my tests it is annoying that heartbeat eats hundreds of megabytes
> > > after some hours/days. Can I help you to make fixes sooner?
> > > What are the best configure switches for memory leak detection
> > >
> (--enable-dmalloc/--enable-crm-dev/--enable-crm-dmalloc/--enable-crm-force-malloc)?
> > > Is it better to make up simple testcases (less resources, less
> > > operations) or the complex testcase, which contains all possible
> > > memory leaks?
> > > Should I use latest dev sources or latest stable sources?
> > > (I would like to have fixes against 2.0.8 currently)
> > >
> > > The output of script for node sk16251c:
> > > PID VIRT RES DATA SHR %MEM TIME+ S COMMAND
> > > Fri Feb 2 18:26:13 CET 2007 - sk16251c
> > > 27708 2944 1056 396 744 0.2 0:00.88 S ha_logd: read process
> > > 27713 2812 864 264 620 0.2 0:00.87 S ha_logd: write process
> > > 27756 2976 1284 264 1084 0.3 0:00.01 S
> > > /usr/local/lib/heartbeat/pingd -m 10 -d 5s
> > > 27757 3356 1368 704 1104 0.3 0:00.01 S /usr/local/lib/heartbeat/ccm
> > > 27758 4452 2308 1356 1388 0.5 0:10.35 S /usr/local/lib/heartbeat/cib
> > > 27759 3168 1488 396 1136 0.3 0:00.25 S
> /usr/local/lib/heartbeat/lrmd -r
> > > 27760 3060 3060 392 2572 0.6 0:00.00 S
> /usr/local/lib/heartbeat/stonithd
> > > 27761 3968 2316 1188 1164 0.5 0:00.19 S
> /usr/local/lib/heartbeat/attrd
> > > 27762 5500 3416 2192 1680 0.7 0:01.50 S
> /usr/local/lib/heartbeat/crmd
> > > 27769 3660 1896 788 1196 0.4 0:00.46 S
> /usr/local/lib/heartbeat/tengine
> > > 27770 4404 2596 1132 1416 0.5 0:02.39 S
> /usr/local/lib/heartbeat/pengine
> > > ...
> > > Sat Feb 3 01:54:47 CET 2007 - sk16251c
> > > 27708 2944 1076 396 744 0.2 0:52.57 S ha_logd: read process
> > > 27713 2812 876 264 620 0.2 0:43.66 S ha_logd: write process
> > > 27756 2976 1284 264 1084 0.3 0:00.16 S
> > > /usr/local/lib/heartbeat/pingd -m 10 -d 5s
> > > 27757 4016 2056 1364 1104 0.4 0:00.26 S /usr/local/lib/heartbeat/ccm
> > > 27758 4452 2352 1356 1404 0.5 9:34.45 S /usr/local/lib/heartbeat/cib
> > > 27759 3168 1500 396 1140 0.3 0:08.75 S
> /usr/local/lib/heartbeat/lrmd -r
> > > 27760 3060 3060 392 2572 0.6 0:00.29 S
> /usr/local/lib/heartbeat/stonithd
> > > 27761 69440 66m 65m 1164 13.4 0:17.46 S
> /usr/local/lib/heartbeat/attrd
> > > 27762 34540 31m 30m 1680 6.4 1:41.60 S
> /usr/local/lib/heartbeat/crmd
> > > 27769 3660 1900 788 1200 0.4 0:18.97 S
> /usr/local/lib/heartbeat/tengine
> > > 27770 4980 3148 1708 1416 0.6 2:31.45 S
> /usr/local/lib/heartbeat/pengine
> > >
> > >
> > > Palo
> > >
> > >
> > > On 1/29/07, Pavol Gono <palo.gono at gmail.com> wrote:
> > > > Hi
> > > >
> > > > I found memory leaks of described processes when doing following
> failovers:
> > > > deboserver -> pgbook: with crm_standby
> > > > pgbook -> deboserver: failing monitor operation of resource Dummy
> > > > Frequency is 2 failovers per minute. Script and configuration
> attached.
> > > > Memory leaks of crmd are the most markant: 132 KB per failover.
> > > > pengine displays the "Potential memory leak detected" messages, and
> > > > shall be fixed in upstream already.
> > > >
> > > > Output:
> > > > PID USER VIRT RES DATA SHR %MEM TIME+ S COMMAND
> > > > Mon Jan 29 15:30:47 CET 2007
> > > > 3437 hacluste 6152 2844 1492 1816 0.6 0:00.18 S crmd
> > > > 3443 hacluste 5020 2084 796 1340 0.4 0:00.08 S tengine
> > > > 3444 hacluste 5560 2564 940 1548 0.5 0:00.10 S pengine
> > > > Mon Jan 29 15:31:13 CET 2007
> > > > 3437 hacluste 6304 2980 1644 1820 0.6 0:00.36 S crmd
> > > > 3443 hacluste 5020 2104 796 1352 0.4 0:00.15 S tengine
> > > > 3444 hacluste 5768 2724 1148 1552 0.5 0:00.35 S pengine
> > > > ...
> > > > Mon Jan 29 15:34:17 CET 2007
> > > > 3437 hacluste 7360 4096 2700 1820 0.8 0:01.63 S crmd
> > > > 3443 hacluste 5152 2272 928 1352 0.4 0:00.61 S tengine
> > > > 3444 hacluste 5768 2760 1148 1552 0.5 0:02.31 S pengine
> > > > ...
> > > > Mon Jan 29 15:48:19 CET 2007
> > > > 3437 hacluste 12376 9084 7716 1820 1.8 0:07.75 S crmd
> > > > 3443 hacluste 6472 3604 2248 1352 0.7 0:02.76 S tengine
> > > > 3444 hacluste 5768 2804 1148 1552 0.5 0:11.46 S pengine
> > > > Mon Jan 29 15:48:46 CET 2007
> > > > 3437 hacluste 12508 9240 7848 1820 1.8 0:07.92 S crmd
> > > > 3443 hacluste 6472 3648 2248 1352 0.7 0:02.81 S tengine
> > > > 3444 hacluste 5840 2808 1220 1552 0.5 0:11.73 S pengine
> > > > ...
> > > > Mon Jan 29 16:16:26 CET 2007
> > > > 3437 hacluste 22276 18m 17m 1820 3.7 0:19.82 S crmd
> > > > 3443 hacluste 9244 6324 5020 1352 1.2 0:07.04 S tengine
> > > > 3444 hacluste 5912 2888 1292 1552 0.6 0:29.18 S pengine
> > > >
> > > >
> > > > I used stable 2.0.8 sources with minor modifications from upstream
> > > > (see attached patch).
> > > >
> > > > Palo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch_hb_10101m.diff
Type: text/x-diff
Size: 2972 bytes
Desc: not available
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20070207/4e3451fd/patch_hb_10101m-0001.bin
-------------- next part --------------
#!/bin/sh
# Configurable stuff
# local node
NODE1='mach14s10'
# peer node
NODE2='mach13s10'
NODE1_IF1_IP='10.54.0.14'
NODE1_IF2_IP='10.55.0.14'
NODE2_IF1_IP='10.54.0.13'
NODE2_IF2_IP='10.55.0.13'
PING_NODES='10.54.0.6 10.55.0.4 10.42.100.1'
LOG_FILE_LOCAL="log-$NODE1"
LOG_FILE_PEER="log-$NODE2"
# how to connect to peer node (without password)
SSH_CMD="ssh root@$NODE2_IF1_IP"
# the first resource must be x_Dummy
RESOURCES='x_Dummy x_IPaddrL x_IPaddrR'
# pattern for egrep, to parse all heartbeat's processes from top
PROC_PATTERN='\<(ha_logd|pingd|ccm|cib|[lc]rmd|stonithd|attrd|[tp]engine)\>| heartbeat: '
PING_NODE1_CHAIN="INPUT -s $(echo $PING_NODES | awk '{print $1}') -p icmp -j DROP"
PING_NODE2_CHAIN="INPUT -s $(echo $PING_NODES | awk '{print $2}') -p icmp -j DROP"
LOCAL_LINK1_CHAIN="INPUT -d $NODE1_IF1_IP -p udp --dport 694 -j DROP"
LOCAL_LINK2_CHAIN="INPUT -d $NODE1_IF2_IP -p udp --dport 694 -j DROP"
PEER_LINK1_CHAIN="INPUT -d $NODE2_IF1_IP -p udp --dport 694 -j DROP"
PEER_LINK2_CHAIN="INPUT -d $NODE2_IF2_IP -p udp --dport 694 -j DROP"
FIRST_RESOURCE="$(echo $RESOURCES | awk '{print $1}')"
LOG_COUNTER=0
my_log() {
LOG_COUNTER=$(($LOG_COUNTER+1))
logger "$1 $LOG_COUNTER"
$SSH_CMD logger "$1 $LOG_COUNTER"
}
echo "Rule for blocking ping node 1: $PING_NODE1_CHAIN"
echo "Rule for blocking ping node 2: $PING_NODE2_CHAIN"
echo "Blocking incomming heartbeats node 1 link 1: $LOCAL_LINK1_CHAIN"
echo "Blocking incomming heartbeats node 1 link 2: $LOCAL_LINK2_CHAIN"
echo "Blocking incomming heartbeats node 2 link 1: $PEER_LINK1_CHAIN"
echo "Blocking incomming heartbeats node 2 link 2: $PEER_LINK2_CHAIN"
iptables -D $PING_NODE1_CHAIN 2>/dev/null
iptables -D $PING_NODE2_CHAIN 2>/dev/null
$SSH_CMD iptables -D $PING_NODE1_CHAIN 2>/dev/null
$SSH_CMD iptables -D $PING_NODE2_CHAIN 2>/dev/null
iptables -D $LOCAL_LINK1_CHAIN 2>/dev/null
iptables -D $LOCAL_LINK2_CHAIN 2>/dev/null
$SSH_CMD iptables -D $PEER_LINK1_CHAIN 2>/dev/null
$SSH_CMD iptables -D $PEER_LINK2_CHAIN 2>/dev/null
top -bn1 | egrep '\<PID\>' | egrep -v grep > "$LOG_FILE_LOCAL"
$SSH_CMD top -bn1 | egrep '\<PID\>' > "$LOG_FILE_PEER"
for i in $RESOURCES ; do
crm_failcount -D -r$i -U"$NODE1" 2>/dev/null
crm_failcount -D -r$i -U"$NODE2" 2>/dev/null
done
crm_standby -D -U"$NODE1" 2>/dev/null
crm_standby -D -U"$NODE2" 2>/dev/null
echo -n "Press Enter to continue..."
read
echo "Starting the test loop at $(date) on $NODE1 and $NODE2"
safe_disconnect() {
# safe disconnecting of local links (one direction)
my_log LLLL$1
iptables -A $LOCAL_LINK1_CHAIN
sleep 5
my_log MMMM$1
iptables -D $LOCAL_LINK1_CHAIN
sleep 5
my_log NNNN$1
iptables -A $LOCAL_LINK2_CHAIN
sleep 5
my_log OOOO$1
iptables -D $LOCAL_LINK2_CHAIN
sleep 5
# safe disconnecting of peer links (one direction)
my_log PPPP$1
$SSH_CMD iptables -A $PEER_LINK1_CHAIN
sleep 5
my_log QQQQ$1
$SSH_CMD iptables -D $PEER_LINK1_CHAIN
sleep 5
my_log RRRR$1
$SSH_CMD iptables -A $PEER_LINK2_CHAIN
sleep 5
my_log SSSS$1
$SSH_CMD iptables -D $PEER_LINK2_CHAIN
sleep 5
# safe disconnecting of links (both direction)
my_log TTTT$1
iptables -A $LOCAL_LINK1_CHAIN
$SSH_CMD iptables -A $PEER_LINK1_CHAIN
sleep 5
my_log UUUU$1
iptables -D $LOCAL_LINK1_CHAIN
$SSH_CMD iptables -D $PEER_LINK1_CHAIN
sleep 5
my_log VVVV$1
iptables -A $LOCAL_LINK2_CHAIN
$SSH_CMD iptables -A $PEER_LINK2_CHAIN
sleep 5
my_log WWWW$1
iptables -D $LOCAL_LINK2_CHAIN
$SSH_CMD iptables -D $PEER_LINK2_CHAIN
sleep 5
# safe removal of both connections to ping node
my_log XXXX$1
iptables -A $PING_NODE2_CHAIN
$SSH_CMD iptables -A $PING_NODE2_CHAIN
sleep 15
my_log YYYY$1
iptables -D $PING_NODE2_CHAIN
$SSH_CMD iptables -D $PING_NODE2_CHAIN
sleep 5
}
while : ; do
echo "$(date) - $NODE1" >> "$LOG_FILE_LOCAL"
top -bn1 | egrep "$PROC_PATTERN" | egrep -v grep | sort -n >> "$LOG_FILE_LOCAL"
$SSH_CMD echo '$(date) -' "$NODE2" >> "$LOG_FILE_PEER"
$SSH_CMD top -bn1 | egrep "$PROC_PATTERN" | sort -n >> "$LOG_FILE_PEER"
# failover LOCAL->PEER
my_log BBBB
crm_standby -von -U"$NODE1"
sleep 20
my_log CCCC
for i in $RESOURCES ; do
crm_resource -C -r$i -H"$NODE1" &
done
my_log DDDD
for i in $RESOURCES ; do
crm_failcount -D -r$i -U"$NODE1" 2>/dev/null &
done
sleep 10
my_log EEEE
crm_standby -D -U"$NODE1"
echo -n 1
sleep 5
# failover PEER->LOCAL
my_log FFFF
$SSH_CMD rm /tmp/a/a
sleep 20
for i in $RESOURCES ; do
crm_failcount -D -r$i -U"$NODE1" 2>/dev/null &
crm_failcount -D -r$i -U"$NODE2" 2>/dev/null &
done
echo -n 2
sleep 5
# failover LOCAL->PEER
my_log GGGG
iptables -A $PING_NODE1_CHAIN
sleep 20
my_log HHHH
iptables -D $PING_NODE1_CHAIN
sleep 10
safe_disconnect 1
for i in $RESOURCES ; do
crm_failcount -D -r$i -U"$NODE1" 2>/dev/null &
crm_failcount -D -r$i -U"$NODE2" 2>/dev/null &
done
echo -n 3
sleep 5
# failover PEER->LOCAL
my_log IIII
$SSH_CMD iptables -A $PING_NODE1_CHAIN
sleep 20
my_log JJJJ
$SSH_CMD iptables -D $PING_NODE1_CHAIN
sleep 10
safe_disconnect 2
echo -n 4
sleep 5
done
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cib.start.xml
Type: text/xml
Size: 11440 bytes
Desc: not available
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20070207/4e3451fd/cib.start-0001.bin
More information about the Linux-HA
mailing list