[Linux-HA] Re: memory leaks of crmd and tengine in 2.0.8
Pavol Gono
palo.gono at gmail.com
Sat Feb 3 21:22:47 MST 2007
Hi
I started another type of testing - simulation of disconnecting cables
with iptables. Failovers between nodes are triggered by blocking ICMP
responses from ping nodes (see script.txt).
There are another two leaking processes:
attrd eats 396 KB per while loop
ccm displays following type of messages sometimes
ccm: [27757]: WARN: leaking memory? previous arena=3108864 present arena=3244032
(very small memory increase)
Configuration is similar to previous post, only Dummy resource is
replaced by custom one.
For my tests it is annoying that heartbeat eats hundreds of megabytes
after some hours/days. Can I help you to make fixes sooner?
What are the best configure switches for memory leak detection
(--enable-dmalloc/--enable-crm-dev/--enable-crm-dmalloc/--enable-crm-force-malloc)?
Is it better to make up simple testcases (less resources, less
operations) or the complex testcase, which contains all possible
memory leaks?
Should I use latest dev sources or latest stable sources?
(I would like to have fixes against 2.0.8 currently)
The output of script for node sk16251c:
PID VIRT RES DATA SHR %MEM TIME+ S COMMAND
Fri Feb 2 18:26:13 CET 2007 - sk16251c
27708 2944 1056 396 744 0.2 0:00.88 S ha_logd: read process
27713 2812 864 264 620 0.2 0:00.87 S ha_logd: write process
27756 2976 1284 264 1084 0.3 0:00.01 S
/usr/local/lib/heartbeat/pingd -m 10 -d 5s
27757 3356 1368 704 1104 0.3 0:00.01 S /usr/local/lib/heartbeat/ccm
27758 4452 2308 1356 1388 0.5 0:10.35 S /usr/local/lib/heartbeat/cib
27759 3168 1488 396 1136 0.3 0:00.25 S /usr/local/lib/heartbeat/lrmd -r
27760 3060 3060 392 2572 0.6 0:00.00 S /usr/local/lib/heartbeat/stonithd
27761 3968 2316 1188 1164 0.5 0:00.19 S /usr/local/lib/heartbeat/attrd
27762 5500 3416 2192 1680 0.7 0:01.50 S /usr/local/lib/heartbeat/crmd
27769 3660 1896 788 1196 0.4 0:00.46 S /usr/local/lib/heartbeat/tengine
27770 4404 2596 1132 1416 0.5 0:02.39 S /usr/local/lib/heartbeat/pengine
...
Sat Feb 3 01:54:47 CET 2007 - sk16251c
27708 2944 1076 396 744 0.2 0:52.57 S ha_logd: read process
27713 2812 876 264 620 0.2 0:43.66 S ha_logd: write process
27756 2976 1284 264 1084 0.3 0:00.16 S
/usr/local/lib/heartbeat/pingd -m 10 -d 5s
27757 4016 2056 1364 1104 0.4 0:00.26 S /usr/local/lib/heartbeat/ccm
27758 4452 2352 1356 1404 0.5 9:34.45 S /usr/local/lib/heartbeat/cib
27759 3168 1500 396 1140 0.3 0:08.75 S /usr/local/lib/heartbeat/lrmd -r
27760 3060 3060 392 2572 0.6 0:00.29 S /usr/local/lib/heartbeat/stonithd
27761 69440 66m 65m 1164 13.4 0:17.46 S /usr/local/lib/heartbeat/attrd
27762 34540 31m 30m 1680 6.4 1:41.60 S /usr/local/lib/heartbeat/crmd
27769 3660 1900 788 1200 0.4 0:18.97 S /usr/local/lib/heartbeat/tengine
27770 4980 3148 1708 1416 0.6 2:31.45 S /usr/local/lib/heartbeat/pengine
Palo
On 1/29/07, Pavol Gono <palo.gono at gmail.com> wrote:
> Hi
>
> I found memory leaks of described processes when doing following failovers:
> deboserver -> pgbook: with crm_standby
> pgbook -> deboserver: failing monitor operation of resource Dummy
> Frequency is 2 failovers per minute. Script and configuration attached.
> Memory leaks of crmd are the most markant: 132 KB per failover.
> pengine displays the "Potential memory leak detected" messages, and
> shall be fixed in upstream already.
>
> Output:
> PID USER VIRT RES DATA SHR %MEM TIME+ S COMMAND
> Mon Jan 29 15:30:47 CET 2007
> 3437 hacluste 6152 2844 1492 1816 0.6 0:00.18 S crmd
> 3443 hacluste 5020 2084 796 1340 0.4 0:00.08 S tengine
> 3444 hacluste 5560 2564 940 1548 0.5 0:00.10 S pengine
> Mon Jan 29 15:31:13 CET 2007
> 3437 hacluste 6304 2980 1644 1820 0.6 0:00.36 S crmd
> 3443 hacluste 5020 2104 796 1352 0.4 0:00.15 S tengine
> 3444 hacluste 5768 2724 1148 1552 0.5 0:00.35 S pengine
> ...
> Mon Jan 29 15:34:17 CET 2007
> 3437 hacluste 7360 4096 2700 1820 0.8 0:01.63 S crmd
> 3443 hacluste 5152 2272 928 1352 0.4 0:00.61 S tengine
> 3444 hacluste 5768 2760 1148 1552 0.5 0:02.31 S pengine
> ...
> Mon Jan 29 15:48:19 CET 2007
> 3437 hacluste 12376 9084 7716 1820 1.8 0:07.75 S crmd
> 3443 hacluste 6472 3604 2248 1352 0.7 0:02.76 S tengine
> 3444 hacluste 5768 2804 1148 1552 0.5 0:11.46 S pengine
> Mon Jan 29 15:48:46 CET 2007
> 3437 hacluste 12508 9240 7848 1820 1.8 0:07.92 S crmd
> 3443 hacluste 6472 3648 2248 1352 0.7 0:02.81 S tengine
> 3444 hacluste 5840 2808 1220 1552 0.5 0:11.73 S pengine
> ...
> Mon Jan 29 16:16:26 CET 2007
> 3437 hacluste 22276 18m 17m 1820 3.7 0:19.82 S crmd
> 3443 hacluste 9244 6324 5020 1352 1.2 0:07.04 S tengine
> 3444 hacluste 5912 2888 1292 1552 0.6 0:29.18 S pengine
>
>
> I used stable 2.0.8 sources with minor modifications from upstream
> (see attached patch).
>
> Palo
-------------- next part --------------
#!/bin/sh
OUR_NODENAME=sk16251c
PEER_NODENAME=linux-sles1
LOG_FILE_OUR="log-$OUR_NODENAME"
LOG_FILE_PEER="log-$PEER_NODENAME"
SSH_PEER_NODE='ssh root at 10.0.0.5'
RESOURCES='x_processResource x_IPaddrL x_IPaddrR'
PING_NODE1_CHAIN='INPUT -s 10.0.0.8 -p icmp -j DROP'
PING_NODE2_CHAIN='INPUT -s 10.0.0.9 -p icmp -j DROP'
OUR_LINK1_CHAIN=' INPUT -d 10.0.0.30 -p udp --dport 694 -j DROP'
OUR_LINK2_CHAIN=' INPUT -d 10.0.1.30 -p udp --dport 694 -j DROP'
PEER_LINK1_CHAIN='INPUT -d 10.0.0.5 -p udp --dport 694 -j DROP'
PEER_LINK2_CHAIN='INPUT -d 10.0.1.5 -p udp --dport 694 -j DROP'
PROC_PATTERN='\<(ha_logd|heartbeat:|pingd|ccm|cib|[lc]rmd|stonithd|attrd|[tp]engine)\>'
top -bn1 | egrep '\<PID\>' | egrep -v grep > "$LOG_FILE_OUR"
$SSH_PEER_NODE top -bn1 | egrep '\<PID\>' > "$LOG_FILE_PEER"
for i in $RESOURCES ; do
crm_failcount -D -r$i -U"$OUR_NODENAME" 2>/dev/null
crm_failcount -D -r$i -U"$PEER_NODENAME" 2>/dev/null
done
crm_standby -D -U"$OUR_NODENAME" 2>/dev/null
crm_standby -D -U"$PEER_NODENAME" 2>/dev/null
iptables -D $PING_NODE1_CHAIN 2>/dev/null
iptables -D $PING_NODE2_CHAIN 2>/dev/null
$SSH_PEER_NODE iptables -D $PING_NODE1_CHAIN 2>/dev/null
$SSH_PEER_NODE iptables -D $PING_NODE2_CHAIN 2>/dev/null
iptables -D $OUR_LINK1_CHAIN 2>/dev/null
iptables -D $OUR_LINK2_CHAIN 2>/dev/null
$SSH_PEER_NODE iptables -D $PEER_LINK1_CHAIN 2>/dev/null
$SSH_PEER_NODE iptables -D $PEER_LINK2_CHAIN 2>/dev/null
echo -n "Press Enter to continue..."
read
echo "Starting the test loop at $(date) on $(uname -n)"
safe_disconnect() {
# safe disconnecting of our links
logger PPPP$1 ; $SSH_PEER_NODE logger PPPP$1
iptables -A $OUR_LINK1_CHAIN
sleep 5
logger QQQQ$1 ; $SSH_PEER_NODE logger QQQQ$1
iptables -D $OUR_LINK1_CHAIN
sleep 5
logger RRRR$1 ; $SSH_PEER_NODE logger RRRR$1
iptables -A $OUR_LINK2_CHAIN
sleep 5
logger SSSS$1 ; $SSH_PEER_NODE logger SSSS$1
iptables -D $OUR_LINK2_CHAIN
sleep 5
# safe disconnecting of peer links
logger TTTT$1 ; $SSH_PEER_NODE logger TTTT$1
$SSH_PEER_NODE iptables -A $PEER_LINK1_CHAIN
sleep 5
logger UUUU$1 ; $SSH_PEER_NODE logger UUUU$1
$SSH_PEER_NODE iptables -D $PEER_LINK1_CHAIN
sleep 5
logger VVVV$1 ; $SSH_PEER_NODE logger VVVV$1
$SSH_PEER_NODE iptables -A $PEER_LINK2_CHAIN
sleep 5
logger WWWW$1 ; $SSH_PEER_NODE logger WWWW$1
$SSH_PEER_NODE iptables -D $PEER_LINK2_CHAIN
sleep 5
# safe removal of both connections to ping node
logger XXXX$1 ; $SSH_PEER_NODE logger XXXX$1
iptables -A $PING_NODE2_CHAIN
$SSH_PEER_NODE iptables -A $PING_NODE2_CHAIN
sleep 11
logger YYYY$1 ; $SSH_PEER_NODE logger YYYY$1
iptables -D $PING_NODE2_CHAIN
$SSH_PEER_NODE iptables -D $PING_NODE2_CHAIN
sleep 5
}
while : ; do
echo "$(date) - $(uname -n)" >> "$LOG_FILE_OUR"
top -bn1 | egrep "$PROC_PATTERN" | egrep -v grep | sort -n >> "$LOG_FILE_OUR"
$SSH_PEER_NODE echo '$(date) - $(uname -n)' >> "$LOG_FILE_PEER"
$SSH_PEER_NODE top -bn1 | egrep "$PROC_PATTERN" | sort -n >> "$LOG_FILE_PEER"
for i in $RESOURCES ; do
crm_failcount -D -r$i -U"$OUR_NODENAME" 2>/dev/null &
crm_failcount -D -r$i -U"$PEER_NODENAME" 2>/dev/null &
done
sleep 5
# failover OUR->PEER
logger AAAA ; $SSH_PEER_NODE logger AAAA
iptables -A $PING_NODE1_CHAIN
sleep 11
logger BBBB ; $SSH_PEER_NODE logger BBBB
iptables -D $PING_NODE1_CHAIN
sleep 5
safe_disconnect 1
for i in $RESOURCES ; do
crm_failcount -D -r$i -U"$OUR_NODENAME" 2>/dev/null &
crm_failcount -D -r$i -U"$PEER_NODENAME" 2>/dev/null &
done
sleep 5
# failover PEER->OUR
logger CCCC ; $SSH_PEER_NODE logger CCCC
$SSH_PEER_NODE iptables -A $PING_NODE1_CHAIN
sleep 11
logger DDDD ; $SSH_PEER_NODE logger DDDD
$SSH_PEER_NODE iptables -D $PING_NODE1_CHAIN
sleep 5
safe_disconnect 2
echo -n .
sleep 5
done
-------------- next part --------------
keepalive 200ms
deadtime 3
warntime 2000ms
initdead 10
udpport 694
ucast eth1 10.0.0.5
ucast eth2 10.0.1.5
auto_failback off
watchdog /dev/watchdog
node linux-sles1 sk16251c
ping 10.0.0.8 10.0.0.9
respawn root /usr/local/lib/heartbeat/pingd -m 10 -d 5s
realtime on
debug 1
msgfmt netstring
use_logd yes
compression zlib
traditional_compression false
coredumps true
crm yes
More information about the Linux-HA
mailing list