[Linux-HA] Re: memory leaks of crmd and tengine in 2.0.8

Pavol Gono palo.gono at gmail.com
Sat Feb 3 21:22:47 MST 2007


Hi

I started another type of testing - simulation of disconnecting cables
with iptables. Failovers between nodes are triggered by blocking ICMP
responses from ping nodes (see script.txt).

There are another two leaking processes:
attrd eats 396 KB per while loop
ccm displays following type of messages sometimes
ccm: [27757]: WARN: leaking memory? previous arena=3108864 present arena=3244032
(very small memory increase)

Configuration is similar to previous post, only Dummy resource is
replaced by custom one.

For my tests it is annoying that heartbeat eats hundreds of megabytes
after some hours/days. Can I help you to make fixes sooner?
What are the best configure switches for memory leak detection
(--enable-dmalloc/--enable-crm-dev/--enable-crm-dmalloc/--enable-crm-force-malloc)?
Is it better to make up simple testcases (less resources, less
operations) or the complex testcase, which contains all possible
memory leaks?
Should I use latest dev sources or latest stable sources?
(I would like to have fixes against 2.0.8 currently)

The output of script for node sk16251c:
  PID  VIRT  RES DATA  SHR %MEM    TIME+  S COMMAND
Fri Feb  2 18:26:13 CET 2007 - sk16251c
27708  2944 1056  396  744  0.2   0:00.88 S ha_logd: read process
27713  2812  864  264  620  0.2   0:00.87 S ha_logd: write process
27756  2976 1284  264 1084  0.3   0:00.01 S
/usr/local/lib/heartbeat/pingd -m 10 -d 5s
27757  3356 1368  704 1104  0.3   0:00.01 S /usr/local/lib/heartbeat/ccm
27758  4452 2308 1356 1388  0.5   0:10.35 S /usr/local/lib/heartbeat/cib
27759  3168 1488  396 1136  0.3   0:00.25 S /usr/local/lib/heartbeat/lrmd -r
27760  3060 3060  392 2572  0.6   0:00.00 S /usr/local/lib/heartbeat/stonithd
27761  3968 2316 1188 1164  0.5   0:00.19 S /usr/local/lib/heartbeat/attrd
27762  5500 3416 2192 1680  0.7   0:01.50 S /usr/local/lib/heartbeat/crmd
27769  3660 1896  788 1196  0.4   0:00.46 S /usr/local/lib/heartbeat/tengine
27770  4404 2596 1132 1416  0.5   0:02.39 S /usr/local/lib/heartbeat/pengine
...
Sat Feb  3 01:54:47 CET 2007 - sk16251c
27708  2944 1076  396  744  0.2   0:52.57 S ha_logd: read process
27713  2812  876  264  620  0.2   0:43.66 S ha_logd: write process
27756  2976 1284  264 1084  0.3   0:00.16 S
/usr/local/lib/heartbeat/pingd -m 10 -d 5s
27757  4016 2056 1364 1104  0.4   0:00.26 S /usr/local/lib/heartbeat/ccm
27758  4452 2352 1356 1404  0.5   9:34.45 S /usr/local/lib/heartbeat/cib
27759  3168 1500  396 1140  0.3   0:08.75 S /usr/local/lib/heartbeat/lrmd -r
27760  3060 3060  392 2572  0.6   0:00.29 S /usr/local/lib/heartbeat/stonithd
27761 69440  66m  65m 1164 13.4   0:17.46 S /usr/local/lib/heartbeat/attrd
27762 34540  31m  30m 1680  6.4   1:41.60 S /usr/local/lib/heartbeat/crmd
27769  3660 1900  788 1200  0.4   0:18.97 S /usr/local/lib/heartbeat/tengine
27770  4980 3148 1708 1416  0.6   2:31.45 S /usr/local/lib/heartbeat/pengine


Palo


On 1/29/07, Pavol Gono <palo.gono at gmail.com> wrote:
> Hi
>
> I found memory leaks of described processes when doing following failovers:
> deboserver -> pgbook: with crm_standby
> pgbook -> deboserver: failing monitor operation of resource Dummy
> Frequency is 2 failovers per minute. Script and configuration attached.
> Memory leaks of crmd are the most markant: 132 KB per failover.
> pengine displays the "Potential memory leak detected" messages, and
> shall be fixed in upstream already.
>
> Output:
>   PID USER      VIRT  RES DATA  SHR %MEM    TIME+  S COMMAND
> Mon Jan 29 15:30:47 CET 2007
>  3437 hacluste  6152 2844 1492 1816  0.6   0:00.18 S crmd
>  3443 hacluste  5020 2084  796 1340  0.4   0:00.08 S tengine
>  3444 hacluste  5560 2564  940 1548  0.5   0:00.10 S pengine
> Mon Jan 29 15:31:13 CET 2007
>  3437 hacluste  6304 2980 1644 1820  0.6   0:00.36 S crmd
>  3443 hacluste  5020 2104  796 1352  0.4   0:00.15 S tengine
>  3444 hacluste  5768 2724 1148 1552  0.5   0:00.35 S pengine
> ...
> Mon Jan 29 15:34:17 CET 2007
>  3437 hacluste  7360 4096 2700 1820  0.8   0:01.63 S crmd
>  3443 hacluste  5152 2272  928 1352  0.4   0:00.61 S tengine
>  3444 hacluste  5768 2760 1148 1552  0.5   0:02.31 S pengine
> ...
> Mon Jan 29 15:48:19 CET 2007
>  3437 hacluste 12376 9084 7716 1820  1.8   0:07.75 S crmd
>  3443 hacluste  6472 3604 2248 1352  0.7   0:02.76 S tengine
>  3444 hacluste  5768 2804 1148 1552  0.5   0:11.46 S pengine
> Mon Jan 29 15:48:46 CET 2007
>  3437 hacluste 12508 9240 7848 1820  1.8   0:07.92 S crmd
>  3443 hacluste  6472 3648 2248 1352  0.7   0:02.81 S tengine
>  3444 hacluste  5840 2808 1220 1552  0.5   0:11.73 S pengine
> ...
> Mon Jan 29 16:16:26 CET 2007
>  3437 hacluste 22276  18m  17m 1820  3.7   0:19.82 S crmd
>  3443 hacluste  9244 6324 5020 1352  1.2   0:07.04 S tengine
>  3444 hacluste  5912 2888 1292 1552  0.6   0:29.18 S pengine
>
>
> I used stable 2.0.8 sources with minor modifications from upstream
> (see attached patch).
>
> Palo
-------------- next part --------------
#!/bin/sh
OUR_NODENAME=sk16251c
PEER_NODENAME=linux-sles1
LOG_FILE_OUR="log-$OUR_NODENAME"
LOG_FILE_PEER="log-$PEER_NODENAME"
SSH_PEER_NODE='ssh root at 10.0.0.5'
RESOURCES='x_processResource x_IPaddrL x_IPaddrR'
PING_NODE1_CHAIN='INPUT -s 10.0.0.8  -p icmp -j DROP'
PING_NODE2_CHAIN='INPUT -s 10.0.0.9  -p icmp -j DROP'
OUR_LINK1_CHAIN=' INPUT -d 10.0.0.30 -p udp --dport 694 -j DROP'
OUR_LINK2_CHAIN=' INPUT -d 10.0.1.30 -p udp --dport 694 -j DROP'
PEER_LINK1_CHAIN='INPUT -d 10.0.0.5  -p udp --dport 694 -j DROP'
PEER_LINK2_CHAIN='INPUT -d 10.0.1.5  -p udp --dport 694 -j DROP'
PROC_PATTERN='\<(ha_logd|heartbeat:|pingd|ccm|cib|[lc]rmd|stonithd|attrd|[tp]engine)\>'

top -bn1 | egrep '\<PID\>' | egrep -v grep > "$LOG_FILE_OUR"
$SSH_PEER_NODE top -bn1 | egrep '\<PID\>' > "$LOG_FILE_PEER"
for i in $RESOURCES ; do
   crm_failcount -D -r$i -U"$OUR_NODENAME" 2>/dev/null
   crm_failcount -D -r$i -U"$PEER_NODENAME" 2>/dev/null
done
crm_standby -D -U"$OUR_NODENAME" 2>/dev/null
crm_standby -D -U"$PEER_NODENAME" 2>/dev/null
iptables -D $PING_NODE1_CHAIN 2>/dev/null
iptables -D $PING_NODE2_CHAIN 2>/dev/null
$SSH_PEER_NODE iptables -D $PING_NODE1_CHAIN 2>/dev/null
$SSH_PEER_NODE iptables -D $PING_NODE2_CHAIN 2>/dev/null
iptables -D $OUR_LINK1_CHAIN 2>/dev/null
iptables -D $OUR_LINK2_CHAIN 2>/dev/null
$SSH_PEER_NODE iptables -D $PEER_LINK1_CHAIN 2>/dev/null
$SSH_PEER_NODE iptables -D $PEER_LINK2_CHAIN 2>/dev/null
echo -n "Press Enter to continue..."
read
echo "Starting the test loop at $(date) on $(uname -n)"

safe_disconnect() {
   # safe disconnecting of our links
   logger PPPP$1 ; $SSH_PEER_NODE logger PPPP$1
   iptables -A $OUR_LINK1_CHAIN
   sleep 5
   logger QQQQ$1 ; $SSH_PEER_NODE logger QQQQ$1
   iptables -D $OUR_LINK1_CHAIN
   sleep 5
   logger RRRR$1 ; $SSH_PEER_NODE logger RRRR$1
   iptables -A $OUR_LINK2_CHAIN
   sleep 5
   logger SSSS$1 ; $SSH_PEER_NODE logger SSSS$1
   iptables -D $OUR_LINK2_CHAIN
   sleep 5
   # safe disconnecting of peer links
   logger TTTT$1 ; $SSH_PEER_NODE logger TTTT$1
   $SSH_PEER_NODE iptables -A $PEER_LINK1_CHAIN
   sleep 5
   logger UUUU$1 ; $SSH_PEER_NODE logger UUUU$1
   $SSH_PEER_NODE iptables -D $PEER_LINK1_CHAIN
   sleep 5
   logger VVVV$1 ; $SSH_PEER_NODE logger VVVV$1
   $SSH_PEER_NODE iptables -A $PEER_LINK2_CHAIN
   sleep 5
   logger WWWW$1 ; $SSH_PEER_NODE logger WWWW$1
   $SSH_PEER_NODE iptables -D $PEER_LINK2_CHAIN
   sleep 5
   # safe removal of both connections to ping node
   logger XXXX$1 ; $SSH_PEER_NODE logger XXXX$1
   iptables -A $PING_NODE2_CHAIN
   $SSH_PEER_NODE iptables -A $PING_NODE2_CHAIN
   sleep 11
   logger YYYY$1 ; $SSH_PEER_NODE logger YYYY$1
   iptables -D $PING_NODE2_CHAIN
   $SSH_PEER_NODE iptables -D $PING_NODE2_CHAIN
   sleep 5
}

while : ; do
   echo "$(date) - $(uname -n)" >> "$LOG_FILE_OUR"
   top -bn1 | egrep "$PROC_PATTERN" | egrep -v grep | sort -n >> "$LOG_FILE_OUR"
   $SSH_PEER_NODE echo '$(date) - $(uname -n)' >> "$LOG_FILE_PEER"
   $SSH_PEER_NODE top -bn1 | egrep "$PROC_PATTERN" | sort -n >> "$LOG_FILE_PEER"

   for i in $RESOURCES ; do
      crm_failcount -D -r$i -U"$OUR_NODENAME" 2>/dev/null &
      crm_failcount -D -r$i -U"$PEER_NODENAME" 2>/dev/null &
   done
   sleep 5
   # failover OUR->PEER
   logger AAAA ; $SSH_PEER_NODE logger AAAA
   iptables -A $PING_NODE1_CHAIN
   sleep 11
   logger BBBB ; $SSH_PEER_NODE logger BBBB
   iptables -D $PING_NODE1_CHAIN
   sleep 5
   safe_disconnect 1

   for i in $RESOURCES ; do
      crm_failcount -D -r$i -U"$OUR_NODENAME" 2>/dev/null &
      crm_failcount -D -r$i -U"$PEER_NODENAME" 2>/dev/null &
   done
   sleep 5
   # failover PEER->OUR
   logger CCCC ; $SSH_PEER_NODE logger CCCC
   $SSH_PEER_NODE iptables -A $PING_NODE1_CHAIN
   sleep 11
   logger DDDD ; $SSH_PEER_NODE logger DDDD
   $SSH_PEER_NODE iptables -D $PING_NODE1_CHAIN
   sleep 5
   safe_disconnect 2

   echo -n .
   sleep 5
done
-------------- next part --------------
keepalive 200ms
deadtime 3
warntime 2000ms
initdead 10
udpport  694
ucast eth1 10.0.0.5
ucast eth2 10.0.1.5
auto_failback off
watchdog /dev/watchdog
node  linux-sles1 sk16251c
ping 10.0.0.8 10.0.0.9
respawn root /usr/local/lib/heartbeat/pingd -m 10 -d 5s
realtime on
debug 1
msgfmt netstring
use_logd yes
compression zlib
traditional_compression false
coredumps true
crm yes


More information about the Linux-HA mailing list