[Linux-HA] Re: memory leaks of crmd and tengine in 2.0.8

Pavol Gono palo.gono at gmail.com
Wed Feb 7 07:58:37 MST 2007


On 2/6/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> On 2/5/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> > Hi Pavol,
> >
> > Sorry for the delay, I'm not ignoring you, I've just been busy elsewhere.
> >
> > If I'm reading your data correctly, attrd and crmd seem to be the
> > worst offenders with tengine a bit behind.  I'd not realized the
> > numbers were so extreme :-(
> >
> > if you look in lib/cl_plumbing/cl_malloc.c there are a number of
> > #defines that may help tracking this down.  i will start tackling this
> > tomorrow (starting with attrd given its low complexity).
>
> If you're inclined, you could rerun your tests to verify my attrd changes.
>
> Patches are (in order):
> * http://hg.linux-ha.org/dev/rev/30b947bd77e5
> * http://hg.linux-ha.org/dev/rev/8ff8ca1f9294
> * http://hg.linux-ha.org/dev/rev/5cc8305990e2
>

Hi

I compiled revision 10101 and it seems many attrd leaks remained. Now
I am using more complex test scenario. Dummy resource is patched to
not support reloads.

I defined HA_MALLOC_TRACK, which seems to have negative effect to BSC.
All details and logs are huge, I put it on
http://fornax.elf.stuba.sk/~palino/hb_10101_leaks.tar.bz2

Memory of processes after first and last iteration:

PID  VIRT  RES DATA  SHR %MEM  TIME+  S COMMAND
Wed Feb  7 12:51:56 CET 2007 - mach14s10
411  3084 1208  536  744  0.0 0:10.15 S ha_logd: read process
412  2816  908  268  628  0.0 0:05.53 S ha_logd: write process
443 11372  10m 8336 2920  0.3 0:00.49 S heartbeat: master control process
456  4300 4300 1264 2920  0.1 0:00.00 S heartbeat: FIFO reader
457  4428 4428 1392 2920  0.1 0:00.05 S heartbeat: write: ucast eth4
458  4428 4428 1392 2920  0.1 0:00.04 S heartbeat: read: ucast eth4
459  4428 4428 1392 2920  0.1 0:00.06 S heartbeat: write: ucast eth5
460  4428 4428 1392 2920  0.1 0:00.03 S heartbeat: read: ucast eth5
461  4428 4428 1392 2920  0.1 0:00.36 S heartbeat: write: ping 10.54.0.6
462  4428 4428 1392 2920  0.1 0:00.08 S heartbeat: read: ping 10.54.0.6
463  4428 4428 1392 2920  0.1 0:00.28 S heartbeat: write: ping 10.55.0.4
464  4428 4428 1392 2920  0.1 0:00.12 S heartbeat: read: ping 10.55.0.4
465  4428 4428 1392 2920  0.1 0:00.28 S heartbeat: write: ping 10.42.100.1
466  4428 4428 1392 2920  0.1 0:00.05 S heartbeat: read: ping 10.42.100.1
477  2976 1312  264 1096  0.0 0:00.01 S /usr/local/lib/heartbeat/pingd -m
478  3356 1396  704 1112  0.0 0:00.02 S /usr/local/lib/heartbeat/ccm
479  6440 3356 2180 1624  0.1 0:08.13 S /usr/local/lib/heartbeat/cib
480  3300 1608  528 1144  0.0 0:00.13 S /usr/local/lib/heartbeat/lrmd -r
481  3060 3060  392 2572  0.1 0:00.00 S /usr/local/lib/heartbeat/stonithd
482  4092 2388 1312 1168  0.1 0:00.15 S /usr/local/lib/heartbeat/attrd
483  6260 4104 2956 1684  0.1 0:01.61 S /usr/local/lib/heartbeat/crmd
550  3924 2312 1052 1220  0.1 0:00.50 S /usr/local/lib/heartbeat/tengine
551  4752 3096 1760 1360  0.1 0:01.14 S /usr/local/lib/heartbeat/pengine
...
Wed Feb  7 14:02:10 CET 2007 - mach14s10
411  3084 1268  536  744  0.0 1:49.60 S ha_logd: read process
412  2816  908  268  628  0.0 0:56.62 S ha_logd: write process
443 11372  10m 8336 2920  0.3 0:04.65 S heartbeat: master control process
456  4300 4300 1264 2920  0.1 0:00.00 S heartbeat: FIFO reader
457  4428 4428 1392 2920  0.1 0:00.68 S heartbeat: write: ucast eth4
458  4428 4428 1392 2920  0.1 0:00.57 S heartbeat: read: ucast eth4
459  4428 4428 1392 2920  0.1 0:00.56 S heartbeat: write: ucast eth5
460  4428 4428 1392 2920  0.1 0:00.50 S heartbeat: read: ucast eth5
461  4428 4428 1392 2920  0.1 0:04.19 S heartbeat: write: ping 10.54.0.6
462  4428 4428 1392 2920  0.1 0:03.26 S heartbeat: read: ping 10.54.0.6
463  4428 4428 1392 2920  0.1 0:03.12 S heartbeat: write: ping 10.55.0.4
464  4428 4428 1392 2920  0.1 0:02.60 S heartbeat: read: ping 10.55.0.4
465  4428 4428 1392 2920  0.1 0:02.56 S heartbeat: write: ping 10.42.100.1
466  4428 4428 1392 2920  0.1 0:03.05 S heartbeat: read: ping 10.42.100.1
477  2976 1312  264 1096  0.0 0:00.05 S /usr/local/lib/heartbeat/pingd -m
478  3488 1516  836 1112  0.0 0:00.06 S /usr/local/lib/heartbeat/ccm
479  6440 3396 2180 1640  0.1 1:29.81 S /usr/local/lib/heartbeat/cib
480  3300 1608  528 1144  0.0 0:01.25 S /usr/local/lib/heartbeat/lrmd -r
481  3060 3060  392 2572  0.1 0:00.04 S /usr/local/lib/heartbeat/stonithd
482 15972  13m  12m 1168  0.4 0:02.68 S /usr/local/lib/heartbeat/attrd
483 16952  14m  13m 1684  0.4 0:20.32 S /usr/local/lib/heartbeat/crmd
550  4848 3236 1976 1220  0.1 0:05.47 S /usr/local/lib/heartbeat/tengine
551  4824 3180 1832 1360  0.1 0:15.31 S /usr/local/lib/heartbeat/pengine

Summary of 70-minute test:
attrd: +11 MB, increase after each test loop
crmd: +10 MB, increase after each test loop
tengine: +924 KB
pengine: +72 KB, very random increase
ccm: +132 KB, very random increase

Palo


> > On 2/4/07, Pavol Gono <palo.gono at gmail.com> wrote:
> > > Hi
> > >
> > > I started another type of testing - simulation of disconnecting cables
> > > with iptables. Failovers between nodes are triggered by blocking ICMP
> > > responses from ping nodes (see script.txt).
> > >
> > > There are another two leaking processes:
> > > attrd eats 396 KB per while loop
> > > ccm displays following type of messages sometimes
> > > ccm: [27757]: WARN: leaking memory? previous arena=3108864 present
> arena=3244032
> > > (very small memory increase)
> > >
> > > Configuration is similar to previous post, only Dummy resource is
> > > replaced by custom one.
> > >
> > > For my tests it is annoying that heartbeat eats hundreds of megabytes
> > > after some hours/days. Can I help you to make fixes sooner?
> > > What are the best configure switches for memory leak detection
> > >
> (--enable-dmalloc/--enable-crm-dev/--enable-crm-dmalloc/--enable-crm-force-malloc)?
> > > Is it better to make up simple testcases (less resources, less
> > > operations) or the complex testcase, which contains all possible
> > > memory leaks?
> > > Should I use latest dev sources or latest stable sources?
> > > (I would like to have fixes against 2.0.8 currently)
> > >
> > > The output of script for node sk16251c:
> > >   PID  VIRT  RES DATA  SHR %MEM    TIME+  S COMMAND
> > > Fri Feb  2 18:26:13 CET 2007 - sk16251c
> > > 27708  2944 1056  396  744  0.2   0:00.88 S ha_logd: read process
> > > 27713  2812  864  264  620  0.2   0:00.87 S ha_logd: write process
> > > 27756  2976 1284  264 1084  0.3   0:00.01 S
> > > /usr/local/lib/heartbeat/pingd -m 10 -d 5s
> > > 27757  3356 1368  704 1104  0.3   0:00.01 S /usr/local/lib/heartbeat/ccm
> > > 27758  4452 2308 1356 1388  0.5   0:10.35 S /usr/local/lib/heartbeat/cib
> > > 27759  3168 1488  396 1136  0.3   0:00.25 S
> /usr/local/lib/heartbeat/lrmd -r
> > > 27760  3060 3060  392 2572  0.6   0:00.00 S
> /usr/local/lib/heartbeat/stonithd
> > > 27761  3968 2316 1188 1164  0.5   0:00.19 S
> /usr/local/lib/heartbeat/attrd
> > > 27762  5500 3416 2192 1680  0.7   0:01.50 S
> /usr/local/lib/heartbeat/crmd
> > > 27769  3660 1896  788 1196  0.4   0:00.46 S
> /usr/local/lib/heartbeat/tengine
> > > 27770  4404 2596 1132 1416  0.5   0:02.39 S
> /usr/local/lib/heartbeat/pengine
> > > ...
> > > Sat Feb  3 01:54:47 CET 2007 - sk16251c
> > > 27708  2944 1076  396  744  0.2   0:52.57 S ha_logd: read process
> > > 27713  2812  876  264  620  0.2   0:43.66 S ha_logd: write process
> > > 27756  2976 1284  264 1084  0.3   0:00.16 S
> > > /usr/local/lib/heartbeat/pingd -m 10 -d 5s
> > > 27757  4016 2056 1364 1104  0.4   0:00.26 S /usr/local/lib/heartbeat/ccm
> > > 27758  4452 2352 1356 1404  0.5   9:34.45 S /usr/local/lib/heartbeat/cib
> > > 27759  3168 1500  396 1140  0.3   0:08.75 S
> /usr/local/lib/heartbeat/lrmd -r
> > > 27760  3060 3060  392 2572  0.6   0:00.29 S
> /usr/local/lib/heartbeat/stonithd
> > > 27761 69440  66m  65m 1164 13.4   0:17.46 S
> /usr/local/lib/heartbeat/attrd
> > > 27762 34540  31m  30m 1680  6.4   1:41.60 S
> /usr/local/lib/heartbeat/crmd
> > > 27769  3660 1900  788 1200  0.4   0:18.97 S
> /usr/local/lib/heartbeat/tengine
> > > 27770  4980 3148 1708 1416  0.6   2:31.45 S
> /usr/local/lib/heartbeat/pengine
> > >
> > >
> > > Palo
> > >
> > >
> > > On 1/29/07, Pavol Gono <palo.gono at gmail.com> wrote:
> > > > Hi
> > > >
> > > > I found memory leaks of described processes when doing following
> failovers:
> > > > deboserver -> pgbook: with crm_standby
> > > > pgbook -> deboserver: failing monitor operation of resource Dummy
> > > > Frequency is 2 failovers per minute. Script and configuration
> attached.
> > > > Memory leaks of crmd are the most markant: 132 KB per failover.
> > > > pengine displays the "Potential memory leak detected" messages, and
> > > > shall be fixed in upstream already.
> > > >
> > > > Output:
> > > >   PID USER      VIRT  RES DATA  SHR %MEM    TIME+  S COMMAND
> > > > Mon Jan 29 15:30:47 CET 2007
> > > >  3437 hacluste  6152 2844 1492 1816  0.6   0:00.18 S crmd
> > > >  3443 hacluste  5020 2084  796 1340  0.4   0:00.08 S tengine
> > > >  3444 hacluste  5560 2564  940 1548  0.5   0:00.10 S pengine
> > > > Mon Jan 29 15:31:13 CET 2007
> > > >  3437 hacluste  6304 2980 1644 1820  0.6   0:00.36 S crmd
> > > >  3443 hacluste  5020 2104  796 1352  0.4   0:00.15 S tengine
> > > >  3444 hacluste  5768 2724 1148 1552  0.5   0:00.35 S pengine
> > > > ...
> > > > Mon Jan 29 15:34:17 CET 2007
> > > >  3437 hacluste  7360 4096 2700 1820  0.8   0:01.63 S crmd
> > > >  3443 hacluste  5152 2272  928 1352  0.4   0:00.61 S tengine
> > > >  3444 hacluste  5768 2760 1148 1552  0.5   0:02.31 S pengine
> > > > ...
> > > > Mon Jan 29 15:48:19 CET 2007
> > > >  3437 hacluste 12376 9084 7716 1820  1.8   0:07.75 S crmd
> > > >  3443 hacluste  6472 3604 2248 1352  0.7   0:02.76 S tengine
> > > >  3444 hacluste  5768 2804 1148 1552  0.5   0:11.46 S pengine
> > > > Mon Jan 29 15:48:46 CET 2007
> > > >  3437 hacluste 12508 9240 7848 1820  1.8   0:07.92 S crmd
> > > >  3443 hacluste  6472 3648 2248 1352  0.7   0:02.81 S tengine
> > > >  3444 hacluste  5840 2808 1220 1552  0.5   0:11.73 S pengine
> > > > ...
> > > > Mon Jan 29 16:16:26 CET 2007
> > > >  3437 hacluste 22276  18m  17m 1820  3.7   0:19.82 S crmd
> > > >  3443 hacluste  9244 6324 5020 1352  1.2   0:07.04 S tengine
> > > >  3444 hacluste  5912 2888 1292 1552  0.6   0:29.18 S pengine
> > > >
> > > >
> > > > I used stable 2.0.8 sources with minor modifications from upstream
> > > > (see attached patch).
> > > >
> > > > Palo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch_hb_10101m.diff
Type: text/x-diff
Size: 2972 bytes
Desc: not available
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20070207/4e3451fd/patch_hb_10101m-0001.bin
-------------- next part --------------
#!/bin/sh

# Configurable stuff

# local node
NODE1='mach14s10'
# peer node
NODE2='mach13s10'
NODE1_IF1_IP='10.54.0.14'
NODE1_IF2_IP='10.55.0.14'
NODE2_IF1_IP='10.54.0.13'
NODE2_IF2_IP='10.55.0.13'
PING_NODES='10.54.0.6 10.55.0.4 10.42.100.1'

LOG_FILE_LOCAL="log-$NODE1"
LOG_FILE_PEER="log-$NODE2"
# how to connect to peer node (without password)
SSH_CMD="ssh root@$NODE2_IF1_IP"
# the first resource must be x_Dummy
RESOURCES='x_Dummy x_IPaddrL x_IPaddrR'
# pattern for egrep, to parse all heartbeat's processes from top
PROC_PATTERN='\<(ha_logd|pingd|ccm|cib|[lc]rmd|stonithd|attrd|[tp]engine)\>| heartbeat: '


PING_NODE1_CHAIN="INPUT -s $(echo $PING_NODES | awk '{print $1}') -p icmp -j DROP"
PING_NODE2_CHAIN="INPUT -s $(echo $PING_NODES | awk '{print $2}') -p icmp -j DROP"
LOCAL_LINK1_CHAIN="INPUT -d $NODE1_IF1_IP -p udp --dport 694 -j DROP"
LOCAL_LINK2_CHAIN="INPUT -d $NODE1_IF2_IP -p udp --dport 694 -j DROP"
PEER_LINK1_CHAIN="INPUT -d $NODE2_IF1_IP -p udp --dport 694 -j DROP"
PEER_LINK2_CHAIN="INPUT -d $NODE2_IF2_IP -p udp --dport 694 -j DROP"

FIRST_RESOURCE="$(echo $RESOURCES | awk '{print $1}')"
LOG_COUNTER=0

my_log() {
   LOG_COUNTER=$(($LOG_COUNTER+1))
   logger "$1 $LOG_COUNTER"
   $SSH_CMD logger "$1 $LOG_COUNTER"
}

echo "Rule for blocking ping node 1: $PING_NODE1_CHAIN"
echo "Rule for blocking ping node 2: $PING_NODE2_CHAIN"
echo "Blocking incomming heartbeats node 1 link 1: $LOCAL_LINK1_CHAIN"
echo "Blocking incomming heartbeats node 1 link 2: $LOCAL_LINK2_CHAIN"
echo "Blocking incomming heartbeats node 2 link 1: $PEER_LINK1_CHAIN"
echo "Blocking incomming heartbeats node 2 link 2: $PEER_LINK2_CHAIN"
iptables -D $PING_NODE1_CHAIN 2>/dev/null
iptables -D $PING_NODE2_CHAIN 2>/dev/null
$SSH_CMD iptables -D $PING_NODE1_CHAIN 2>/dev/null
$SSH_CMD iptables -D $PING_NODE2_CHAIN 2>/dev/null
iptables -D $LOCAL_LINK1_CHAIN 2>/dev/null
iptables -D $LOCAL_LINK2_CHAIN 2>/dev/null
$SSH_CMD iptables -D $PEER_LINK1_CHAIN 2>/dev/null
$SSH_CMD iptables -D $PEER_LINK2_CHAIN 2>/dev/null
top -bn1 | egrep '\<PID\>' | egrep -v grep > "$LOG_FILE_LOCAL"
$SSH_CMD top -bn1 | egrep '\<PID\>' > "$LOG_FILE_PEER"
for i in $RESOURCES ; do
   crm_failcount -D -r$i -U"$NODE1" 2>/dev/null
   crm_failcount -D -r$i -U"$NODE2" 2>/dev/null
done
crm_standby -D -U"$NODE1" 2>/dev/null
crm_standby -D -U"$NODE2" 2>/dev/null
echo -n "Press Enter to continue..."
read
echo "Starting the test loop at $(date) on $NODE1 and $NODE2"

safe_disconnect() {
   # safe disconnecting of local links (one direction)
   my_log LLLL$1
   iptables -A $LOCAL_LINK1_CHAIN
   sleep 5
   my_log MMMM$1
   iptables -D $LOCAL_LINK1_CHAIN
   sleep 5
   my_log NNNN$1
   iptables -A $LOCAL_LINK2_CHAIN
   sleep 5
   my_log OOOO$1
   iptables -D $LOCAL_LINK2_CHAIN
   sleep 5
   # safe disconnecting of peer links (one direction)
   my_log PPPP$1
   $SSH_CMD iptables -A $PEER_LINK1_CHAIN
   sleep 5
   my_log QQQQ$1
   $SSH_CMD iptables -D $PEER_LINK1_CHAIN
   sleep 5
   my_log RRRR$1
   $SSH_CMD iptables -A $PEER_LINK2_CHAIN
   sleep 5
   my_log SSSS$1
   $SSH_CMD iptables -D $PEER_LINK2_CHAIN
   sleep 5
   # safe disconnecting of links (both direction)
   my_log TTTT$1
   iptables -A $LOCAL_LINK1_CHAIN
   $SSH_CMD iptables -A $PEER_LINK1_CHAIN
   sleep 5
   my_log UUUU$1
   iptables -D $LOCAL_LINK1_CHAIN
   $SSH_CMD iptables -D $PEER_LINK1_CHAIN
   sleep 5
   my_log VVVV$1
   iptables -A $LOCAL_LINK2_CHAIN
   $SSH_CMD iptables -A $PEER_LINK2_CHAIN
   sleep 5
   my_log WWWW$1
   iptables -D $LOCAL_LINK2_CHAIN
   $SSH_CMD iptables -D $PEER_LINK2_CHAIN
   sleep 5
   # safe removal of both connections to ping node
   my_log XXXX$1
   iptables -A $PING_NODE2_CHAIN
   $SSH_CMD iptables -A $PING_NODE2_CHAIN
   sleep 15
   my_log YYYY$1
   iptables -D $PING_NODE2_CHAIN
   $SSH_CMD iptables -D $PING_NODE2_CHAIN
   sleep 5
}

while : ; do
   echo "$(date) - $NODE1" >> "$LOG_FILE_LOCAL"
   top -bn1 | egrep "$PROC_PATTERN" | egrep -v grep | sort -n >> "$LOG_FILE_LOCAL"
   $SSH_CMD echo '$(date) -' "$NODE2" >> "$LOG_FILE_PEER"
   $SSH_CMD top -bn1 | egrep "$PROC_PATTERN" | sort -n >> "$LOG_FILE_PEER"

   # failover LOCAL->PEER
   my_log BBBB
   crm_standby -von -U"$NODE1"
   sleep 20
   my_log CCCC
   for i in $RESOURCES ; do
      crm_resource -C -r$i -H"$NODE1" &
   done
   my_log DDDD
   for i in $RESOURCES ; do
      crm_failcount -D -r$i -U"$NODE1" 2>/dev/null &
   done
   sleep 10
   my_log EEEE
   crm_standby -D -U"$NODE1"
   echo -n 1
   sleep 5

   # failover PEER->LOCAL
   my_log FFFF
   $SSH_CMD rm /tmp/a/a
   sleep 20
   for i in $RESOURCES ; do
      crm_failcount -D -r$i -U"$NODE1" 2>/dev/null &
      crm_failcount -D -r$i -U"$NODE2" 2>/dev/null &
   done
   echo -n 2
   sleep 5

   # failover LOCAL->PEER
   my_log GGGG
   iptables -A $PING_NODE1_CHAIN
   sleep 20
   my_log HHHH
   iptables -D $PING_NODE1_CHAIN
   sleep 10
   safe_disconnect 1
   for i in $RESOURCES ; do
      crm_failcount -D -r$i -U"$NODE1" 2>/dev/null &
      crm_failcount -D -r$i -U"$NODE2" 2>/dev/null &
   done
   echo -n 3
   sleep 5

   # failover PEER->LOCAL
   my_log IIII
   $SSH_CMD iptables -A $PING_NODE1_CHAIN
   sleep 20
   my_log JJJJ
   $SSH_CMD iptables -D $PING_NODE1_CHAIN
   sleep 10
   safe_disconnect 2
   echo -n 4
   sleep 5
done
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cib.start.xml
Type: text/xml
Size: 11440 bytes
Desc: not available
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20070207/4e3451fd/cib.start-0001.bin


More information about the Linux-HA mailing list