[Linux-HA] Weird HA-behavior with XEN/HA/DRBD/LVM

sussox sussox at gmail.com
Fri Jul 3 01:30:07 MDT 2009


I have a HA-setup with Xen heartbeat lvm and DRBD. According to this howto:
http://www.asplund.nu/xencluster/xen-cluster-howto.html

Im using 2 Poweredge R710. ha1 & ha2 with Ubuntu 8.04

I can do a manual live migration of the domU's without any problems. Also if
i shutdown heartbeat on ha1 with init.d/heartbeat the domU's are migrated to
ha2 sucessfully.

Also, if i pull the plug on ha1, after a while the domU's start on ha2 (as
they should).

However! When doing a "reboot" on ha1, domU's begin to migrate but then
crashes on ha2. pasting ha-debug and xend.log below.

Any ides why it keeps doing this? All i can think of is that some process is
being killed to fast (when the domU's are beeing migrated, but i don't know
what to look for.) Also, i ran a test couple of weeks ago with the same
setup but on one R710 and a older shuttle and then there was no problem.
Tried to redo the howto twice but with the same problem

Cheers! /Sussox

ha-debug:

Code:

heartbeat[28295]: 2009/06/30_13:46:12 info: Received shutdown notice from
'ha1.vbm.se'.
heartbeat[28295]: 2009/06/30_13:46:12 info: Resources being acquired from
ha1.vbm.se.
heartbeat[28295]: 2009/06/30_13:46:12 debug: StartNextRemoteRscReq(): child
count 1
heartbeat[29445]: 2009/06/30_13:46:12 info: acquire local HA resources
(standby).
ResourceManager[29472]: 2009/06/30_13:46:12 info: Acquiring resource group:
ha2.vbm.se xendomainsHA2
heartbeat[29446]: 2009/06/30_13:46:13 info: Local Resource acquisition
completed.
heartbeat[28295]: 2009/06/30_13:46:13 debug: StartNextRemoteRscReq(): child
count 2
heartbeat[28295]: 2009/06/30_13:46:13 debug: StartNextRemoteRscReq(): child
count 1
ResourceManager[29472]: 2009/06/30_13:46:13 info: Running
/etc/ha.d/resource.d/xendomainsHA2  start
ResourceManager[29472]: 2009/06/30_13:46:13 debug: Starting
/etc/ha.d/resource.d/xendomainsHA2  start
ResourceManager[29472]: 2009/06/30_13:46:13 debug:
/etc/ha.d/resource.d/xendomainsHA2  start done. RC=0
heartbeat[29445]: 2009/06/30_13:46:13 info: local HA resource acquisition
completed (standby).
heartbeat[28295]: 2009/06/30_13:46:13 info: Standby resource acquisition
done [foreign].
heartbeat[29559]: 2009/06/30_13:46:13 debug: notify_world: setting SIGCHLD
Handler to SIG_DFL
harc[29559]: 2009/06/30_13:46:13 info: Running /etc/ha.d/rc.d/status status
mach_down[29573]: 2009/06/30_13:46:13 info: Taking over resource group
xendomainsHA1
ResourceManager[29597]: 2009/06/30_13:46:13 info: Acquiring resource group:
ha1.vbm.se xendomainsHA1
ResourceManager[29597]: 2009/06/30_13:46:13 info: Running
/etc/ha.d/resource.d/xendomainsHA1  start
ResourceManager[29597]: 2009/06/30_13:46:13 debug: Starting
/etc/ha.d/resource.d/xendomainsHA1  start
Starting auto Xen domains: hejsan(skip) *   [done]
ResourceManager[29597]: 2009/06/30_13:46:13 debug:
/etc/ha.d/resource.d/xendomainsHA1  start done. RC=0
mach_down[29573]: 2009/06/30_13:46:13 info: /usr/share/heartbeat/mach_down:
nice_failback: foreign resources acquired
mach_down[29573]: 2009/06/30_13:46:13 info: mach_down takeover complete for
node ha1.vbm.se.
heartbeat[28295]: 2009/06/30_13:46:13 info: mach_down takeover complete.
heartbeat[29696]: 2009/06/30_13:46:13 debug: notify_world: setting SIGCHLD
Handler to SIG_DFL
harc[29696]: 2009/06/30_13:46:13 info: Running
/etc/ha.d/rc.d/ip-request-resp ip-request-resp
ip-request-resp[29696]: 2009/06/30_13:46:13 received ip-request-resp
xendomainsHA2 OK yes
ResourceManager[29715]: 2009/06/30_13:46:13 info: Acquiring resource group:
ha2.vbm.se xendomainsHA2
ResourceManager[29715]: 2009/06/30_13:46:13 info: Running
/etc/ha.d/resource.d/xendomainsHA2  start
ResourceManager[29715]: 2009/06/30_13:46:13 debug: Starting
/etc/ha.d/resource.d/xendomainsHA2  start
ResourceManager[29715]: 2009/06/30_13:46:13 debug:
/etc/ha.d/resource.d/xendomainsHA2  start done. RC=0
heartbeat[28295]: 2009/06/30_13:46:24 WARN: node ha1.vbm.se: is dead
heartbeat[28295]: 2009/06/30_13:46:24 info: Dead node ha1.vbm.se gave up
resources.
heartbeat[28295]: 2009/06/30_13:46:24 info: Link ha1.vbm.se:eth0 dead.

xend.log

Code:

[2009-06-30 13:46:11 5499] DEBUG (XendCheckpoint:210) restore:shadow=0x0,
_static_max=0x18000000, _static_min=0x0,
[2009-06-30 13:46:11 5499] DEBUG (balloon:151) Balloon: 398436 KiB free;
need 393216; done.
[2009-06-30 13:46:11 5499] DEBUG (XendCheckpoint:227) [xc_restore]:
/usr/lib/xen/bin/xc_restore 4 7 1 2 0 0 0
[2009-06-30 13:46:11 5499] INFO (XendCheckpoint:365) xc_domain_restore
start: p2m_size = 18800
[2009-06-30 13:46:11 5499] INFO (XendCheckpoint:365) Reloading memory pages:  
0%
[2009-06-30 13:46:14 5499] INFO (XendCheckpoint:365) ERROR Internal error:
Error when reading page (type was 0)
[2009-06-30 13:46:14 5499] INFO (XendCheckpoint:365) Restore exit with rc=1
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1913)
XendDomainInfo.destroy: domid=7
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1930)
XendDomainInfo.destroyDomain(7)
[2009-06-30 13:46:14 5499] ERROR (XendDomainInfo:1942)
XendDomainInfo.destroy: xc.domain_destroy failed.
Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/xen/xend/XendDomainInfo.py", line
1937, in destroyDomain
    xc.domain_destroy(self.domid)
Error: (3, 'No such process')
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1553) No device model
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1555) Releasing devices
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1561) Removing vif/0
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:590)
XendDomainInfo.destroyDevice: deviceClass = vif, device = vif/0
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1561) Removing vbd/51713
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:590)
XendDomainInfo.destroyDevice: deviceClass = vbd, device = vbd/51713
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1561) Removing vbd/51714
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:590)
XendDomainInfo.destroyDevice: deviceClass = vbd, device = vbd/51714
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1561) Removing console/0
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:590)
XendDomainInfo.destroyDevice: deviceClass = console, device = console/0
[2009-06-30 13:46:14 5499] ERROR (XendDomain:1136) Restore failed
Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/xen/xend/XendDomain.py", line 1134,
in domain_restore_fd
    return XendCheckpoint.restore(self, fd, paused=paused)
  File "/usr/lib/python2.5/site-packages/xen/xend/XendCheckpoint.py", line
231, in restore
    forkHelper(cmd, fd, handler.handler, True)
  File "/usr/lib/python2.5/site-packages/xen/xend/XendCheckpoint.py", line
353, in forkHelper
    raise XendError("%s failed" % string.join(cmd))
XendError: /usr/lib/xen/bin/xc_restore 4 7 1 2 0 0 0 failed 

What xendomainsHA2 does:

#!/bin/bash
#
# /etc/init.d/xendomains
# Start / stop domains automatically when domain 0 boots / shuts down.
#
# chkconfig: 345 99 00
# description: Start / stop Xen domains.
#
# This script offers fairly basic functionality.  It should work on Redhat
# but also on LSB-compliant SuSE releases and on Debian with the LSB package
# installed.  (LSB is the Linux Standard Base)
#
# Based on the example in the "Designing High Quality Integrated Linux
# Applications HOWTO" by Avi Alkalay
# <http://www.tldp.org/HOWTO/HighQuality-Apps-HOWTO/>
#
### BEGIN INIT INFO
# Provides:          xendomains
# Required-Start:    $syslog $remote_fs xend
# Should-Start:
# Required-Stop:     $syslog $remote_fs xend
# Should-Stop:
# Default-Start:     3 4 5
# Default-Stop:      0 1 2 6
# Short-Description: Start/stop secondary xen domains
# Description:       Start / stop domains automatically when domain 0 
#                    boots / shuts down.
### END INIT INFO

# Correct exit code would probably be 5, but it's enough 
# if xend complains if we're not running as privileged domain
if ! [ -e /proc/xen/privcmd ]; then
	exit 0
fi

LOCKFILE=/var/lock/xendomainsHA2
XENDOM_CONFIG=/etc/default/xendomainsHA2

test -r $XENDOM_CONFIG || { echo "$XENDOM_CONFIG not existing";
	if [ "$1" = "stop" ]; then exit 0;
	else exit 6; fi; }

. $XENDOM_CONFIG

# Use the SUSE rc_ init script functions;
# emulate them on LSB, RH and other systems
if test -e /etc/rc.status; then
    # SUSE rc script library
    . /etc/rc.status
else    
    _cmd=$1
    declare -a _SMSG
    if test "${_cmd}" = "status"; then
	_SMSG=(running dead dead unused unknown)
	_RC_UNUSED=3
    else
	_SMSG=(done failed failed missed failed skipped unused failed failed)
	_RC_UNUSED=6
    fi
    if test -e /etc/init.d/functions; then
	# REDHAT
	. /etc/init.d/functions
	echo_rc()
	{
	    #echo -n "  [${_SMSG[${_RC_RV}]}] "
	    if test ${_RC_RV} = 0; then
		success "  [${_SMSG[${_RC_RV}]}] "
	    else
		failure "  [${_SMSG[${_RC_RV}]}] "
	    fi
	}
    elif test -e /lib/lsb/init-functions; then
	# LSB    
    	. /lib/lsb/init-functions
        if alias log_success_msg >/dev/null 2>/dev/null; then
	  echo_rc()
	  {
	       echo "  [${_SMSG[${_RC_RV}]}] "
	  }
        else
	  echo_rc()
	  {
	    if test ${_RC_RV} = 0; then
		log_success_msg "  [${_SMSG[${_RC_RV}]}] "
	    else
		log_failure_msg "  [${_SMSG[${_RC_RV}]}] "
	    fi
	  }
        fi
    else    
	# emulate it
	echo_rc()
	{
	    echo "  [${_SMSG[${_RC_RV}]}] "
	}
    fi
    rc_reset() { _RC_RV=0; }
    rc_failed()
    {
	if test -z "$1"; then 
	    _RC_RV=1;
	elif test "$1" != "0"; then 
	    _RC_RV=$1; 
    	fi
	return ${_RC_RV}
    }
    rc_check()
    {
	return rc_failed $?
    }	
    rc_status()
    {
	rc_failed $?
	if test "$1" = "-r"; then _RC_RV=0; shift; fi
	if test "$1" = "-s"; then rc_failed 5; echo_rc; rc_failed 3; shift; fi
	if test "$1" = "-u"; then rc_failed ${_RC_UNUSED}; echo_rc; rc_failed 3;
shift; fi
	if test "$1" = "-v"; then echo_rc; shift; fi
	if test "$1" = "-r"; then _RC_RV=0; shift; fi
	return ${_RC_RV}
    }
    rc_exit() { exit ${_RC_RV}; }
    rc_active() 
    {
	if test -z "$RUNLEVEL"; then read RUNLEVEL REST < <(/sbin/runlevel); fi
	if test -e /etc/init.d/S[0-9][0-9]${1}; then return 0; fi
	return 1
    }
fi

if ! which usleep >&/dev/null
then
  usleep()
  {
    if [ -n "$1" ]
    then
      sleep $(( $1 / 1000000 ))
    fi
  }
fi

# Reset status of this service
rc_reset

##
# Returns 0 (success) if the given parameter names a directory, and that
# directory is not empty.
#
contains_something()
{
  if [ -d "$1" ] && [ `/bin/ls $1 | wc -l` -gt 0 ]
  then
    return 0
  else
    return 1
  fi
}

# read name from xen config file
rdname()
{
    NM=$(xm create --quiet --dryrun --defconfig "$1" |
         sed -n 's/^.*(name \(.*\))$/\1/p')
}

rdnames()
{
    NAMES=
    if ! contains_something "$XENDOMAINS_AUTO"
    then 
	return
    fi
    for dom in $XENDOMAINS_AUTO/*; do
	rdname $dom
	if test -z $NAMES; then 
	    NAMES=$NM; 
	else
	    NAMES="$NAMES|$NM"
	fi
    done
}

parseln()
{
    name=`echo "$1" | cut -d\  -f1`
    name=${name%% *}
    rest=`echo "$1" | cut -d\  -f2-`
    read id mem cpu vcpu state tm < <(echo "$rest")
}

is_running()
{
    rdname $1
    RC=1
    while read LN; do
	parseln "$LN"
	if test "$id" = "0"; then continue; fi
	case $name in 
	    ($NM)
		RC=0
		;;
	esac
    done < <(xm list | grep -v '^Name')
    return $RC
}

start() 
{
    if [ -f $LOCKFILE ]; then 
	echo -n "xendomains already running (lockfile exists)"
	return; 
    fi

    saved_domains=" "
    if [ "$XENDOMAINS_RESTORE" = "true" ] &&
       contains_something "$XENDOMAINS_SAVE"
    then
        mkdir -p $(dirname "$LOCKFILE")
	touch $LOCKFILE
	echo -n "Restoring Xen domains:"
	saved_domains=`ls $XENDOMAINS_SAVE`
	for dom in $XENDOMAINS_SAVE/*; do
	    echo -n " ${dom##*/}"
	    xm restore $dom
	    if [ $? -ne 0 ]; then
		rc_failed $?
		echo -n '!'
	    else
		# mv $dom ${dom%/*}/.${dom##*/}
		rm $dom
	    fi
	done
	echo .
    fi

    if contains_something "$XENDOMAINS_AUTO"
    then
	touch $LOCKFILE
	echo -n "Starting auto Xen domains:"
	# We expect config scripts for auto starting domains to be in
	# XENDOMAINS_AUTO - they could just be symlinks to files elsewhere

	# Create all domains with config files in XENDOMAINS_AUTO.
	# TODO: We should record which domain name belongs 
	# so we have the option to selectively shut down / migrate later
	# If a domain statefile from $XENDOMAINS_SAVE matches a domain name
	# in $XENDOMAINS_AUTO, do not try to start that domain; if it didn't 
	# restore correctly it requires administrative attention.
	for dom in $XENDOMAINS_AUTO/*; do
	    echo -n " ${dom##*/}"
	    shortdom=$(echo $dom | sed -n 's/^.*\/\(.*\)$/\1/p')
	    echo $saved_domains | grep -w $shortdom > /dev/null
	    if [ $? -eq 0 ] || is_running $dom; then
		echo -n "(skip)"
	    else
		xm create --quiet --defconfig $dom
		if [ $? -ne 0 ]; then
		    rc_failed $?
		    echo -n '!'
		else
		    usleep $XENDOMAINS_CREATE_USLEEP
		fi
	    fi
	done
    fi	
}

all_zombies()
{
    while read LN; do
	parseln "$LN"
	if test $id = 0; then continue; fi
	if test "$state" != "-b---d" -a "$state" != "-----d"; then
	    return 1;
	fi
    done < <(xm list | grep -v '^Name')
    return 0
}

# Wait for max $XENDOMAINS_STOP_MAXWAIT for xm $1 to finish;
# if it has not exited by that time kill it, so the init script will
# succeed within a finite amount of time; if $2 is nonnull, it will
# kill the command as well as soon as no domain (except for zombies)
# are left (used for shutdown --all).
watchdog_xm()
{
    if test -z "$XENDOMAINS_STOP_MAXWAIT" -o "$XENDOMAINS_STOP_MAXWAIT" =
"0"; then
	exit
    fi
    usleep 20000
    for no in `seq 0 $XENDOMAINS_STOP_MAXWAIT`; do
	# exit if xm save/migrate/shutdown is finished
	PSAX=`ps axlw | grep "xm $1" | grep -v grep`
	if test -z "$PSAX"; then exit; fi
	echo -n "."; sleep 1
	# go to kill immediately if there's only zombies left
	if all_zombies && test -n "$2"; then break; fi
    done
    sleep 1
    read PSF PSUID PSPID PSPPID < <(echo "$PSAX")
    # kill xm $1
    kill $PSPID >/dev/null 2>&1
}

stop()
{
    # Collect list of domains to shut down
    if test "$XENDOMAINS_AUTO_ONLY" = "true"; then
	rdnames
    fi
    echo -n "Shutting down Xen domains:"
    while read LN; do
	parseln "$LN"
	if test $id = 0; then continue; fi
	echo -n " $name"
	if test "$XENDOMAINS_AUTO_ONLY" = "true"; then
	    case $name in
		($NAMES)
		    # nothing
		    ;;
		(*)
		    echo -n "(skip)"
		    continue
		    ;;
	    esac
	fi
	# XENDOMAINS_SYSRQ chould be something like just "s" 
	# or "s e i u" or even "s e s i u o"
	# for the latter, you should set XENDOMAINS_USLEEP to 1200000 or so
	if test -n "$XENDOMAINS_SYSRQ"; then
	    for sysrq in $XENDOMAINS_SYSRQ; do
		echo -n "(SR-$sysrq)"
		xm sysrq $id $sysrq
		if test $? -ne 0; then
		    rc_failed $?
		    echo -n '!'
		fi
		# usleep just ignores empty arg
		usleep $XENDOMAINS_USLEEP
	    done
	fi
	if test "$state" = "-b---d" -o "$state" = "-----d"; then
	    echo -n "(zomb)"
	    continue
	fi
	if test -n "$XENDOMAINS_MIGRATE"; then
	    echo -n "(migr)"
	    watchdog_xm migrate &
	    WDOG_PID=$!
	    xm migrate $id $XENDOMAINS_MIGRATE
	    if test $? -ne 0; then
		rc_failed $?
		echo -n '!'
		kill $WDOG_PID >/dev/null 2>&1
	    else
		kill $WDOG_PID >/dev/null 2>&1
		continue
	    fi
	fi
	if test -n "$XENDOMAINS_SAVE"; then
	    echo -n "(save)"
	    watchdog_xm save &
	    WDOG_PID=$!
	    mkdir -p "$XENDOMAINS_SAVE"
	    xm save $id $XENDOMAINS_SAVE/$name
	    if test $? -ne 0; then
		rc_failed $?
		echo -n '!'
		kill $WDOG_PID >/dev/null 2>&1
	    else
		kill $WDOG_PID >/dev/null 2>&1
		continue
	    fi
	fi
	if test -n "$XENDOMAINS_SHUTDOWN"; then
	    # XENDOMAINS_SHUTDOWN should be "--halt --wait"
	    echo -n "(shut)"
	    watchdog_xm shutdown &
	    WDOG_PID=$!
	    xm shutdown $id $XENDOMAINS_SHUTDOWN
	    if test $? -ne 0; then
		rc_failed $?
		echo -n '!'
	    fi
	    kill $WDOG_PID >/dev/null 2>&1
	fi
    done < <(xm list | grep -v '^Name')

    # NB. this shuts down ALL Xen domains (politely), not just the ones in
    # AUTODIR/*
    # This is because it's easier to do ;-) but arguably if this script is
run
    # on system shutdown then it's also the right thing to do.
    if ! all_zombies && test -n "$XENDOMAINS_SHUTDOWN_ALL"; then
	# XENDOMAINS_SHUTDOWN_ALL should be "--all --halt --wait"
	echo -n " SHUTDOWN_ALL "
	watchdog_xm shutdown 1 &
	WDOG_PID=$!
	xm shutdown $XENDOMAINS_SHUTDOWN_ALL
	if test $? -ne 0; then
	    rc_failed $?
	    echo -n '!'
	fi
	kill $WDOG_PID >/dev/null 2>&1
    fi

    # Unconditionally delete lock file
    rm -f $LOCKFILE
}

check_domain_up()
{
    while read LN; do
	parseln "$LN"
	if test $id = 0; then continue; fi
	case $name in 
	    ($1)
		return 0
		;;
	esac
    done < <(xm list | grep -v "^Name")
    return 1
}

check_all_auto_domains_up()
{
    if ! contains_something "$XENDOMAINS_AUTO"
    then
      return 0
    fi
    missing=
    for nm in $XENDOMAINS_AUTO/*; do
	rdname $nm
	found=0
	if check_domain_up "$NM"; then 
	    echo -n " $name"
	else 
	    missing="$missing $NM"
	fi
    done
    if test -n "$missing"; then
	echo -n " MISS AUTO:$missing"
	return 1
    fi
    return 0
}

check_all_saved_domains_up()
{
    if ! contains_something "$XENDOMAINS_SAVE" 
    then
      return 0
    fi
    missing=`/bin/ls $XENDOMAINS_SAVE`
    echo -n " MISS SAVED: " $missing
    return 1
}

# This does NOT necessarily restart all running domains: instead it
# stops all running domains and then boots all the domains specified in
# AUTODIR.  If other domains have been started manually then they will
# not get restarted.
# Commented out to avoid confusion!

restart()
{
    stop
    start
}

reload()
{
    restart
}


case "$1" in
    start)
	start
	rc_status
	if test -f $LOCKFILE; then rc_status -v; fi
	;;

    stop)
	stop
	rc_status -v
	;;

    restart)
	restart
	;;
    reload)
	reload
	;;
    force-reload)
	reload
	;;
    status)
	echo -n "Checking for xendomains:" 
	if test ! -f $LOCKFILE; then 
	    rc_failed 3
	else
	    check_all_auto_domains_up
	    rc_status
	    check_all_saved_domains_up
	    rc_status
	fi
	rc_status -v
	;;

    *)
	echo "Usage: $0 {start|stop|restart|reload|status}"
	rc_failed 3
	rc_status -v
	;;
esac

rc_exit

-- 
View this message in context: http://www.nabble.com/Weird-HA-behavior-with-XEN-HA-DRBD-LVM-tp24318862p24318862.html
Sent from the Linux-HA mailing list archive at Nabble.com.




More information about the Linux-HA mailing list