[Linux-HA] Weird HA-behavior with XEN/HA/DRBD/LVM
sussox
sussox at gmail.com
Fri Jul 3 01:30:07 MDT 2009
I have a HA-setup with Xen heartbeat lvm and DRBD. According to this howto:
http://www.asplund.nu/xencluster/xen-cluster-howto.html
Im using 2 Poweredge R710. ha1 & ha2 with Ubuntu 8.04
I can do a manual live migration of the domU's without any problems. Also if
i shutdown heartbeat on ha1 with init.d/heartbeat the domU's are migrated to
ha2 sucessfully.
Also, if i pull the plug on ha1, after a while the domU's start on ha2 (as
they should).
However! When doing a "reboot" on ha1, domU's begin to migrate but then
crashes on ha2. pasting ha-debug and xend.log below.
Any ides why it keeps doing this? All i can think of is that some process is
being killed to fast (when the domU's are beeing migrated, but i don't know
what to look for.) Also, i ran a test couple of weeks ago with the same
setup but on one R710 and a older shuttle and then there was no problem.
Tried to redo the howto twice but with the same problem
Cheers! /Sussox
ha-debug:
Code:
heartbeat[28295]: 2009/06/30_13:46:12 info: Received shutdown notice from
'ha1.vbm.se'.
heartbeat[28295]: 2009/06/30_13:46:12 info: Resources being acquired from
ha1.vbm.se.
heartbeat[28295]: 2009/06/30_13:46:12 debug: StartNextRemoteRscReq(): child
count 1
heartbeat[29445]: 2009/06/30_13:46:12 info: acquire local HA resources
(standby).
ResourceManager[29472]: 2009/06/30_13:46:12 info: Acquiring resource group:
ha2.vbm.se xendomainsHA2
heartbeat[29446]: 2009/06/30_13:46:13 info: Local Resource acquisition
completed.
heartbeat[28295]: 2009/06/30_13:46:13 debug: StartNextRemoteRscReq(): child
count 2
heartbeat[28295]: 2009/06/30_13:46:13 debug: StartNextRemoteRscReq(): child
count 1
ResourceManager[29472]: 2009/06/30_13:46:13 info: Running
/etc/ha.d/resource.d/xendomainsHA2 start
ResourceManager[29472]: 2009/06/30_13:46:13 debug: Starting
/etc/ha.d/resource.d/xendomainsHA2 start
ResourceManager[29472]: 2009/06/30_13:46:13 debug:
/etc/ha.d/resource.d/xendomainsHA2 start done. RC=0
heartbeat[29445]: 2009/06/30_13:46:13 info: local HA resource acquisition
completed (standby).
heartbeat[28295]: 2009/06/30_13:46:13 info: Standby resource acquisition
done [foreign].
heartbeat[29559]: 2009/06/30_13:46:13 debug: notify_world: setting SIGCHLD
Handler to SIG_DFL
harc[29559]: 2009/06/30_13:46:13 info: Running /etc/ha.d/rc.d/status status
mach_down[29573]: 2009/06/30_13:46:13 info: Taking over resource group
xendomainsHA1
ResourceManager[29597]: 2009/06/30_13:46:13 info: Acquiring resource group:
ha1.vbm.se xendomainsHA1
ResourceManager[29597]: 2009/06/30_13:46:13 info: Running
/etc/ha.d/resource.d/xendomainsHA1 start
ResourceManager[29597]: 2009/06/30_13:46:13 debug: Starting
/etc/ha.d/resource.d/xendomainsHA1 start
Starting auto Xen domains: hejsan(skip) * [done]
ResourceManager[29597]: 2009/06/30_13:46:13 debug:
/etc/ha.d/resource.d/xendomainsHA1 start done. RC=0
mach_down[29573]: 2009/06/30_13:46:13 info: /usr/share/heartbeat/mach_down:
nice_failback: foreign resources acquired
mach_down[29573]: 2009/06/30_13:46:13 info: mach_down takeover complete for
node ha1.vbm.se.
heartbeat[28295]: 2009/06/30_13:46:13 info: mach_down takeover complete.
heartbeat[29696]: 2009/06/30_13:46:13 debug: notify_world: setting SIGCHLD
Handler to SIG_DFL
harc[29696]: 2009/06/30_13:46:13 info: Running
/etc/ha.d/rc.d/ip-request-resp ip-request-resp
ip-request-resp[29696]: 2009/06/30_13:46:13 received ip-request-resp
xendomainsHA2 OK yes
ResourceManager[29715]: 2009/06/30_13:46:13 info: Acquiring resource group:
ha2.vbm.se xendomainsHA2
ResourceManager[29715]: 2009/06/30_13:46:13 info: Running
/etc/ha.d/resource.d/xendomainsHA2 start
ResourceManager[29715]: 2009/06/30_13:46:13 debug: Starting
/etc/ha.d/resource.d/xendomainsHA2 start
ResourceManager[29715]: 2009/06/30_13:46:13 debug:
/etc/ha.d/resource.d/xendomainsHA2 start done. RC=0
heartbeat[28295]: 2009/06/30_13:46:24 WARN: node ha1.vbm.se: is dead
heartbeat[28295]: 2009/06/30_13:46:24 info: Dead node ha1.vbm.se gave up
resources.
heartbeat[28295]: 2009/06/30_13:46:24 info: Link ha1.vbm.se:eth0 dead.
xend.log
Code:
[2009-06-30 13:46:11 5499] DEBUG (XendCheckpoint:210) restore:shadow=0x0,
_static_max=0x18000000, _static_min=0x0,
[2009-06-30 13:46:11 5499] DEBUG (balloon:151) Balloon: 398436 KiB free;
need 393216; done.
[2009-06-30 13:46:11 5499] DEBUG (XendCheckpoint:227) [xc_restore]:
/usr/lib/xen/bin/xc_restore 4 7 1 2 0 0 0
[2009-06-30 13:46:11 5499] INFO (XendCheckpoint:365) xc_domain_restore
start: p2m_size = 18800
[2009-06-30 13:46:11 5499] INFO (XendCheckpoint:365) Reloading memory pages:
0%
[2009-06-30 13:46:14 5499] INFO (XendCheckpoint:365) ERROR Internal error:
Error when reading page (type was 0)
[2009-06-30 13:46:14 5499] INFO (XendCheckpoint:365) Restore exit with rc=1
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1913)
XendDomainInfo.destroy: domid=7
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1930)
XendDomainInfo.destroyDomain(7)
[2009-06-30 13:46:14 5499] ERROR (XendDomainInfo:1942)
XendDomainInfo.destroy: xc.domain_destroy failed.
Traceback (most recent call last):
File "/usr/lib/python2.5/site-packages/xen/xend/XendDomainInfo.py", line
1937, in destroyDomain
xc.domain_destroy(self.domid)
Error: (3, 'No such process')
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1553) No device model
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1555) Releasing devices
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1561) Removing vif/0
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:590)
XendDomainInfo.destroyDevice: deviceClass = vif, device = vif/0
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1561) Removing vbd/51713
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:590)
XendDomainInfo.destroyDevice: deviceClass = vbd, device = vbd/51713
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1561) Removing vbd/51714
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:590)
XendDomainInfo.destroyDevice: deviceClass = vbd, device = vbd/51714
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:1561) Removing console/0
[2009-06-30 13:46:14 5499] DEBUG (XendDomainInfo:590)
XendDomainInfo.destroyDevice: deviceClass = console, device = console/0
[2009-06-30 13:46:14 5499] ERROR (XendDomain:1136) Restore failed
Traceback (most recent call last):
File "/usr/lib/python2.5/site-packages/xen/xend/XendDomain.py", line 1134,
in domain_restore_fd
return XendCheckpoint.restore(self, fd, paused=paused)
File "/usr/lib/python2.5/site-packages/xen/xend/XendCheckpoint.py", line
231, in restore
forkHelper(cmd, fd, handler.handler, True)
File "/usr/lib/python2.5/site-packages/xen/xend/XendCheckpoint.py", line
353, in forkHelper
raise XendError("%s failed" % string.join(cmd))
XendError: /usr/lib/xen/bin/xc_restore 4 7 1 2 0 0 0 failed
What xendomainsHA2 does:
#!/bin/bash
#
# /etc/init.d/xendomains
# Start / stop domains automatically when domain 0 boots / shuts down.
#
# chkconfig: 345 99 00
# description: Start / stop Xen domains.
#
# This script offers fairly basic functionality. It should work on Redhat
# but also on LSB-compliant SuSE releases and on Debian with the LSB package
# installed. (LSB is the Linux Standard Base)
#
# Based on the example in the "Designing High Quality Integrated Linux
# Applications HOWTO" by Avi Alkalay
# <http://www.tldp.org/HOWTO/HighQuality-Apps-HOWTO/>
#
### BEGIN INIT INFO
# Provides: xendomains
# Required-Start: $syslog $remote_fs xend
# Should-Start:
# Required-Stop: $syslog $remote_fs xend
# Should-Stop:
# Default-Start: 3 4 5
# Default-Stop: 0 1 2 6
# Short-Description: Start/stop secondary xen domains
# Description: Start / stop domains automatically when domain 0
# boots / shuts down.
### END INIT INFO
# Correct exit code would probably be 5, but it's enough
# if xend complains if we're not running as privileged domain
if ! [ -e /proc/xen/privcmd ]; then
exit 0
fi
LOCKFILE=/var/lock/xendomainsHA2
XENDOM_CONFIG=/etc/default/xendomainsHA2
test -r $XENDOM_CONFIG || { echo "$XENDOM_CONFIG not existing";
if [ "$1" = "stop" ]; then exit 0;
else exit 6; fi; }
. $XENDOM_CONFIG
# Use the SUSE rc_ init script functions;
# emulate them on LSB, RH and other systems
if test -e /etc/rc.status; then
# SUSE rc script library
. /etc/rc.status
else
_cmd=$1
declare -a _SMSG
if test "${_cmd}" = "status"; then
_SMSG=(running dead dead unused unknown)
_RC_UNUSED=3
else
_SMSG=(done failed failed missed failed skipped unused failed failed)
_RC_UNUSED=6
fi
if test -e /etc/init.d/functions; then
# REDHAT
. /etc/init.d/functions
echo_rc()
{
#echo -n " [${_SMSG[${_RC_RV}]}] "
if test ${_RC_RV} = 0; then
success " [${_SMSG[${_RC_RV}]}] "
else
failure " [${_SMSG[${_RC_RV}]}] "
fi
}
elif test -e /lib/lsb/init-functions; then
# LSB
. /lib/lsb/init-functions
if alias log_success_msg >/dev/null 2>/dev/null; then
echo_rc()
{
echo " [${_SMSG[${_RC_RV}]}] "
}
else
echo_rc()
{
if test ${_RC_RV} = 0; then
log_success_msg " [${_SMSG[${_RC_RV}]}] "
else
log_failure_msg " [${_SMSG[${_RC_RV}]}] "
fi
}
fi
else
# emulate it
echo_rc()
{
echo " [${_SMSG[${_RC_RV}]}] "
}
fi
rc_reset() { _RC_RV=0; }
rc_failed()
{
if test -z "$1"; then
_RC_RV=1;
elif test "$1" != "0"; then
_RC_RV=$1;
fi
return ${_RC_RV}
}
rc_check()
{
return rc_failed $?
}
rc_status()
{
rc_failed $?
if test "$1" = "-r"; then _RC_RV=0; shift; fi
if test "$1" = "-s"; then rc_failed 5; echo_rc; rc_failed 3; shift; fi
if test "$1" = "-u"; then rc_failed ${_RC_UNUSED}; echo_rc; rc_failed 3;
shift; fi
if test "$1" = "-v"; then echo_rc; shift; fi
if test "$1" = "-r"; then _RC_RV=0; shift; fi
return ${_RC_RV}
}
rc_exit() { exit ${_RC_RV}; }
rc_active()
{
if test -z "$RUNLEVEL"; then read RUNLEVEL REST < <(/sbin/runlevel); fi
if test -e /etc/init.d/S[0-9][0-9]${1}; then return 0; fi
return 1
}
fi
if ! which usleep >&/dev/null
then
usleep()
{
if [ -n "$1" ]
then
sleep $(( $1 / 1000000 ))
fi
}
fi
# Reset status of this service
rc_reset
##
# Returns 0 (success) if the given parameter names a directory, and that
# directory is not empty.
#
contains_something()
{
if [ -d "$1" ] && [ `/bin/ls $1 | wc -l` -gt 0 ]
then
return 0
else
return 1
fi
}
# read name from xen config file
rdname()
{
NM=$(xm create --quiet --dryrun --defconfig "$1" |
sed -n 's/^.*(name \(.*\))$/\1/p')
}
rdnames()
{
NAMES=
if ! contains_something "$XENDOMAINS_AUTO"
then
return
fi
for dom in $XENDOMAINS_AUTO/*; do
rdname $dom
if test -z $NAMES; then
NAMES=$NM;
else
NAMES="$NAMES|$NM"
fi
done
}
parseln()
{
name=`echo "$1" | cut -d\ -f1`
name=${name%% *}
rest=`echo "$1" | cut -d\ -f2-`
read id mem cpu vcpu state tm < <(echo "$rest")
}
is_running()
{
rdname $1
RC=1
while read LN; do
parseln "$LN"
if test "$id" = "0"; then continue; fi
case $name in
($NM)
RC=0
;;
esac
done < <(xm list | grep -v '^Name')
return $RC
}
start()
{
if [ -f $LOCKFILE ]; then
echo -n "xendomains already running (lockfile exists)"
return;
fi
saved_domains=" "
if [ "$XENDOMAINS_RESTORE" = "true" ] &&
contains_something "$XENDOMAINS_SAVE"
then
mkdir -p $(dirname "$LOCKFILE")
touch $LOCKFILE
echo -n "Restoring Xen domains:"
saved_domains=`ls $XENDOMAINS_SAVE`
for dom in $XENDOMAINS_SAVE/*; do
echo -n " ${dom##*/}"
xm restore $dom
if [ $? -ne 0 ]; then
rc_failed $?
echo -n '!'
else
# mv $dom ${dom%/*}/.${dom##*/}
rm $dom
fi
done
echo .
fi
if contains_something "$XENDOMAINS_AUTO"
then
touch $LOCKFILE
echo -n "Starting auto Xen domains:"
# We expect config scripts for auto starting domains to be in
# XENDOMAINS_AUTO - they could just be symlinks to files elsewhere
# Create all domains with config files in XENDOMAINS_AUTO.
# TODO: We should record which domain name belongs
# so we have the option to selectively shut down / migrate later
# If a domain statefile from $XENDOMAINS_SAVE matches a domain name
# in $XENDOMAINS_AUTO, do not try to start that domain; if it didn't
# restore correctly it requires administrative attention.
for dom in $XENDOMAINS_AUTO/*; do
echo -n " ${dom##*/}"
shortdom=$(echo $dom | sed -n 's/^.*\/\(.*\)$/\1/p')
echo $saved_domains | grep -w $shortdom > /dev/null
if [ $? -eq 0 ] || is_running $dom; then
echo -n "(skip)"
else
xm create --quiet --defconfig $dom
if [ $? -ne 0 ]; then
rc_failed $?
echo -n '!'
else
usleep $XENDOMAINS_CREATE_USLEEP
fi
fi
done
fi
}
all_zombies()
{
while read LN; do
parseln "$LN"
if test $id = 0; then continue; fi
if test "$state" != "-b---d" -a "$state" != "-----d"; then
return 1;
fi
done < <(xm list | grep -v '^Name')
return 0
}
# Wait for max $XENDOMAINS_STOP_MAXWAIT for xm $1 to finish;
# if it has not exited by that time kill it, so the init script will
# succeed within a finite amount of time; if $2 is nonnull, it will
# kill the command as well as soon as no domain (except for zombies)
# are left (used for shutdown --all).
watchdog_xm()
{
if test -z "$XENDOMAINS_STOP_MAXWAIT" -o "$XENDOMAINS_STOP_MAXWAIT" =
"0"; then
exit
fi
usleep 20000
for no in `seq 0 $XENDOMAINS_STOP_MAXWAIT`; do
# exit if xm save/migrate/shutdown is finished
PSAX=`ps axlw | grep "xm $1" | grep -v grep`
if test -z "$PSAX"; then exit; fi
echo -n "."; sleep 1
# go to kill immediately if there's only zombies left
if all_zombies && test -n "$2"; then break; fi
done
sleep 1
read PSF PSUID PSPID PSPPID < <(echo "$PSAX")
# kill xm $1
kill $PSPID >/dev/null 2>&1
}
stop()
{
# Collect list of domains to shut down
if test "$XENDOMAINS_AUTO_ONLY" = "true"; then
rdnames
fi
echo -n "Shutting down Xen domains:"
while read LN; do
parseln "$LN"
if test $id = 0; then continue; fi
echo -n " $name"
if test "$XENDOMAINS_AUTO_ONLY" = "true"; then
case $name in
($NAMES)
# nothing
;;
(*)
echo -n "(skip)"
continue
;;
esac
fi
# XENDOMAINS_SYSRQ chould be something like just "s"
# or "s e i u" or even "s e s i u o"
# for the latter, you should set XENDOMAINS_USLEEP to 1200000 or so
if test -n "$XENDOMAINS_SYSRQ"; then
for sysrq in $XENDOMAINS_SYSRQ; do
echo -n "(SR-$sysrq)"
xm sysrq $id $sysrq
if test $? -ne 0; then
rc_failed $?
echo -n '!'
fi
# usleep just ignores empty arg
usleep $XENDOMAINS_USLEEP
done
fi
if test "$state" = "-b---d" -o "$state" = "-----d"; then
echo -n "(zomb)"
continue
fi
if test -n "$XENDOMAINS_MIGRATE"; then
echo -n "(migr)"
watchdog_xm migrate &
WDOG_PID=$!
xm migrate $id $XENDOMAINS_MIGRATE
if test $? -ne 0; then
rc_failed $?
echo -n '!'
kill $WDOG_PID >/dev/null 2>&1
else
kill $WDOG_PID >/dev/null 2>&1
continue
fi
fi
if test -n "$XENDOMAINS_SAVE"; then
echo -n "(save)"
watchdog_xm save &
WDOG_PID=$!
mkdir -p "$XENDOMAINS_SAVE"
xm save $id $XENDOMAINS_SAVE/$name
if test $? -ne 0; then
rc_failed $?
echo -n '!'
kill $WDOG_PID >/dev/null 2>&1
else
kill $WDOG_PID >/dev/null 2>&1
continue
fi
fi
if test -n "$XENDOMAINS_SHUTDOWN"; then
# XENDOMAINS_SHUTDOWN should be "--halt --wait"
echo -n "(shut)"
watchdog_xm shutdown &
WDOG_PID=$!
xm shutdown $id $XENDOMAINS_SHUTDOWN
if test $? -ne 0; then
rc_failed $?
echo -n '!'
fi
kill $WDOG_PID >/dev/null 2>&1
fi
done < <(xm list | grep -v '^Name')
# NB. this shuts down ALL Xen domains (politely), not just the ones in
# AUTODIR/*
# This is because it's easier to do ;-) but arguably if this script is
run
# on system shutdown then it's also the right thing to do.
if ! all_zombies && test -n "$XENDOMAINS_SHUTDOWN_ALL"; then
# XENDOMAINS_SHUTDOWN_ALL should be "--all --halt --wait"
echo -n " SHUTDOWN_ALL "
watchdog_xm shutdown 1 &
WDOG_PID=$!
xm shutdown $XENDOMAINS_SHUTDOWN_ALL
if test $? -ne 0; then
rc_failed $?
echo -n '!'
fi
kill $WDOG_PID >/dev/null 2>&1
fi
# Unconditionally delete lock file
rm -f $LOCKFILE
}
check_domain_up()
{
while read LN; do
parseln "$LN"
if test $id = 0; then continue; fi
case $name in
($1)
return 0
;;
esac
done < <(xm list | grep -v "^Name")
return 1
}
check_all_auto_domains_up()
{
if ! contains_something "$XENDOMAINS_AUTO"
then
return 0
fi
missing=
for nm in $XENDOMAINS_AUTO/*; do
rdname $nm
found=0
if check_domain_up "$NM"; then
echo -n " $name"
else
missing="$missing $NM"
fi
done
if test -n "$missing"; then
echo -n " MISS AUTO:$missing"
return 1
fi
return 0
}
check_all_saved_domains_up()
{
if ! contains_something "$XENDOMAINS_SAVE"
then
return 0
fi
missing=`/bin/ls $XENDOMAINS_SAVE`
echo -n " MISS SAVED: " $missing
return 1
}
# This does NOT necessarily restart all running domains: instead it
# stops all running domains and then boots all the domains specified in
# AUTODIR. If other domains have been started manually then they will
# not get restarted.
# Commented out to avoid confusion!
restart()
{
stop
start
}
reload()
{
restart
}
case "$1" in
start)
start
rc_status
if test -f $LOCKFILE; then rc_status -v; fi
;;
stop)
stop
rc_status -v
;;
restart)
restart
;;
reload)
reload
;;
force-reload)
reload
;;
status)
echo -n "Checking for xendomains:"
if test ! -f $LOCKFILE; then
rc_failed 3
else
check_all_auto_domains_up
rc_status
check_all_saved_domains_up
rc_status
fi
rc_status -v
;;
*)
echo "Usage: $0 {start|stop|restart|reload|status}"
rc_failed 3
rc_status -v
;;
esac
rc_exit
--
View this message in context: http://www.nabble.com/Weird-HA-behavior-with-XEN-HA-DRBD-LVM-tp24318862p24318862.html
Sent from the Linux-HA mailing list archive at Nabble.com.
More information about the Linux-HA
mailing list