[Linux-HA] Re: failback (Andrew Beekhof)
Sripathi, Roopa (Roopa)
rsripathi at alcatel-lucent.com
Thu Oct 4 08:18:51 MDT 2007
Hi,
Attached is the input.xml generated from running command :
ptest -L -VVVVVV --save-input input.xml
The only way failback is happening is by running :
crm_resource -C -r IPaddr_cluster -H roopa1
crm_resource -C -r RES_X -H roopa1
I need an autofailback to happen once the resource RES_X is fixed on the
first node(Node A) and the resource RES_X fails on Node B, where it had
failed over to.
That is, I need a failback to happen without having to run the
crm_resource -C command.
Is it possible ?
I changed the default resource stickiness to 0, still the same
behaviour.
I am attaching the logs & cib.xml, zipped
Looking for help asap,
Thanks,
Roopa Sripathi
Message: 5
Date: Mon, 1 Oct 2007 14:26:45 +0200
From: "Andrew Beekhof" <beekhof at gmail.com>
Subject: Re: [Linux-HA] failback
To: "General Linux-HA mailing list" <linux-ha at lists.linux-ha.org>
Message-ID:
<26ef5e70710010526h2f1f3c5eg4a53ff42bc1830c3 at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
this is why they're not being started anywhere
ptest[10530]: 2007/09/26_07:03:37 debug: native_assign_node: All nodes
for resource IPaddr_cluster are unavailable, unclean or shutting down
the next question is why they are unavailable... which is alas
impossible to know without the current cluster status.
try adding:
--save-input input.xml
to the ptest command and attaching input.xml here
Hi,
I have the following problem.
I am using heartbeat 2.1.2
I have an IP address resource and a another resource RES_X.
Two nodes Node A and Node B
In my Active Passive configuration, I will have the RES_X up all the
time on both servers.
The Failover or heartbeat startup would have to run a reload command on
this resource. So in the OCF RA script for RES_X, I have a reload
command of that resource when start is called. And stop of this resource
does not do anything but return success.
My deafault failure stickiness is -100 and default resource stickiness
is 300
RES_X has a score of 250, with preferred location as Node A
When the RES_X is down on node A, it fails over to node B , and that's
great.
But When I fix the failure on node A and kill the resource RES_X on node
B, I need the IP address and RES_X start ( that is, start called but
reload is executed) to happen on node A.
This does not happen.
I ran ptest,
I see the scores are set to -1000000 on both nodes.
I try running failcount reset, still does not work. Scores are not
reset.
How can I make the failback to Node A happen ?
Ptest output is below :
ptest[10530]: 2007/09/26_07:03:37 debug: unpack_config: STONITH of
failed nodes is disabled
ptest[10530]: 2007/09/26_07:03:37 debug: unpack_config: Cluster is
symmetric - resources can run anywhere by default
ptest[10530]: 2007/09/26_07:03:37 debug: unpack_config: On loss of CCM
Quorum: Stop ALL resources
ptest[10530]: 2007/09/26_07:03:37 info: determine_online_status: Node
roopa2 is online
ptest[10530]: 2007/09/26_07:03:37 WARN: unpack_rsc_op: Processing failed
op (qip-named_2_start_0) on roopa2
ptest[10530]: 2007/09/26_07:03:37 WARN: unpack_rsc_op: Handling failed
start for qip-named_2 on roopa2
ptest[10530]: 2007/09/26_07:03:37 info: determine_online_status: Node
roopa1 is online
ptest[10530]: 2007/09/26_07:03:37 WARN: unpack_rsc_op: Processing failed
op (qip-named_2_start_0) on roopa1
ptest[10530]: 2007/09/26_07:03:37 WARN: unpack_rsc_op: Handling failed
start for qip-named_2 on roopa1
ptest[10530]: 2007/09/26_07:03:37 info: group_print: Resource Group:
group_1
ptest[10530]: 2007/09/26_07:03:37 info: native_print: IPaddr_cluster
(heartbeat::ocf:IPaddr): Stopped
ptest[10530]: 2007/09/26_07:03:37 info: native_print: qip-named_2
(heartbeat::ocf:qip-named): Stopped
ptest[10530]: 2007/09/26_07:03:37 info: native_print: qip-named_2
(heartbeat::ocf:qip-named): Stopped
ptest[10530]: 2007/09/26_07:03:37 debug: group_rsc_location: Processing
rsc_location prefered_location_group_1 for group_1
ptest[10530]: 2007/09/26_07:03:37 debug: native_print: Allocating:
IPaddr_cluster (heartbeat::ocf:IPaddr): Stopped
ptest[10530]: 2007/09/26_07:03:37 debug: native_assign_node: Color
IPaddr_cluster, Node[0] roopa2: -1000000
ptest[10530]: 2007/09/26_07:03:37 debug: native_assign_node: Color
IPaddr_cluster, Node[1] roopa1: -1000000
ptest[10530]: 2007/09/26_07:03:37 debug: native_assign_node: All nodes
for resource IPaddr_cluster are unavailable, unclean or shutting down
ptest[10530]: 2007/09/26_07:03:37 WARN: native_color: Resource
IPaddr_cluster cannot run anywhere
ptest[10530]: 2007/09/26_07:03:37 debug: native_print: Allocating:
qip-named_2 (heartbeat::ocf:qip-named): Stopped
ptest[10530]: 2007/09/26_07:03:37 debug: native_assign_node: Color
qip-named_2, Node[0] roopa2: -1000000
ptest[10530]: 2007/09/26_07:03:37 debug: native_assign_node: Color
qip-named_2, Node[1] roopa1: -1000000
ptest[10530]: 2007/09/26_07:03:37 debug: native_assign_node: All nodes
for resource qip-named_2 are unavailable, unclean or shutting down
ptest[10530]: 2007/09/26_07:03:37 WARN: native_color: Resource
qip-named_2 cannot run anywhere
ptest[10530]: 2007/09/26_07:03:37 debug: update_action: Ignoring
implies left - qip-named_2 already stopped
ptest[10530]: 2007/09/26_07:03:37 debug: update_action: * Marking
action group_1_start_0 un-runnable because of IPaddr_cluster_start_0
ptest[10530]: 2007/09/26_07:03:37 debug: init_dotfile: PE_DOT: digraph
"g" {
ptest[10530]: 2007/09/26_07:03:37 debug: main: PE_DOT: }
ptest[10530]: 2007/09/26_07:03:37 info: unpack_graph: Unpacked
transition 0: 0 actions in 0 synapses
ptest[10530]: 2007/09/26_07:03:37 info: set_default_graph_functions:
Setting default graph functions
ptest[10530]: 2007/09/26_07:03:37 debug: run_graph:
====================================================
ptest[10530]: 2007/09/26_07:03:37 info: run_graph: Transition 0:
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0)
~
[root at Roopa1 ~]# crm_failcount -D -U roopa1 -r qip-named_2
crm_failcount[10532]: 2007/09/26_07:04:05 info: Invoked: crm_failcount
-D -U roopa2 -r qip-named_2
[root at Roopa1 ~]# crm_failcount -G -U roopa1 -r qip-named_2
crm_failcount[10533]: 2007/09/26_07:04:07 info: Invoked: crm_failcount
-G -U roopa2 -r qip-named_2
name=fail-count-qip-named_2 value=0
[root at Roopa1 ~]#
Snippet of the OCF script :
prog=named
start() {
echo -n $"Reloading $prog: "
if [ -n "`pidofproc $prog`" ]; then
echo -n $"$prog: running"
reload
return $?
else
echo -n $"$prog: not running"
failure
return 1
fi
}
stop() {
echo -n $"stop $prog called, returning without stopping
$prog "
return $OCF_SUCCESS
}
rndcstatus() {
$xxx $xxxOPTIONS status >/dev/null 2>&1
rc=$?
if [ $rc -eq 1 ]; then
return 7
fi
return $rc
}
restart() {
stop
sleep 3
start
}
reload() {
echo -n $"Reloading $prog: "
$xxx $xxxOPTIONS reload >/dev/null 2>&1
RETVAL=$?
echo
return $RETVAL
}
Thanks,
Roopa
-------------- next part --------------
A non-text attachment was scrubbed...
Name: input.xml
Type: text/xml
Size: 9322 bytes
Desc: input.xml
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20071004/9a226e10/input-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: failback_issue.zip
Type: application/zip
Size: 88033 bytes
Desc: failback_issue.zip
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20071004/9a226e10/failback_issue-0001.zip
More information about the Linux-HA
mailing list