[Linux-HA] Re: failback (Andrew Beekhof)

Sripathi, Roopa (Roopa) rsripathi at alcatel-lucent.com
Thu Oct 4 08:18:51 MDT 2007


Hi,

Attached is the input.xml generated from running command :

ptest -L -VVVVVV  --save-input input.xml

The only way failback is happening is by running :
crm_resource -C  -r IPaddr_cluster -H roopa1
crm_resource -C  -r RES_X  -H roopa1


I need an autofailback to happen once the resource RES_X is fixed on the
first node(Node A) and the resource RES_X fails on Node B, where it had
failed over to.

That is, I need a failback to happen without having to run the
crm_resource -C command.

Is it possible ?

I changed the default resource stickiness to 0, still the same
behaviour.
 I am attaching the logs & cib.xml, zipped

Looking for help asap,

Thanks,

Roopa Sripathi
 


Message: 5
Date: Mon, 1 Oct 2007 14:26:45 +0200
From: "Andrew Beekhof" <beekhof at gmail.com>
Subject: Re: [Linux-HA] failback
To: "General Linux-HA mailing list" <linux-ha at lists.linux-ha.org>
Message-ID:
	<26ef5e70710010526h2f1f3c5eg4a53ff42bc1830c3 at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

this is why they're not being started anywhere

ptest[10530]: 2007/09/26_07:03:37 debug: native_assign_node: All nodes
for resource IPaddr_cluster are unavailable, unclean or shutting down

the next question is why they are unavailable... which is alas
impossible to know without the current cluster status.

try adding:
   --save-input input.xml
to the ptest command and attaching input.xml here

Hi,

 

I have the following problem.

 

I am using heartbeat 2.1.2

 

I have an IP address resource and a another resource RES_X.

Two nodes Node A and Node B

 

In my Active Passive configuration, I will have the RES_X up all the
time on both servers.

 

The Failover or heartbeat startup would have to run a reload command on
this resource. So in the OCF RA script for RES_X, I have a reload
command of that resource when start is called. And stop of this resource
does not do anything but return success.

 

My deafault failure stickiness is -100 and default resource stickiness
is 300

RES_X has a score of 250, with preferred location as Node A

 

 

When the RES_X is down on node A, it fails over to node B , and that's
great.

 

But When I fix the failure on node A and kill the resource RES_X on node
B, I need the IP address and RES_X start ( that is, start called but
reload is executed)  to happen on node A.

This does not happen.

 

I ran ptest,

 

I see the scores are set to -1000000 on both nodes.

I try running failcount reset, still does not work. Scores are not
reset.

 

How can I make the failback to Node A happen ?

 

Ptest output is below :

 

ptest[10530]: 2007/09/26_07:03:37 debug: unpack_config: STONITH of
failed nodes is disabled

ptest[10530]: 2007/09/26_07:03:37 debug: unpack_config: Cluster is
symmetric - resources can run anywhere by default

ptest[10530]: 2007/09/26_07:03:37 debug: unpack_config: On loss of CCM
Quorum: Stop ALL resources

ptest[10530]: 2007/09/26_07:03:37 info: determine_online_status: Node
roopa2 is online

ptest[10530]: 2007/09/26_07:03:37 WARN: unpack_rsc_op: Processing failed
op (qip-named_2_start_0) on roopa2

ptest[10530]: 2007/09/26_07:03:37 WARN: unpack_rsc_op: Handling failed
start for qip-named_2 on roopa2

ptest[10530]: 2007/09/26_07:03:37 info: determine_online_status: Node
roopa1 is online

ptest[10530]: 2007/09/26_07:03:37 WARN: unpack_rsc_op: Processing failed
op (qip-named_2_start_0) on roopa1

ptest[10530]: 2007/09/26_07:03:37 WARN: unpack_rsc_op: Handling failed
start for qip-named_2 on roopa1

ptest[10530]: 2007/09/26_07:03:37 info: group_print: Resource Group:
group_1

ptest[10530]: 2007/09/26_07:03:37 info: native_print:     IPaddr_cluster
(heartbeat::ocf:IPaddr):        Stopped

ptest[10530]: 2007/09/26_07:03:37 info: native_print:     qip-named_2
(heartbeat::ocf:qip-named):     Stopped

ptest[10530]: 2007/09/26_07:03:37 info: native_print:     qip-named_2
(heartbeat::ocf:qip-named):     Stopped

ptest[10530]: 2007/09/26_07:03:37 debug: group_rsc_location: Processing
rsc_location prefered_location_group_1 for group_1

ptest[10530]: 2007/09/26_07:03:37 debug: native_print: Allocating:
IPaddr_cluster       (heartbeat::ocf:IPaddr):        Stopped

ptest[10530]: 2007/09/26_07:03:37 debug: native_assign_node: Color
IPaddr_cluster, Node[0] roopa2: -1000000

ptest[10530]: 2007/09/26_07:03:37 debug: native_assign_node: Color
IPaddr_cluster, Node[1] roopa1: -1000000

ptest[10530]: 2007/09/26_07:03:37 debug: native_assign_node: All nodes
for resource IPaddr_cluster are unavailable, unclean or shutting down

ptest[10530]: 2007/09/26_07:03:37 WARN: native_color: Resource
IPaddr_cluster cannot run anywhere

ptest[10530]: 2007/09/26_07:03:37 debug: native_print: Allocating:
qip-named_2  (heartbeat::ocf:qip-named):     Stopped

ptest[10530]: 2007/09/26_07:03:37 debug: native_assign_node: Color
qip-named_2, Node[0] roopa2: -1000000

ptest[10530]: 2007/09/26_07:03:37 debug: native_assign_node: Color
qip-named_2, Node[1] roopa1: -1000000

ptest[10530]: 2007/09/26_07:03:37 debug: native_assign_node: All nodes
for resource qip-named_2 are unavailable, unclean or shutting down

ptest[10530]: 2007/09/26_07:03:37 WARN: native_color: Resource
qip-named_2 cannot run anywhere

ptest[10530]: 2007/09/26_07:03:37 debug: update_action:       Ignoring
implies left - qip-named_2 already stopped

ptest[10530]: 2007/09/26_07:03:37 debug: update_action:    * Marking
action group_1_start_0 un-runnable because of IPaddr_cluster_start_0

ptest[10530]: 2007/09/26_07:03:37 debug: init_dotfile: PE_DOT:  digraph
"g" {

ptest[10530]: 2007/09/26_07:03:37 debug: main: PE_DOT: }

ptest[10530]: 2007/09/26_07:03:37 info: unpack_graph: Unpacked
transition 0: 0 actions in 0 synapses

ptest[10530]: 2007/09/26_07:03:37 info: set_default_graph_functions:
Setting default graph functions

ptest[10530]: 2007/09/26_07:03:37 debug: run_graph:
====================================================

ptest[10530]: 2007/09/26_07:03:37 info: run_graph: Transition 0:
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0)

~

 

 

 

 

[root at Roopa1 ~]# crm_failcount -D -U roopa1 -r qip-named_2

crm_failcount[10532]: 2007/09/26_07:04:05 info: Invoked: crm_failcount
-D -U roopa2 -r qip-named_2

[root at Roopa1 ~]# crm_failcount -G -U roopa1 -r qip-named_2

crm_failcount[10533]: 2007/09/26_07:04:07 info: Invoked: crm_failcount
-G -U roopa2 -r qip-named_2

 name=fail-count-qip-named_2 value=0

[root at Roopa1 ~]#

 

 

 

Snippet of the OCF script :

 

prog=named

 

start() {

            echo -n $"Reloading $prog: "

            if [ -n "`pidofproc $prog`" ]; then

                        echo -n $"$prog: running"

                        reload

                        return $?

            else

                        echo -n $"$prog: not running"

                        failure

                        return 1

            fi

}

 

stop() {

            echo -n $"stop $prog called, returning without stopping
$prog "

            return $OCF_SUCCESS

}

 

rndcstatus() {

            $xxx $xxxOPTIONS status >/dev/null 2>&1

            rc=$?

            if [ $rc -eq 1 ]; then

            return 7

            fi

            return $rc

}

 

restart() {

            stop

            sleep 3

            start

}

 

reload() {

            echo -n $"Reloading $prog: "

            $xxx $xxxOPTIONS reload >/dev/null 2>&1

            RETVAL=$?

            echo

            return $RETVAL

}

 

 

Thanks,

 

Roopa

 

 


-------------- next part --------------
A non-text attachment was scrubbed...
Name: input.xml
Type: text/xml
Size: 9322 bytes
Desc: input.xml
URL: <http://lists.linux-ha.org/pipermail/linux-ha/attachments/20071004/9a226e10/attachment.xml>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: failback_issue.zip
Type: application/zip
Size: 88033 bytes
Desc: failback_issue.zip
URL: <http://lists.linux-ha.org/pipermail/linux-ha/attachments/20071004/9a226e10/attachment.zip>


More information about the Linux-HA mailing list