[Linux-HA] failover test and behavior

Dejan Muhamedagic dejanmm at fastmail.fm
Thu Sep 6 13:34:50 MDT 2007


Hi,

On Thu, Sep 06, 2007 at 06:47:16PM +0200, FG wrote:
> Hi,
> 
> I use heartbeat 2.1.1 in an active/passive configuration.
> 
> I'am testing differents failover and need some explanations:
> 
> My node are castor (active) and pollux (standby).
> 
> I'm testing the process failover with monitoring. My configuration use
> default_stickiness = "200" and default_failure_stickiness ="-200" and as
> constraint rsc_location castor with a score of "200".
> With these options, i can have 5 process failures before all services
> can failover to castor.
> 
> It goes as a charm... :-)
> 
> The score on castor  decrease from 1000 (4 resources x 200 +
> score_constraint 200) to 0 and with the sixth failure, failover.
> The scores after failover are: castor (-1000) and pollux (800).
> [root at castor crm]# ptest -L -VVVVVVVVVVVVVVVVVVVVV 2>&1|grep assign
> ptest[31985]: 2007/09/06_15:57:25 debug: debug5: do_calculations: assign
> nodes to colors
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
> IPaddr_147_210_36_7, Node[0] pollux: 800
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
> IPaddr_147_210_36_7, Node[1] castor: -1000
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Assigning
> pollux to IPaddr_147_210_36_7
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
> Filesystem_2, Node[0] pollux: 1000000
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
> Filesystem_2, Node[1] castor: -1000000
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Assigning
> pollux to Filesystem_2
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
> cyrus-imapd_3, Node[0] pollux: 1000000
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
> cyrus-imapd_3, Node[1] castor: -1000000
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Assigning
> pollux to cyrus-imapd_3
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
> saslauthd_4, Node[0] pollux: 1000000
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
> saslauthd_4, Node[1] castor: -1000000
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Assigning
> pollux to saslauthd_4
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
> pingd-child:0, Node[0] castor: 1
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
> pingd-child:0, Node[1] pollux: 0
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Assigning
> castor to pingd-child:0
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
> pingd-child:1, Node[0] pollux: 1
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
> pingd-child:1, Node[1] castor: -1000000
> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Assigning
> pollux to pingd-child:1
> 
> Now to test, I unplug the network card on pollux. I thought then  to
> have a new failover to the first node (castor) but nothing...
> So i watch my score and my log
> 
> [root at castor crm]# ptest -L -VVVVVVVVVVVVVVVVVVVVV 2>&1|grep assign
> ptest[32467]: 2007/09/06_16:17:11 debug: debug5: do_calculations: assign
> nodes to colors
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
> IPaddr_147_210_36_7, Node[0] castor: -1000
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
> IPaddr_147_210_36_7, Node[1] pollux: -1000000
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: All nodes
> for resource IPaddr_147_210_36_7 are unavailable, unclean or shutting down
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
> Filesystem_2, Node[0] castor: -1000000
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
> Filesystem_2, Node[1] pollux: -1000000
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: All nodes
> for resource Filesystem_2 are unavailable, unclean or shutting down
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
> cyrus-imapd_3, Node[0] castor: -1000000
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
> cyrus-imapd_3, Node[1] pollux: -1000000
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: All nodes
> for resource cyrus-imapd_3 are unavailable, unclean or shutting down
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
> saslauthd_4, Node[0] castor: -1000000
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
> saslauthd_4, Node[1] pollux: -1000000
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: All nodes
> for resource saslauthd_4 are unavailable, unclean or shutting down
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
> pingd-child:0, Node[0] castor: 1
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
> pingd-child:0, Node[1] pollux: 0
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Assigning
> castor to pingd-child:0
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
> pingd-child:1, Node[0] pollux: 1
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
> pingd-child:1, Node[1] castor: -1000000
> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Assigning
> pollux to pingd-child:1
> 
> pengine[20890]: 2007/09/06_16:00:23 WARN: native_color: Resource
> IPaddr_147_210_36_7 cannot run anywhere
> pengine[20890]: 2007/09/06_16:00:23 WARN: native_color: Resource
> Filesystem_2 cannot run anywhere
> pengine[20890]: 2007/09/06_16:00:23 WARN: native_color: Resource
> cyrus-imapd_3 cannot run anywhere
> pengine[20890]: 2007/09/06_16:00:23 WARN: native_color: Resource
> saslauthd_4 cannot run anywhere
> 
> Could someone explain me what's happening ? Is that split-brain ???

Yes, it is.

> Because of pingd failed,and my rule to score="-INFINITY", i think scores
> on pollux are logics, aren't it ? And finally we have the same score for
> resources on the two nodes 
> How can i avoid this behavior ?

The cluster won't try to run the resources on a node which has
negative score, i.e. one on which the resource failed too many
times. That's your case it seems. Try to reset the failcount and
see if that helps.

Thanks.

Dejan

> /
> /I attach my settings (cibadmin -Q in a normal state), would you please
> help to verify it ?
> 
> Thanks, regards
> 
> Fabrice
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems



More information about the Linux-HA mailing list