[Fwd: [Linux-HA] repeatable failovers]
Andreas Kurz
andreas.kurz at gmail.com
Tue Dec 19 16:17:50 MST 2006
Hello!
Some things I found out about the stickiness of resources .... and I
hope most of it is true ;-)
- resource-stickiness of INFINITY and the resource stays on the
current node, except the "start" operation fails e.g. after a failure
of the "monitor" operation. Even if
another node with a higher score would be online.
- resource-stickiness of -INFINITY and the resource moves _always_
away from the current node, if another node comes online with a non
negative score for that resource ( so also default score 0 is
sufficient)
- if the monitor operation fails for a resource heartbeat tries to
restart it locally as long the "start" operation is successfull ( if
no resource-failure-stickiness is defined)
- if resource-failure-stickiness is defined for a resource the
fail-counter is increased and the score of the current node for that
resource is decreased by the resource-failure-stickiness -- for manual
resetting the fail-counter see http://www.linux-ha.org/v2/faq ...
search for "crm_failcount"
- a negative score for a resource enforces failover to another node
(with a positive score)
-if you really want your resource failover after every error to the
other node ... wheter this is a good idea or not ... and without a
manual reset of the fail-count, have a look at:
http://www.linux-ha.org/_cache/HeartbeatTutorials__LinuxKongress-2006-tutorial.pdf
... search for "attrd_updater".
You could set your own score_attributes for your nodes depending on
the result of your special script. This should make it possible to
reduce the score for the current node whithout increasing the
fail-count (resource-failure-stickiness=0).
Regards,
Andi
> Hi
>
> On my 2-node cluster, I have one network service monitored by custom
> "watcher".
> For old heartbeat(v1), I used following configuration:
> - Watcher monitors process of network service. If watcher detects any
> problem with service, watcher will call hb_standby on current machine.
> - auto_failback settings is off. If no problem, resources shall remain
> on the same machine till eternity. If any problem, resources shall
> move to another machine.
>
> So the typical usage (when excluding hardware/network/OS problems):
> - resources are on machine A
> - after some time, watcher on A detects problem with service on A,
> calling hb_standby
> - resources are moved to machine B
> - after some time, watcher on B detects problem with service on B,
> calling hb_standby
> - resources are moved to machine A
> - after some time, watcher on A detects problem with service on A,
> calling hb_standby
> - resources are moved machine B
> ...
>
> I wanted to configure v2 heartbeat in the similar way. Instead of
> calling hb_standby, I wanted to use monitoring functionality of
> heartbeat.
> My OCF resource agent is called processResource. For monitor operation
> it behaves in following way:
> - network service is running correctly - return 0 to heartbeat
> - service was not started or was stopped - return 7
> - service is starting up - return 1
> - service is stopping - return 1
> - watcher detected problem with service - return 1 (in meantime,
> watcher is stopping service)
> Start operation of processResource lasts typically some milliseconds,
> stop operation lasts up to 8 seconds.
>
> To describe problem, let's the "A->B" means following failover:
> - resources are on machine A (processResource on A returns 0 to
> heartbeat's monitor operation)
> - after some time, watcher on A detects problem with service on A
> - processResource returns 1 to next heartbeat's monitor operation
> - heartbeat shall stop all resources on A
> - heartbeat shall start all resources on B
> - resources are on machine B (processResource on B returns 0 to
> heartbeat's monitor operation)
>
> Let's the "AxA" means following failover:
> - resources are on machine A (processResource on A returns 0 to
> heartbeat's monitor operation)
> - after some time, watcher on A detects problem with service on A
> - processResource returns 1 to next heartbeat's monitor operation
> - heartbeat restarts resource processResource on A
> - resources are on machine A (processResource on A returns 0 to
> heartbeat's monitor operation)
>
>
> So my problem is, that I was not able to configure heartbeat to do the
> following scenario:
> A->B->A->B->A->B ...
>
> According to
> http://www.linux-ha.org/v2/dtd1.0/annotated#default_resource_stickiness
> - I shall set default-resource-stickiness to INFINITY because of
> original auto_failback off. From
> http://www.linux-ha.org/v2/faq/forced_failover I understand that
> resources are moved to another machine immediately, if
> default_resource_failure_stickiness is low enough. Node score is zero,
> so I have default_resource_failure_stickiness -INFINITY.
> The result of this configuration is:
> - start heartbeat on A and B
> - resources are on A
> - A->B
> - after some time, watcher on B detects problem with service on B
> - on B, processResource returns 1 to next heartbeat's monitor operation
> - heartbeat stop all resources on B
> And now, no resources are running on cluster!
> (see attachment, name of A: debo, name of B: fico)
> In this state crm_verify -VL gives:
> crm_verify[22877]: 2006/12/17_08:34:50 WARN: unpack_rsc_op: Processing
> failed op (x_processResource_monitor_5000) for x_processResource on
> debo
> crm_verify[22877]: 2006/12/17_08:34:50 WARN: unpack_rsc_op: Processing
> failed op (x_processResource_monitor_5000) for x_processResource on
> fico
>
>
> I suspect fail counts on both nodes were set and only human can now
> start resources again. (I tried to find how to "deactivate" fail
> counts, but with no success). I tried many combinations of stickiness
> and failure stickiness values, but repeatable failovers were not
> possible.
>
> Just a remark, when abs(stickiness) >= abs(failure stickiness), the
> usual behavior was:
> AxAxAxAxA ... or BxBxBxBxB ...
> And it is again useless.
>
> In attachment, there are ha.cf, logs, cibadmin -Ql outputs.
> debo machine: Linux, Debian sarge
> configure options: --with-group-name=haclient
> --with-ccmuser-name=hacluster --sysconfdir=/etc --localstatedir=/var
> --disable-tipc --disable-ldirectord --disable-snmp
> --enable-bundled_ltdl --enable-ltdl-convenience --disable-mgmt
> --disable-quorumd --disable-fatal-warnings --enable-crm-dev CFLAGS='-g
> -O0 -fno-unit-at-a-time'
> fico machine: Linux, Gentoo
> configure options: --with-group-name=cluster
> --with-ccmuser-name=cluster --with-group-id=65 --with-ccmuser-id=65
> --sysconfdir=/etc --localstatedir=/var --disable-tipc
> --disable-ldirectord --disable-snmp --enable-bundled_ltdl
> --enable-ltdl-convenience --disable-mgmt --disable-quorumd
> --disable-fatal-warnings --enable-crm-dev CFLAGS='-g -O0
> -fno-unit-at-a-time'
> Sources of heartbeat were taken from http://hg.linux-ha.org/dev
> changeset 9857.
>
>
> If you have any ideas how to get heartbeat to work in
> "A->B->A->B->A->B way", please let me know. Any help appreciated.
>
> Palo
More information about the Linux-HA
mailing list