[Linux-HA] repeatable failovers

Pavol Gono palo.gono at gmail.com
Sun Dec 17 11:33:31 MST 2006


Hi

On my 2-node cluster, I have one network service monitored by custom "watcher".
For old heartbeat(v1), I used following configuration:
- Watcher monitors process of network service. If watcher detects any
problem with service, watcher will call hb_standby on current machine.
- auto_failback settings is off. If no problem, resources shall remain
on the same machine till eternity. If any problem, resources shall
move to another machine.

So the typical usage (when excluding hardware/network/OS problems):
- resources are on machine A
- after some time, watcher on A detects problem with service on A,
calling hb_standby
- resources are moved to machine B
- after some time, watcher on B detects problem with service on B,
calling hb_standby
- resources are moved to machine A
- after some time, watcher on A detects problem with service on A,
calling hb_standby
- resources are moved machine B
...

I wanted to configure v2 heartbeat in the similar way. Instead of
calling hb_standby, I wanted to use monitoring functionality of
heartbeat.
My OCF resource agent is called processResource. For monitor operation
it behaves in following way:
- network service is running correctly - return 0 to heartbeat
- service was not started or was stopped - return 7
- service is starting up - return 1
- service is stopping - return 1
- watcher detected problem with service - return 1 (in meantime,
watcher is stopping service)
Start operation of processResource lasts typically some milliseconds,
stop operation lasts up to 8 seconds.

To describe problem, let's the "A->B" means following failover:
- resources are on machine A (processResource on A returns 0 to
heartbeat's monitor operation)
- after some time, watcher on A detects problem with service on A
- processResource returns 1 to next heartbeat's monitor operation
- heartbeat shall stop all resources on A
- heartbeat shall start all resources on B
- resources are on machine B (processResource on B returns 0 to
heartbeat's monitor operation)

Let's the "AxA" means following failover:
- resources are on machine A (processResource on A returns 0 to
heartbeat's monitor operation)
- after some time, watcher on A detects problem with service on A
- processResource returns 1 to next heartbeat's monitor operation
- heartbeat restarts resource processResource on A
- resources are on machine A (processResource on A returns 0 to
heartbeat's monitor operation)


So my problem is, that I was not able to configure heartbeat to do the
following scenario:
A->B->A->B->A->B ...

According to http://www.linux-ha.org/v2/dtd1.0/annotated#default_resource_stickiness
- I shall set default-resource-stickiness to INFINITY because of
original auto_failback off. From
http://www.linux-ha.org/v2/faq/forced_failover I understand that
resources are moved to another machine immediately, if
default_resource_failure_stickiness is low enough. Node score is zero,
so I have default_resource_failure_stickiness -INFINITY.
The result of this configuration is:
- start heartbeat on A and B
- resources are on A
- A->B
- after some time, watcher on B detects problem with service on B
- on B, processResource returns 1 to next heartbeat's monitor operation
- heartbeat stop all resources on B
And now, no resources are running on cluster!
(see attachment, name of A: debo, name of B: fico)
In this state crm_verify -VL gives:
crm_verify[22877]: 2006/12/17_08:34:50 WARN: unpack_rsc_op: Processing
failed op (x_processResource_monitor_5000) for x_processResource on
debo
crm_verify[22877]: 2006/12/17_08:34:50 WARN: unpack_rsc_op: Processing
failed op (x_processResource_monitor_5000) for x_processResource on
fico


I suspect fail counts on both nodes were set and only human can now
start resources again. (I tried to find how to "deactivate" fail
counts, but with no success). I tried many combinations of stickiness
and failure stickiness values, but repeatable failovers were not
possible.

Just a remark, when abs(stickiness) >= abs(failure stickiness), the
usual behavior was:
AxAxAxAxA ... or BxBxBxBxB ...
And it is again useless.

In attachment, there are ha.cf, logs, cibadmin -Ql outputs.
debo machine: Linux, Debian sarge
configure options: --with-group-name=haclient
--with-ccmuser-name=hacluster --sysconfdir=/etc --localstatedir=/var
--disable-tipc --disable-ldirectord --disable-snmp
--enable-bundled_ltdl --enable-ltdl-convenience --disable-mgmt
--disable-quorumd --disable-fatal-warnings --enable-crm-dev CFLAGS='-g
-O0 -fno-unit-at-a-time'
fico machine: Linux, Gentoo
configure options: --with-group-name=cluster
--with-ccmuser-name=cluster --with-group-id=65 --with-ccmuser-id=65
--sysconfdir=/etc --localstatedir=/var --disable-tipc
--disable-ldirectord --disable-snmp --enable-bundled_ltdl
--enable-ltdl-convenience --disable-mgmt --disable-quorumd
--disable-fatal-warnings --enable-crm-dev CFLAGS='-g -O0
-fno-unit-at-a-time'
Sources of heartbeat were taken from http://hg.linux-ha.org/dev changeset 9857.


If you have any ideas how to get heartbeat to work in
"A->B->A->B->A->B way", please let me know. Any help appreciated.

Palo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hb_208_cfg.tar.bz2
Type: application/x-bzip2
Size: 12524 bytes
Desc: not available
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20061217/625368b7/hb_208_cfg.tar-0001.bin


More information about the Linux-HA mailing list