[Linux-HA] Issues with simple failover setup
sanelson at gmail.com
Sun Jan 4 03:04:58 MST 2009
I am running Heartbeat 2.3 on CentOS 5.2. I have 2 nodes - both
apache servers. All I want to achieve is a simple failover:
In the case where one of the two nodes is running httpd, if the
running node experiences a failure - httpd is stopped, or the machine
stops responding (ie the network has been lost or the machine down
hard), fail over to the second node.
I seem to have achieved this when starting with a fresh install. I
have defined two resources:
<primitive class="ocf" id="IPaddr_10_0_0_53"
<op id="IPaddr_10_0_0_53_mon" interval="5s"
<nvpair id="IPaddr_10_0_0_53_attr_0" name="ip"
<primitive class="lsb" id="httpd_2" provider="heartbeat" type="httpd">
<op id="httpd_2_mon" interval="20s" name="monitor" timeout="10s"/>
As I understand it, the IP, primitive type="IPaddr" has a monitor set
to fire every 5 seconds, and
timeout after 5 seconds, and it has one attribute, the IP address itself.
The httpd, primitive type="httpd", really just refers to the
/etc/init.d/httpd script, since it is of class="lsb". It only has a
single operation and no attributes - the operation is a monitor which
fires every 10 seconds, and will timeout after 10 seconds. For an
init script, the monitor just consists of running the script as
"/etc/init.d/httpd status" and looking for "running" in the response.
I've defined one constraint:
<rsc_colocation id="web_same" from="IPaddr_10_0_0_53"
The IP address and the httpd are preferred to run on the same
machine, with INFINITE priority - in other words, they MUST run on the
This should have the effect of forcing the migration of both resources together.
I've modified default-resource-stickiness and
AIUI, these two options define how the CRM and the LRM handle failures
The default-resource-stickiness is the score given to each active
resource on the active node, leading to a default score of 2000 for
node and 0 for the inactive node.
When there is a failure, the failure-stickiness score is applied, and
since it's negative, it should lower the score on the failed (active)
node to below 0, triggering a
If the second node fails as well, that node will be taken negative,
leaving no nodes capable of running the resources. If a node reboots,
it should reset its score to 0, or it can be manually reset by running
"crm_failcount -D -r httpd_2" on the previously-failed node.
So far so good. Do please correct my understanding if I've gone wrong.
Live test below:
Ok - so taking my cluster, erasing the cib with cibadmin -E, and
rebooting both nodes. I've not got httpd starting by default on
either machine, so when they come up, I will start httpd on one
machine. Interestingly the result of cibadmin -E seems to have been
that cibadmin -Q now times out, so I've hacked around a bit deleting
/var/lib/heartbeat/crm/cib.xml and trying to load it, by making the
admin_epoch bigger than that which seemed to be there (though from
where I know not).
$ crm_resource -W -r httpd_2
seems to show that httpd_2 is running on node2, and I can confirm
this. I don't know how this happened, as I didn't start apache, but
it has happened...
So - if I shutdown httpd on node 2, it should failover, and it does.
So, now apache is running on node 1, and node 2 should have a score of
-6001 as it failed. This is reflected in the failcount on node 2.
I shouldn't be able to move the resource back to node2 - it still has
a failure count > 0.
However, it seems I can - using crm_resource -M -r httpd_2 -H node2
Ok - resetting the failcount to 0. The cluster should be in the same
state it was before - let's try to kill apache.
This time, apache seems to have restarted on node 2, and there was no
failover. I don't understand this. The failcount has gone back up to
1, but the resource hasn't moved.
Let's try to kill it again. Same again - it gets restarted on node 2.
The failure count hasn't gone to 2. Killing it one more time gives
the same behaviour. Oh well... let's try to move the resource to node
Fine - that works with crm_resource, and now the cluster claims apache
is on node 1. I concur.
Let's reset failure count for good measure.
Now let's try killing apache on node1. Once again, apache gets
restarted on node1, but there's no failover.
So - what's going on - what have I got wrong? Also could someone
please tell me the canonical way to reset the cluster, and import a
More information about the Linux-HA