[Linux-HA] ocf:heartbeat:apache resource agent and timeouts
lars.ellenberg at linbit.com
Thu Apr 5 10:53:10 MDT 2012
On Tue, Apr 03, 2012 at 01:53:41PM +0200, David Gubler wrote:
> Hi list,
> I've been experimenting with Heartbeat/Pacemaker on Ubuntu 11.10
> (Pacemaker 1.1.5 and Heartbeat 3.0.5) and I have hit a very nasty issue
> with the apache resource agent.
> But first things first, my test setup:
> root at node0:~# crm configure show
> node $id="5a46c3c9-1f1e-45ad-9eb4-ebf216734d97" node1
> node $id="9270b333-9056-4560-8ca2-9f878b1f8966" node0
> primitive apache ocf:heartbeat:apache \
> params testconffile="/etc/ha.d/doodletest.pm" testname="doodle"\
> op monitor interval="30" timeout="120" \
> meta is-managed="false"
> primitive site0ip ocf:heartbeat:IPaddr \
> params ip="192.168.88.90" cidr_netmask="255.255.255.0" nic="eth0"
> primitive site1ip ocf:heartbeat:IPaddr \
> params ip="192.168.88.91" cidr_netmask="255.255.255.0" nic="eth0"
> clone apacheClone apache
> colocation bothips -100: site0ip site1ip
> colocation site0 inf: site0ip apacheClone
> colocation site1 inf: site1ip apacheClone
> property $id="cib-bootstrap-options" \
> dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
> cluster-infrastructure="Heartbeat" \
> no-quorum-policy="ignore" \
> stonith-enabled="false" \
> last-lrm-refresh="1333391544" \
> One of the test I did was simulate a messed up apache (e.g. connection
> limit reached):
> $ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP
Uhm, "invalid test case".
iptables -I INPUT -p tcp --dport 80 -i lo -j REJECT
iptables -I INPUT -p tcp --dport 80 -i lo -j REJECT --reject-with tcp-reset
> Of course, this should produce a monitor timeout, which should mark the
> apache as failed, and that's what happened.
> However, recovery didn't work after I did
> $ iptables -F
> The problem, according to what I could figure out:
> The apache resource agent
> does not have a timeout set for curl/wget. Curl has a default timeout of
> about 3 minutes, wget may even retry up to 20 times and thus may
> potentially take ages to time out.
> Thus, the monitor operation did time out instead of wget (thus,
> pacemaker thinks that the monitor itself has failed instead of the
> service it is monitoring, which is semantically just plain wrong, IMHO).
But does not make a difference, practically.
The monitor timed out. That's a fact.
So why not have show up in the logs.
Pacemaker behaviour is just the same,
whether a monitor action "timed out", or "failed".
> Since the resource agent let the (still waiting) wget process hang
> around practically forever, it also didn't notice when apache had
> recovered (after iptables -f).
After the monitor action timed out or failed,
the recovery action by pacemaker would be to stop the service,
and restart it (there or elsewhere).
Did that not happen?
The start operation of the apache RA internally does monitor as well,
so it likely times out as well.
I'd expect the cluster to move the unresponsive apache to some other
node, after monitor and restart timed out. Which I think is the right
thing to do.
> Bottom line:
> I think the apache resource agent badly needs a timeout parameter which
> is supplied to wget/curl and the documentation should make clear that
> the current monitor timeout provided by pacemaker is not a substitute
> for that (it cannot really be used to detect non-responsive web
Why not? I think it can.
Again: timeout is timeout, regardless on what level.
If you want shorter timeouts,
configure shorter timeouts on the monitor action.
But I'm not opposed to add "--connection-timeout=..."
and equivalent to the command line of the test clients.
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
More information about the Linux-HA