[Linux-HA] debugging resource configuration
Greg Woods
woods at ucar.edu
Fri Oct 29 10:32:07 MDT 2010
On Thu, 2010-10-28 at 18:38 -0600, Eric Schoeller wrote:
> Just a shot in the dark here kind of ... but I know that when I had this
> type of problem with a stonith device it was timeout related. You could
> try boosting your timeouts all around, or even check what
>
> # time /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf start
>
> reports back.
[root at vmx1 log]# time /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf start
real 0m0.261s
user 0m0.188s
sys 0m0.068s
(after which it is working fine; I can connect to the virtual service
and get properly redirected to the real server).
I am convinced now that either there is a bug in the resource agent so
that the monitor process just doesn't work right, there is something
obvious and stupid that I just don't see in my configuration that is
wrong, or else the ldirectord script that I have (which came from the
CentOS heartbeat-ldirectord package) is incompatible with what the
resource agent is expecting.
> If timeouts aren't it, I would start breaking out parts of the cluster
> config and trying it again until it works
I still haven't been able to make it work, but I have eliminated a
number of variables. I got rid of all the IPAddr resources, order
statements, and colocation statements. All that is there now that is
relevant to ldirectord is:
primitive ldirectord ocf:heartbeat:ldirectord \
op start interval="20s" timeout="15s" \
op stop interval="20s" timeout="15s" \
op monitor interval="20s" timeout="20s"
(I have actually tried different interval and timeout numbers but the
result is always the same).
That's it. Then I configured the eth1:0 interface manually to correspond
with the IP address of the virtual server configured in ldirectord.cf,
and ran "crm resource ldirectord start". The result is the same start,
stop, FAILED scenario repeated. The logs appear to show that it is
running the status check every 2 seconds or so, despite my interval and
timeout settings:
[Fri Oct 29 10:11:06 2010|ldirectord.cf|19214] Starting Linux Director
v1.186-ha-2.1.3 as daemon
[Fri Oct 29 10:11:06 2010|ldirectord.cf|19216] Added virtual server:
128.117.64.127:25
[...]
[Fri Oct 29 10:11:06 2010|ldirectord.cf|19216] Quiescent real server:
128.117.64.123:25 (128.117.64.127:25
) (Weight set to 0)
[...]
[Fri Oct 29 10:11:06 2010|ldirectord.cf|19216] Restored real server:
128.117.64.123:25 (128.117.64.127:25)
(Weight set to 1)
(there are similar pairs of entries for all the declared real servers)
So far so good, now comes the problem:
[Fri Oct 29 10:11:06 2010|ldirectord.cf|19221] Invoking ldirectord
invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf status
[Fri Oct 29 10:11:06 2010|ldirectord.cf|19221] ldirectord
for /etc/ha.d/ldirectord.cf is running with pid: 19216
[Fri Oct 29 10:11:06 2010|ldirectord.cf|19221] Exiting from ldirectord
status
[Fri Oct 29 10:11:08 2010|ldirectord.cf|19405] Invoking ldirectord
invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf status
[Fri Oct 29 10:11:08 2010|ldirectord.cf|19405] ldirectord
for /etc/ha.d/ldirectord.cf is running with pid: 19216
[Fri Oct 29 10:11:08 2010|ldirectord.cf|19405] Exiting from ldirectord
status
[Fri Oct 29 10:11:08 2010|ldirectord.cf|19410] Invoking ldirectord
invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf status
[Fri Oct 29 10:11:08 2010|ldirectord.cf|19410] ldirectord
for /etc/ha.d/ldirectord.cf is running with pid: 19216
[Fri Oct 29 10:11:08 2010|ldirectord.cf|19410] Exiting from ldirectord
status
[Fri Oct 29 10:11:08 2010|ldirectord.cf|19416] Invoking ldirectord
invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf stop
The status check should have succeeded, but the monitor process thinks
it failed. Also as can be seen, the status check is repeated only 2
seconds later. The corresponding log for lrmd shows:
Oct 29 10:11:05 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5526:
start
Oct 29 10:11:06 vmx1.ucar.edu lrmd: [4842]: info: RA output:
(ldirectord:start:stdout) /usr/sbin/ldirector
d /etc/ha.d/ldirectord.cf start
Oct 29 10:11:06 vmx1.ucar.edu lrmd: [4842]: info: Managed
ldirectord:start process 19203 exited with retur
n code 0.
Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5527:
start
Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2906:
operation start[5527] on ocf::ldirector
d::ldirectord for client 4845, its parameters: CRM_meta_interval=[20000]
CRM_meta_timeout=[15000] crm_feature_set=[3.0.1] CRM_meta_name=[start]
for rsc is already running.
Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2916:
postponing all ops on resource ldirectord by 1000 ms
Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2906:
operation start[5527] on ocf::ldirectord::ldirectord for client 4845,
its parameters: CRM_meta_interval=[20000] CRM_meta_timeout=[15000]
crm_feature_set=[3.0.1] CRM_meta_name=[start] for rsc is already
running.
Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2910:
operations on resource ldirectord already delayed
Oct 29 10:11:08 vmx1.ucar.edu lrmd: [4842]: info: Managed
ldirectord:start process 19401 exited with return code 0.
Oct 29 10:11:08 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5528:
stop
Oct 29 10:11:08 vmx1.ucar.edu lrmd: [4842]: info: Managed
ldirectord:stop process 19406 exited with return code 0.
Oct 29 10:11:08 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5529:
monitor
Oct 29 10:11:08 vmx1.ucar.edu lrmd: [4842]: WARN: Managed
ldirectord:monitor process 19418 exited with return code 7.
Oct 29 10:11:10 vmx1.ucar.edu lrmd: [4842]: info: cancel_op: operation
monitor[5529] on ocf::ldirectord::ldirectord for client 4845, its
parameters: CRM_meta_interval=[20000] CRM_meta_timeout=[20000]
crm_feature_set=[3.0.1] CRM_meta_name=[monitor] cancelled
Oct 29 10:11:10 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5530:
stop
Oct 29 10:11:10 vmx1.ucar.edu lrmd: [4842]: info: Managed
ldirectord:stop process 19432 exited with return code 0.
Can someone more expert than I interpret this for me? Why does it say
that it is already running? I checked the obvious and there was
no /var/run/ldirectord.ldirectord.cf.pid file prior to all this (there
is now that I have started it manually).
Thanks,
--Greg
More information about the Linux-HA
mailing list