AW: [Linux-HA] Apache failover / renaming the binary

Andrew Beekhof beekhof at gmail.com
Thu Jul 3 03:13:53 MDT 2008


On Thu, Jul 3, 2008 at 11:04, Ehlers, Kolja <ehlers at clinresearch.com> wrote:
> the patch fixed it, but now a new problem occurs. Here is what I did:
>
> 1. renamed httpd on www1test
> 2. started heartbeat on both nodes (now heartbeat succesfully fails apache over to www2test)
>
> ============
> Last updated: Thu Jul  3 10:58:24 2008
> Current DC: www2test (5e0f97b7-6780-4487-baf9-6c36500b1276)
> 2 Nodes configured.
> 1 Resources configured.
> ============
>
> Node: www2test (5e0f97b7-6780-4487-baf9-6c36500b1276): online
> Node: www1test (3a325e23-2184-46ed-9e88-42a11f28c2be): online
>
> Resource Group: group_1
>    IPaddr_192_168_11_25        (ocf::heartbeat:IPaddr):        Started www2test
>
>    apache_2    (ocf::heartbeat:apache):        Started www2test
>
> Failed actions:
>    apache_2_start_0 (node=www1test, call=6, rc=5): complete
>
> 3. I renamed httpd- back to httpd on www1test
> 4. rebooted www2test and now apache is not starting on www1test

because its not allowed to.
you fixed the problem (by renaming the binary again) but the cluster
isn't psychic... you need to tell it that it's ok to run apache there
again.

read up on:
  crm_resource -C
and
  crm_failcount

> - IPaddr is
>
> ============
> Last updated: Thu Jul  3 11:00:57 2008
> Current DC: www1test (3a325e23-2184-46ed-9e88-42a11f28c2be)
> 2 Nodes configured.
> 1 Resources configured.
> ============
>
> Node: www2test (5e0f97b7-6780-4487-baf9-6c36500b1276): OFFLINE
> Node: www1test (3a325e23-2184-46ed-9e88-42a11f28c2be): online
>
> Resource Group: group_1
>    IPaddr_192_168_11_25        (ocf::heartbeat:IPaddr):        Started www1test
>    apache_2    (ocf::heartbeat:apache):        Stopped
>
> Failed actions:
>    apache_2_start_0 (node=www1test, call=6, rc=5): complete
>
> www1test:/ # crm_verify -VVVVL
> crm_verify[19271]: 2008/07/03_11:02:00 info: main: =#=#=#=#= Getting XML =#=#=#=#=
> crm_verify[19271]: 2008/07/03_11:02:00 info: main: Reading XML from: live cluster
> crm_verify[19271]: 2008/07/03_11:02:00 notice: main: Required feature set: 2.0
> crm_verify[19271]: 2008/07/03_11:02:00 debug: cluster_option: Using default value 'false' for cluster option 'stonith-enabled'
> crm_verify[19271]: 2008/07/03_11:02:00 debug: cluster_option: Using default value 'reboot' for cluster option 'stonith-action'
> crm_verify[19271]: 2008/07/03_11:02:00 debug: cluster_option: Using default value '0' for cluster option 'default-resource-failure-stickiness'
> crm_verify[19271]: 2008/07/03_11:02:00 debug: cluster_option: Using default value '60s' for cluster option 'cluster-delay'
> crm_verify[19271]: 2008/07/03_11:02:00 debug: cluster_option: Using default value '30' for cluster option 'batch-limit'
> crm_verify[19271]: 2008/07/03_11:02:00 debug: cluster_option: Using default value '20s' for cluster option 'default-action-timeout'
> crm_verify[19271]: 2008/07/03_11:02:00 debug: cluster_option: Using default value 'true' for cluster option 'stop-orphan-resources'
> crm_verify[19271]: 2008/07/03_11:02:00 debug: cluster_option: Using default value 'true' for cluster option 'stop-orphan-actions'
> crm_verify[19271]: 2008/07/03_11:02:00 debug: cluster_option: Using default value 'false' for cluster option 'remove-after-stop'
> crm_verify[19271]: 2008/07/03_11:02:00 debug: cluster_option: Using default value '-1' for cluster option 'pe-error-series-max'
> crm_verify[19271]: 2008/07/03_11:02:00 debug: cluster_option: Using default value '-1' for cluster option 'pe-warn-series-max'
> crm_verify[19271]: 2008/07/03_11:02:00 debug: cluster_option: Using default value '-1' for cluster option 'pe-input-series-max'
> crm_verify[19271]: 2008/07/03_11:02:00 debug: cluster_option: Using default value 'true' for cluster option 'startup-fencing'
> crm_verify[19271]: 2008/07/03_11:02:00 debug: cluster_option: Using default value 'true' for cluster option 'start-failure-is-fatal'
> crm_verify[19271]: 2008/07/03_11:02:00 debug: unpack_config: Default action timeout: 20s
> crm_verify[19271]: 2008/07/03_11:02:00 debug: unpack_config: Default stickiness: 1000000
> crm_verify[19271]: 2008/07/03_11:02:00 debug: unpack_config: Default failure stickiness: 0
> crm_verify[19271]: 2008/07/03_11:02:00 debug: unpack_config: STONITH of failed nodes is disabled
> crm_verify[19271]: 2008/07/03_11:02:00 debug: unpack_config: Cluster is symmetric - resources can run anywhere by default
> crm_verify[19271]: 2008/07/03_11:02:00 debug: unpack_config: On loss of CCM Quorum: Stop ALL resources
> crm_verify[19271]: 2008/07/03_11:02:00 info: determine_online_status: Node www1test is online
> crm_verify[19271]: 2008/07/03_11:02:00 debug: common_apply_stickiness: fail-count-apache_2: INFINITY
> crm_verify[19271]: 2008/07/03_11:02:00 ERROR: unpack_rsc_op: Hard error: apache_2_start_0 failed with rc=5.
> crm_verify[19271]: 2008/07/03_11:02:00 ERROR: unpack_rsc_op:   Preventing apache_2 from re-starting on www1test
> crm_verify[19271]: 2008/07/03_11:02:00 WARN: unpack_rsc_op: Processing failed op apache_2_start_0 on www1test: Error
> crm_verify[19271]: 2008/07/03_11:02:00 WARN: unpack_rsc_op: Compatability handling for failed op apache_2_start_0 on www1test
> crm_verify[19271]: 2008/07/03_11:02:00 notice: group_print: Resource Group: group_1
> crm_verify[19271]: 2008/07/03_11:02:00 notice: native_print:     IPaddr_192_168_11_25   (ocf::heartbeat:IPaddr):        Started www1test
> crm_verify[19271]: 2008/07/03_11:02:00 notice: native_print:     apache_2       (ocf::heartbeat:apache):        Stopped
> crm_verify[19271]: 2008/07/03_11:02:00 debug: group_rsc_location: Processing rsc_location pref_run_apache_group for group_1
> crm_verify[19271]: 2008/07/03_11:02:00 debug: native_merge_weights: IPaddr_192_168_11_25: Rolling back scores from apache_2
> crm_verify[19271]: 2008/07/03_11:02:00 debug: native_assign_node: Assigning www1test to IPaddr_192_168_11_25
> crm_verify[19271]: 2008/07/03_11:02:00 debug: native_assign_node: All nodes for resource apache_2 are unavailable, unclean or shutting down
> crm_verify[19271]: 2008/07/03_11:02:00 WARN: native_color: Resource apache_2 cannot run anywhere
> crm_verify[19271]: 2008/07/03_11:02:00 notice: NoRoleChange: Leave resource IPaddr_192_168_11_25        (www1test)
> Warnings found during check: config may not be valid
> crm_verify[19271]: 2008/07/03_11:02:00 debug: cib_native_signoff: Signing out of the CIB Service
>
> Thanks for your help
>
> -----Ursprüngliche Nachricht-----
> Von: linux-ha-bounces at lists.linux-ha.org
> [mailto:linux-ha-bounces at lists.linux-ha.org]Im Auftrag von Dominik Klein
> Gesendet: Donnerstag, 3. Juli 2008 10:23
> An: General Linux-HA mailing list
> Betreff: Re: AW: [Linux-HA] Apache failover / renaming the binary
>
>
> Your testcase is not exactly the best, but it should still cause a failover.
>
> Please try the attached patch. I don't know why "start" was excluded at
> that place. Does not make sense to me. Maybe someone can explain on the
> dev list.
>
> Imho, what you're doing should not produce what you're seeing and this
> patch should fix it.
>
> Comments please!
>
> Regards
> Dominik
>
> Ehlers, Kolja wrote:
>> thanks for the reply, still the problem remains. If apache cannot be started/restarted it is not failed over to the second node. I have two equal servers and I want to run the virtual ip + apache (grouped) on either one of the nodes. To test the configuration I have renamed httpd on the one node to httpd_ else I am not sure how to simulate a non starting apache. But either way when heartbeat is started the apache start is failed on www1test and nothing happens then. I have attached my CIB and the logs
>>
>> This is what crm_mon gives me:
>>
>> Refresh in 1s...
>>
>> ============
>> Last updated: Thu Jul  3 09:53:34 2008
>> Current DC: www2test (5e0f97b7-6780-4487-baf9-6c36500b1276)
>> 2 Nodes configured.
>> 1 Resources configured.
>> ============
>>
>> Node: www2test (5e0f97b7-6780-4487-baf9-6c36500b1276): online
>> Node: www1test (3a325e23-2184-46ed-9e88-42a11f28c2be): online
>>
>> Resource Group: group_1
>>     IPaddr_192_168_11_25        (ocf::heartbeat:IPaddr):        Started www1test
>>     apache_2    (ocf::heartbeat:apache):        Stopped
>>
>> Failed actions:
>>     apache_2_start_0 (node=www1test, call=6, rc=6): complete
>>
>>
>>
>> www1test:~ # crm_verify -VVVVL
>> crm_verify[8124]: 2008/07/03_09:54:55 info: main: =#=#=#=#= Getting XML =#=#=#=#=
>> crm_verify[8124]: 2008/07/03_09:54:55 info: main: Reading XML from: live cluster
>> crm_verify[8124]: 2008/07/03_09:54:55 notice: main: Required feature set: 2.0
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: cluster_option: Using default value 'false' for cluster option 'stonith-enabled'
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: cluster_option: Using default value 'reboot' for cluster option 'stonith-action'
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: cluster_option: Using default value '0' for cluster option 'default-resource-failure-stickiness'
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: cluster_option: Using default value '60s' for cluster option 'cluster-delay'
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: cluster_option: Using default value '30' for cluster option 'batch-limit'
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: cluster_option: Using default value '20s' for cluster option 'default-action-timeout'
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: cluster_option: Using default value 'true' for cluster option 'stop-orphan-resources'
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: cluster_option: Using default value 'true' for cluster option 'stop-orphan-actions'
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: cluster_option: Using default value 'false' for cluster option 'remove-after-stop'
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: cluster_option: Using default value '-1' for cluster option 'pe-error-series-max'
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: cluster_option: Using default value '-1' for cluster option 'pe-warn-series-max'
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: cluster_option: Using default value '-1' for cluster option 'pe-input-series-max'
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: cluster_option: Using default value 'true' for cluster option 'startup-fencing'
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: cluster_option: Using default value 'true' for cluster option 'start-failure-is-fatal'
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: unpack_config: Default action timeout: 20s
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: unpack_config: Default stickiness: 1000000
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: unpack_config: Default failure stickiness: 0
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: unpack_config: STONITH of failed nodes is disabled
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: unpack_config: Cluster is symmetric - resources can run anywhere by default
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: unpack_config: On loss of CCM Quorum: Stop ALL resources
>> crm_verify[8124]: 2008/07/03_09:54:55 info: determine_online_status: Node www2test is online
>> crm_verify[8124]: 2008/07/03_09:54:55 info: determine_online_status: Node www1test is online
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: common_apply_stickiness: fail-count-apache_2: INFINITY
>> crm_verify[8124]: 2008/07/03_09:54:55 ERROR: unpack_rsc_op: Hard error: apache_2_start_0 failed with rc=6.
>> crm_verify[8124]: 2008/07/03_09:54:55 ERROR: unpack_rsc_op:   Preventing apache_2 from re-starting anywhere in the cluster
>> crm_verify[8124]: 2008/07/03_09:54:55 WARN: unpack_rsc_op: Processing failed op apache_2_start_0 on www1test: Error
>> crm_verify[8124]: 2008/07/03_09:54:55 WARN: unpack_rsc_op: Compatability handling for failed op apache_2_start_0 on www1test
>> crm_verify[8124]: 2008/07/03_09:54:55 notice: group_print: Resource Group: group_1
>> crm_verify[8124]: 2008/07/03_09:54:55 notice: native_print:     IPaddr_192_168_11_25    (ocf::heartbeat:IPaddr):        Started www1test
>> crm_verify[8124]: 2008/07/03_09:54:55 notice: native_print:     apache_2        (ocf::heartbeat:apache):        Stopped
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: group_rsc_location: Processing rsc_location pref_run_apache_group for group_1
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: native_merge_weights: IPaddr_192_168_11_25: Rolling back scores from apache_2
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: native_assign_node: Assigning www1test to IPaddr_192_168_11_25
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: native_assign_node: All nodes for resource apache_2 are unavailable, unclean or shutting down
>> crm_verify[8124]: 2008/07/03_09:54:55 WARN: native_color: Resource apache_2 cannot run anywhere
>> crm_verify[8124]: 2008/07/03_09:54:55 notice: NoRoleChange: Leave resource IPaddr_192_168_11_25 (www1test)
>> Warnings found during check: config may not be valid
>> crm_verify[8124]: 2008/07/03_09:54:55 debug: cib_native_signoff: Signing out of the CIB Service
>>
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: linux-ha-bounces at lists.linux-ha.org
>> [mailto:linux-ha-bounces at lists.linux-ha.org]Im Auftrag von Dominik Klein
>> Gesendet: Donnerstag, 3. Juli 2008 08:27
>> An: General Linux-HA mailing list
>> Betreff: Re: [Linux-HA] Apache failover / renaming the binary
>>
>>
>> http://hg.linux-ha.org/dev/file/5072025b79b8/resources/OCF/apache
>>
>> lines 516-518
>>
>> another example of how to use exits codes incorrectly.
>>
>> I'll commit a patch soon.
>>
>> In your script: Make line 518 look like this (on all nodes!):
>> exit $OCF_ERR_INSTALLED
>>
>> Then cleanup the resource or start the cluster from scratch and try
>> again. Should fix it.
>>
>> Regards
>> Dominik
>>
>>
>> Ehlers, Kolja wrote:
>>> Hello,
>>>
>>> my simple active/passive cluster seems to work but when running and I do:
>>>
>>> /opt/apache2/bin/apachectl stop && mv /opt/apache2/bin/httpd /opt/apache2/bin/httpd_
>>>
>>> Heartbeat is not failing over apache to node2 (Hard error: apache_2_start_0 failed with rc=6.) This is really odd because the log states "All 2 cluster nodes are eligible to run resources." but then 4 lines further it says "ERROR: unpack_rsc_op:   Preventing apache_2 from re-starting anywhere in the cluster". I am using a very simple CIB with one virtual ip and apache grouped. If i stop apache manually heartbeat does restart apache fine. By the way can I configure it so that it does failover right to the other node if apache is stopped or fails? When manually stopping heartbeat the failover does work.
>>>
>>> So I am not sure which part of my configuration or logs you need to see. I guess im missing something important here.
>>>
>>> This is my cib
>>>
>>>  <cib admin_epoch="0" generated="true" have_quorum="true" ignore_dtd="false" num_peers="2" cib_feature_revision="2.0" crm_feature_set="2.0" epoch="38" num_updates="3" cib-last-written="Wed Jul  2 16:16:51 2008" ccm_transition="2" dc_uuid="5e0f97b7-6780-4487-baf9-6c36500b1276">
>>>    <configuration>
>>>      <crm_config>
>>>        <cluster_property_set id="cib-bootstrap-options">
>>>          <attributes>
>>>            <nvpair id="cib-bootstrap-options-symmetric-cluster" name="symmetric-cluster" value="true"/>
>>>            <nvpair id="cib-bootstrap-options-default-resource-stickiness" name="default-resource-stickiness" value="INFINITY"/>
>>>            <nvpair id="cib-bootstrap-options-is-managed-default" name="is-managed-default" value="true"/>
>>>            <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="stop"/>
>>>            <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="2.1.3-node: a3184d5240c6e7032aef9cce6e5b7752ded544b3"/>
>>>          </attributes>
>>>        </cluster_property_set>
>>>      </crm_config>
>>>      <nodes>
>>>        <node id="5e0f97b7-6780-4487-baf9-6c36500b1276" uname="www2test" type="normal"/>
>>>        <node id="3a325e23-2184-46ed-9e88-42a11f28c2be" uname="www1test" type="normal"/>
>>>      </nodes>
>>>      <resources>
>>>        <group id="group_1">
>>>          <primitive class="ocf" id="IPaddr_192_168_11_25" provider="heartbeat" type="IPaddr">
>>>            <operations>
>>>              <op id="IPaddr_192_168_11_25_mon" interval="5s" name="monitor" timeout="5s"/>
>>>            </operations>
>>>            <instance_attributes id="IPaddr_192_168_11_25_inst_attr">
>>>              <attributes>
>>>                <nvpair id="IPaddr_192_168_11_25_attr_0" name="ip" value="192.168.11.25"/>
>>>              </attributes>
>>>            </instance_attributes>
>>>          </primitive>
>>>          <primitive class="ocf" id="apache_2" provider="heartbeat" type="apache">
>>>            <operations>
>>>              <op id="apache_2_mon" interval="5s" name="monitor" timeout="10s"/>
>>>            </operations>
>>>            <instance_attributes id="apache_2_inst_attr">
>>>              <attributes>
>>>                <nvpair id="apache_2_attr_0" name="configfile" value="/opt/apache2/conf/httpd.conf"/>
>>>              </attributes>
>>>            </instance_attributes>
>>>            <instance_attributes id="apache_2">
>>>              <attributes>
>>>                <nvpair id="apache_2-httpd" name="httpd" value="/opt/apache2/bin/httpd"/>
>>>              </attributes>
>>>            </instance_attributes>
>>>          </primitive>
>>>        </group>
>>>      </resources>
>>>      <constraints>
>>>        <rsc_location id="run_group1" rsc="group_1">
>>>          <rule id="pref_run_apache_group" score="0">
>>>            <expression attribute="#uname" operation="eq" value="www1test" id="7667baf9-522d-40ac-a901-195bfe84a3df"/>
>>>          </rule>
>>>        </rsc_location>
>>>      </constraints>
>>>    </configuration>
>>>  </cib>
>
>
> Geschäftsführung: Dr. Michael Fischer, Reinhard Eisebitt
> Amtsgericht Köln HRB 32356
> Steuer-Nr.: 217/5717/0536
> Ust.Id.-Nr.: DE 204051920
> --
> This email transmission and any documents, files or previous email
> messages attached to it may contain information that is confidential or
> legally privileged. If you are not the intended recipient or a person
> responsible for delivering this transmission to the intended recipient,
> you are hereby notified that any disclosure, copying, printing,
> distribution or use of this transmission is strictly prohibited. If you
> have received this transmission in error, please immediately notify the
> sender by telephone or return email and delete the original transmission
> and its attachments without reading or saving in any manner.
>
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>


More information about the Linux-HA mailing list