[Linux-HA] 4 Node cluster insanity.

Andrew Beekhof beekhof at gmail.com
Mon Feb 12 00:51:33 MST 2007


On 2/9/07, John Lange <john.lange at open-it.ca> wrote:
> Situation: 4 node active, active, active, active cluster using shared
> storage (fiber connected SAN). Every node has an IP address, EVMS volume
> formatted with OCFS2 and exported via NFSServer.

Are you using user-space heartbeating?  (you're running sles10?)

>
> The goal here is to provide failover to another node should one go down.
>
> To summerize, here are the resources:
>
> EVMS Clone
> OCFS2 Clone
>
> (Those most be started in order.)
>
> IP1
> NFSServer1
> IP2
> NFSServer2
> IP3
> NFSServer3
> IP4
> NFSServer4
>
> The IP must start before nfs and nfs must restart on a node if it gets a
> new IP.
>
> Under its current configuration it sort of works. When the stars are
> aligned correctly, all 4 nodes boot and do their job.
>
> Unfortunately, its horrendously unstable. If anything goes wrong the
> whole thing implodes like a house of cards. Nodes start fencing each
> other and rebooting seemingly randomly and general insanity ensues
> requiring extensive manual intervention to bring things back to life.
>
> So, for example, if I issue a "heartbeat stop" on node4, at least one
> other node gets fenced and reboots and general mayhem results. I can't
> for the life of me figure out why stopping node4 would reboot node1?
> That makes no sense.
>
> To top it off, I believe all this fencing may have corrupted a node. No
> matter what I do node3 will no longer mount the ocfs file system. Here
> is what dmesg looks like:
>
> OCFS2 Node Manager 1.2.3-SLES Thu Aug 17 11:38:33 PDT 2006 (build sles)
> o2cb heartbeat: registered disk mode
> OCFS2 DLM 1.2.3-SLES Thu Aug 17 11:38:33 PDT 2006 (build sles)
> OCFS2 DLMFS 1.2.3-SLES Thu Aug 17 11:38:33 PDT 2006 (build sles)
> OCFS2 User DLM kernel interface loaded
> o2cb heartbeat: registered user mode
> Node vs2 is up in group 89FC5CB6C98B43B998AB8492874EA6CA
> o2net: connected to node vs2 (num 1) at 10.1.1.12:7777
> Node vs4 is up in group 89FC5CB6C98B43B998AB8492874EA6CA
> Node vs1 is up in group 89FC5CB6C98B43B998AB8492874EA6CA
> o2net: connected to node vs1 (num 0) at 10.1.1.11:7777
> Node vs3 is up in group 89FC5CB6C98B43B998AB8492874EA6CA
> OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT 2006 (build sles)
> (4160,0):o2net_connect_expired:1444 ERROR: no connection established with node 3 after 10 seconds, giving up and returning errors.
> (5014,1):dlm_request_join:786 ERROR: status = -107
> (5014,1):dlm_try_to_join_domain:935 ERROR: status = -107
> (5014,1):dlm_join_domain:1188 ERROR: status = -107
> (5014,1):dlm_register_domain:1381 ERROR: status = -107
> (5014,1):ocfs2_dlm_init:2007 ERROR: status = -107
> (5014,1):ocfs2_mount_volume:1090 ERROR: status = -107
> ocfs2: Unmounting device (253,13) on (node 2)
> Node vs3 is down in group 89FC5CB6C98B43B998AB8492874EA6CA
> Node vs1 is down in group 89FC5CB6C98B43B998AB8492874EA6CA
> o2net: no longer connected to node vs1 (num 0) at 10.1.1.11:7777
> Node vs4 is down in group 89FC5CB6C98B43B998AB8492874EA6CA
> Node vs2 is down in group 89FC5CB6C98B43B998AB8492874EA6CA
> o2net: no longer connected to node vs2 (num 1) at 10.1.1.12:7777
>
> =========
> ocfs simply will no longer start on node3.
>
> One of the other problems is crm_mon shows nfsserver as being started
> when in reality it isn't. as you can see from the below cib.xml, I'm
> using lsb/nfsserver. I noticed in one of the logs files a message like
> "nfsserver doesn't support restart". Is there a better resource agent
> that should be used? There doesn't appear to be an ocf version.
>
> Anyhow, here is the cib.xml. If someone could please look it over and
> make some suggestions that would be greatly appreciated.
>
>  <cib generated="true" admin_epoch="0" have_quorum="true" num_peers="4" cib_feature_revision="1.3" ignore_dtd="false" ccm_transition="42" dc_uuid="f6ed8bf2-eb64-4fa0-8bab-c7e990193876" epoch="208" num_updates="9391" cib-last-written="Fri Feb  9 12:18:19 2007">
>    <configuration>
>      <crm_config>
>        <cluster_property_set id="cib-bootstrap-options">
>          <attributes>
>            <nvpair id="cib-bootstrap-options-transition_idle_timeout" name="transition-idle-timeout" value="60"/>
>            <nvpair id="cib-bootstrap-options-stonith_enabled" name="stonith-enabled" value="true"/>
>            <nvpair id="cib-bootstrap-options-stonith_action" name="stonith-action" value="reboot"/>
>            <nvpair id="cib-bootstrap-options-symmetric_cluster" name="symmetric-cluster" value="true"/>
>            <nvpair id="cib-bootstrap-options-no_quorum_policy" name="no-quorum-policy" value="ignore"/>
>            <nvpair id="cib-bootstrap-options-stop_orphan_resources" name="stop-orphan-resources" value="true"/>
>            <nvpair id="cib-bootstrap-options-stop_orphan_actions" name="stop-orphan-actions" value="true"/>
>            <nvpair id="cib-bootstrap-options-is_managed_default" name="is-managed-default" value="true"/>
>            <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1171041047"/>
>          </attributes>
>        </cluster_property_set>
>      </crm_config>
>      <nodes>
>        <node id="21a514da-4a8c-49a8-bd78-79179418a3f5" uname="vs1" type="normal"/>
>        <node id="f6ed8bf2-eb64-4fa0-8bab-c7e990193876" uname="vs4" type="normal"/>
>        <node id="4cb9baf7-747c-46dc-9d8c-debb00225d84" uname="vs3" type="normal"/>
>        <node id="9ba549a0-8f53-46fe-9946-02d1ea6acc2d" uname="vs2" type="normal"/>
>      </nodes>
>      <resources>
>        <clone id="stonithcloneset" globally_unique="false">
>          <instance_attributes id="stonithcloneset">
>            <attributes>
>              <nvpair id="stonithcloneset-01" name="clone_node_max" value="1"/>
>            </attributes>
>          </instance_attributes>
>          <primitive id="stonithclone" class="stonith" type="external/ssh" provider="heartbeat">
>            <operations>
>              <op name="monitor" interval="5s" timeout="20s" prereq="nothing" id="stonithclone-op-01"/>
>              <op name="start" timeout="20s" prereq="nothing" id="stonithclone-op-02"/>
>            </operations>
>            <instance_attributes id="stonithclone">
>              <attributes>
>                <nvpair id="stonithclone-01" name="hostlist" value="vs1,vs2,vs3,vs4"/>
>              </attributes>
>            </instance_attributes>
>          </primitive>
>        </clone>
>        <clone id="evmscloneset" notify="true" globally_unique="false">
>          <instance_attributes id="evmscloneset">
>            <attributes>
>              <nvpair id="evmscloneset-01" name="clone_node_max" value="1"/>
>              <nvpair id="evmscloneset_target_role" name="target_role" value="started"/>
>            </attributes>
>          </instance_attributes>
>          <primitive id="evmsclone" class="ocf" type="EvmsSCC" provider="heartbeat"/>
>        </clone>
>        <clone id="imagestorecloneset" notify="true" globally_unique="false">
>          <instance_attributes id="imagestorecloneset">
>            <attributes>
>              <nvpair id="imagestorecloneset-01" name="clone_node_max" value="1"/>
>              <nvpair id="imagestorecloneset-02" name="target_role" value="started"/>
>            </attributes>
>          </instance_attributes>
>          <primitive id="imagestoreclone" class="ocf" type="Filesystem" provider="heartbeat">
>            <operations>
>              <op name="monitor" interval="30s" timeout="60s" prereq="nothing" id="imagestoreclone-op-01"/>
>            </operations>
>            <instance_attributes id="imagestoreclone">
>              <attributes>
>                <nvpair id="imagestoreclone-01" name="device" value="/dev/evms/lv1/cameras"/>
>                <nvpair id="imagestoreclone-02" name="directory" value="/data/cameras"/>
>                <nvpair id="imagestoreclone-03" name="fstype" value="ocfs2"/>
>                <nvpair id="imagestoreclone:3_target_role" name="target_role" value="started"/>
>              </attributes>
>            </instance_attributes>
>          </primitive>
>        </clone>
>        <primitive class="ocf" type="IPaddr" provider="heartbeat" id="vs2vip" resource_stickiness="#default">
>          <instance_attributes id="vs2vip_instance_attrs">
>            <attributes>
>              <nvpair name="target_role" id="vs2vip_target_role" value="started"/>
>              <nvpair id="c7e3b680-d5a5-4fd9-be12-55b34e5ad71b" name="ip" value="142.160.197.59"/>
>              <nvpair id="8d68ab51-3fe9-47ea-8945-4dd65a2558a4" name="nic" value="eth0"/>
>            </attributes>
>          </instance_attributes>
>        </primitive>
>        <primitive class="ocf" type="IPaddr" provider="heartbeat" id="vs1vip">
>          <instance_attributes id="vs1vip_instance_attrs">
>            <attributes>
>              <nvpair name="target_role" id="vs1vip_target_role" value="started"/>
>              <nvpair id="c41cd38b-dec5-49e2-8394-f487e50f77d3" name="ip" value="142.160.197.58"/>
>              <nvpair id="c716fb30-af32-4b78-9af6-1536beac6469" name="nic" value="eth0"/>
>            </attributes>
>          </instance_attributes>
>        </primitive>
>        <primitive class="ocf" type="IPaddr" provider="heartbeat" id="vs3vip">
>          <instance_attributes id="vs3vip_instance_attrs">
>            <attributes>
>              <nvpair name="target_role" id="vs3vip_target_role" value="started"/>
>              <nvpair id="abca92c6-d079-49e0-a5b1-5c0473ff648a" name="ip" value="142.160.197.61"/>
>              <nvpair id="93e30e85-e929-4e49-931b-2a1e1cc7389f" name="nic" value="eth0"/>
>            </attributes>
>          </instance_attributes>
>        </primitive>
>        <primitive id="vs4vip" class="ocf" type="IPaddr" provider="heartbeat">
>          <instance_attributes id="vs4vip_instance_attrs">
>            <attributes>
>              <nvpair id="vs4vip_target_role" name="target_role" value="started"/>
>              <nvpair id="26690918-cb87-429b-a5e0-439cc2100834" name="ip" value="142.160.197.62"/>
>              <nvpair id="414a8489-e7d8-4e50-ad04-b3606b30c687" name="nic" value="eth0"/>
>            </attributes>
>          </instance_attributes>
>        </primitive>
>        <primitive id="nfsservervs1" class="lsb" type="nfsserver" provider="heartbeat">
>          <instance_attributes id="nfsservervs1_instance_attrs">
>            <attributes>
>              <nvpair id="nfsservervs1_target_role" name="target_role" value="started"/>
>            </attributes>
>          </instance_attributes>
>        </primitive>
>        <primitive id="nfsservervs2" class="lsb" type="nfsserver" provider="heartbeat">
>          <instance_attributes id="nfsservervs2_instance_attrs">
>            <attributes>
>              <nvpair id="nfsservervs2_target_role" name="target_role" value="started"/>
>            </attributes>
>          </instance_attributes>
>        </primitive>
>        <primitive id="nfsservervs3" class="lsb" type="nfsserver" provider="heartbeat">
>          <instance_attributes id="nfsservervs3_instance_attrs">
>            <attributes>
>              <nvpair id="nfsservervs3_target_role" name="target_role" value="started"/>
>            </attributes>
>          </instance_attributes>
>        </primitive>
>        <primitive id="nfsservervs4" class="lsb" type="nfsserver" provider="heartbeat">
>          <instance_attributes id="nfsservervs4_instance_attrs">
>            <attributes>
>              <nvpair id="nfsservervs4_target_role" name="target_role" value="started"/>
>            </attributes>
>          </instance_attributes>
>        </primitive>
>      </resources>
>      <constraints>
>        <rsc_order id="vm1orderconstraints-01" from="imagestorecloneset" to="evmscloneset"/>
>        <rsc_location id="place_vs1vip" rsc="vs1vip">
>          <rule id="prefered_place_vs1vip" score="100">
>            <expression attribute="#uname" id="c9241954-f81f-4c74-94c9-08718bbd1fc2" operation="eq" value="vs1"/>
>          </rule>
>        </rsc_location>
>        <rsc_location id="place_vs2vip" rsc="vs2vip">
>          <rule id="prefered_place_vs2vip" score="100">
>            <expression attribute="#uname" id="deed4f08-2ff5-4faf-9e49-c9bb4354a912" operation="eq" value="vs2"/>
>          </rule>
>        </rsc_location>
>        <rsc_location id="place_vs3vip" rsc="vs3vip">
>          <rule id="prefered_place_vs3vip" score="100">
>            <expression attribute="#uname" id="c0342b73-610f-481f-af36-12d36ef322de" operation="eq" value="vs3"/>
>          </rule>
>        </rsc_location>
>        <rsc_location id="place_vs4vip" rsc="vs4vip">
>          <rule id="prefered_place_vs4vip" score="100">
>            <expression attribute="#uname" id="b9d624c6-e52a-490e-a1ca-d6633de236e1" operation="eq" value="vs4"/>
>          </rule>
>        </rsc_location>
>        <rsc_order id="nfsserverorderconstraints-04" from="nfsservervs4" to="vs4vip"/>
>        <rsc_order id="nfsserverorderconstraints-02" from="nfsservervs2" to="vs2vip"/>
>        <rsc_order id="nfsserverorderconstraints-03" from="nfsservervs3" to="vs3vip"/>
>        <rsc_order id="nfsserverorderconstraints-01" from="nfsservervs1" to="vs1vip"/>
>        <rsc_location id="place_nfs1" rsc="nfsservervs1">
>          <rule id="prefered_place_nfs1" score="100">
>            <expression attribute="#uname" id="3a5dc056-4f28-4b03-b171-205f8cc4bb48" operation="eq" value="vs1"/>
>          </rule>
>        </rsc_location>
>        <rsc_location id="place_nfs2" rsc="nfsservervs2">
>          <rule id="prefered_place_nfs2" score="100">
>            <expression attribute="#uname" id="16e6421c-9c3f-412d-b433-a80f3682e509" operation="eq" value="vs2"/>
>          </rule>
>        </rsc_location>
>        <rsc_location id="place_nfs3" rsc="nfsservervs3">
>          <rule id="prefered_place_nfs3" score="100">
>            <expression attribute="#uname" id="fdf1fe6f-be5d-4afc-9323-94adf756d164" operation="eq" value="vs3"/>
>          </rule>
>        </rsc_location>
>        <rsc_location id="place_nfs4" rsc="nfsservervs4">
>          <rule id="prefered_place_nfs4" score="100">
>            <expression attribute="#uname" id="f652501e-5052-4dea-8288-ce9fdd3c8954" operation="eq" value="vs4"/>
>          </rule>
>        </rsc_location>
>      </constraints>
>    </configuration>
>  </cib>
>
> ----
>
> John Lange
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>


More information about the Linux-HA mailing list