[Linux-HA] Clean Linux HA installation (no CIB modifications) OC_EV_MS_NOT_PRIMARY and blades reboot

Mike Toler mike.toler at prodeasystems.com
Wed Jun 11 18:52:31 MDT 2008


I have a simple 3 server HA setup (at the moment) that just won't stay
up.

The scenario is:
	1.  YUM install heartbeat
	2.  Create SIMPLE ha.cf file that looks like this on the 3
blades.
				crm on
				auto_failback off
				logfacility     local0
				keepalive 2
				deadtime 10
				mcast eth0 230.0.0.1 894 1 0
				node dal-xcp-11.prodea-lo.net
dal-xcp-21.prodea-lo.net dal-xcp-12.prodea-lo.net
	3.  Start Linux HA on all blades and see them all appear in the
HB_GUI window.
	4.  Wait < 5 minutes and watch as 2 nodes drop out of cluster
and the servers they are on reboot

The cib.xml looks like:
		<cib generated="false" admin_epoch="0"
have_quorum="false" ignore_dtd="false" num_peers="0"
cib_feature_revision="2.0" epoch="5" num_updates="1"
cib-last-written="Thu Jun 12 00:36:14 2008" ccm_transition="1">
		   <configuration>
		     <crm_config>
		       <cluster_property_set id="cib-bootstrap-options">
		         <attributes>
		           <nvpair id="cib-bootstrap-options-dc-version"
name="dc-version" value="2.1.3-node:
552305612591183b1628baa5bc6e903e0f1e26a3"/>
		         </attributes>
		       </cluster_property_set>
		     </crm_config>
		     <nodes>
		       <node id="75aaa6e0-fb0c-478f-af94-f4b408d4538e"
uname="dal-xcp-11.prodea-lo.net" type="normal"/>
		       <node id="0c4e7148-a55a-4fb0-8a8a-ea2cade291d0"
uname="dal-xcp-21.prodea-lo.net" type="normal"/>
		       <node id="56e225d5-3e37-4550-8c35-c4ab45dff01d"
uname="dal-xcp-12.prodea-lo.net" type="normal"/>
		     </nodes>
		     <resources/>
		     <constraints/>
		   </configuration>
		 </cib>

These are the logs I'm seeing just before two of the blades reboot.

How can so simple a setup have this kind of problem?

Jun 12 00:21:08 dal-xcp-12 cib: [3232]: info: mem_handle_event: Got an
event OC_EV_MS_NOT_PRIMARY from ccm
Jun 12 00:21:08 dal-xcp-12 cib: [3232]: info: mem_handle_event:
instance=3, nodes=3, new=3, lost=0, n_idx=0, new_idx=0, old_idx=6
Jun 12 00:21:08 dal-xcp-12 crmd: [3236]: info: mem_handle_event: Got an
event OC_EV_MS_NOT_PRIMARY from ccm
Jun 12 00:21:08 dal-xcp-12 crmd: [3236]: info: mem_handle_event:
instance=3, nodes=3, new=3, lost=0, n_idx=0, new_idx=0, old_idx=6
Jun 12 00:21:08 dal-xcp-12 crmd: [3236]: info: crmd_ccm_msg_callback:
Quorum lost after event=NOT PRIMARY (id=3)


Jun 12 00:22:26 dal-xcp-12 cib: [3232]: info:
cib_client_status_callback: Status update: Client
dal-xcp-21.prodea-lo.net/cib now has status [leave]
Jun 12 00:22:35 dal-xcp-12 heartbeat: [3223]: WARN: node
dal-xcp-21.prodea-lo.net: is dead
Jun 12 00:22:35 dal-xcp-12 crmd: [3236]: notice:
crmd_ha_status_callback: Status update: Node dal-xcp-21.prodea-lo.net
now has status [dead]
Jun 12 00:22:36 dal-xcp-12 heartbeat: [3223]: info: Link
dal-xcp-21.prodea-lo.net:eth0 dead.
Jun 12 00:22:49 dal-xcp-12 heartbeat: [3223]: WARN: node
dal-xcp-11.prodea-lo.net: is dead
Jun 12 00:22:49 dal-xcp-12 heartbeat: [3223]: info: Link
dal-xcp-11.prodea-lo.net:eth0 dead.
Jun 12 00:22:49 dal-xcp-12 crmd: [3236]: notice:
crmd_ha_status_callback: Status update: Node dal-xcp-11.prodea-lo.net
now has status [dead]
Jun 12 00:23:05 dal-xcp-12 cib: [3232]: info: mem_handle_event: Got an
event OC_EV_MS_INVALID from ccm
Jun 12 00:23:05 dal-xcp-12 cib: [3232]: info: mem_handle_event: no
mbr_track info
Jun 12 00:23:05 dal-xcp-12 crmd: [3236]: info: mem_handle_event: Got an
event OC_EV_MS_INVALID from ccm
Jun 12 00:23:05 dal-xcp-12 cib: [3232]: info: mem_handle_event: Got an
event OC_EV_MS_INVALID from ccm
Jun 12 00:23:05 dal-xcp-12 crmd: [3236]: info: mem_handle_event: no
mbr_track info
Jun 12 00:23:05 dal-xcp-12 cib: [3232]: info: mem_handle_event:
instance=1, nodes=1, new=0, lost=2, n_idx=0, new_idx=1, old_idx=4
Jun 12 00:23:05 dal-xcp-12 cib: [3232]: ERROR: cib_ccm_msg_callback:
Membership instance ID went backwards! 3->1
Jun 12 00:23:05 dal-xcp-12 cib: [3232]: ERROR: crm_abort:
cib_ccm_msg_callback: Triggered non-fatal assert at callbacks.c:1806 :
current_instance <= membership->m_instance
Jun 12 00:23:05 dal-xcp-12 crmd: [3236]: info: mem_handle_event: Got an
event OC_EV_MS_INVALID from ccm
Jun 12 00:23:05 dal-xcp-12 crmd: [3236]: info: mem_handle_event:
instance=1, nodes=1, new=0, lost=2, n_idx=0, new_idx=1, old_idx=4
Jun 12 00:23:05 dal-xcp-12 crmd: [3236]: info: crmd_ccm_msg_callback:
Quorum lost after event=INVALID (id=1)
Jun 12 00:23:05 dal-xcp-12 crmd: [3236]: ERROR: crmd_ccm_msg_callback:
Membership instance ID went backwards! 3->1
Jun 12 00:23:05 dal-xcp-12 crmd: [3236]: ERROR: crm_abort:
crmd_ccm_msg_callback: Triggered non-fatal assert at callbacks.c:520 :
current_ccm_membership_id



Michael Toler





This message is confidential to Prodea Systems, Inc unless otherwise indicated 
or apparent from its nature. This message is directed to the intended recipient 
only, who may be readily determined by the sender of this message and its 
contents. If the reader of this message is not the intended recipient, or an 
employee or agent responsible for delivering this message to the intended 
recipient:(a)any dissemination or copying of this message is strictly 
prohibited; and(b)immediately notify the sender by return message and destroy 
any copies of this message in any form(electronic, paper or otherwise) that you 
have.The delivery of this message and its information is neither intended to be 
nor constitutes a disclosure or waiver of any trade secrets, intellectual 
property, attorney work product, or attorney-client communications. The 
authority of the individual sending this message to legally bind Prodea Systems  
is neither apparent nor implied,and must be independently verified.


More information about the Linux-HA mailing list