[Linux-HA] Clean Linux HA installation (no CIB modifications)
OC_EV_MS_NOT_PRIMARY and blades reboot
Andrew Beekhof
beekhof at gmail.com
Wed Jun 11 23:52:01 MDT 2008
You didn't say which version (kinda important) but I'm guessing its too old.
The give away is:
Jun 12 00:23:05 dal-xcp-12 cib: [3232]: ERROR: cib_ccm_msg_callback:
Membership instance ID went backwards! 3->1
Which was fixed in the CCM at least a year ago.
On Thu, Jun 12, 2008 at 02:52, Mike Toler <mike.toler at prodeasystems.com> wrote:
> I have a simple 3 server HA setup (at the moment) that just won't stay
> up.
>
> The scenario is:
> 1. YUM install heartbeat
> 2. Create SIMPLE ha.cf file that looks like this on the 3
> blades.
> crm on
> auto_failback off
> logfacility local0
> keepalive 2
> deadtime 10
> mcast eth0 230.0.0.1 894 1 0
> node dal-xcp-11.prodea-lo.net
> dal-xcp-21.prodea-lo.net dal-xcp-12.prodea-lo.net
> 3. Start Linux HA on all blades and see them all appear in the
> HB_GUI window.
> 4. Wait < 5 minutes and watch as 2 nodes drop out of cluster
> and the servers they are on reboot
>
> The cib.xml looks like:
> <cib generated="false" admin_epoch="0"
> have_quorum="false" ignore_dtd="false" num_peers="0"
> cib_feature_revision="2.0" epoch="5" num_updates="1"
> cib-last-written="Thu Jun 12 00:36:14 2008" ccm_transition="1">
> <configuration>
> <crm_config>
> <cluster_property_set id="cib-bootstrap-options">
> <attributes>
> <nvpair id="cib-bootstrap-options-dc-version"
> name="dc-version" value="2.1.3-node:
> 552305612591183b1628baa5bc6e903e0f1e26a3"/>
> </attributes>
> </cluster_property_set>
> </crm_config>
> <nodes>
> <node id="75aaa6e0-fb0c-478f-af94-f4b408d4538e"
> uname="dal-xcp-11.prodea-lo.net" type="normal"/>
> <node id="0c4e7148-a55a-4fb0-8a8a-ea2cade291d0"
> uname="dal-xcp-21.prodea-lo.net" type="normal"/>
> <node id="56e225d5-3e37-4550-8c35-c4ab45dff01d"
> uname="dal-xcp-12.prodea-lo.net" type="normal"/>
> </nodes>
> <resources/>
> <constraints/>
> </configuration>
> </cib>
>
> These are the logs I'm seeing just before two of the blades reboot.
>
> How can so simple a setup have this kind of problem?
>
> Jun 12 00:21:08 dal-xcp-12 cib: [3232]: info: mem_handle_event: Got an
> event OC_EV_MS_NOT_PRIMARY from ccm
> Jun 12 00:21:08 dal-xcp-12 cib: [3232]: info: mem_handle_event:
> instance=3, nodes=3, new=3, lost=0, n_idx=0, new_idx=0, old_idx=6
> Jun 12 00:21:08 dal-xcp-12 crmd: [3236]: info: mem_handle_event: Got an
> event OC_EV_MS_NOT_PRIMARY from ccm
> Jun 12 00:21:08 dal-xcp-12 crmd: [3236]: info: mem_handle_event:
> instance=3, nodes=3, new=3, lost=0, n_idx=0, new_idx=0, old_idx=6
> Jun 12 00:21:08 dal-xcp-12 crmd: [3236]: info: crmd_ccm_msg_callback:
> Quorum lost after event=NOT PRIMARY (id=3)
>
>
> Jun 12 00:22:26 dal-xcp-12 cib: [3232]: info:
> cib_client_status_callback: Status update: Client
> dal-xcp-21.prodea-lo.net/cib now has status [leave]
> Jun 12 00:22:35 dal-xcp-12 heartbeat: [3223]: WARN: node
> dal-xcp-21.prodea-lo.net: is dead
> Jun 12 00:22:35 dal-xcp-12 crmd: [3236]: notice:
> crmd_ha_status_callback: Status update: Node dal-xcp-21.prodea-lo.net
> now has status [dead]
> Jun 12 00:22:36 dal-xcp-12 heartbeat: [3223]: info: Link
> dal-xcp-21.prodea-lo.net:eth0 dead.
> Jun 12 00:22:49 dal-xcp-12 heartbeat: [3223]: WARN: node
> dal-xcp-11.prodea-lo.net: is dead
> Jun 12 00:22:49 dal-xcp-12 heartbeat: [3223]: info: Link
> dal-xcp-11.prodea-lo.net:eth0 dead.
> Jun 12 00:22:49 dal-xcp-12 crmd: [3236]: notice:
> crmd_ha_status_callback: Status update: Node dal-xcp-11.prodea-lo.net
> now has status [dead]
> Jun 12 00:23:05 dal-xcp-12 cib: [3232]: info: mem_handle_event: Got an
> event OC_EV_MS_INVALID from ccm
> Jun 12 00:23:05 dal-xcp-12 cib: [3232]: info: mem_handle_event: no
> mbr_track info
> Jun 12 00:23:05 dal-xcp-12 crmd: [3236]: info: mem_handle_event: Got an
> event OC_EV_MS_INVALID from ccm
> Jun 12 00:23:05 dal-xcp-12 cib: [3232]: info: mem_handle_event: Got an
> event OC_EV_MS_INVALID from ccm
> Jun 12 00:23:05 dal-xcp-12 crmd: [3236]: info: mem_handle_event: no
> mbr_track info
> Jun 12 00:23:05 dal-xcp-12 cib: [3232]: info: mem_handle_event:
> instance=1, nodes=1, new=0, lost=2, n_idx=0, new_idx=1, old_idx=4
> Jun 12 00:23:05 dal-xcp-12 cib: [3232]: ERROR: cib_ccm_msg_callback:
> Membership instance ID went backwards! 3->1
> Jun 12 00:23:05 dal-xcp-12 cib: [3232]: ERROR: crm_abort:
> cib_ccm_msg_callback: Triggered non-fatal assert at callbacks.c:1806 :
> current_instance <= membership->m_instance
> Jun 12 00:23:05 dal-xcp-12 crmd: [3236]: info: mem_handle_event: Got an
> event OC_EV_MS_INVALID from ccm
> Jun 12 00:23:05 dal-xcp-12 crmd: [3236]: info: mem_handle_event:
> instance=1, nodes=1, new=0, lost=2, n_idx=0, new_idx=1, old_idx=4
> Jun 12 00:23:05 dal-xcp-12 crmd: [3236]: info: crmd_ccm_msg_callback:
> Quorum lost after event=INVALID (id=1)
> Jun 12 00:23:05 dal-xcp-12 crmd: [3236]: ERROR: crmd_ccm_msg_callback:
> Membership instance ID went backwards! 3->1
> Jun 12 00:23:05 dal-xcp-12 crmd: [3236]: ERROR: crm_abort:
> crmd_ccm_msg_callback: Triggered non-fatal assert at callbacks.c:520 :
> current_ccm_membership_id
>
>
>
> Michael Toler
>
>
>
>
>
> This message is confidential to Prodea Systems, Inc unless otherwise indicated
> or apparent from its nature. This message is directed to the intended recipient
> only, who may be readily determined by the sender of this message and its
> contents. If the reader of this message is not the intended recipient, or an
> employee or agent responsible for delivering this message to the intended
> recipient:(a)any dissemination or copying of this message is strictly
> prohibited; and(b)immediately notify the sender by return message and destroy
> any copies of this message in any form(electronic, paper or otherwise) that you
> have.The delivery of this message and its information is neither intended to be
> nor constitutes a disclosure or waiver of any trade secrets, intellectual
> property, attorney work product, or attorney-client communications. The
> authority of the individual sending this message to legally bind Prodea Systems
> is neither apparent nor implied,and must be independently verified.
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
More information about the Linux-HA
mailing list