[Linux-HA] Resource in master state - no monitor operation
Dejan Muhamedagic
dejanmm at fastmail.fm
Tue Oct 2 14:33:27 MDT 2007
Hi,
On Tue, Oct 02, 2007 at 11:27:59AM -0700, Assaf N wrote:
> > Assaf N wrote:
> > > Hello,
> > >
> > > I started a small test cluster using heartbeat 2.1.1. The cluster contains
> > one simple master/slave resource.
> > >
> > > While playing around with this cluster, I've noticed that whenever the
> > resource is promoted to be the master on a machine, Heartbeat stops calling
> > its monitor operation on this node. A quick look on the ha-debug log reveals
> > that the monitor op is stopped intentionally, because of the resource
> > promotion. However, there is no restarting of this op once the node becomes
> > the master. When a second node starts and its resource takes the master
> > role, our demoted resource starts to be monitored again.
> > >
> > > I'm attaching my cib.xml, ha-debug and the resource agent script. Do I
> > have a configuration error, or have I encountered a bug?
> >
> > please refer to the following conversation and tell us whether this
> > resolves your issue:
> >
> > http://www.gossamer-threads.com/lists/linuxha/users/42529
> >
>
> Thanks, it does resolve my issue. How embarrassing to discover it was answered a few days ago... I searched the list a few days before it was posted, and neglected to search again before sending my question... :-)
>
> Now I've encountered a new issue - the 'success' return code from the monitor function is supposed to be 0 when the resource is a slave, and 8 when it's a master, right? Well, this is true when the resource is first started, but after the resource is promoted and demoted heartbeat still considers 8 to be the success return value, although the resource is not a master anymore. If I return 0 the resource is stopped and started, and the success return value is 0 again. Is this on purpose?
>
> I'm experiencing another strange behavior on the following scenario - one node is the DC and running the master instance of a resource, and the second is running the slave instance. When I stop the heartbeat service on the first node (rh4vm2, the DC) it takes it a hundred seconds to go down, and it complains about the monitor action running on the second node (rh4vm1):
>
> crmd[11630]: 2007/10/02_12:46:58 info: stop_subsystem: Sent -TERM to tengine: [11680]
> crmd[11630]: 2007/10/02_12:46:58 info: do_shutdown: Waiting for subsystems to exit
> tengine[11680]: 2007/10/02_12:47:06 WARN: action_timer_callback: Timer popped (abort_level=1000000, complete=false)
> tengine[11680]: 2007/10/02_12:47:06 WARN: print_elem: Action missed its timeout[Action 2]: In-flight (id: rsc_smith:0_monitor_3000, l
> oc: rh4vm1, priority: 20)
> tengine[11680]: 2007/10/02_12:48:37 WARN: global_timer_callback: Timer popped (abort_level=1000000, complete=false)
> tengine[11680]: 2007/10/02_12:48:37 info: unconfirmed_actions: Action rsc_smith:0_monitor_3000 2 unconfirmed from peer
> tengine[11680]: 2007/10/02_12:48:37 ERROR: unconfirmed_actions: Waiting on 1 unconfirmed actions
> tengine[11680]: 2007/10/02_12:48:37 WARN: global_timer_callback: Transition abort timeout reached... marking transition complete.
> tengine[11680]: 2007/10/02_12:48:37 info: notify_crmd: Exiting after transition
> tengine[11680]: 2007/10/02_12:48:37 WARN: global_timer_callback: Writing 1 unconfirmed actions to the CIB
> tengine[11680]: 2007/10/02_12:48:37 info: unconfirmed_actions: Action rsc_smith:0_monitor_3000 2 unconfirmed from peer
> tengine[11680]: 2007/10/02_12:48:37 ERROR: unconfirmed_actions: Waiting on 1 unconfirmed actions
>
> Any idea why this happens?
No. The CRM expected a reply about an action outcome from LRM,
but received none or something. There were bugs in this area
before, but should've been fixed. This a test cluster? Can you
turn debug on? You can then use the brand new hb_report utility
to collect all the information (see
http://marc.info/?l=linux-ha&m=119091078501042&w=2 ).
And finally file a bug report.
Thanks,
Dejan
> Thanks for your help,
> Assaf
>
>
>
>
> my cib:
>
> <cib admin_epoch="0" have_quorum="false" ignore_dtd="false" num_peers="0" cib_feature_revision="1.3" generated="false" epoch="1385"num_updates="1" cib-last-written="Tue Oct 2 12:50:31 2007">
> <configuration>
> <crm_config>
> <cluster_property_set id="cluster_properties">
> <attributes>
> <nvpair id="default-resource-stickiness" name="default-resource-stickiness" value="70"/>
> <nvpair id="default-resource-failure-stickiness" name="default-resource-failure-stickiness" value="-100"/>
> </attributes>
> </cluster_property_set>
> <cluster_property_set id="cib-bootstrap-options">
> <attributes>
> <nvpair name="last-lrm-refresh" id="cib-bootstrap-options-last-lrm-refresh" value="1191307342"/>
> </attributes>
> </cluster_property_set>
> </crm_config>
> <nodes>
> <node id="0441b161-2421-4218-8b03-0c044937e197" uname="rh4vm1" type="normal">
> <instance_attributes id="master-0441b161-2421-4218-8b03-0c044937e197">
> <attributes>
> <nvpair id="nodes-master-rsc_smith:1-0441b161-2421-4218-8b03-0c044937e197" name="master-rsc_smith:1" value="20"/>
> <nvpair id="nodes-master-rsc_smith:0-0441b161-2421-4218-8b03-0c044937e197" name="master-rsc_smith:0" value="20"/>
> </attributes>
> </instance_attributes>
> </node>
> <node uname="rh4vm2" type="normal" id="f55d8a1b-6931-4a84-989c-7f241ce2897e">
> <instance_attributes id="master-f55d8a1b-6931-4a84-989c-7f241ce2897e">
> <attributes>
> <nvpair name="master-rsc_smith:0" id="nodes-master-rsc_smith:0-f55d8a1b-6931-4a84-989c-7f241ce2897e" value="20"/>
> <nvpair name="master-rsc_smith:1" id="nodes-master-rsc_smith:1-f55d8a1b-6931-4a84-989c-7f241ce2897e" value="30"/>
> </attributes>
> </instance_attributes>
> </node>
> </nodes>
> <resources>
> <master_slave id="master_slave_mvap" ordered="false" interleave="false" notify="false">
> <instance_attributes id="ia_clone_ip">
> <attributes>
> <nvpair id="nvpair_ms_grp_mvap_clone_max" name="clone_max" value="2"/>
> <nvpair id="nvpair_ms_grp_mvap_clone_node_max" name="clone_node_max" value="1"/>
> <nvpair id="nvpair_ms_grp_mvap_master_max" name="master_max" value="1"/>
> <nvpair id="nvpair_ms_grp_mvap_master_node_max" name="master_node_max" value="1"/>
> </attributes>
> </instance_attributes>
> <primitive id="rsc_smith" class="ocf" type="smith2_agent" provider="ML">
> <operations>
> <op id="op_smith_monitor_special" name="monitor" timeout="3s" interval="3000ms" start_delay="6s">
> <instance_attributes id="ia_smith_monitor_special">
> <attributes>
> <nvpair id="nvpair_smith_monitor_special_action" name="monitor_action" value="BIT1"/>
> </attributes>
> </instance_attributes>
> </op>
> <op id="op_smith_monitor_master" name="monitor" timeout="3s" interval="3001ms" start_delay="6s" role="Master">
> <instance_attributes id="ia_smith_monitor_master">
> <attributes>
> <nvpair id="nvpair_smith_monitor_master_action" name="monitor_action" value="BIT2"/>
> <nvpair id="nvpair_smith_monitor_master_state" name="master_monitor" value="master"/>
> </attributes>
> </instance_attributes>
> </op>
> </operations>
> </primitive>
> </master_slave>
> </resources>
> <constraints>
> <rsc_location id="loc_smith0" rsc="rsc_smith:0">
> <rule id="loc_smith0_rule_run" score="INFINITY">
> <expression id="loc_smith0_expression_run" attribute="#uname" operation="eq" value="rh4vm1"/>
> </rule>
> <rule id="loc_smith0_rule_norun" score="-INFINITY">
> <expression id="loc_smith0_expression_norun" attribute="#uname" operation="ne" value="rh4vm1"/>
> </rule>
> </rsc_location>
> <rsc_location id="loc_smith1" rsc="rsc_smith:1">
> <rule id="loc_smith1_rule_run" score="INFINITY">
> <expression id="loc_smith1_expression_run" attribute="#uname" operation="eq" value="rh4vm2"/>
> </rule>
> <rule id="loc_smith1_rule_norun" score="-INFINITY">
> <expression id="loc_smith1_expression_norun" attribute="#uname" operation="ne" value="rh4vm2"/>
> </rule>
> </rsc_location>
> </constraints>
> </configuration>
> </cib>
>
>
>
>
>
>
>
>
>
>
>
>
>
> > cheers,
> > raoul bhatia
> > --
> > ____________________________________________________________________
> > DI (FH) Raoul Bhatia M.Sc. email. r.bhatia at ipax.at
> > Technischer Leiter
> >
> > IPAX - Aloy Bhatia Hava OEG web. http://www.ipax.at
> > Barawitzkagasse 10/2/2/11 email. office at ipax.at
> > 1190 Wien tel. +43 1 3670030
> > FN 277995t HG Wien fax. +43 1 3670030 15
> > ____________________________________________________________________
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
>
>
>
>
>
> ____________________________________________________________________________________
> Need a vacation? Get great deals
> to amazing places on Yahoo! Travel.
> http://travel.yahoo.com/
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
More information about the Linux-HA
mailing list