[Linux-HA] Resource in master state - no monitor operation

Dejan Muhamedagic dejanmm at fastmail.fm
Tue Oct 2 14:33:27 MDT 2007


Hi,

On Tue, Oct 02, 2007 at 11:27:59AM -0700, Assaf N wrote:
> > Assaf N wrote:
> > > Hello,
> > >
> > > I started a small test cluster using heartbeat 2.1.1. The cluster contains
> > one simple master/slave resource.
> > >
> > > While playing around with this cluster, I've noticed that whenever the
> > resource is promoted to be the master on a machine, Heartbeat stops calling
> > its monitor operation on this node. A quick look on the ha-debug log reveals
> > that the monitor op is stopped intentionally, because of the resource
> > promotion. However, there is no restarting of this op once the node becomes
> > the master. When a second node starts and its resource takes the master
> > role, our demoted resource starts to be monitored again.
> > >
> > > I'm attaching my cib.xml, ha-debug and the resource agent script. Do I
> > have a configuration error, or have I encountered a bug?
> > 
> > please refer to the following conversation and tell us whether this
> > resolves your issue:
> > 
> >     http://www.gossamer-threads.com/lists/linuxha/users/42529
> > 
> 
> Thanks, it does resolve my issue. How embarrassing to discover it was answered a few days ago... I searched the list a few days before it was posted, and neglected to search again before sending my question... :-)
> 
> Now I've encountered a new issue - the 'success' return code from the monitor function is supposed to be 0 when the resource is a slave, and 8 when it's a master, right? Well, this is true when the resource is first started, but after the resource is promoted and demoted heartbeat still considers 8 to be the success return value, although the resource is not a master anymore. If I return 0 the resource is stopped and started, and the success return value is 0 again. Is this on purpose?
> 
> I'm experiencing another strange behavior on the following scenario - one node is the DC and running the master instance of a resource, and the second is running the slave instance. When I stop the heartbeat service on the first node (rh4vm2, the DC) it takes it a hundred seconds to go down, and it complains about the monitor action running on the second node (rh4vm1):
> 
> crmd[11630]: 2007/10/02_12:46:58 info: stop_subsystem: Sent -TERM to tengine: [11680]
> crmd[11630]: 2007/10/02_12:46:58 info: do_shutdown: Waiting for subsystems to exit
> tengine[11680]: 2007/10/02_12:47:06 WARN: action_timer_callback: Timer popped (abort_level=1000000, complete=false)
> tengine[11680]: 2007/10/02_12:47:06 WARN: print_elem: Action missed its timeout[Action 2]: In-flight (id: rsc_smith:0_monitor_3000, l
> oc: rh4vm1, priority: 20)
> tengine[11680]: 2007/10/02_12:48:37 WARN: global_timer_callback: Timer popped (abort_level=1000000, complete=false)
> tengine[11680]: 2007/10/02_12:48:37 info: unconfirmed_actions: Action rsc_smith:0_monitor_3000 2 unconfirmed from peer
> tengine[11680]: 2007/10/02_12:48:37 ERROR: unconfirmed_actions: Waiting on 1 unconfirmed actions
> tengine[11680]: 2007/10/02_12:48:37 WARN: global_timer_callback: Transition abort timeout reached... marking transition complete.
> tengine[11680]: 2007/10/02_12:48:37 info: notify_crmd: Exiting after transition
> tengine[11680]: 2007/10/02_12:48:37 WARN: global_timer_callback: Writing 1 unconfirmed actions to the CIB
> tengine[11680]: 2007/10/02_12:48:37 info: unconfirmed_actions: Action rsc_smith:0_monitor_3000 2 unconfirmed from peer
> tengine[11680]: 2007/10/02_12:48:37 ERROR: unconfirmed_actions: Waiting on 1 unconfirmed actions
> 
> Any idea why this happens?

No. The CRM expected a reply about an action outcome from LRM,
but received none or something. There were bugs in this area
before, but should've been fixed. This a test cluster? Can you
turn debug on? You can then use the brand new hb_report utility
to collect all the information (see
http://marc.info/?l=linux-ha&m=119091078501042&w=2 ).
And finally file a bug report.

Thanks,

Dejan

> Thanks for your help,
> Assaf
> 
> 
> 
> 
> my cib:
> 
>  <cib admin_epoch="0" have_quorum="false" ignore_dtd="false" num_peers="0" cib_feature_revision="1.3" generated="false" epoch="1385"num_updates="1" cib-last-written="Tue Oct  2 12:50:31 2007">
>    <configuration>
>      <crm_config>
>        <cluster_property_set id="cluster_properties">
>          <attributes>
>            <nvpair id="default-resource-stickiness" name="default-resource-stickiness" value="70"/>
>            <nvpair id="default-resource-failure-stickiness" name="default-resource-failure-stickiness" value="-100"/>
>          </attributes>
>        </cluster_property_set>
>        <cluster_property_set id="cib-bootstrap-options">
>          <attributes>
>            <nvpair name="last-lrm-refresh" id="cib-bootstrap-options-last-lrm-refresh" value="1191307342"/>
>          </attributes>
>        </cluster_property_set>
>      </crm_config>
>      <nodes>
>        <node id="0441b161-2421-4218-8b03-0c044937e197" uname="rh4vm1" type="normal">
>          <instance_attributes id="master-0441b161-2421-4218-8b03-0c044937e197">
>            <attributes>
>              <nvpair id="nodes-master-rsc_smith:1-0441b161-2421-4218-8b03-0c044937e197" name="master-rsc_smith:1" value="20"/>
>              <nvpair id="nodes-master-rsc_smith:0-0441b161-2421-4218-8b03-0c044937e197" name="master-rsc_smith:0" value="20"/>
>            </attributes>
>          </instance_attributes>
>        </node>
>        <node uname="rh4vm2" type="normal" id="f55d8a1b-6931-4a84-989c-7f241ce2897e">
>          <instance_attributes id="master-f55d8a1b-6931-4a84-989c-7f241ce2897e">
>            <attributes>
>              <nvpair name="master-rsc_smith:0" id="nodes-master-rsc_smith:0-f55d8a1b-6931-4a84-989c-7f241ce2897e" value="20"/>
>              <nvpair name="master-rsc_smith:1" id="nodes-master-rsc_smith:1-f55d8a1b-6931-4a84-989c-7f241ce2897e" value="30"/>
>            </attributes>
>          </instance_attributes>
>        </node>
>      </nodes>
>      <resources>
>        <master_slave id="master_slave_mvap" ordered="false" interleave="false" notify="false">
>          <instance_attributes id="ia_clone_ip">
>            <attributes>
>              <nvpair id="nvpair_ms_grp_mvap_clone_max" name="clone_max" value="2"/>
>              <nvpair id="nvpair_ms_grp_mvap_clone_node_max" name="clone_node_max" value="1"/>
>              <nvpair id="nvpair_ms_grp_mvap_master_max" name="master_max" value="1"/>
>              <nvpair id="nvpair_ms_grp_mvap_master_node_max" name="master_node_max" value="1"/>
>            </attributes>
>          </instance_attributes>
>          <primitive id="rsc_smith" class="ocf" type="smith2_agent" provider="ML">
>            <operations>
>              <op id="op_smith_monitor_special" name="monitor" timeout="3s" interval="3000ms" start_delay="6s">
>                <instance_attributes id="ia_smith_monitor_special">
>                  <attributes>
>                    <nvpair id="nvpair_smith_monitor_special_action" name="monitor_action" value="BIT1"/>
>                  </attributes>
>                </instance_attributes>
>              </op>
>              <op id="op_smith_monitor_master" name="monitor" timeout="3s" interval="3001ms" start_delay="6s" role="Master">
>                <instance_attributes id="ia_smith_monitor_master">
>                  <attributes>
>                    <nvpair id="nvpair_smith_monitor_master_action" name="monitor_action" value="BIT2"/>
>                    <nvpair id="nvpair_smith_monitor_master_state" name="master_monitor" value="master"/>
>                  </attributes>
>                </instance_attributes>
>              </op>
>            </operations>
>          </primitive>
>        </master_slave>
>      </resources>
>      <constraints>
>        <rsc_location id="loc_smith0" rsc="rsc_smith:0">
>          <rule id="loc_smith0_rule_run" score="INFINITY">
>            <expression id="loc_smith0_expression_run" attribute="#uname" operation="eq" value="rh4vm1"/>
>          </rule>
>          <rule id="loc_smith0_rule_norun" score="-INFINITY">
>            <expression id="loc_smith0_expression_norun" attribute="#uname" operation="ne" value="rh4vm1"/>
>          </rule>
>        </rsc_location>
>        <rsc_location id="loc_smith1" rsc="rsc_smith:1">
>          <rule id="loc_smith1_rule_run" score="INFINITY">
>            <expression id="loc_smith1_expression_run" attribute="#uname" operation="eq" value="rh4vm2"/>
>          </rule>
>          <rule id="loc_smith1_rule_norun" score="-INFINITY">
>            <expression id="loc_smith1_expression_norun" attribute="#uname" operation="ne" value="rh4vm2"/>
>          </rule>
>        </rsc_location>
>      </constraints>
>    </configuration>
>  </cib>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> > cheers,
> > raoul bhatia
> > -- 
> > ____________________________________________________________________
> > DI (FH) Raoul Bhatia M.Sc.          email.          r.bhatia at ipax.at
> > Technischer Leiter
> > 
> > IPAX - Aloy Bhatia Hava OEG         web.          http://www.ipax.at
> > Barawitzkagasse 10/2/2/11           email.            office at ipax.at
> > 1190 Wien                           tel.               +43 1 3670030
> > FN 277995t HG Wien                  fax.            +43 1 3670030 15
> > ____________________________________________________________________
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> > 
> 
> 
> 
> 
>        
> ____________________________________________________________________________________
> Need a vacation? Get great deals
> to amazing places on Yahoo! Travel.
> http://travel.yahoo.com/
> 
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems



More information about the Linux-HA mailing list