[Linux-HA] "Clones, Stonith and Suicide" The SysAdmin who had a nervous breakdown.

Dejan Muhamedagic dejanmm at fastmail.fm
Tue Oct 2 14:28:06 MDT 2007


Hi,

On Tue, Oct 02, 2007 at 05:17:38PM +0100, Peter Farrell wrote:
> Can someone verify my CIB please?
> 
> It's not working as intended and the more I read the less I understand...
> I've stared at the config for the past 2 days hoping to be struck by
> sudden understanding... hasn't happened yet.

Don't worry, the learning curve is extremely steep. We all need
quite some patience.

> I don't understand how you make a rule, and then call that rule as a
> result of an action. I used the bit from the pingd FAQ page:
> http://www.linux-ha.org/v2/faq/pingd
> "Quickstart - Only Run my_resource on Nodes with Access to at Least
> One Ping Node"
> 
> So - for my pingd clone, the operation is 'monitor' and 'on_fail=fence'
> <op id="pingd-child-monitor" name="monitor" interval="20s"
> timeout="40s" prereq="nothing" on_fail="fence"/>
> 
> I assume that this literally means:
> "ask the LRM to see if pingd is running every 20s, if after 40s pingd
> is not running, call it 'failed', and as it's 'failed' - fence it off,
> which forces the resource to migrate to another node and marks this
> one as 'degraded' and will not allow resource to run until it's been
> cleaned up"
> 
> Is that right? If so, then this bit I'm OK with.

No, not exactly. The monitor operation may fail (i.e. the
resource agent says that the resource isn't running) or timeout
(that's what you described). Of course, both are considered to be
failures by CRM. on_fail=fence means that if this operation
fails, the node will be fenced, i.e. rebooted if you have an
operational stonith device. Perhaps a tad harsh for a monitor
failure.

> But - the 'dampen and multiplier' - I don't get.
> <nvpair id="pingd-dampen" name="dampen" value="5s"/>
> Does this mean: Wait 5 seconds before saying "yep - pingd says there's
> nothing out there, once pingd says 'there's nothing out there;?" Now
> write it out to the CIB and let any actions take place?

Yes, the cluster sort of stands back a bit until everything
settles.

> <nvpair id="pingd-multiplier" name="multiplier" value="100"/>
> This is a weighted score thing right? It's adding 100 to each node
> that 'can' ping?

Right.

> So if one can't ping, then the score gets knocked down and the
> resource wants to move to a "higher scoring" node?? I completely don't
> understand this... What if you already have a constraint set for a
> node preference, does this override it? Conflict with it?

The node with the highest score is chosen to run the service. If
there's more than one with the same score, then one's chosen at
(pseudo)random. If no score is non-negative then the resource
can't run anywhere.

> In any case - now that my node has no ping, and is fenced, I saw
> another bit of code called 'DoFencing' which I modified thinking it
> would now cause the node to commit suicide since it had no
> connectivity. But I've no idea about how it's meant to work... It's
> saying "your clone DoFencing is stonith via suicide" right?

I don't know until I see the code you're talking about. Typically
though stonith resources are configured to reboot other nodes and
not commit suicides. There's a special stonith agent called
suicide for this purpose.

> What do the clone_max and clone_node_max mean?
> Is clone_max = 2, mean that there are a maximum of 2 nodes that use
> it? 2 stonith daemons that run on each node? What? Ditto for
> clone_node_max?

clone_max: The maximum number of instances of this clone in the cluster.

clone_node_max: The maximum number of instances of this clone at one node.

> As for the operations on the DoFencing clone - what are they
> triggering? The timeouts are for what? the stonith daemon itself? Am I
> calling the stonith daemon itself to commit suicide? If so - why would
> I have a monitor or start operation?

This is admitedly a bit confusing. The start operation doesn't
do anything with the device, just makes it available. The stop
operation is the opposite. In other words, in order for the
stonith device to be used it must first be started.

The monitor operation is essential because the cluster wants to
make sure that the stonith device is operational. Typically, it
consists of logging into the device and requesting some kind of
status.

The timeouts are for the operations on which they are defined.
The start operation implies a monitor.

> Do you need a constraint with a rule to 'start' this resource? ie.
> kill myself? Does it just 'know' to do this? I'm really not getting
> it.

Under some circumstances it is necessary to ensure that a node
has relinquished resources. A typical example is a failed stop
operation. In that case the CRM will issue a RESET or POWEROFF
request to the eligible stonith device.

> <clone id="DoFencing">
>  <instance_attributes>
>   <attributes>
>     <nvpair name="clone_max" value="2"/>
>     <nvpair name="clone_node_max" value="1"/>
>   </attributes>
>  </instance_attributes>
> <primitive class="stonith" id="child_DoFencing" type="suicide"
> provider="heartbeat">
>  <operations>
>   <op name="monitor" interval="5s" timeout="20s" prereq="nothing"/>
>   <op name="start" timeout="20s" prereq="nothing"/>
>  </operations>
> </primitive>
> </clone>

The suicide stonith device is not exactly the best approach.
Ultimately it is not reliable, so it should not be used on the
production clusters. If you can afford it, get a real (hardware)
stonith device.

> Intended actions:
> > node1 loses ping, (which in my world means that it's dead)
> > resources migrate to node2
> > node1 reboots (what I really want is for the fenced resource to be 'cleaned up' so that it can run again on this node - I'm not fussy about how I achieve that)
> > resource migrates back to node1 once ping (connectivity is restored).

Rebooting a node should imply a resource cleanup. In the next
release the cluster will also be able to "forget" after some time
about the failure.

> actual actions:
> > node1 loses ping,
> > resource migrates to node2.

And the node1 is not rebooted? Then there's a problem with the
stonith setup. Any errors in logs?

> > node2 loses ping but 'resource cannot run anywhere' ensues and both nodes are 'active' but no resources are being ran.
> 
> I think fundamentally my approach is wrong and that I should leave it
> to fail and have human intervention to clean it up rather than hope it
> will flip flop between nodes.

That depends on your needs of course. At any rate, it should be
possible to configure the cluster to fit those needs.

There is also the meatware stonith device which will prompt a
human to clean up/reboot.

> But - I'd like to have a better grasp of
> how V2 works in general before making the choice to fall back to a
> simpler config.

HTH.

Thanks,

Dejan

> -Peter
> 
> 
> Active / Passive set up.
> 2 nodes, one resource (ldirectord) balancing traffic for IP addresses
> on 2 web servers.
> 2 nics [eth0: dmz facing - eth1: crossover cable, on 10.0.0.1/2]
> 
> This relates to the previous post:
> "How can you clean up a degraded node w/out killing it (and not manually)?"
> 
> Versions:
> heartbeat-stonith-2.1.2-3.el4.centos
> heartbeat-pils-2.1.2-3.el4.centos
> heartbeat-ldirectord-2.1.2-3.el4.centos
> heartbeat-2.1.2-3.el4.centos

> <resources>
> 	<group id="group_1">
> 		<primitive class="ocf" id="IPaddr_212_140_130_37" provider="heartbeat" type="IPaddr">
> 			<operations>
> 				<op id="IPaddr_212_140_130_37_mon" interval="5s" name="monitor" timeout="5s"/>
> 			</operations>
> 			<instance_attributes id="IPaddr_212_140_130_37_inst_attr">
> 				<attributes>
> 					<nvpair id="IPaddr_212_140_130_37_attr_0" name="ip" value="212.140.130.37"/>
> 				</attributes>
> 			</instance_attributes>
> 		</primitive>
> 		<primitive class="ocf" id="IPaddr_212_140_130_38" provider="heartbeat" type="IPaddr">
> 			<operations>
> 				<op id="IPaddr_212_140_130_38_mon" interval="5s" name="monitor" timeout="5s"/>
> 			</operations>
> 			<instance_attributes id="IPaddr_212_140_130_38_inst_attr">
> 				<attributes>
> 					<nvpair id="IPaddr_212_140_130_38_attr_0" name="ip" value="212.140.130.38"/>
> 				</attributes>
> 			</instance_attributes>
> 		</primitive>
> 		<primitive class="ocf" id="ldirectord_3" provider="heartbeat" type="ldirectord">
> 			<operations>
> 				<op id="ldirectord_3_mon" interval="120s" name="monitor" timeout="60s"/>
> 			</operations>
> 			<instance_attributes id="ldirectord_3_inst_attr">
> 				<attributes>
> 					<nvpair id="ldirectord_3_attr_1" name="1" value="ldirectord.cf"/>
> 				</attributes>
> 			</instance_attributes>
> 		</primitive>
> 	</group>
> 	<clone id="pingd">
> 		<instance_attributes id="pingd">
> 			<attributes>
> 				<nvpair id="pingd-clone_node_max" name="clone_node_max" value="1"/>
> 			</attributes>
> 		</instance_attributes>
> 		<primitive id="pingd-child" provider="heartbeat" class="ocf" type="pingd">
> 			<operations>
> 				<op id="pingd-child-monitor" name="monitor" interval="20s" timeout="40s" prereq="nothing" on_fail="fence"/>
> 			</operations>
> 			<instance_attributes id="pingd_inst_attr">
> 				<attributes>
> 					<nvpair id="pingd-dampen" name="dampen" value="5s"/>
> 					<nvpair id="pingd-multiplier" name="multiplier" value="100"/>
> 				</attributes>
> 			</instance_attributes>
> 		</primitive>
> 	</clone>
> 	<clone id="DoFencing">
> 		<instance_attributes>
> 			<attributes>
> 				<nvpair name="clone_max" value="2"/>
> 				<nvpair name="clone_node_max" value="1"/>
> 			</attributes>
> 		</instance_attributes>
> 		<primitive id="child_DoFencing" class="stonith" type="suicide" provider="heartbeat">
> 			<operations>
> 				<op name="monitor" interval="5s" timeout="20s" prereq="nothing"/>
> 				<op name="start" timeout="20s" prereq="nothing"/>
> 			</operations>
> 		</primitive>
> 	</clone>
> </resources>
> <constraints>
> 	<rsc_location rsc="group_1" id="rsc_location_group_1">
> 		<rule id="prefered_location_group_1" score="200">
> 			<expression attribute="#uname" id="prefered_location_group_1_expr" operation="eq" value="dmz1.scarceskills.com"/>
> 		</rule>
> 		<rule id="group_1:connected:rule" score="-INFINITY" boolean_op="and">
> 			<expression id="my_resource:connected:expr:zero" attribute="pingd" operation="lte" value="0"/>
> 		</rule>
> 	</rsc_location>
> 	<rsc_location id="cli-prefer-group_1" rsc="group_1">
> 		<rule id="cli-prefer-rule-group_1" score="INFINITY">
> 			<expression id="cli-prefer-expr-group_1" attribute="#uname" operation="eq" value="dmz1.scarceskills.com" type="string"/>
> 		</rule>
> 	</rsc_location>
> </constraints>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems


More information about the Linux-HA mailing list