[Linux-HA] "Clones, Stonith and Suicide" The SysAdmin who had a nervous breakdown.

Peter Farrell peter.d.farrell at gmail.com
Tue Oct 2 10:17:38 MDT 2007


Can someone verify my CIB please?

It's not working as intended and the more I read the less I understand...
I've stared at the config for the past 2 days hoping to be struck by
sudden understanding... hasn't happened yet.

I don't understand how you make a rule, and then call that rule as a
result of an action. I used the bit from the pingd FAQ page:
http://www.linux-ha.org/v2/faq/pingd
"Quickstart - Only Run my_resource on Nodes with Access to at Least
One Ping Node"

So - for my pingd clone, the operation is 'monitor' and 'on_fail=fence'
<op id="pingd-child-monitor" name="monitor" interval="20s"
timeout="40s" prereq="nothing" on_fail="fence"/>

I assume that this literally means:
"ask the LRM to see if pingd is running every 20s, if after 40s pingd
is not running, call it 'failed', and as it's 'failed' - fence it off,
which forces the resource to migrate to another node and marks this
one as 'degraded' and will not allow resource to run until it's been
cleaned up"

Is that right? If so, then this bit I'm OK with.

But - the 'dampen and multiplier' - I don't get.
<nvpair id="pingd-dampen" name="dampen" value="5s"/>
Does this mean: Wait 5 seconds before saying "yep - pingd says there's
nothing out there, once pingd says 'there's nothing out there;?" Now
write it out to the CIB and let any actions take place?
<nvpair id="pingd-multiplier" name="multiplier" value="100"/>
This is a weighted score thing right? It's adding 100 to each node
that 'can' ping?
So if one can't ping, then the score gets knocked down and the
resource wants to move to a "higher scoring" node?? I completely don't
understand this... What if you already have a constraint set for a
node preference, does this override it? Conflict with it?

In any case - now that my node has no ping, and is fenced, I saw
another bit of code called 'DoFencing' which I modified thinking it
would now cause the node to commit suicide since it had no
connectivity. But I've no idea about how it's meant to work... It's
saying "your clone DoFencing is stonith via suicide" right?
What do the clone_max and clone_node_max mean?
Is clone_max = 2, mean that there are a maximum of 2 nodes that use
it? 2 stonith daemons that run on each node? What? Ditto for
clone_node_max?

As for the operations on the DoFencing clone - what are they
triggering? The timeouts are for what? the stonith daemon itself? Am I
calling the stonith daemon itself to commit suicide? If so - why would
I have a monitor or start operation?
Do you need a constraint with a rule to 'start' this resource? ie.
kill myself? Does it just 'know' to do this? I'm really not getting
it.

<clone id="DoFencing">
 <instance_attributes>
  <attributes>
    <nvpair name="clone_max" value="2"/>
    <nvpair name="clone_node_max" value="1"/>
  </attributes>
 </instance_attributes>
<primitive class="stonith" id="child_DoFencing" type="suicide"
provider="heartbeat">
 <operations>
  <op name="monitor" interval="5s" timeout="20s" prereq="nothing"/>
  <op name="start" timeout="20s" prereq="nothing"/>
 </operations>
</primitive>
</clone>


Intended actions:
> node1 loses ping, (which in my world means that it's dead)
> resources migrate to node2
> node1 reboots (what I really want is for the fenced resource to be 'cleaned up' so that it can run again on this node - I'm not fussy about how I achieve that)
> resource migrates back to node1 once ping (connectivity is restored).

actual actions:
> node1 loses ping,
> resource migrates to node2.
> node2 loses ping but 'resource cannot run anywhere' ensues and both nodes are 'active' but no resources are being ran.

I think fundamentally my approach is wrong and that I should leave it
to fail and have human intervention to clean it up rather than hope it
will flip flop between nodes. But - I'd like to have a better grasp of
how V2 works in general before making the choice to fall back to a
simpler config.

-Peter


Active / Passive set up.
2 nodes, one resource (ldirectord) balancing traffic for IP addresses
on 2 web servers.
2 nics [eth0: dmz facing - eth1: crossover cable, on 10.0.0.1/2]

This relates to the previous post:
"How can you clean up a degraded node w/out killing it (and not manually)?"

Versions:
heartbeat-stonith-2.1.2-3.el4.centos
heartbeat-pils-2.1.2-3.el4.centos
heartbeat-ldirectord-2.1.2-3.el4.centos
heartbeat-2.1.2-3.el4.centos
-------------- next part --------------
A non-text attachment was scrubbed...
Name: working_cib_20070926.1432.xml
Type: text/xml
Size: 3144 bytes
Desc: not available
Url : http://lists.community.tummy.com/pipermail/linux-ha/attachments/20071002/b3e331e4/working_cib_20070926.1432.bin


More information about the Linux-HA mailing list