[Linux-HA] "Clones, Stonith and Suicide" The SysAdmin who had a nervous breakdown.

Peter Farrell peter.d.farrell at gmail.com
Tue Oct 2 15:55:03 MDT 2007


On 02/10/2007, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> Hi,
>
> On Tue, Oct 02, 2007 at 05:17:38PM +0100, Peter Farrell wrote:
> > Can someone verify my CIB please?
> >
> > It's not working as intended and the more I read the less I understand...
> > I've stared at the config for the past 2 days hoping to be struck by
> > sudden understanding... hasn't happened yet.
>
> Don't worry, the learning curve is extremely steep. We all need
> quite some patience.
>
> > I don't understand how you make a rule, and then call that rule as a
> > result of an action. I used the bit from the pingd FAQ page:
> > http://www.linux-ha.org/v2/faq/pingd
> > "Quickstart - Only Run my_resource on Nodes with Access to at Least
> > One Ping Node"
> >
> > So - for my pingd clone, the operation is 'monitor' and 'on_fail=fence'
> > <op id="pingd-child-monitor" name="monitor" interval="20s"
> > timeout="40s" prereq="nothing" on_fail="fence"/>
> >
> > I assume that this literally means:
> > "ask the LRM to see if pingd is running every 20s, if after 40s pingd
> > is not running, call it 'failed', and as it's 'failed' - fence it off,
> > which forces the resource to migrate to another node and marks this
> > one as 'degraded' and will not allow resource to run until it's been
> > cleaned up"
> >
> > Is that right? If so, then this bit I'm OK with.
>
> No, not exactly. The monitor operation may fail (i.e. the
> resource agent says that the resource isn't running) or timeout
> (that's what you described). Of course, both are considered to be
> failures by CRM. on_fail=fence means that if this operation
> fails, the node will be fenced, i.e. rebooted if you have an
> operational stonith device. Perhaps a tad harsh for a monitor
> failure.

1. The approach for me is (this is a test cluster - but I want to use
it to replace a production one) - if either of the load balancers
can't ping one or two routers in my DMZ, then this must mean they're
dead. I figured if they can't see the router - how the hell can they
see the apache servers they're meant to be managing?
Is this 'correct political thought' or a sloppy foundation to begin with?

2. I didn't know that fence meant 'rebooted'. I thought it was sort of
'fenced off' and left in a degraded state should someone want to poke
around a bit.
RE: Perhaps a tad harsh for a monitor failure - I agree. But what's a
girl to do?
Am I on the right track here? Do I want it rebooting? Do I just want
Heartbeat to restart? Does it matter? If it comes up and the link is
still dead - will it cycle forever w/ reboots?

3. the real bit I'm missing: Let's say I want it rebooted after
fencing. What 'commands it' to do so? Just the flag 'on_fail=fence'?
Does that automatically look for a started stonith device  or resource
and if it finds one, it just uses it? I mean - how does the stonith
suicide (which doesn't work - but suppose for a minute it did) - how
is it connected to another operational directive?

> > But - the 'dampen and multiplier' - I don't get.
> > <nvpair id="pingd-dampen" name="dampen" value="5s"/>
> > Does this mean: Wait 5 seconds before saying "yep - pingd says there's
> > nothing out there, once pingd says 'there's nothing out there;?" Now
> > write it out to the CIB and let any actions take place?
>
> Yes, the cluster sort of stands back a bit until everything
> settles.
>
> > <nvpair id="pingd-multiplier" name="multiplier" value="100"/>
> > This is a weighted score thing right? It's adding 100 to each node
> > that 'can' ping?
>
> Right.

Can you control the frequency of the pings themselves? What
constitutes a timeout in this case?  (n) packets lost? latency?

> > So if one can't ping, then the score gets knocked down and the
> > resource wants to move to a "higher scoring" node?? I completely don't
> > understand this... What if you already have a constraint set for a
> > node preference, does this override it? Conflict with it?
>
> The node with the highest score is chosen to run the service. If
> there's more than one with the same score, then one's chosen at
> (pseudo)random. If no score is non-negative then the resource
> can't run anywhere.
>
> > In any case - now that my node has no ping, and is fenced, I saw
> > another bit of code called 'DoFencing' which I modified thinking it
> > would now cause the node to commit suicide since it had no
> > connectivity. But I've no idea about how it's meant to work... It's
> > saying "your clone DoFencing is stonith via suicide" right?
>
> I don't know until I see the code you're talking about. Typically
> though stonith resources are configured to reboot other nodes and
> not commit suicides. There's a special stonith agent called
> suicide for this purpose.
>
> > What do the clone_max and clone_node_max mean?
> > Is clone_max = 2, mean that there are a maximum of 2 nodes that use
> > it? 2 stonith daemons that run on each node? What? Ditto for
> > clone_node_max?
>
> clone_max: The maximum number of instances of this clone in the cluster.

What is the guidance for this? Should you have one per machine? One in general?

> clone_node_max: The maximum number of instances of this clone at one node.

Ditto above: Should I have a stonith clone per resource / per node? Or
just one?

> > As for the operations on the DoFencing clone - what are they
> > triggering? The timeouts are for what? the stonith daemon itself? Am I
> > calling the stonith daemon itself to commit suicide? If so - why would
> > I have a monitor or start operation?
>
> This is admitedly a bit confusing. The start operation doesn't
> do anything with the device, just makes it available. The stop
> operation is the opposite. In other words, in order for the
> stonith device to be used it must first be started.

So for any stonith resource, (using suicide / ssh methods) I'll always
want to have
monitor, start & stop?
Monitor for the cluster to use it, start to see it and stop is
effectively the 'reboot' bit?

> The monitor operation is essential because the cluster wants to
> make sure that the stonith device is operational. Typically, it
> consists of logging into the device and requesting some kind of
> status.
>
> The timeouts are for the operations on which they are defined.
> The start operation implies a monitor.
>
> > Do you need a constraint with a rule to 'start' this resource? ie.
> > kill myself? Does it just 'know' to do this? I'm really not getting
> > it.
>
> Under some circumstances it is necessary to ensure that a node
> has relinquished resources. A typical example is a failed stop
> operation. In that case the CRM will issue a RESET or POWEROFF
> request to the eligible stonith device.

So - the previous 'on_fail=fence' for the pingd clone - where would
that go ideally?
(I mean - on which operation?)
Ping needs a monitor and needs a start. Does it need a stop?

> > <clone id="DoFencing">
> >  <instance_attributes>
> >   <attributes>
> >     <nvpair name="clone_max" value="2"/>
> >     <nvpair name="clone_node_max" value="1"/>
> >   </attributes>
> >  </instance_attributes>
> > <primitive class="stonith" id="child_DoFencing" type="suicide"
> > provider="heartbeat">
> >  <operations>
> >   <op name="monitor" interval="5s" timeout="20s" prereq="nothing"/>
> >   <op name="start" timeout="20s" prereq="nothing"/>
> >  </operations>
> > </primitive>
> > </clone>
>
> The suicide stonith device is not exactly the best approach.
> Ultimately it is not reliable, so it should not be used on the
> production clusters. If you can afford it, get a real (hardware)
> stonith device.

Can't. No budget. Advice taken - I'll have to kill these via SSH or suicide.
I set up ssh keys for every user, root, haclient, hacluster - they
always fail authentication.
How can you tell which user / method it's using? Can you set which
interface they use (in order to force it (ssh) down the crossover
cables?)

>
> > Intended actions:
> > > node1 loses ping, (which in my world means that it's dead)
> > > resources migrate to node2
> > > node1 reboots (what I really want is for the fenced resource to be 'cleaned up' so that it can run again on this node - I'm not fussy about how I achieve that)
> > > resource migrates back to node1 once ping (connectivity is restored).
>
> Rebooting a node should imply a resource cleanup. In the next
> release the cluster will also be able to "forget" after some time
> about the failure.
>
> > actual actions:
> > > node1 loses ping,
> > > resource migrates to node2.
>
> And the node1 is not rebooted? Then there's a problem with the
> stonith setup. Any errors in logs?

It's never called. I've cocked up the config by experimenting via
'cut-n-paste' rather than taking the time to understand the thing
properly. Having said that I've read the docs, watching Alan down
under, trawled the lists and re-arranged others configs, but it's
still pretty random. Plus it's been 2 weeks and I'm an instant
gratification kind of guy, so I'm out of my comfort zone and getting a
little pissed (w/ myself) now :-)

I just don't get how it's (stonith) is called in relation to another
resource failing. The mechanism, the relationships. It's not just
stonith, for example if the ping failed and I wanted to start apache
on a cluster node to take over all IP addresses and serve up a 'temp.
out of service' page I wouldn't have the foggiest.

>
> > > node2 loses ping but 'resource cannot run anywhere' ensues and both nodes are 'active' but no resources are being ran.
> >
> > I think fundamentally my approach is wrong and that I should leave it
> > to fail and have human intervention to clean it up rather than hope it
> > will flip flop between nodes.
>
> That depends on your needs of course. At any rate, it should be
> possible to configure the cluster to fit those needs.
>
> There is also the meatware stonith device which will prompt a
> human to clean up/reboot.
>
> > But - I'd like to have a better grasp of
> > how V2 works in general before making the choice to fall back to a
> > simpler config.
>
> HTH.

It has. Thanks a lot.

-Peter

> Thanks,
>
> Dejan
>
> > -Peter
> >
> >
> > Active / Passive set up.
> > 2 nodes, one resource (ldirectord) balancing traffic for IP addresses
> > on 2 web servers.
> > 2 nics [eth0: dmz facing - eth1: crossover cable, on 10.0.0.1/2]
> >
> > This relates to the previous post:
> > "How can you clean up a degraded node w/out killing it (and not manually)?"
> >
> > Versions:
> > heartbeat-stonith-2.1.2-3.el4.centos
> > heartbeat-pils-2.1.2-3.el4.centos
> > heartbeat-ldirectord-2.1.2-3.el4.centos
> > heartbeat-2.1.2-3.el4.centos
>
> > <resources>
> >       <group id="group_1">
> >               <primitive class="ocf" id="IPaddr_212_140_130_37" provider="heartbeat" type="IPaddr">
> >                       <operations>
> >                               <op id="IPaddr_212_140_130_37_mon" interval="5s" name="monitor" timeout="5s"/>
> >                       </operations>
> >                       <instance_attributes id="IPaddr_212_140_130_37_inst_attr">
> >                               <attributes>
> >                                       <nvpair id="IPaddr_212_140_130_37_attr_0" name="ip" value="212.140.130.37"/>
> >                               </attributes>
> >                       </instance_attributes>
> >               </primitive>
> >               <primitive class="ocf" id="IPaddr_212_140_130_38" provider="heartbeat" type="IPaddr">
> >                       <operations>
> >                               <op id="IPaddr_212_140_130_38_mon" interval="5s" name="monitor" timeout="5s"/>
> >                       </operations>
> >                       <instance_attributes id="IPaddr_212_140_130_38_inst_attr">
> >                               <attributes>
> >                                       <nvpair id="IPaddr_212_140_130_38_attr_0" name="ip" value="212.140.130.38"/>
> >                               </attributes>
> >                       </instance_attributes>
> >               </primitive>
> >               <primitive class="ocf" id="ldirectord_3" provider="heartbeat" type="ldirectord">
> >                       <operations>
> >                               <op id="ldirectord_3_mon" interval="120s" name="monitor" timeout="60s"/>
> >                       </operations>
> >                       <instance_attributes id="ldirectord_3_inst_attr">
> >                               <attributes>
> >                                       <nvpair id="ldirectord_3_attr_1" name="1" value="ldirectord.cf"/>
> >                               </attributes>
> >                       </instance_attributes>
> >               </primitive>
> >       </group>
> >       <clone id="pingd">
> >               <instance_attributes id="pingd">
> >                       <attributes>
> >                               <nvpair id="pingd-clone_node_max" name="clone_node_max" value="1"/>
> >                       </attributes>
> >               </instance_attributes>
> >               <primitive id="pingd-child" provider="heartbeat" class="ocf" type="pingd">
> >                       <operations>
> >                               <op id="pingd-child-monitor" name="monitor" interval="20s" timeout="40s" prereq="nothing" on_fail="fence"/>
> >                       </operations>
> >                       <instance_attributes id="pingd_inst_attr">
> >                               <attributes>
> >                                       <nvpair id="pingd-dampen" name="dampen" value="5s"/>
> >                                       <nvpair id="pingd-multiplier" name="multiplier" value="100"/>
> >                               </attributes>
> >                       </instance_attributes>
> >               </primitive>
> >       </clone>
> >       <clone id="DoFencing">
> >               <instance_attributes>
> >                       <attributes>
> >                               <nvpair name="clone_max" value="2"/>
> >                               <nvpair name="clone_node_max" value="1"/>
> >                       </attributes>
> >               </instance_attributes>
> >               <primitive id="child_DoFencing" class="stonith" type="suicide" provider="heartbeat">
> >                       <operations>
> >                               <op name="monitor" interval="5s" timeout="20s" prereq="nothing"/>
> >                               <op name="start" timeout="20s" prereq="nothing"/>
> >                       </operations>
> >               </primitive>
> >       </clone>
> > </resources>
> > <constraints>
> >       <rsc_location rsc="group_1" id="rsc_location_group_1">
> >               <rule id="prefered_location_group_1" score="200">
> >                       <expression attribute="#uname" id="prefered_location_group_1_expr" operation="eq" value="dmz1.scarceskills.com"/>
> >               </rule>
> >               <rule id="group_1:connected:rule" score="-INFINITY" boolean_op="and">
> >                       <expression id="my_resource:connected:expr:zero" attribute="pingd" operation="lte" value="0"/>
> >               </rule>
> >       </rsc_location>
> >       <rsc_location id="cli-prefer-group_1" rsc="group_1">
> >               <rule id="cli-prefer-rule-group_1" score="INFINITY">
> >                       <expression id="cli-prefer-expr-group_1" attribute="#uname" operation="eq" value="dmz1.scarceskills.com" type="string"/>
> >               </rule>
> >       </rsc_location>
> > </constraints>
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



More information about the Linux-HA mailing list