[Linux-HA] "Clones,
Stonith and Suicide" The SysAdmin who had a nervous breakdown.
Dejan Muhamedagic
dejanmm at fastmail.fm
Wed Oct 3 07:11:22 MDT 2007
Hi,
On Tue, Oct 02, 2007 at 10:55:03PM +0100, Peter Farrell wrote:
> On 02/10/2007, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> > Hi,
> >
> > On Tue, Oct 02, 2007 at 05:17:38PM +0100, Peter Farrell wrote:
> > > Can someone verify my CIB please?
> > >
> > > It's not working as intended and the more I read the less I understand...
> > > I've stared at the config for the past 2 days hoping to be struck by
> > > sudden understanding... hasn't happened yet.
> >
> > Don't worry, the learning curve is extremely steep. We all need
> > quite some patience.
> >
> > > I don't understand how you make a rule, and then call that rule as a
> > > result of an action. I used the bit from the pingd FAQ page:
> > > http://www.linux-ha.org/v2/faq/pingd
> > > "Quickstart - Only Run my_resource on Nodes with Access to at Least
> > > One Ping Node"
> > >
> > > So - for my pingd clone, the operation is 'monitor' and 'on_fail=fence'
> > > <op id="pingd-child-monitor" name="monitor" interval="20s"
> > > timeout="40s" prereq="nothing" on_fail="fence"/>
> > >
> > > I assume that this literally means:
> > > "ask the LRM to see if pingd is running every 20s, if after 40s pingd
> > > is not running, call it 'failed', and as it's 'failed' - fence it off,
> > > which forces the resource to migrate to another node and marks this
> > > one as 'degraded' and will not allow resource to run until it's been
> > > cleaned up"
> > >
> > > Is that right? If so, then this bit I'm OK with.
> >
> > No, not exactly. The monitor operation may fail (i.e. the
> > resource agent says that the resource isn't running) or timeout
> > (that's what you described). Of course, both are considered to be
> > failures by CRM. on_fail=fence means that if this operation
> > fails, the node will be fenced, i.e. rebooted if you have an
> > operational stonith device. Perhaps a tad harsh for a monitor
> > failure.
>
> 1. The approach for me is (this is a test cluster - but I want to use
> it to replace a production one) - if either of the load balancers
> can't ping one or two routers in my DMZ, then this must mean they're
> dead. I figured if they can't see the router - how the hell can they
> see the apache servers they're meant to be managing?
> Is this 'correct political thought' or a sloppy foundation to begin with?
It's just that the resources _are_ going to move. No need to kill
the cooperating node.
> 2. I didn't know that fence meant 'rebooted'. I thought it was sort of
> 'fenced off' and left in a degraded state should someone want to poke
> around a bit.
> RE: Perhaps a tad harsh for a monitor failure - I agree. But what's a
> girl to do?
> Am I on the right track here? Do I want it rebooting? Do I just want
> Heartbeat to restart? Does it matter? If it comes up and the link is
> still dead - will it cycle forever w/ reboots?
Not sure, but could be. Whenever a node comes up all resources
are probed, i.e. one monitor operation is fired.
> 3. the real bit I'm missing: Let's say I want it rebooted after
> fencing.
Fencing _is_ rebooting.
> What 'commands it' to do so? Just the flag 'on_fail=fence'?
Yes. Some other things too. For example, one node has a quorum
and it cannot establish the state of another node. Then, to make
sure, it kills the other node.
> Does that automatically look for a started stonith device or resource
> and if it finds one, it just uses it?
Yes.
> I mean - how does the stonith
> suicide (which doesn't work - but suppose for a minute it did) - how
> is it connected to another operational directive?
Not sure why does this confuse you. Once a decision has been
reached that a node should be fenced (rebooted), the cluster will
try to find a means to do that. That means is a stonith resource.
> > > But - the 'dampen and multiplier' - I don't get.
> > > <nvpair id="pingd-dampen" name="dampen" value="5s"/>
> > > Does this mean: Wait 5 seconds before saying "yep - pingd says there's
> > > nothing out there, once pingd says 'there's nothing out there;?" Now
> > > write it out to the CIB and let any actions take place?
> >
> > Yes, the cluster sort of stands back a bit until everything
> > settles.
> >
> > > <nvpair id="pingd-multiplier" name="multiplier" value="100"/>
> > > This is a weighted score thing right? It's adding 100 to each node
> > > that 'can' ping?
> >
> > Right.
>
> Can you control the frequency of the pings themselves? What
> constitutes a timeout in this case? (n) packets lost? latency?
I don't know. But it's supposed to do the "right thing".
> > > So if one can't ping, then the score gets knocked down and the
> > > resource wants to move to a "higher scoring" node?? I completely don't
> > > understand this... What if you already have a constraint set for a
> > > node preference, does this override it? Conflict with it?
> >
> > The node with the highest score is chosen to run the service. If
> > there's more than one with the same score, then one's chosen at
> > (pseudo)random. If no score is non-negative then the resource
> > can't run anywhere.
> >
> > > In any case - now that my node has no ping, and is fenced, I saw
> > > another bit of code called 'DoFencing' which I modified thinking it
> > > would now cause the node to commit suicide since it had no
> > > connectivity. But I've no idea about how it's meant to work... It's
> > > saying "your clone DoFencing is stonith via suicide" right?
> >
> > I don't know until I see the code you're talking about. Typically
> > though stonith resources are configured to reboot other nodes and
> > not commit suicides. There's a special stonith agent called
> > suicide for this purpose.
> >
> > > What do the clone_max and clone_node_max mean?
> > > Is clone_max = 2, mean that there are a maximum of 2 nodes that use
> > > it? 2 stonith daemons that run on each node? What? Ditto for
> > > clone_node_max?
> >
> > clone_max: The maximum number of instances of this clone in the cluster.
>
> What is the guidance for this? Should you have one per machine? One in general?
There's no guidance. Clones are just useful if you want to have
more than one instance of a resource. Typically this is set to
the number of nodes.
> > clone_node_max: The maximum number of instances of this clone at one node.
>
> Ditto above: Should I have a stonith clone per resource / per node? Or
> just one?
One typical example for clones is an NFS filesystem. If one wants
it mounted on all nodes, a cloned Filesystem resource suffices.
> > > As for the operations on the DoFencing clone - what are they
> > > triggering? The timeouts are for what? the stonith daemon itself? Am I
> > > calling the stonith daemon itself to commit suicide? If so - why would
> > > I have a monitor or start operation?
> >
> > This is admitedly a bit confusing. The start operation doesn't
> > do anything with the device, just makes it available. The stop
> > operation is the opposite. In other words, in order for the
> > stonith device to be used it must first be started.
>
> So for any stonith resource, (using suicide / ssh methods) I'll always
> want to have
> monitor, start & stop?
Just start and monitor. Normally, you don't have to use stop.
> Monitor for the cluster to use it, start to see it and stop is
> effectively the 'reboot' bit?
No, the stop bit is to stop the stonith resource.
> > The monitor operation is essential because the cluster wants to
> > make sure that the stonith device is operational. Typically, it
> > consists of logging into the device and requesting some kind of
> > status.
> >
> > The timeouts are for the operations on which they are defined.
> > The start operation implies a monitor.
> >
> > > Do you need a constraint with a rule to 'start' this resource? ie.
> > > kill myself? Does it just 'know' to do this? I'm really not getting
> > > it.
> >
> > Under some circumstances it is necessary to ensure that a node
> > has relinquished resources. A typical example is a failed stop
> > operation. In that case the CRM will issue a RESET or POWEROFF
> > request to the eligible stonith device.
>
> So - the previous 'on_fail=fence' for the pingd clone - where would
> that go ideally?
It's really simple: on_fail instructs cluster what to do in case
this operation failed.
> (I mean - on which operation?)
> Ping needs a monitor and needs a start. Does it need a stop?
No.
> > > <clone id="DoFencing">
> > > <instance_attributes>
> > > <attributes>
> > > <nvpair name="clone_max" value="2"/>
> > > <nvpair name="clone_node_max" value="1"/>
> > > </attributes>
> > > </instance_attributes>
> > > <primitive class="stonith" id="child_DoFencing" type="suicide"
> > > provider="heartbeat">
> > > <operations>
> > > <op name="monitor" interval="5s" timeout="20s" prereq="nothing"/>
> > > <op name="start" timeout="20s" prereq="nothing"/>
> > > </operations>
> > > </primitive>
> > > </clone>
> >
> > The suicide stonith device is not exactly the best approach.
> > Ultimately it is not reliable, so it should not be used on the
> > production clusters. If you can afford it, get a real (hardware)
> > stonith device.
>
> Can't. No budget. Advice taken - I'll have to kill these via SSH or suicide.
Note that in case the cluster wants to stonith (reset) a node it
will try to do that forever. Hence, if at that time your stonith
device is not operational, the cluster will basically block.
That's also why using ssh as a stonith device is dangerous. For
example, if the power supply fails, the living node will never
take over the resources.
> I set up ssh keys for every user, root, haclient, hacluster - they
> always fail authentication.
> How can you tell which user / method it's using?
ssh uses the root user. You should check yourself if it works
without password.
> Can you set which
> interface they use (in order to force it (ssh) down the crossover
> cables?)
No. It's as if you run ssh on the command line.
> >
> > > Intended actions:
> > > > node1 loses ping, (which in my world means that it's dead)
> > > > resources migrate to node2
> > > > node1 reboots (what I really want is for the fenced resource to be 'cleaned up' so that it can run again on this node - I'm not fussy about how I achieve that)
> > > > resource migrates back to node1 once ping (connectivity is restored).
> >
> > Rebooting a node should imply a resource cleanup. In the next
> > release the cluster will also be able to "forget" after some time
> > about the failure.
> >
> > > actual actions:
> > > > node1 loses ping,
> > > > resource migrates to node2.
> >
> > And the node1 is not rebooted? Then there's a problem with the
> > stonith setup. Any errors in logs?
>
> It's never called. I've cocked up the config by experimenting via
> 'cut-n-paste' rather than taking the time to understand the thing
> properly. Having said that I've read the docs, watching Alan down
> under, trawled the lists and re-arranged others configs, but it's
> still pretty random. Plus it's been 2 weeks and I'm an instant
> gratification kind of guy, so I'm out of my comfort zone and getting a
> little pissed (w/ myself) now :-)
>
> I just don't get how it's (stonith) is called in relation to another
> resource failing. The mechanism, the relationships. It's not just
> stonith, for example if the ping failed and I wanted to start apache
> on a cluster node to take over all IP addresses and serve up a 'temp.
> out of service' page I wouldn't have the foggiest.
Well, it definitely takes some time to get used to it.
Thanks,
Dejan
> > > > node2 loses ping but 'resource cannot run anywhere' ensues and both nodes are 'active' but no resources are being ran.
> > >
> > > I think fundamentally my approach is wrong and that I should leave it
> > > to fail and have human intervention to clean it up rather than hope it
> > > will flip flop between nodes.
> >
> > That depends on your needs of course. At any rate, it should be
> > possible to configure the cluster to fit those needs.
> >
> > There is also the meatware stonith device which will prompt a
> > human to clean up/reboot.
> >
> > > But - I'd like to have a better grasp of
> > > how V2 works in general before making the choice to fall back to a
> > > simpler config.
> >
> > HTH.
>
> It has. Thanks a lot.
>
> -Peter
>
> > Thanks,
> >
> > Dejan
> >
> > > -Peter
> > >
> > >
> > > Active / Passive set up.
> > > 2 nodes, one resource (ldirectord) balancing traffic for IP addresses
> > > on 2 web servers.
> > > 2 nics [eth0: dmz facing - eth1: crossover cable, on 10.0.0.1/2]
> > >
> > > This relates to the previous post:
> > > "How can you clean up a degraded node w/out killing it (and not manually)?"
> > >
> > > Versions:
> > > heartbeat-stonith-2.1.2-3.el4.centos
> > > heartbeat-pils-2.1.2-3.el4.centos
> > > heartbeat-ldirectord-2.1.2-3.el4.centos
> > > heartbeat-2.1.2-3.el4.centos
> >
> > > <resources>
> > > <group id="group_1">
> > > <primitive class="ocf" id="IPaddr_212_140_130_37" provider="heartbeat" type="IPaddr">
> > > <operations>
> > > <op id="IPaddr_212_140_130_37_mon" interval="5s" name="monitor" timeout="5s"/>
> > > </operations>
> > > <instance_attributes id="IPaddr_212_140_130_37_inst_attr">
> > > <attributes>
> > > <nvpair id="IPaddr_212_140_130_37_attr_0" name="ip" value="212.140.130.37"/>
> > > </attributes>
> > > </instance_attributes>
> > > </primitive>
> > > <primitive class="ocf" id="IPaddr_212_140_130_38" provider="heartbeat" type="IPaddr">
> > > <operations>
> > > <op id="IPaddr_212_140_130_38_mon" interval="5s" name="monitor" timeout="5s"/>
> > > </operations>
> > > <instance_attributes id="IPaddr_212_140_130_38_inst_attr">
> > > <attributes>
> > > <nvpair id="IPaddr_212_140_130_38_attr_0" name="ip" value="212.140.130.38"/>
> > > </attributes>
> > > </instance_attributes>
> > > </primitive>
> > > <primitive class="ocf" id="ldirectord_3" provider="heartbeat" type="ldirectord">
> > > <operations>
> > > <op id="ldirectord_3_mon" interval="120s" name="monitor" timeout="60s"/>
> > > </operations>
> > > <instance_attributes id="ldirectord_3_inst_attr">
> > > <attributes>
> > > <nvpair id="ldirectord_3_attr_1" name="1" value="ldirectord.cf"/>
> > > </attributes>
> > > </instance_attributes>
> > > </primitive>
> > > </group>
> > > <clone id="pingd">
> > > <instance_attributes id="pingd">
> > > <attributes>
> > > <nvpair id="pingd-clone_node_max" name="clone_node_max" value="1"/>
> > > </attributes>
> > > </instance_attributes>
> > > <primitive id="pingd-child" provider="heartbeat" class="ocf" type="pingd">
> > > <operations>
> > > <op id="pingd-child-monitor" name="monitor" interval="20s" timeout="40s" prereq="nothing" on_fail="fence"/>
> > > </operations>
> > > <instance_attributes id="pingd_inst_attr">
> > > <attributes>
> > > <nvpair id="pingd-dampen" name="dampen" value="5s"/>
> > > <nvpair id="pingd-multiplier" name="multiplier" value="100"/>
> > > </attributes>
> > > </instance_attributes>
> > > </primitive>
> > > </clone>
> > > <clone id="DoFencing">
> > > <instance_attributes>
> > > <attributes>
> > > <nvpair name="clone_max" value="2"/>
> > > <nvpair name="clone_node_max" value="1"/>
> > > </attributes>
> > > </instance_attributes>
> > > <primitive id="child_DoFencing" class="stonith" type="suicide" provider="heartbeat">
> > > <operations>
> > > <op name="monitor" interval="5s" timeout="20s" prereq="nothing"/>
> > > <op name="start" timeout="20s" prereq="nothing"/>
> > > </operations>
> > > </primitive>
> > > </clone>
> > > </resources>
> > > <constraints>
> > > <rsc_location rsc="group_1" id="rsc_location_group_1">
> > > <rule id="prefered_location_group_1" score="200">
> > > <expression attribute="#uname" id="prefered_location_group_1_expr" operation="eq" value="dmz1.scarceskills.com"/>
> > > </rule>
> > > <rule id="group_1:connected:rule" score="-INFINITY" boolean_op="and">
> > > <expression id="my_resource:connected:expr:zero" attribute="pingd" operation="lte" value="0"/>
> > > </rule>
> > > </rsc_location>
> > > <rsc_location id="cli-prefer-group_1" rsc="group_1">
> > > <rule id="cli-prefer-rule-group_1" score="INFINITY">
> > > <expression id="cli-prefer-expr-group_1" attribute="#uname" operation="eq" value="dmz1.scarceskills.com" type="string"/>
> > > </rule>
> > > </rsc_location>
> > > </constraints>
> > > _______________________________________________
> > > Linux-HA mailing list
> > > Linux-HA at lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
More information about the Linux-HA
mailing list