[Linux-HA] "Clones, Stonith and Suicide" The SysAdmin who had a nervous breakdown.

Dave Blaschke debltc at us.ibm.com
Wed Oct 3 07:49:23 MDT 2007

Dejan Muhamedagic wrote:
> Hi,
> On Tue, Oct 02, 2007 at 10:55:03PM +0100, Peter Farrell wrote:
>> On 02/10/2007, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
>>> Hi,
>>> On Tue, Oct 02, 2007 at 05:17:38PM +0100, Peter Farrell wrote:
>>>> Can someone verify my CIB please?
>>>> It's not working as intended and the more I read the less I understand...
>>>> I've stared at the config for the past 2 days hoping to be struck by
>>>> sudden understanding... hasn't happened yet.
>>> Don't worry, the learning curve is extremely steep. We all need
>>> quite some patience.
>>>> I don't understand how you make a rule, and then call that rule as a
>>>> result of an action. I used the bit from the pingd FAQ page:
>>>> http://www.linux-ha.org/v2/faq/pingd
>>>> "Quickstart - Only Run my_resource on Nodes with Access to at Least
>>>> One Ping Node"
>>>> So - for my pingd clone, the operation is 'monitor' and 'on_fail=fence'
>>>> <op id="pingd-child-monitor" name="monitor" interval="20s"
>>>> timeout="40s" prereq="nothing" on_fail="fence"/>
>>>> I assume that this literally means:
>>>> "ask the LRM to see if pingd is running every 20s, if after 40s pingd
>>>> is not running, call it 'failed', and as it's 'failed' - fence it off,
>>>> which forces the resource to migrate to another node and marks this
>>>> one as 'degraded' and will not allow resource to run until it's been
>>>> cleaned up"
>>>> Is that right? If so, then this bit I'm OK with.
>>> No, not exactly. The monitor operation may fail (i.e. the
>>> resource agent says that the resource isn't running) or timeout
>>> (that's what you described). Of course, both are considered to be
>>> failures by CRM. on_fail=fence means that if this operation
>>> fails, the node will be fenced, i.e. rebooted if you have an
>>> operational stonith device. Perhaps a tad harsh for a monitor
>>> failure.
>> 1. The approach for me is (this is a test cluster - but I want to use
>> it to replace a production one) - if either of the load balancers
>> can't ping one or two routers in my DMZ, then this must mean they're
>> dead. I figured if they can't see the router - how the hell can they
>> see the apache servers they're meant to be managing?
>> Is this 'correct political thought' or a sloppy foundation to begin with?
> It's just that the resources _are_ going to move. No need to kill
> the cooperating node.
>> 2. I didn't know that fence meant 'rebooted'. I thought it was sort of
>> 'fenced off' and left in a degraded state should someone want to poke
>> around a bit.
>> RE: Perhaps a tad harsh for a monitor failure - I agree. But what's a
>> girl to do?
>> Am I on the right track here? Do I want it rebooting? Do I just want
>> Heartbeat to restart? Does it matter? If it comes up and the link is
>> still dead - will it cycle forever w/ reboots?
> Not sure, but could be. Whenever a node comes up all resources
> are probed, i.e. one monitor operation is fired.
>> 3. the real bit I'm missing: Let's say I want it rebooted after
>> fencing.
> Fencing _is_ rebooting.
Sorry to jump in the middle of this thread, but can't you also power off 
the node by setting stonith_action to poweroff instead of reboot?  Of 
course you need a stonith device that supports ST_POWEROFF...  I haven't 
read through the code but I'd assume that option works.

More information about the Linux-HA mailing list