[Linux-HA] stonith riloe - nodes kill each other

Jan Kalcic jandot at googlemail.com
Wed Jul 1 10:31:03 MDT 2009


Dejan Muhamedagic wrote:
> Hi,
>
> On Fri, Jun 26, 2009 at 04:33:30PM +0200, Jan Kalcic wrote:
>   
>> Andrew Beekhof wrote:
>>     
>>> On Fri, Jun 26, 2009 at 3:07 PM, Jan Kalcic<jandot at googlemail.com> wrote:
>>>   
>>>       
>>>> Andrew Beekhof wrote:
>>>>     
>>>>         
>>>>> On Fri, Jun 26, 2009 at 10:55 AM, Jan<jandot at googlemail.com> wrote:
>>>>>
>>>>>       
>>>>>           
>>>>>> Hi,
>>>>>>
>>>>>> a very boring issue with stonith using the plugin external/riloe (never used
>>>>>> it). Whenever I try to simulate a split-brain condition (using iptables) in
>>>>>> order to test stonith, both nodes kill each other. Not exactly what
>>>>>> expected.
>>>>>>
>>>>>>         
>>>>>>             
>>>>> Sure it is
>>>>>
>>>>> [snip]
>>>>>
>>>>>
>>>>>       
>>>>>           
>>>>>>        <nvpair id="nvpair-56c027e0-80c8-49a7-9cf1-1af593a9391f"
>>>>>> name="no-quorum-policy"
>>>>>> value="ignore"/>
>>>>>>
>>>>>>         
>>>>>>             
>>>>> With that option, this is exactly what I'd expect.
>>>>>
>>>>> Have a read of:
>>>>>    http://ourobengr.com/ha
>>>>>
>>>>>       
>>>>>           
>>>> For what I understood, probably wrongly, that should be the right option
>>>> for a two nodes cluster, where only one node can't have quorum, that's
>>>> why should be "ignore". Is this wrong?
>>>>
>>>> I had already taken a quick look at that document (I love that picture
>>>> btw) but not as deeply as now. I am going to review my timeout for sure.
>>>> Anyway, I don't get any hint about the quorum setting. Should it be
>>>> different that "ignore"?
>>>>     
>>>>         
>>> No, thats the right value for a two node cluster.
>>> But that value can also leads to the behavior you described.
>>>
>>> Though normally one side shoots the other before it can shoot back.
>>>   
>>>       
>> This does not happen. The reason could be that usin iLO the node is not
>> actually shot but gracefully shutdown. For this reason the shot node has
>> all the time to shoot the other side back. Make sense?
>>     
>
> Yes, it does.
>
>   
>> In this case I would need to stonith the other side not gracefully but
>> strongly like unplugging the cable but it seems this is not available
>> with the riloe plugin, is it?
>>     
>
> Yes, it is. You should use the latest version of the plugin.
>   

I checked the plugin's version and it seems to be the very last one. It
is the one installed with SLES11-HA. A diff with the plugin available on
the openSuSE build service for openSuSE 11.1 reports they are the same.
> ilo_powerdown_method should be set to power, AFAIK. I think that
> that does a "cable pull" operation. If you still find a problem
> with nodes shooting each other at the same time, please file a
> bugzilla. I'm not sure if that can be fixed, depends on the
> timings when talking to the device.
>   

I will try with the power option in the next few days. What let me
confused is the description below I extracted from the plugin. "power"
takes longer than button. I would expect it is shoot the node
immediately in order to not be stonith back.

<shortdesc lang="en">Power down method</shortdesc>
<longdesc lang="en">
The method to powerdown the host in question.
* button - Emulate holding down the power button
* power - Emulate turning off the machines power

NB: A button request takes around 20 seconds. The power method
about half a minute.

Thanks,
Jan
> Thanks,
>
> Dejan
>
>
>
>   
>> Thanks,
>> Jan
>>     
>>>> My issue isn't exactly the deathmatch described there, first of all
>>>> because the openais daemon is disable at boot and secondly because this
>>>> stonith policy is poweroff. Rather, is a strange situation where both
>>>> nodes kill themselves and they both shutdown.
>>>>     
>>>>         
>>> They'd both be killing each other.
>>>
>>>   
>>>       
>>>> I wonder if it is a timeout issue. My timeout here for the stonith
>>>> resource is 15s. Does it mean that when a stonith is sent by the first
>>>> node to the second one and this node can't shutdown itself in 15s, it
>>>> stonith the first node?
>>>>     
>>>>         
>>> No.  This is unrelated
>>> _______________________________________________
>>> Linux-HA mailing list
>>> Linux-HA at lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>>>
>>>   
>>>       
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA at lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>     
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>   




More information about the Linux-HA mailing list