[Linux-HA] stonith riloe - nodes kill each other

Jan Kalcic jandot at googlemail.com
Fri Jul 3 03:04:11 MDT 2009


Jan Kalcic wrote:
> Dejan Muhamedagic wrote:
>   
>> Hi,
>>
>> On Fri, Jun 26, 2009 at 04:33:30PM +0200, Jan Kalcic wrote:
>>   
>>     
>>> Andrew Beekhof wrote:
>>>     
>>>       
>>>> On Fri, Jun 26, 2009 at 3:07 PM, Jan Kalcic<jandot at googlemail.com> wrote:
>>>>   
>>>>       
>>>>         
>>>>> Andrew Beekhof wrote:
>>>>>     
>>>>>         
>>>>>           
>>>>>> On Fri, Jun 26, 2009 at 10:55 AM, Jan<jandot at googlemail.com> wrote:
>>>>>>
>>>>>>       
>>>>>>           
>>>>>>             
>>>>>>> Hi,
>>>>>>>
>>>>>>> a very boring issue with stonith using the plugin external/riloe (never used
>>>>>>> it). Whenever I try to simulate a split-brain condition (using iptables) in
>>>>>>> order to test stonith, both nodes kill each other. Not exactly what
>>>>>>> expected.
>>>>>>>
>>>>>>>         
>>>>>>>             
>>>>>>>               
>>>>>> Sure it is
>>>>>>
>>>>>> [snip]
>>>>>>
>>>>>>
>>>>>>       
>>>>>>           
>>>>>>             
>>>>>>>        <nvpair id="nvpair-56c027e0-80c8-49a7-9cf1-1af593a9391f"
>>>>>>> name="no-quorum-policy"
>>>>>>> value="ignore"/>
>>>>>>>
>>>>>>>         
>>>>>>>             
>>>>>>>               
>>>>>> With that option, this is exactly what I'd expect.
>>>>>>
>>>>>> Have a read of:
>>>>>>    http://ourobengr.com/ha
>>>>>>
>>>>>>       
>>>>>>           
>>>>>>             
>>>>> For what I understood, probably wrongly, that should be the right option
>>>>> for a two nodes cluster, where only one node can't have quorum, that's
>>>>> why should be "ignore". Is this wrong?
>>>>>
>>>>> I had already taken a quick look at that document (I love that picture
>>>>> btw) but not as deeply as now. I am going to review my timeout for sure.
>>>>> Anyway, I don't get any hint about the quorum setting. Should it be
>>>>> different that "ignore"?
>>>>>     
>>>>>         
>>>>>           
>>>> No, thats the right value for a two node cluster.
>>>> But that value can also leads to the behavior you described.
>>>>
>>>> Though normally one side shoots the other before it can shoot back.
>>>>   
>>>>       
>>>>         
>>> This does not happen. The reason could be that usin iLO the node is not
>>> actually shot but gracefully shutdown. For this reason the shot node has
>>> all the time to shoot the other side back. Make sense?
>>>     
>>>       
>> Yes, it does.
>>
>>   
>>     
>>> In this case I would need to stonith the other side not gracefully but
>>> strongly like unplugging the cable but it seems this is not available
>>> with the riloe plugin, is it?
>>>     
>>>       
>> Yes, it is. You should use the latest version of the plugin.
>>   
>>     
>
> I checked the plugin's version and it seems to be the very last one. It
> is the one installed with SLES11-HA. A diff with the plugin available on
> the openSuSE build service for openSuSE 11.1 reports they are the same.
>   
>> ilo_powerdown_method should be set to power, AFAIK. I think that
>> that does a "cable pull" operation. If you still find a problem
>> with nodes shooting each other at the same time, please file a
>> bugzilla. I'm not sure if that can be fixed, depends on the
>> timings when talking to the device.
>>   
>>     
>
> I will try with the power option in the next few days. What let me
> confused is the description below I extracted from the plugin. "power"
> takes longer than button. I would expect it is shoot the node
> immediately in order to not be stonith back.
>
> <shortdesc lang="en">Power down method</shortdesc>
> <longdesc lang="en">
> The method to powerdown the host in question.
> * button - Emulate holding down the power button
> * power - Emulate turning off the machines power
>
> NB: A button request takes around 20 seconds. The power method
> about half a minute.
>
>   
Ok, actually the power method was the one I was already using. What I
changed is the stonith action from poweroff, which shutdown gracefully
the node, to reboot which actually reboot the server but it also resets
it in few seconds.Deadthmatch no longer occur. From command line I
managed to stonith the node just like I want. Reset and with no reboot,
(-T reset) but I could not "move" this command into pacemaker.

Thanks,
Jan

> Thanks,
> Jan
>   
>> Thanks,
>>
>> Dejan
>>
>>
>>
>>   
>>     
>>> Thanks,
>>> Jan
>>>     
>>>       
>>>>> My issue isn't exactly the deathmatch described there, first of all
>>>>> because the openais daemon is disable at boot and secondly because this
>>>>> stonith policy is poweroff. Rather, is a strange situation where both
>>>>> nodes kill themselves and they both shutdown.
>>>>>     
>>>>>         
>>>>>           
>>>> They'd both be killing each other.
>>>>
>>>>   
>>>>       
>>>>         
>>>>> I wonder if it is a timeout issue. My timeout here for the stonith
>>>>> resource is 15s. Does it mean that when a stonith is sent by the first
>>>>> node to the second one and this node can't shutdown itself in 15s, it
>>>>> stonith the first node?
>>>>>     
>>>>>         
>>>>>           
>>>> No.  This is unrelated
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> Linux-HA at lists.linux-ha.org
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>>>>
>>>>   
>>>>       
>>>>         
>>> _______________________________________________
>>> Linux-HA mailing list
>>> Linux-HA at lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>>>     
>>>       
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA at lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
>>   
>>     
>
>
>   




More information about the Linux-HA mailing list