[Linux-HA] Reasonable values for timeouts

Andrew Beekhof beekhof at gmail.com
Fri Jul 13 01:43:37 MDT 2007


On 7/13/07, Max Hofer <max.hofer at apus.co.at> wrote:
>
> I agree with the experience dscribed by Eddie.
>
> For the 'monitor' keep in mind that the timeout should be lower than the
> interval. It does not make sense to start a 2nd monitor cycle when the
> first one did not finish.



i _think_ that the interval is the time between one action ending and the
next one starting (rather than between both starting)

at least i hope that


In the end it boils down to:
> * you have to know what kind of operations/commands the resource
> agent is doing
> * make a rough estimation how long those commands take in worst case
>
> In a perfect world the person who wrote the RA would provides you
> with reasonable standard values via the meta-information. Well ... you
> know
> how the owrld is ;-)
>
> Keep also in mind what happens when the action fails:
> * failed start ---> resource will never be able to start on the cluster
> node
> again until you clear it with crm_resource -C (crm_resource -V is your
> friend to find those resources)


we're working on that :-)
the idea is to have it use the same mechanism as a monitor failure

* failed monitor --> fail-count increase
> * failed stop ---> reosurce is UNMANAGED which means the cluster
> environment will not start it anywhere else until you cleaned up the whole
> thing manually and made a crm_resourlce -C


unless you have stonith enabled in which case we'll shoot the node so we can
continue.


On Thursday 12 July 2007, Eddie C wrote:
> > I have found a few things:
> >
> > 1) A status or monitor function.. I would set a timeout for more then 30
> > seconds.
> > Why? Sometimes developers/administrators do not understand the heartbeat
> > capability. They only want to to/restart a service quickly. If you set
> the
> > status/monitor too low it detects little restarts and may cause a fail
> over.
> > Also if the service is broken somehow heartbeat may try to restart it
> very
> > often filling up logs quickly
> >
> > 2) As for the timeouts. setting them high might be better as well 30
> sec+. I
> > had a piece of code that started in a split second in the lab with a
> testing
> > configuration. In the real world it took over 20 seconds to start. I had
> the
> > timeout set at 5. This drove the system crazy because things were
> starting
> > after heartbeat gave up and attempted to fail them over to another node.
> >
> > Remember heartbeat is called as HA High Availability not CA Continuous
> > Availability. I personally found that fail over ~60 seconds is good. If
> you
> > go to low the state machine mechanics can start getting tricky.
> >
> >
> >
> > On 7/12/07, matilda matilda <matilda at grandel.de> wrote:
> > >
> > > >>> "Andrew Beekhof" <beekhof at gmail.com> 12.07.2007 15:40 >>>
> > > > >>> "Andrew Beekhof" <beekhof at gmail.com> 12.07.2007 13:53 >>>
> > > > On 7/12/07, matilda matilda <matilda at grandel.de> wrote:
> > > > > Hi all,
> > > > >
> > > > > how do I get reasonable values for timeout attributes for certain
> > > operations?
> > > > > How can I tune them?
> > > > > Or shall I use the values provided in the RA metadata?
> > > >
> > > > the default-action-timeout option determines what is used by default
> > > > to use a different value for a particular operation, eg. 300s for a
> > > > start operation, go to the resource you wish to modify and add:
> > > >
> > > >            <operations>
> > > >              <op id="somevalue" name="start" timeout="300s"/>
> > > >            </operations>
> > > >
> > > >or for a recurring monitor operation such as:
> > > >      <op id="DoFencing-1" name="monitor" interval="60s"
> > > prereq="nothing"/>
> > > >just change that to something like:
> > > >      <op id="DoFencing-1" name="monitor" interval="60s"
> > > >prereq="nothing" timeout="300s"/>
> > > >
> > > >
> > > >does that help?
> > >
> > >
> > > Thank you, but what I really wanted to know is:
> > > How do I get a feeling about how long a certain action could take
> before
> > > it is assumed that this action doesn't work. So, how could I get a
> timeout
> > > value which is as short as possible but not too short.
> > > Is there a way to test a RA in different load situations?
> > >
> > > Best regards
> > > Andreas Mock
> > >
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>


More information about the Linux-HA mailing list