[Linux-HA] Reasonable values for timeouts
Andrew Beekhof
beekhof at gmail.com
Fri Jul 13 01:43:37 MDT 2007
On 7/13/07, Max Hofer <max.hofer at apus.co.at> wrote:
>
> I agree with the experience dscribed by Eddie.
>
> For the 'monitor' keep in mind that the timeout should be lower than the
> interval. It does not make sense to start a 2nd monitor cycle when the
> first one did not finish.
i _think_ that the interval is the time between one action ending and the
next one starting (rather than between both starting)
at least i hope that
In the end it boils down to:
> * you have to know what kind of operations/commands the resource
> agent is doing
> * make a rough estimation how long those commands take in worst case
>
> In a perfect world the person who wrote the RA would provides you
> with reasonable standard values via the meta-information. Well ... you
> know
> how the owrld is ;-)
>
> Keep also in mind what happens when the action fails:
> * failed start ---> resource will never be able to start on the cluster
> node
> again until you clear it with crm_resource -C (crm_resource -V is your
> friend to find those resources)
we're working on that :-)
the idea is to have it use the same mechanism as a monitor failure
* failed monitor --> fail-count increase
> * failed stop ---> reosurce is UNMANAGED which means the cluster
> environment will not start it anywhere else until you cleaned up the whole
> thing manually and made a crm_resourlce -C
unless you have stonith enabled in which case we'll shoot the node so we can
continue.
On Thursday 12 July 2007, Eddie C wrote:
> > I have found a few things:
> >
> > 1) A status or monitor function.. I would set a timeout for more then 30
> > seconds.
> > Why? Sometimes developers/administrators do not understand the heartbeat
> > capability. They only want to to/restart a service quickly. If you set
> the
> > status/monitor too low it detects little restarts and may cause a fail
> over.
> > Also if the service is broken somehow heartbeat may try to restart it
> very
> > often filling up logs quickly
> >
> > 2) As for the timeouts. setting them high might be better as well 30
> sec+. I
> > had a piece of code that started in a split second in the lab with a
> testing
> > configuration. In the real world it took over 20 seconds to start. I had
> the
> > timeout set at 5. This drove the system crazy because things were
> starting
> > after heartbeat gave up and attempted to fail them over to another node.
> >
> > Remember heartbeat is called as HA High Availability not CA Continuous
> > Availability. I personally found that fail over ~60 seconds is good. If
> you
> > go to low the state machine mechanics can start getting tricky.
> >
> >
> >
> > On 7/12/07, matilda matilda <matilda at grandel.de> wrote:
> > >
> > > >>> "Andrew Beekhof" <beekhof at gmail.com> 12.07.2007 15:40 >>>
> > > > >>> "Andrew Beekhof" <beekhof at gmail.com> 12.07.2007 13:53 >>>
> > > > On 7/12/07, matilda matilda <matilda at grandel.de> wrote:
> > > > > Hi all,
> > > > >
> > > > > how do I get reasonable values for timeout attributes for certain
> > > operations?
> > > > > How can I tune them?
> > > > > Or shall I use the values provided in the RA metadata?
> > > >
> > > > the default-action-timeout option determines what is used by default
> > > > to use a different value for a particular operation, eg. 300s for a
> > > > start operation, go to the resource you wish to modify and add:
> > > >
> > > > <operations>
> > > > <op id="somevalue" name="start" timeout="300s"/>
> > > > </operations>
> > > >
> > > >or for a recurring monitor operation such as:
> > > > <op id="DoFencing-1" name="monitor" interval="60s"
> > > prereq="nothing"/>
> > > >just change that to something like:
> > > > <op id="DoFencing-1" name="monitor" interval="60s"
> > > >prereq="nothing" timeout="300s"/>
> > > >
> > > >
> > > >does that help?
> > >
> > >
> > > Thank you, but what I really wanted to know is:
> > > How do I get a feeling about how long a certain action could take
> before
> > > it is assumed that this action doesn't work. So, how could I get a
> timeout
> > > value which is as short as possible but not too short.
> > > Is there a way to test a RA in different load situations?
> > >
> > > Best regards
> > > Andreas Mock
> > >
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
More information about the Linux-HA
mailing list