[Linux-HA] Heartbeat 2.0.7 with CRM and IBM ServeRAID

Andrew Beekhof beekhof at gmail.com
Tue Oct 10 05:04:43 MDT 2006


On 10/10/06, Jon Fanti <Jon.Fanti at unique.com> wrote:
> I've now fixed the ServeRAID RA script, since there were too errors -
> firstly it was trying to return $OCF_SUCCESS rather than exit
> $OCF_SUCCESS, there was also a syntax change in the ipssend utility.
> I've included the modified version - but it has some extra ha_log
> statements in there.

A patch would have been better, but I'll see what i can do

> However, it currently takes a long time for ServeRAID to be available
> to mount by my filesystem resource, looking at the logs I see that
> heartbeat attempts to start the ServerRAID multiple times. I timed how
> long it takes to start ServeRAID by hand ~15 seconds, so I've added:
> name="start" timeout="20s", this prevents Heartbeat from thinking the
> resource startup has failed immediately,

> but it doesn't seem to prevent
> Heartbeat from attempting to start the resource many times.

Multiple starts are always allowed.  This is mandated by both the LSB
an OCF specs.

Starting a started resource must succeed.

> Does heartbeat issue a start command, then issue a monitor command to
> check the resource is available, or should it just wait for a maximum of
> my start timeout value for the RA to return success? If the later, I
> don't believe this is happening.

We send a monitor (interval=0) to see what state the resource is in.
Assuming it reports 7 (not running), we then start the resource.
Assuming it reports 0 (success), we then start any recurring monitor
actions that we specified for the resource.

If the start fails, we will send a stop action.

If one of the recurring monitor actions fails, we will send a stop
followed by another start.

For every action, we wait for it to complete before moving on.  The
problem is that you have a resource that takes 15+ seconds to start
but default_action_timeout is set to 5s.

You either need to increase this value for all actions or (as I
believe you have done) set a higher value on the start operation.

> I have attached the ha-log and ha-debug
> files.
>
> So, just to clarify what I know, and what works:
>
> - The ServeRAID RA now runs correctly, and verifies with ocf-checker
> tool

excellent

> - ServeRAID RA can be run from the CLI by hand with success

also good

> - ServeRAID RA takes ~15 seconds to fully complete and return success

ok

> - If the ServeRAID RA is run from Heartbeat it is executed multiple
> times

normal

> - The ServeRAID RA requires a timeout value to prevent Heartbeat
> believing it has failed to start

not true. it needs a _higher_ timeout than the default you set.


More information about the Linux-HA mailing list