[Linux-HA] the return code of failing start action

Andrew Beekhof beekhof at gmail.com
Tue Oct 9 00:53:15 MDT 2007


dejan, can you take a look at this pls?
rc for an operation seems to be changing in the lrmd somehow

On 10/9/07, Junko IKEDA <ikedaj at intellilink.co.jp> wrote:
> > > Hi,
> > >
> > > when I tried the following case,
> > > the return code of start action was something strange.
> > >
> > > 1) There are two node; active and standby node
> > > 2) one resource is running on the active node
> > > 3) SplitBrain came up!
> >
> > you created a split brain or it occurred on its own?
>
> I created it on purpose.

ok

>
> > > 4) the resource would be going to start on the both node,
> >
> > you dont have stonith configured right?
> >
> > because this is exactly the reason why two-node clusters, particularly
> > ones without stonith configured are a seriously bad idea.
> >
> > at least configure pingd so that only one side will try and run the
> resources
>
> There is no stonith configuration for now.
> This might sound strange, but we are testing some worst cases without
> stonith.

just a little... it reminds me of the old joke:

patient: doctor, doctor, it hurts when i do this!
doctor: well, dont do that then


is the concern that some part of the stonith setup will fail and you
want to see how the cluster behaves without it?

otherwise i confess i dont see the point.

> It's sure that stonith can help this situation if it's configured.
>
> > >    I drive it into failure on purpose on the standby node.
> > >    so, the return code of start action would be -1 on standby.
> > >    (it worked well)
> >
> > -1 means "timed out"... thats not a good value to return from an RA
>
> sorry for the lacking of talk...
> I created it on purpose, too.
> I wanted to know how heartbeat would work if an RA went into "timed out".

ah

> > the whole concept of trying to handle this is in a resource's start
> > action is a horrible substitute for a correctly configured cluster.
> > continuing down this path will only lead to pain.
> >
> > > 5) after recovering SplitBrain, the return code on standby node was
> "-2"...
> > >    and crm_mon on the active node also showed it as -2.
> > >
> > > Why is it incremented?
> >
> > i'm not sure i follow this anymore... which return code are you talking
> about?
> > if you're talking about the one from the start action, it is never
> > modified in any way
>
> the return code for "timed out" (maybe) became -2 after recovering from
> SplitBrain.
> It was -1 first.

how odd

> I tried to gather the log files with hb_report and attached it.
>
> build_operation_update() said like this;
>
> debug: build_operation_update: Calculated digest
> e68af41c5248ad5766285315f043c074 for prmDummy_start_0
> (2:-1;4:3:22520a1d-c026-4941-a403-717fc054c2c3)
>
> ...
>
> debug: build_operation_update: Calculated digest
> e68af41c5248ad5766285315f043c074 for prmDummy_start_0
> (2:-2;4:3:22520a1d-c026-4941-a403-717fc054c2c3)

in that case we're just using the value supplied by the lrm


More information about the Linux-HA mailing list