[Linux-HA] Questions on crm_resource -C

Andrew Beekhof beekhof at gmail.com
Thu Feb 1 07:45:26 MST 2007


On 2/1/07, Pavol Gono <palo.gono at gmail.com> wrote:
> On 2/1/07, Andrew Beekhof <beekhof at gmail.com> wrote:
> > On 1/30/07, Pavol Gono <palo.gono at gmail.com> wrote:
> > > Hi
> > >
> > > I was curious why execution of "crm_resource -C" takes at least 10
> > > seconds. I found in crm/admin/crm_resource.c two times sleep(5). How
> > > was this time determined?
> >
> > not very scientifically unfortunately
> >
> > > Is it safe to decrease it e.g. to sleep(1),
> > > when I have fast network connection and strong machines?
> >
> > you can always try :-)
>
> of course I can try :) What is the worst situation which can happen if
> the time is too short? Is the sleep used because you don't have
> bidirectional communication between processes? Or because of
> possibility of race conditions?

non-destructive race condition.  basically we'd run the PE with the
old data - so it would look like nothing had changed.

>
> >
> > >
> > > The next question is on possibility to run more crm_resource -C
> > > commands in paralell.
> > > My implementation of failure of my important resource is:
> > > - custom monitoring process detects internal failure of resource
> > > - crm_standby -v on -U node
> > > - wait till heartbeat stops all our resources
> > > (now machine runs no resources and I want to cleanup everything)
> > > - custom restarting of services
> >
> > BAD BAD BAD
> >
> > you've just guaranteed that the resource(s) are active on multiple nodes
> >
> >
> > what do you want done?  ensure that the resource is restarted when you
> > detect a failure?
> >
>
> ok, I didn't write it correctly...
> Lets say I have complex software behind one resource agent. The
> software has STOPPED and STARTED states. After failure it is wise to
> restart this software from scratch to be sure. Of course - only while
> related resource agent is "stopped" (this is the reason why I want to
> have node in standby mode, while doing all the stuff around)

wouldnt you need the whole cluster in standby mode?  otherwise one of
the other machines will still have it running while you verify
everything is ok...

>
> >
> > > - for all resources: crm_resource -C -H node -r res
> >
> >  crm_resource -C will only have a noticeable effect if a resource
> > failed to start when we asked it to.  so from what i can tell of your
> > scenario, this is having no effect.
>
> Lets say it is possible that a RA failed to start. I want to be sure
> that this RA has chance to start again after cleanup.
>
> Restart of whole machine would be better, but it is too slow and in
> many cases not necessary.
>
> >
> > > - for all resources: crm_failcount -D -U node -r res
> > > - crm_standby -D -U node
> > >
> > > When doing all of this sequential, it takes too much time - number of
> > > resources x 10sec (crm_failcount and crm_standby is fast enough).
> > > Is it safe to run crm_resources and crm_failcounts in paralell? E.g.
> > > crm_resource -C -H node -r res1 & crm_resource -C -H node -r res2 &
> > > crm_resource -C -H node -r res3 & crm_failcount -D -U node -r res1 &
> > > crm_failcount -D -U node -r res2 & crm_failcount -D -U node -r res3 &
>
> I tried running more crm_resource-s in paralell and sometimes nonfatal
> asserts of crmd appear (triggered by crm_resource). I am currently
> analyzing what exactly happened. So the answer on above question is
> still missing :)
>
> Palo
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>


More information about the Linux-HA mailing list