[Linux-HA] Failure to start resource makes it impossible to fail back

Andrew Beekhof beekhof at gmail.com
Tue Nov 13 09:57:58 MST 2007


On Nov 13, 2007, at 5:17 PM, Anders Brownworth wrote:

> Thanks for the quick response, Andrew.
>
> 'crm_resource -C -r OpenSer' seems to work but I do get an error  
> about last-lrm-refresh not being able to be set:
>
> Nov 13 14:00:12 box01 crm_resource: [11391]: ERROR:  
> cib_native_perform_op: Call failed: The object/attribute does not  
> exist
> Nov 13 14:00:12 box01 crm_resource: [11391]: ERROR: update_attr:  
> Error setting last-lrm-refresh=1194962406 (section=crm_config,  
> set=cib-bootstrap-options): The object/attribute does not exist

This shouldn't be important.
What version are you running?

> The resource does, however, fail back when I do that AND set the  
> fail-count to 0 on the primary and backup.
>
> But the resource won't fail back unless fail-count is defined on the  
> backup. The fail-count is initially undefined:
>
> (box01:~) # crm_failcount -G -r OpenSer -U box02
> name=fail-count-OpenSer value=(null)
> Error performing operation: The object/attribute does not exist
>
> Because the service failed to start previously on the primary,  
> (box01) the fail-count is defined there. Once I define the fail- 
> count on the backup (box02)
>
> (box01:~) # crm_failcount -v 0 -r OpenSer -U box02
> (box01:~) # crm_failcount -G -r OpenSer -U box02
> name=fail-count-OpenSer value=0
>
> it migrates back as expected.

Thats really weird (and looks like a bug).
Can you try with a later version?

Unless its not important what the update contains and just that there  
is one^... so the TE gets triggered and does the migration.

Thats what the "last-lrm-refresh" code above it supposed to be doing.   
That not working could cause this kind of behavior.

> I suppose I should add a "set fail-count to 0" for both box01 and  
> box02 in my startup scripts so merely doing a 'crm_resource -C -r  
> OpenSer' migrates the service back after the initial failure.
>
> Is there a better way to be doing this?
>
> -Anders
>
> Andrew Beekhof wrote:
>> prior to the latest interim build, starts were always fatal and  
>> required the use of crm_resource -C to make the node eligible again.
>>
>> as of the last interim release, just make sure start-failure-is- 
>> fatal=false and use crm_failcount as you have below for "normal"  
>> failures.
>>
>>> Additionally, I followed the advice under "Resetting Failure  
>>> Counts" in the V2 FAQ ( http://linux-ha.org/v2/faq ) where it  
>>> suggests:
>>>
>>> crm_failcount -D -U nodeA -r my_rsc
>>>
>>> Rather than reset the failure count, this just torches it in such  
>>> a way that you can't even read it with the query command given in  
>>> the next step of the same example. I found statically setting the  
>>> count back to 0 with:
>>>
>>> crm_failcount -v 0 -U box01 -r OpenSer
>>>
>>> worked much better and allowed me to push resources back and forth  
>>> just by moving the fail count up and down.
>>>
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA at lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems



More information about the Linux-HA mailing list