[Linux-HA] failcount for master/slave resource
Andrew Beekhof
beekhof at gmail.com
Thu Apr 24 01:21:17 MDT 2008
On Tue, Apr 22, 2008 at 4:01 AM, Junko IKEDA <ikedaj at intellilink.co.jp> wrote:
> > > I have one master/slave resource.
> > > (Heartbeat 2.2.0 + Pacemaker 0.6.2)
> > >
> > > Master/Slave Set: ms-sf
> > > stateful-1:0 (ocf::heartbeat:Stateful):Master node-b
> > > stateful-1:1 (ocf::heartbeat:Stateful):Started node-a
> > >
> > > If stateful-1:0 fails, crm_mon would show like this;
> > >
> > > Master/Slave Set: ms-sf
> > > stateful-1:0 (ocf::heartbeat:Stateful):Stopped
> > > stateful-1:1 (ocf::heartbeat:Stateful):Master node-a
> > >
> > > Failed actions:
> > > stateful-1:0_demote_0 (node=node-b, call=7, rc=7): complete
> > >
> > > I tried to clear the failcount of stateful-1:0 with crm_failcount.
> >
> > That doesn't remove the failed operation though... only the counter
> > which tracks how many times the resource failed.
> >
> > Perhaps try crm_resource -C
>
> ok, I tried this.
>
> (1) run the resource
>
>
> Master/Slave Set: ms-sf
> stateful-1:0 (ocf::heartbeat:Stateful):Master node-b
> stateful-1:1 (ocf::heartbeat:Stateful):Started node-a
>
>
> (2) break master resource
>
> # rm -f /var/run/heartbeat/rsctmp/Stateful-stateful-1\:0.state
>
>
> Master/Slave Set: ms-sf
> stateful-1:0 (ocf::heartbeat:Stateful):Stopped
> stateful-1:1 (ocf::heartbeat:Stateful):Master node-a
>
> Failed actions:
> stateful-1:0_demote_0 (node=node-b, call=7, rc=7): complete
>
>
> (3) clear master resource
>
> # crm_resource -C -r stateful-1:0 -H node-b
>
>
> Master/Slave Set: ms-sf
> stateful-1:0 (ocf::heartbeat:Stateful):Stopped
> stateful-1:1 (ocf::heartbeat:Stateful):Master node-a
>
>
> (4) get back the failcount to "0"
>
>
> # crm_failcount -r stateful-1:0 -U node-b -D
>
>
> Master/Slave Set: ms-sf
> stateful-1:0 (ocf::heartbeat:Stateful):Master node-b
> stateful-1:1 (ocf::heartbeat:Stateful):Stopped
>
>
> node-b could be master again,
> but stateful-1:1 on node-a stopped instead of being slave(status Started).
>
> at this time, the failcount for stateful-1:1/node-a is counted.
>
> # cibadmin -Q | grep fail-count
> <nvpair
> id="status-c53511b5-7568-426e-bbd5-f258e24aa9ac-fail-count-stateful-1:1"
> name="fail-count-stateful-1:1" value="1"/>
>
> Is it needed to be counted?
"sort of"
Given what happened, it is correct that the failcount was incremented.
The problem is that what happened was incorrect... the monitor-1s op
was being executed _before_ the instance was being demoted (which is
clearly wrong).
Fixed in: http://hg.clusterlabs.org/pacemaker/stable-0.6/rev/e105f4e7a3cf
More information about the Linux-HA
mailing list