[Linux-HA] Re: nodes won't auto_failback after network failure

Andrew Beekhof beekhof at gmail.com
Wed Sep 12 00:38:24 MDT 2007


On 9/3/07, sebastien lorandel <lorandel.sebastien at gmail.com> wrote:
> Ok,
>
> now I don't have this error anymore, but services still don't failback to my
> node (ha2). And even when my other (ha1) node is shutdown they don't. So I
> don't have services running in my cluster anymore...
>
> - After I reconnect the cable, the cluster see it and say that both nodes
> are able to run services:
>
> Sep  3 12:01:13 ha1 crmd: [20955]: info: do_state_transition: All 2 cluster
> nodes are eligible to run resources.
>
> - But, then it can't make them re-run on the node who failed:
>
> Sep  3 12:01:13 ha1 pengine: [22141]: info: determine_online_status: Node
> ha2 is online
> Sep  3 12:01:13 ha1 pengine: [22141]: WARN: unpack_rsc_op: Processing failed
> op (IPaddr_start_0) on ha2
> Sep  3 12:01:13 ha1 pengine: [22141]: WARN: unpack_rsc_op: Handling failed
> start for IPaddr on ha2
> Sep  3 12:01:13 ha1 pengine: [22141]: WARN: unpack_rsc_op: Processing failed
> op (IPaddr_monitor_5000) on ha2

failed start actions are always fatal
that resource can not run on that node again until you fix the problem
and use crm_resource -C


> - Then when when I stop ha1 who was running my resources, it says ha2 is
> elligible:
>
> Sep  3 12:17:42 ha2 crmd: [7347]: info: do_state_transition: All 1 cluster
> nodes are eligible to run resources.

this is a higher level "is it allowed to run any resources at all" statement
it does not reflect a specific node's ability to run a specific resource

> - And then...
>
> Sep  3 12:17:42 ha2 pengine: [8293]: info: determine_online_status: Node ha2
> is online
> Sep  3 12:17:42 ha2 pengine: [8293]: WARN: unpack_rsc_op: Processing failed
> op (IPaddr_start_0) on ha2
> Sep  3 12:17:42 ha2 pengine: [8293]: WARN: unpack_rsc_op: Handling failed
> start for IPaddr on ha2
> Sep  3 12:17:42 ha2 pengine: [8293]: WARN: unpack_rsc_location: No resource
> (con=my_resource:connected, rsc=my_resource)
> Sep  3 12:17:42 ha2 pengine: [8293]: info: group_print: Resource Group:
> group1
> Sep  3 12:17:42 ha2 pengine: [8293]: info: native_print:     IPaddr
> (heartbeat::ocf:IPaddr):        Stopped
> Sep  3 12:17:42 ha2 pengine: [8293]: info: native_print:     sshd
> (lsb:sshd):     Stopped
> Sep  3 12:17:42 ha2 pengine: [8293]: info: clone_print: Clone Set: pingd
> Sep  3 12:17:42 ha2 pengine: [8293]: info: native_print:
> pingd-child:0      (heartbeat::ocf:pingd): Started ha2
> Sep  3 12:17:42 ha2 pengine: [8293]: info: native_print:
> pingd-child:1      (heartbeat::ocf:pingd): Stopped
> Sep  3 12:17:42 ha2 pengine: [8293]: info: native_color: Combine scores from
> sshd and IPaddr
> Sep  3 12:17:42 ha2 pengine: [8293]: WARN: native_color: Resource IPaddr
> cannot run anywhere
> Sep  3 12:17:42 ha2 pengine: [8293]: WARN: native_color: Resource sshd
> cannot run anywhere
> Sep  3 12:17:42 ha2 pengine: [8293]: WARN: native_color: Resource
> pingd-child:1 cannot run anywhere
>
>
> So my cluster just keep running pingd on my node...
> Has somebody an idea?
>
> I would greatly appreciate :), thanks.
> Sebastien.
>
> On 8/31/07, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> >
> > Hi,
> >
> > On Fri, Aug 31, 2007 at 03:42:47PM +0200, sebastien lorandel wrote:
> > > hi, I tried to catch your attention again :)
> > >
> > >
> > > while looking at my logs, I saw these lines:
> > >
> > > Aug 30 11:35:23 ha1 heartbeat: [5927]: WARN: duplicate client add
> > request
> > > [pingd] [6467]
> > > Aug 30 11:35:23 ha1 heartbeat: [5927]: ERROR:
> > api_process_registration_msg:
> > > cannot add client()
> > >
> > > This occurs after I unpluged eth0 from node2, all resources were
> > restarted
> > > on ha1 and I get these errors.
> > >
> > > The node1 that should restart all the resources ends like this, and then
> > it
> > > can't start resources "cannot run anywhere":
> > >
> > > Aug 31 15:29:07 ha2 tengine: [8507]: info: notify_crmd: Transition 10
> > > status: te_complete - <null>
> > > Aug 31 15:29:07 ha2 pengine: [8508]: info: native_color: Combine scores
> > from
> > > sshd and IPaddr
> > > Aug 31 15:29:07 ha2 pengine: [8508]: WARN: native_color: Resource IPaddr
> > > cannot run anywhere
> > > Aug 31 15:29:07 ha2 pengine: [8508]: WARN: native_color: Resource sshd
> > > cannot run anywhere
> > > Aug 31 15:29:07 ha2 pengine: [8508]: WARN: native_color: Resource
> > > pingd-child:1 cannot run anywhere
> > >
> > > I join my ha.cf et my resource déclaration in cib.xml:
> > > (i modified ha.cf a bit, and I removed my nodes form ping line
> > > *********************************************************************
> > > debugfile /var/log/ha/ha-debug
> > > logfile /var/log/ha/ha-log
> > >
> > > node ha1
> > > node ha2
> > > use_logd on
> > > udpport 694
> > > keepalive 500ms # 1 second
> > > deadtime 5
> > > initdead 80
> > > bcast eth1 #eth0
> > > crm yes
> > > auto_failback yes
> > >
> > > ping_group hb1 hb2 server 10.0.0.1
> > > respawn root /usr/lib64/heartbeat/pingd -m 100 -d 2s
> > > apiauth default uid=root # make sure we can run cluster control commands
> > as
> > > root
> > >
> > ***************************************************************************
> > >        <clone id="pingd">
> > > <instance_attributes id="pingd">
> > > <attributes>
> > > <nvpair id="pingd-clone_node_max" name="clone_node_max"
> > > value="1"/>
> > > </attributes>
> > > </instance_attributes>
> > > <primitive id="pingd-child" provider="heartbeat" class="ocf"
> > > type="pingd">
> > > <operations>
> > > <op id="pingd-child-monitor" name="monitor" interval="20s"
> > > timeout="40s" prereq="nothing"/>
> > > <op id="pingd-child-start" name="start" prereq="nothing"/>
> > > </operations>
> > > <instance_attributes id="pingd_inst_attr">
> > > <attributes>
> > > <nvpair id="pingd-dampen" name="dampen" value="5s"/>
> > > <nvpair id="pingd-multiplier" name="multiplier" value="100"/>
> > > </attributes>
> > > </instance_attributes>
> > > </primitive>
> > >  </clone>
> > > ************************************************************************
> >
> > You need only one: either the ha.cf respawn pingd directive or
> > the pingd resource in the cib.
> >
> > Dejan
> >
> > > Please someone help me :) I tried so many different configuration...
> > hope
> > > someone has an idea.
> > > Sébastien.
> > >
> > > I searched and saw somebody who add the same error for cl_status
> > >
> > > On 8/31/07, sebastien lorandel < lorandel.sebastien at gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I have a 2 nodes cluster with eth1 as heartbeat connection between the
> > > > nodes and eth0 interfaces connected to my clients. I installed ping to
> > test
> > > > my network connection. I also declared auto_failback=yes in ha.cf, so
> > that
> > > > when my node comes up after a failure, it get its resources back (it
> > is
> > > > working, I tested it)
> > > >
> > > > In ha.cf it is configured so
> > > >
> > > > respawn root /usr/lib64/heartbeat/pingd -m 100 -d 5s
> > > > ping_group ping_nodes 10.0.0.210 10.0.0.211
> > > > (where 10.0.0.210 and 10.0.0.211 are my nodes.. not sure this is the
> > good
> > > > way to use pingd but it works, I'd better define other servers, no?)
> > > >
> > > > When a network failure occurs over eth0 on node1, all services on
> > node1
> > > > are stopped on node1 and restarted on node2, ok.
> > > > But then when the network failure is repared and node1 comes again in
> > the
> > > > network, resources don't failback, why not? As the node came in the
> > cluster
> > > > again and everything is fine.
> > > > But when restarting the node, everything is ok... Does it mean I need
> > > > STONITH for such failures?
> > > >
> > > > I also get the same behaviour when eth1 fails, the heartbeat signal
> > > > connection (yes I know there is no redundancy, but these are tests :)
> > ).
> > > >
> > > > So my question is, are these behaviour normal?
> > > > thanks in advance.
> > > > --
> > > > Sébastien Lorandel
> > >
> > >
> > >
> > >
> > > --
> > > Sébastien Lorandel
> > > _______________________________________________
> > > Linux-HA mailing list
> > > Linux-HA at lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
>
>
>
> --
> Sébastien Lorandel
> IBM Deutschland Entwicklung
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



More information about the Linux-HA mailing list