[Linux-HA] Re: nodes won't auto_failback after network failure

sebastien lorandel lorandel.sebastien at gmail.com
Mon Sep 3 04:27:22 MDT 2007


Ok,

now I don't have this error anymore, but services still don't failback to my
node (ha2). And even when my other (ha1) node is shutdown they don't. So I
don't have services running in my cluster anymore...

- After I reconnect the cable, the cluster see it and say that both nodes
are able to run services:

Sep  3 12:01:13 ha1 crmd: [20955]: info: do_state_transition: All 2 cluster
nodes are eligible to run resources.

- But, then it can't make them re-run on the node who failed:

Sep  3 12:01:13 ha1 pengine: [22141]: info: determine_online_status: Node
ha2 is online
Sep  3 12:01:13 ha1 pengine: [22141]: WARN: unpack_rsc_op: Processing failed
op (IPaddr_start_0) on ha2
Sep  3 12:01:13 ha1 pengine: [22141]: WARN: unpack_rsc_op: Handling failed
start for IPaddr on ha2
Sep  3 12:01:13 ha1 pengine: [22141]: WARN: unpack_rsc_op: Processing failed
op (IPaddr_monitor_5000) on ha2

- Then when when I stop ha1 who was running my resources, it says ha2 is
elligible:

Sep  3 12:17:42 ha2 crmd: [7347]: info: do_state_transition: All 1 cluster
nodes are eligible to run resources.

- And then...

Sep  3 12:17:42 ha2 pengine: [8293]: info: determine_online_status: Node ha2
is online
Sep  3 12:17:42 ha2 pengine: [8293]: WARN: unpack_rsc_op: Processing failed
op (IPaddr_start_0) on ha2
Sep  3 12:17:42 ha2 pengine: [8293]: WARN: unpack_rsc_op: Handling failed
start for IPaddr on ha2
Sep  3 12:17:42 ha2 pengine: [8293]: WARN: unpack_rsc_location: No resource
(con=my_resource:connected, rsc=my_resource)
Sep  3 12:17:42 ha2 pengine: [8293]: info: group_print: Resource Group:
group1
Sep  3 12:17:42 ha2 pengine: [8293]: info: native_print:     IPaddr
(heartbeat::ocf:IPaddr):        Stopped
Sep  3 12:17:42 ha2 pengine: [8293]: info: native_print:     sshd
(lsb:sshd):     Stopped
Sep  3 12:17:42 ha2 pengine: [8293]: info: clone_print: Clone Set: pingd
Sep  3 12:17:42 ha2 pengine: [8293]: info: native_print:
pingd-child:0      (heartbeat::ocf:pingd): Started ha2
Sep  3 12:17:42 ha2 pengine: [8293]: info: native_print:
pingd-child:1      (heartbeat::ocf:pingd): Stopped
Sep  3 12:17:42 ha2 pengine: [8293]: info: native_color: Combine scores from
sshd and IPaddr
Sep  3 12:17:42 ha2 pengine: [8293]: WARN: native_color: Resource IPaddr
cannot run anywhere
Sep  3 12:17:42 ha2 pengine: [8293]: WARN: native_color: Resource sshd
cannot run anywhere
Sep  3 12:17:42 ha2 pengine: [8293]: WARN: native_color: Resource
pingd-child:1 cannot run anywhere


So my cluster just keep running pingd on my node...
Has somebody an idea?

I would greatly appreciate :), thanks.
Sebastien.

On 8/31/07, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
>
> Hi,
>
> On Fri, Aug 31, 2007 at 03:42:47PM +0200, sebastien lorandel wrote:
> > hi, I tried to catch your attention again :)
> >
> >
> > while looking at my logs, I saw these lines:
> >
> > Aug 30 11:35:23 ha1 heartbeat: [5927]: WARN: duplicate client add
> request
> > [pingd] [6467]
> > Aug 30 11:35:23 ha1 heartbeat: [5927]: ERROR:
> api_process_registration_msg:
> > cannot add client()
> >
> > This occurs after I unpluged eth0 from node2, all resources were
> restarted
> > on ha1 and I get these errors.
> >
> > The node1 that should restart all the resources ends like this, and then
> it
> > can't start resources "cannot run anywhere":
> >
> > Aug 31 15:29:07 ha2 tengine: [8507]: info: notify_crmd: Transition 10
> > status: te_complete - <null>
> > Aug 31 15:29:07 ha2 pengine: [8508]: info: native_color: Combine scores
> from
> > sshd and IPaddr
> > Aug 31 15:29:07 ha2 pengine: [8508]: WARN: native_color: Resource IPaddr
> > cannot run anywhere
> > Aug 31 15:29:07 ha2 pengine: [8508]: WARN: native_color: Resource sshd
> > cannot run anywhere
> > Aug 31 15:29:07 ha2 pengine: [8508]: WARN: native_color: Resource
> > pingd-child:1 cannot run anywhere
> >
> > I join my ha.cf et my resource déclaration in cib.xml:
> > (i modified ha.cf a bit, and I removed my nodes form ping line
> > *********************************************************************
> > debugfile /var/log/ha/ha-debug
> > logfile /var/log/ha/ha-log
> >
> > node ha1
> > node ha2
> > use_logd on
> > udpport 694
> > keepalive 500ms # 1 second
> > deadtime 5
> > initdead 80
> > bcast eth1 #eth0
> > crm yes
> > auto_failback yes
> >
> > ping_group hb1 hb2 server 10.0.0.1
> > respawn root /usr/lib64/heartbeat/pingd -m 100 -d 2s
> > apiauth default uid=root # make sure we can run cluster control commands
> as
> > root
> >
> ***************************************************************************
> >        <clone id="pingd">
> > <instance_attributes id="pingd">
> > <attributes>
> > <nvpair id="pingd-clone_node_max" name="clone_node_max"
> > value="1"/>
> > </attributes>
> > </instance_attributes>
> > <primitive id="pingd-child" provider="heartbeat" class="ocf"
> > type="pingd">
> > <operations>
> > <op id="pingd-child-monitor" name="monitor" interval="20s"
> > timeout="40s" prereq="nothing"/>
> > <op id="pingd-child-start" name="start" prereq="nothing"/>
> > </operations>
> > <instance_attributes id="pingd_inst_attr">
> > <attributes>
> > <nvpair id="pingd-dampen" name="dampen" value="5s"/>
> > <nvpair id="pingd-multiplier" name="multiplier" value="100"/>
> > </attributes>
> > </instance_attributes>
> > </primitive>
> >  </clone>
> > ************************************************************************
>
> You need only one: either the ha.cf respawn pingd directive or
> the pingd resource in the cib.
>
> Dejan
>
> > Please someone help me :) I tried so many different configuration...
> hope
> > someone has an idea.
> > Sébastien.
> >
> > I searched and saw somebody who add the same error for cl_status
> >
> > On 8/31/07, sebastien lorandel < lorandel.sebastien at gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I have a 2 nodes cluster with eth1 as heartbeat connection between the
> > > nodes and eth0 interfaces connected to my clients. I installed ping to
> test
> > > my network connection. I also declared auto_failback=yes in ha.cf, so
> that
> > > when my node comes up after a failure, it get its resources back (it
> is
> > > working, I tested it)
> > >
> > > In ha.cf it is configured so
> > >
> > > respawn root /usr/lib64/heartbeat/pingd -m 100 -d 5s
> > > ping_group ping_nodes 10.0.0.210 10.0.0.211
> > > (where 10.0.0.210 and 10.0.0.211 are my nodes.. not sure this is the
> good
> > > way to use pingd but it works, I'd better define other servers, no?)
> > >
> > > When a network failure occurs over eth0 on node1, all services on
> node1
> > > are stopped on node1 and restarted on node2, ok.
> > > But then when the network failure is repared and node1 comes again in
> the
> > > network, resources don't failback, why not? As the node came in the
> cluster
> > > again and everything is fine.
> > > But when restarting the node, everything is ok... Does it mean I need
> > > STONITH for such failures?
> > >
> > > I also get the same behaviour when eth1 fails, the heartbeat signal
> > > connection (yes I know there is no redundancy, but these are tests :)
> ).
> > >
> > > So my question is, are these behaviour normal?
> > > thanks in advance.
> > > --
> > > Sébastien Lorandel
> >
> >
> >
> >
> > --
> > Sébastien Lorandel
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
Sébastien Lorandel
IBM Deutschland Entwicklung



More information about the Linux-HA mailing list