[Linux-HA] Re: Re:Re:Re:Problems with resources failing over and other little problems

Chris Gallo chrisagallo at gmail.com
Tue Sep 5 13:12:19 MDT 2006


On 9/5/06, Serge Dubrouski <sergeyfd at gmail.com> wrote:
> On 9/5/06, Chris Gallo <chrisagallo at gmail.com> wrote:
> > Alright, here is my new cib file http://isthesuck.com/cib.xml
>
> There is still something wrong with the nodes section. You shouldn't
> have 3 nodes there. Probably you need to remove hostcache file and
> restart heartbeat.

Yep, that fixed it, the hostcache was different on the 2 machines,
removing both hostcache files seems to have worked.

> >
> > and my ha.cf has remained the same
> > > > Here it is, this is pretty much what was in the walkthrough.
> > > > debugfile /var/log/ha-debug
> > > > logfile /var/log/ha-log
> > > > logfacility syslog
> > > > keepalive 2
> > > > deadtime 7
> > > > warntime 8
> > > > initdead 15
mcast eth1 225.0.0.1 694 1 0
udpport 694
> > > > watchdog /dev/watchdog
> > > > node    ldap-1.ev1servers.net
> > > > node    ldap-2.ev1servers.net
> > > > crm yes
> >
>
> No need for ping here. It's not supported this way in 2.0.x

What would be a better way to make sure I have internet connectivity?
The guides are a little unclear on where 1.X and 2.X end and begin.

>
> >
> > > > >Third. Set different scores for rsc_location for different nodes. Node
> > > > >with the higher score will be primary node.
> > > >
> > > > Well, I wanted ldap to run on both nodes at once (so the database will
> > > > get updated on both) which is why its the same for both nodes. However
> > > > for the ip address the primary is 100 and the secondary is 0 so it
> > > > would go back to the primary if it comes back up, however this is not
> > > > the case.
> > >
> > >  Take a look at clones: http://www.linux-ha.org/v2/Concepts/Clones
> >
> > I put that in the cib, however my problem continues. ldap1 starts up
> > fine and brings my resources up. But when I bring ldap2 up ldap2 just
> > sits there. This is all that ldap2 generates in the logs when it
> > starts up.
> >
> > heartbeat[22466]: 2006/09/05_10:16:13 info: Configuration validated.
> > Starting heartbeat 2.0.4
> > heartbeat[22467]: 2006/09/05_10:16:13 info: heartbeat: version 2.0.4
> > heartbeat[22467]: 2006/09/05_10:16:13 info: Heartbeat generation: 60
> > heartbeat[22467]: 2006/09/05_10:16:13 info: G_main_add_TriggerHandler:
> > Added signal manual handler
> > heartbeat[22467]: 2006/09/05_10:16:13 info: G_main_add_TriggerHandler:
> > Added signal manual handler
> > heartbeat[22467]: 2006/09/05_10:16:13 info: Removing
> > /var/run/heartbeat/rsctmp failed, recreating.
> > heartbeat[22467]: 2006/09/05_10:16:13 info: glib: Starting serial
> > heartbeat on tty /dev/ttyS0 (19200 baud)
> > heartbeat[22467]: 2006/09/05_10:16:13 info: glib: UDP Broadcast
> > heartbeat started on port 694 (694) interface eth1
> > heartbeat[22467]: 2006/09/05_10:16:13 info: glib: UDP Broadcast
> > heartbeat closed on port 694 interface eth1 - Status: 1
> > heartbeat[22467]: 2006/09/05_10:16:13 info: glib: ping heartbeat started.
> > heartbeat[22467]: 2006/09/05_10:16:13 ERROR: Cannot open watchdog
> > device: /dev/watchdog
> > heartbeat[22467]: 2006/09/05_10:16:13 info: G_main_add_SignalHandler:
> > Added signal handler for signal 17
> > heartbeat[22467]: 2006/09/05_10:16:13 info: Local status now set to: 'up'
> > heartbeat[22467]: 2006/09/05_10:16:13 info: Exiting
> > write_hostcachedata process 22477 returned rc 0.
> > heartbeat[22467]: 2006/09/05_10:16:14 info: Link
> > ldap-1.ev1servers.net:/dev/ttyS0 up.
> > heartbeat[22467]: 2006/09/05_10:16:14 info: Status update for node
> > ldap-1.ev1servers.net: status active
> > heartbeat[22467]: 2006/09/05_10:16:15 info: Link ldap-1.ev1servers.net:eth1 up.
> > heartbeat[22467]: 2006/09/05_10:16:15 info: Link
> > 207.218.204.193:207.218.204.193 up.
> > heartbeat[22467]: 2006/09/05_10:16:15 info: Status update for node
> > 207.218.204.193: status ping
> > heartbeat[22467]: 2006/09/05_10:16:15 info: Link ldap-2.ev1servers.net:eth1 up.
> >
> > and then it just waits for ldap1 to die or lose connection. My main
> > problem is why doesnt ldap2 start up anything or read its config like
> > ldap1 does? One thing I have noticed is that when both nodes are up
> > and have been up for a while, the cib.xml files still shows
> > num_peers=1, shouldnt this be 2?
>
> There were some problems with Serial connections beween HA nodes. I
> personally never used it. Could you swithc to UDP, just for testings?

Doing that did fix the problem. I updated the ha.cf above. Also I
noticed having bcast and mcast on at the same time doesnt work, is
this expected? Having just mcast though seems to work fine so no
worries.


> >
> >
> >
> > Another concern, although not quite as important, is how would I go
> > about decreasing the time between these 2 log entries.
> > crmd[22491]: 2006/09/05_10:41:25 info: mask(utils.c:crm_timer_popped):
> > Wait Timer (I_NULL) just popped!
> > crmd[22491]: 2006/09/05_10:42:25 info: mask(utils.c:crm_timer_popped):
> > Election Trigger (I_DC_TIMEOUT) just popped!
> >
> > When the node starts up it waits for 60s after starting the HA
> > services, and then starting my services. Can't seem to find anything
> > on decreasing this time, is it possible?
>
> No way for that.

Well, now that I got the nodes to talk to each other when they are
both up this isnt so much of a problem, so no worries here either.

So right now I am stress testing it to hell and back to make sure
everything works like I expect. So you might be hearing back from me
with more questions :)

Thanks for all the help.

-Chris


More information about the Linux-HA mailing list