[Linux-HA] How to troubleshoot: No reply to standby request. Standby request cancelled.

Keith Christian keith1christian at gmail.com
Thu Jul 2 09:57:39 MDT 2009


I'm in the testing phase of building a HA cluster using Heartbeat and DRBD.

Both nodes are running the same configuration files, e.g. drbd.conf,
ha.cf, haresources.

The two nodes are connected with dual heartbeat connections:
        server_one, eth1 192.168.1.136---> server_two, eth1 192.168.1.137
        server_one, eth3 192.168.2.136---> server_two, eth3 192.168.2.137


Heartbeat is able to ping the gateway from both servers:
        Jul  1 17:43:22 server_one heartbeat: [21192]: info: Link
www.xxx.yyy.129:www.xxx.yyy.129 up.
        Jul  1 17:43:22 server_one heartbeat: [21192]: info: Status
update for node www.xxx.yyy.129: status ping
        Jul  1 17:33:18 server_two heartbeat: [19332]: info: Link
www.xxx.yyy.129:www.xxx.yyy.129 up.
        Jul  1 17:33:18 server_two heartbeat: [19332]: info: Status
update for node www.xxx.yyy.129: status ping


Both nodes are running the same version of the software:
	[root at server_one ~]# rpm -qa | grep -i heart
	heartbeat-stonith-2.1.3-3.el5.centos
	heartbeat-2.1.3-3.el5.centos
	heartbeat-pils-2.1.3-3.el5.centos

	[root at server_one ~]# rpm -qa | grep -i drbd
	drbd82-8.2.6-1.el5.centos
	kmod-drbd82-8.2.6-2

	[root at server_one ~]# service heartbeat status
	service heartbeat status
	heartbeat OK [pid 21192 et al] is running on server_one [server_one]...

	[root at server_two ~]# service heartbeat status
	heartbeat OK [pid 19332 et al] is running on server_two [server_two]...


When testing failover, I issue the command "service heartbeat standby"
but, failover does not occur and the message "Standby request
cancelled" appears in the logs of both servers:

Tell server_one to standby:
	[root at server_one ~]# service heartbeat standby
	service heartbeat standby
	auto_failback: off
	Attempting to enter standby mode
	2009/07/02_10:54:38 Going standby [all].

	Jul  2 10:54:38 server_one heartbeat: [21192]: info: server_one wants
to go standby [all]
	Jul  2 10:54:48 server_one heartbeat: [21192]: WARN: No reply to
standby request.  Standby request cancelled.


Tell server_two to standby:
	[root at server_two ~]# service heartbeat standby
	auto_failback: off
	Attempting to enter standby mode
	2009/07/02_10:55:16 Going standby [all].
	[root at server_two ~]#                                   [  OK  ]
						
	Jul  2 10:55:16 server_two heartbeat: [19332]: info: server_two wants
to go standby [all]
	Jul  2 10:55:28 server_two heartbeat: [19332]: WARN: No reply to
standby request.  Standby request cancelled.


Here is the log from server_one during a restart of drbd and heartbeat:

	[root at server_two ~]# service drbd restart ; sleep 10 ; service
heartbeat restart

	Jul  1 17:31:23 server_one kernel: drbd0: State change failed: Device
is held open by someone
	Jul  1 17:31:23 server_one kernel: drbd0:   state = { cs:WFConnection
st:Primary/Unknown ds:UpToDate/DUnknown r--- }
	Jul  1 17:31:23 server_one kernel: drbd0:  wanted = { cs:WFConnection
st:Secondary/Unknown ds:UpToDate/DUnknown r--- }
	Jul  1 17:31:33 server_one heartbeat: [14355]: info: Heartbeat
shutdown in progress. (14355)
	Jul  1 17:31:33 server_one heartbeat: [19482]: info: Giving up all HA
resources.
	Jul  1 17:31:33 server_one ResourceManager[19495]: info: Releasing
resource group: server_one IPaddr::www.xxx.yyy.150/26/eth0
drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/home::ext3 sendmail
	Jul  1 17:31:33 server_one ResourceManager[19495]: info: Running
/etc/init.d/sendmail  stop
	Jul  1 17:31:33 server_one ResourceManager[19495]: info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home ext3 stop
	Jul  1 17:31:33 server_one Filesystem[19575]: INFO: Running stop for
/dev/drbd0 on /home
	Jul  1 17:31:33 server_one Filesystem[19575]: INFO: Trying to unmount /home
	Jul  1 17:31:33 server_one Filesystem[19575]: INFO: unmounted /home
successfully
	Jul  1 17:31:33 server_one Filesystem[19564]: INFO:  Success
	Jul  1 17:31:33 server_one ResourceManager[19495]: info: Running
/etc/ha.d/resource.d/drbddisk drbd-resource-0 stop
	Jul  1 17:31:33 server_one kernel: drbd0: role( Primary -> Secondary )
	Jul  1 17:31:33 server_one kernel: drbd0: Writing meta data super block now.
	Jul  1 17:31:33 server_one ResourceManager[19495]: info: Running
/etc/ha.d/resource.d/IPaddr www.xxx.yyy.150/26/eth0 stop
	Jul  1 17:31:34 server_one IPaddr[19727]: INFO: ifconfig eth0:0 down
	Jul  1 17:31:34 server_one avahi-daemon[7115]: Withdrawing address
record for www.xxx.yyy.150 on eth0.
	Jul  1 17:31:34 server_one IPaddr[19698]: INFO:  Success
	Jul  1 17:31:34 server_one heartbeat: [19482]: info: All HA resources
relinquished.
	Jul  1 17:31:35 server_one heartbeat: [14355]: info: killing
/usr/lib/heartbeat/ipfail process group 14366 with signal 15
	Jul  1 17:31:36 server_one heartbeat: [14355]: info: killing HBFIFO
process 14357 with signal 15
	Jul  1 17:31:36 server_one heartbeat: [14355]: info: killing HBWRITE
process 14358 with signal 15
	Jul  1 17:31:36 server_one heartbeat: [14355]: info: killing HBREAD
process 14359 with signal 15
	Jul  1 17:31:36 server_one heartbeat: [14355]: info: killing HBWRITE
process 14360 with signal 15
	Jul  1 17:31:36 server_one heartbeat: [14355]: info: killing HBREAD
process 14361 with signal 15
	Jul  1 17:31:36 server_one heartbeat: [14355]: info: killing HBWRITE
process 14362 with signal 15
	Jul  1 17:31:36 server_one heartbeat: [14355]: info: killing HBREAD
process 14363 with signal 15
	Jul  1 17:31:36 server_one heartbeat: [14355]: info: Core process
14359 exited. 7 remaining
	Jul  1 17:31:36 server_one heartbeat: [14355]: info: Core process
14363 exited. 6 remaining
	Jul  1 17:31:36 server_one heartbeat: [14355]: info: Core process
14361 exited. 5 remaining
	Jul  1 17:31:36 server_one heartbeat: [14355]: info: Core process
14358 exited. 4 remaining
	Jul  1 17:31:36 server_one heartbeat: [14355]: info: Core process
14362 exited. 3 remaining
	Jul  1 17:31:36 server_one heartbeat: [14355]: info: Core process
14357 exited. 2 remaining
	Jul  1 17:31:36 server_one heartbeat: [14355]: info: Core process
14360 exited. 1 remaining
	Jul  1 17:31:36 server_one heartbeat: [14355]: info: server_one
Heartbeat shutdown complete.
	Jul  1 17:32:08 server_one kernel: drbd0: conn( WFConnection -> Disconnecting )
	Jul  1 17:32:08 server_one kernel: drbd0: Discarding network configuration.
	Jul  1 17:32:08 server_one kernel: drbd0: tl_clear()
	Jul  1 17:32:08 server_one kernel: drbd0: Connection closed
	Jul  1 17:32:08 server_one kernel: drbd0: conn( Disconnecting -> StandAlone )
	Jul  1 17:32:08 server_one kernel: drbd0: receiver terminated
	Jul  1 17:32:08 server_one kernel: drbd0: Terminating receiver thread
	Jul  1 17:32:08 server_one kernel: drbd0: disk( UpToDate -> Diskless )
	Jul  1 17:32:08 server_one kernel: drbd0: drbd_bm_resize called with
capacity == 0
	Jul  1 17:32:08 server_one kernel: drbd0: worker terminated
	Jul  1 17:32:08 server_one kernel: drbd0: Terminating worker thread
	Jul  1 17:32:08 server_one kernel: drbd: module cleanup done.
	Jul  1 17:32:08 server_one kernel: drbd: initialised. Version: 8.2.6
(api:88/proto:86-88)
	Jul  1 17:32:08 server_one kernel: drbd: GIT-hash:
3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
buildsvn at c5-i386-build, 2008-10-03 11:42:32
	Jul  1 17:32:08 server_one kernel: drbd: registered as block device major 147
	Jul  1 17:32:08 server_one kernel: drbd: minor_table @ 0xf7467240
	Jul  1 17:32:08 server_one kernel: drbd0: disk( Diskless -> Attaching )
	Jul  1 17:32:08 server_one kernel: drbd0: Starting worker thread
(from cqueue/1 [145])
	Jul  1 17:32:08 server_one kernel: klogd 1.4.1, ---------- state
change ----------
	Jul  1 17:32:08 server_one kernel: drbd0: Found 6 transactions (34
active extents) in activity log.
	Jul  1 17:32:08 server_one kernel: drbd0: max_segment_size ( = BIO
size ) = 32768
	Jul  1 17:32:08 server_one kernel: drbd0: drbd_bm_resize called with
capacity == 832761342
	Jul  1 17:32:08 server_one kernel: drbd0: resync bitmap:
bits=104095168 words=3252974
	Jul  1 17:32:08 server_one kernel: drbd0: size = 397 GB (416380671 KB)
	Jul  1 17:32:08 server_one kernel: drbd0: reading of bitmap took 21 jiffies
	Jul  1 17:32:08 server_one kernel: drbd0: recounting of set bits took
additional 14 jiffies
	Jul  1 17:32:08 server_one kernel: drbd0: 397 GB (104095168 bits)
marked out-of-sync by on disk bit-map.
	Jul  1 17:32:08 server_one kernel: drbd0: disk( Attaching -> UpToDate )
	Jul  1 17:32:08 server_one kernel: drbd0: Writing meta data super block now.
	Jul  1 17:32:08 server_one kernel: drbd0: conn( StandAlone -> Unconnected )
	Jul  1 17:32:08 server_one kernel: drbd0: Starting receiver thread
(from drbd0_worker [19793])
	Jul  1 17:32:08 server_one kernel: drbd0: receiver (re)started
	Jul  1 17:32:08 server_one kernel: drbd0: conn( Unconnected -> WFConnection )
	Jul  1 17:36:20 server_one kernel: device eth1 entered promiscuous mode
	Jul  1 17:37:01 server_one heartbeat: [20041]: info: Version 2 support: false
	Jul  1 17:37:01 server_one heartbeat: [20041]: WARN: Logging daemon
is disabled --enabling logging daemon is recommended
	Jul  1 17:37:01 server_one heartbeat: [20041]: info: **************************
	Jul  1 17:37:01 server_one heartbeat: [20041]: info: Configuration
validated. Starting heartbeat 2.1.3
	Jul  1 17:37:01 server_one heartbeat: [20042]: info: heartbeat: version 2.1.3
	Jul  1 17:37:01 server_one heartbeat: [20042]: info: Heartbeat
generation: 1245962690
	Jul  1 17:37:01 server_one heartbeat: [20042]: info: glib: UDP
Broadcast heartbeat started on port 694 (694) interface eth1
	Jul  1 17:37:01 server_one heartbeat: [20042]: info: glib: UDP
Broadcast heartbeat closed on port 694 interface eth1 - Status: 1
	Jul  1 17:37:01 server_one heartbeat: [20042]: info: glib: UDP
Broadcast heartbeat started on port 694 (694) interface eth3
	Jul  1 17:37:01 server_one heartbeat: [20042]: info: glib: UDP
Broadcast heartbeat closed on port 694 interface eth3 - Status: 1
	Jul  1 17:37:01 server_one heartbeat: [20042]: info: glib: ping
heartbeat started.
	Jul  1 17:37:01 server_one heartbeat: [20042]: info:
G_main_add_TriggerHandler: Added signal manual handler
	Jul  1 17:37:01 server_one heartbeat: [20042]: info:
G_main_add_TriggerHandler: Added signal manual handler
	Jul  1 17:37:01 server_one heartbeat: [20042]: info:
G_main_add_SignalHandler: Added signal handler for signal 17
	Jul  1 17:37:01 server_one heartbeat: [20042]: info: Local status now
set to: 'up'
	Jul  1 17:37:02 server_one heartbeat: [20042]: info: Link
www.xxx.yyy.129:www.xxx.yyy.129 up.
	Jul  1 17:37:02 server_one heartbeat: [20042]: info: Status update
for node www.xxx.yyy.129: status ping
	Jul  1 17:37:15 server_one kernel: device eth1 left promiscuous mode
	Jul  1 17:37:32 server_one heartbeat: [20042]: WARN: node server_two: is dead
	Jul  1 17:37:32 server_one heartbeat: [20042]: info: Comm_now_up():
updating status to active
	Jul  1 17:37:32 server_one heartbeat: [20042]: info: Local status now
set to: 'active'
	Jul  1 17:37:32 server_one heartbeat: [20042]: info: Starting child
client "/usr/lib/heartbeat/ipfail" (498,496)
	Jul  1 17:37:32 server_one heartbeat: [20042]: WARN: No STONITH
device configured.
	Jul  1 17:37:32 server_one heartbeat: [20042]: WARN: Shared disks are
not protected.
	Jul  1 17:37:32 server_one heartbeat: [20042]: info: Resources being
acquired from server_two.
	Jul  1 17:37:32 server_one heartbeat: [20055]: info: Starting
"/usr/lib/heartbeat/ipfail" as uid 498  gid 496 (pid 20055)
	Jul  1 17:37:32 server_one harc[20056]: info: Running
/etc/ha.d/rc.d/status status
	Jul  1 17:37:32 server_one mach_down[20086]: info:
/usr/share/heartbeat/mach_down: nice_failback: foreign resources
acquired
	Jul  1 17:37:32 server_one mach_down[20086]: info: mach_down takeover
complete for node server_two.
	Jul  1 17:37:32 server_one heartbeat: [20042]: info: Initial resource
acquisition complete (T_RESOURCES(us))
	Jul  1 17:37:32 server_one heartbeat: [20042]: info: mach_down
takeover complete.
	Jul  1 17:37:32 server_one IPaddr[20129]: INFO:  Resource is stopped
	Jul  1 17:37:32 server_one heartbeat: [20057]: info: Local Resource
acquisition completed.
	Jul  1 17:37:32 server_one harc[20193]: info: Running
/etc/ha.d/rc.d/ip-request-resp ip-request-resp
	Jul  1 17:37:32 server_one ip-request-resp[20193]: received
ip-request-resp IPaddr::www.xxx.yyy.150/26/eth0 OK yes
	Jul  1 17:37:32 server_one ResourceManager[20214]: info: Acquiring
resource group: server_one IPaddr::www.xxx.yyy.150/26/eth0
drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/home::ext3 sendmail
	Jul  1 17:37:32 server_one IPaddr[20241]: INFO:  Resource is stopped
	Jul  1 17:37:32 server_one ResourceManager[20214]: info: Running
/etc/ha.d/resource.d/IPaddr www.xxx.yyy.150/26/eth0 start
	Jul  1 17:37:32 server_one IPaddr[20339]: INFO: Using calculated
netmask for www.xxx.yyy.150: 255.255.255.192
	Jul  1 17:37:32 server_one IPaddr[20339]: INFO: eval ifconfig eth0:0
www.xxx.yyy.150 netmask 255.255.255.192 broadcast www.xxx.yyy.191
	Jul  1 17:37:32 server_one avahi-daemon[7115]: Registering new
address record for www.xxx.yyy.150 on eth0.
	Jul  1 17:37:32 server_one avahi-daemon[7115]: Withdrawing address
record for www.xxx.yyy.150 on eth0.
	Jul  1 17:37:32 server_one avahi-daemon[7115]: Registering new
address record for www.xxx.yyy.150 on eth0.
	Jul  1 17:37:32 server_one IPaddr[20310]: INFO:  Success
	Jul  1 17:37:32 server_one ResourceManager[20214]: info: Running
/etc/ha.d/resource.d/drbddisk drbd-resource-0 start
	Jul  1 17:37:32 server_one kernel: drbd0: role( Secondary -> Primary )
	Jul  1 17:37:32 server_one kernel: drbd0: Writing meta data super block now.
	Jul  1 17:37:32 server_one Filesystem[20487]: INFO:  Resource is stopped
	Jul  1 17:37:32 server_one ResourceManager[20214]: info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home ext3 start
	Jul  1 17:37:32 server_one Filesystem[20568]: INFO: Running start for
/dev/drbd0 on /home
	Jul  1 17:37:32 server_one kernel: kjournald starting.  Commit
interval 5 seconds
	Jul  1 17:37:32 server_one kernel: EXT3 FS on drbd0, internal journal
	Jul  1 17:37:32 server_one kernel: EXT3-fs: mounted filesystem with
ordered data mode.
	Jul  1 17:37:32 server_one Filesystem[20557]: INFO:  Success
	Jul  1 17:37:32 server_one ResourceManager[20214]: info: Running
/etc/init.d/sendmail  start
	Jul  1 17:37:38 server_one heartbeat: [20042]: WARN: standby message
[me] from server_one ignored.  local resources in flux.
	Jul  1 17:37:42 server_one heartbeat: [20042]: info: Local Resource
acquisition completed. (none)
	Jul  1 17:37:42 server_one heartbeat: [20042]: info: local resource
transition completed.



There is another cluster that was built about a year ago running the
same versions of heartbeat and drbd.  I've configured this system in
the same way except for IP address changes, so it should work.

What am I overlooking here?  What else can I try to troubleshoot this
and get the failover resolved?

Thanks,

=======Keith



More information about the Linux-HA mailing list