[Linux-HA] udev devices disapear from time time and radnom crashes

Dejan Muhamedagic dejanmm at fastmail.fm
Wed Jul 1 01:57:49 MDT 2009


Hi,

On Tue, Jun 23, 2009 at 10:49:22AM +0200, Heiko wrote:
> Hello,
> 
> i still have the problem that my ha/drbd/xen serverpairs do a sudden reboot
> from time to time.
> I wasnt able to find any logfile entries exactly before these reboots, but
> the ha-log has some lines
> I would like you to explain them to me:
> 
> first these are often seeable:
> 
> Jun 22 18:52:38 xen-B1 heartbeat: [3059]: ERROR: glib: Unable to send [-1]
> ucast packet: No such device
> Jun 22 18:52:38 xen-B1 heartbeat: [3059]: ERROR: write_child: write failure
> on ucast eth0.: No such device

The network interface is down.

> after one reboot the VM was still on the other machine and i waited 20
> minutes, than it moved back to its original host. Is this behaviour expected
> or is there also an error(looks like because of the firs2 lines)?:

Depends on the auto failback setting in ha.cf.

> heartbeat[3042]: 2009/06/23_08:32:36 ERROR: write_child: Exiting due to
> persistent errors: No such device
> heartbeat[3044]: 2009/06/23_08:32:36 ERROR: write_child: Exiting due to
> persistent errors: No such device
> heartbeat[3005]: 2009/06/23_08:32:36 WARN: Managed HBWRITE process 3042
> exited with return code 1.
> heartbeat[3005]: 2009/06/23_08:32:36 ERROR: HBWRITE process died.  Beginning
> communications restart process for comm channel 0.
> heartbeat[3005]: 2009/06/23_08:32:36 WARN: Managed HBWRITE process 3044
> exited with return code 1.
> heartbeat[3005]: 2009/06/23_08:32:36 ERROR: HBWRITE process died.  Beginning
> communications restart process for comm channel 1.
> heartbeat[3005]: 2009/06/23_08:32:36 WARN: Managed HBREAD process 3043
> killed by signal 9 [SIGKILL - Kill, unblockable].
> heartbeat[3005]: 2009/06/23_08:32:36 ERROR: Both comm processes for channel
> 0 have died.  Restarting.

You should check what's happening to the network interfaces.
It may happen that, if you're using dhcp, leases are lost for
some time. Better to use static IP addresses.

Thanks,

Dejan

> heartbeat[3005]: 2009/06/23_08:32:38 CRIT: Cluster node
> xen-a1.fra1.mailcluster returning after partition.
> heartbeat[3005]: 2009/06/23_08:32:38 info: For information on cluster
> partitions, See URL: http://linux-ha.org/SplitBrain
> heartbeat[3005]: 2009/06/23_08:32:38 WARN: Deadtime value may be too small.
> heartbeat[3005]: 2009/06/23_08:32:38 info: See FAQ for information on tuning
> deadtime.
> heartbeat[3005]: 2009/06/23_08:32:38 info: URL:
> http://linux-ha.org/FAQ#heavy_load
> heartbeat[3005]: 2009/06/23_08:32:38 info: Link xen-a1.fra1.mailcluster:eth0
> up.
> heartbeat[3005]: 2009/06/23_08:32:38 WARN: Late heartbeat: Node
> xen-a1.fra1.mailcluster: interval 1810310 ms
> heartbeat[3005]: 2009/06/23_08:32:38 info: Status update for node
> xen-a1.fra1.mailcluster: status active
> harc[12380]:    2009/06/23_08:32:38 info: Running /etc/ha.d/rc.d/status
> status
> heartbeat[3005]: 2009/06/23_08:32:40 info: all clients are now paused
> heartbeat[3005]: 2009/06/23_08:32:40 info: Heartbeat shutdown in progress.
> (3005)
> heartbeat[12396]: 2009/06/23_08:32:40 info: Giving up all HA resources.
> ResourceManager[12409]: 2009/06/23_08:32:40 info: Releasing resource group:
> xen-b1.fra1.mailcluster drbddisk::drbd_backend_2 xen::backend-B1
> ResourceManager[12409]: 2009/06/23_08:32:40 info: Running
> /etc/ha.d/resource.d/xen backend-B1 stop
> ResourceManager[12409]: 2009/06/23_08:32:40 info: Running
> /etc/ha.d/resource.d/drbddisk drbd_backend_2 stop
> ResourceManager[12480]: 2009/06/23_08:32:40 info: Releasing resource group:
> xen-a1.fra1.mailcluster drbddisk::drbd_backend xen::backend-A1
> ResourceManager[12480]: 2009/06/23_08:32:40 info: Running
> /etc/ha.d/resource.d/xen backend-A1 stop
> ResourceManager[12480]: 2009/06/23_08:32:40 info: Running
> /etc/ha.d/resource.d/drbddisk drbd_backend stop
> heartbeat[12396]: 2009/06/23_08:32:40 info: All HA resources relinquished.
> heartbeat[3005]: 2009/06/23_08:32:40 info: all clients are now resumed
> heartbeat[3005]: 2009/06/23_08:32:42 info: killing HBWRITE process 12323
> with signal 15
> heartbeat[3005]: 2009/06/23_08:32:42 info: killing HBREAD process 12324 with
> signal 15
> heartbeat[3005]: 2009/06/23_08:32:42 info: killing HBFIFO process 3041 with
> signal 15
> heartbeat[3005]: 2009/06/23_08:32:42 info: killing HBWRITE process 12325
> with signal 15
> heartbeat[3005]: 2009/06/23_08:32:42 info: killing HBREAD process 12326 with
> signal 15
> heartbeat[3005]: 2009/06/23_08:32:42 info: Core process 12324 exited. 5
> remaining
> heartbeat[3005]: 2009/06/23_08:32:42 info: Core process 12325 exited. 4
> remaining
> heartbeat[3005]: 2009/06/23_08:32:42 info: Core process 3041 exited. 3
> remaining
> heartbeat[3005]: 2009/06/23_08:32:42 info: Core process 12323 exited. 2
> remaining
> heartbeat[3005]: 2009/06/23_08:32:42 info: Core process 12326 exited. 1
> remaining
> heartbeat[3005]: 2009/06/23_08:32:42 info: xen-b1.fra1.mailcluster Heartbeat
> shutdown complete.
> heartbeat[3005]: 2009/06/23_08:32:42 info: Heartbeat restart triggered.
> heartbeat[3005]: 2009/06/23_08:32:42 info: Restarting heartbeat.
> heartbeat[3005]: 2009/06/23_08:32:42 info: Performing heartbeat restart
> exec.
> heartbeat[3005]: 2009/06/23_08:32:53 info: Version 2 support: false
> heartbeat[3005]: 2009/06/23_08:32:53 WARN: Logging daemon is disabled
> --enabling logging daemon is recommended
> heartbeat[3005]: 2009/06/23_08:32:53 info: **************************
> heartbeat[3005]: 2009/06/23_08:32:53 info: Configuration validated. Starting
> heartbeat 2.1.3
> heartbeat[12596]: 2009/06/23_08:32:53 info: heartbeat: version 2.1.3
> heartbeat[12596]: 2009/06/23_08:32:53 info: Heartbeat generation: 1202824479
> heartbeat[12596]: 2009/06/23_08:32:53 info: glib: ucast: write socket
> priority set to IPTOS_LOWDELAY on eth0
> heartbeat[12596]: 2009/06/23_08:32:53 info: glib: ucast: bound send socket
> to device: eth0
> heartbeat[12596]: 2009/06/23_08:32:53 info: glib: ucast: bound receive
> socket to device: eth0
> heartbeat[12596]: 2009/06/23_08:32:53 info: glib: ucast: started on port 694
> interface eth0 to 172.20.1.1
> heartbeat[12596]: 2009/06/23_08:32:53 info: glib: ucast: write socket
> priority set to IPTOS_LOWDELAY on eth0
> heartbeat[12596]: 2009/06/23_08:32:53 info: glib: ucast: bound send socket
> to device: eth0
> heartbeat[12596]: 2009/06/23_08:32:53 info: glib: ucast: bound receive
> socket to device: eth0
> heartbeat[12596]: 2009/06/23_08:32:53 info: glib: ucast: started on port 694
> interface eth0 to 172.20.2.1
> heartbeat[12596]: 2009/06/23_08:32:53 info: G_main_add_TriggerHandler: Added
> signal manual handler
> heartbeat[12596]: 2009/06/23_08:32:53 info: G_main_add_TriggerHandler: Added
> signal manual handler
> heartbeat[12596]: 2009/06/23_08:32:53 info: G_main_add_SignalHandler: Added
> signal handler for signal 17
> heartbeat[12596]: 2009/06/23_08:32:53 info: Local status now set to: 'up'
> heartbeat[12596]: 2009/06/23_08:32:54 info: Link
> xen-a1.fra1.mailcluster:eth0 up.
> heartbeat[12596]: 2009/06/23_08:32:54 info: Status update for node
> xen-a1.fra1.mailcluster: status active
> harc[12659]:    2009/06/23_08:32:54 info: Running /etc/ha.d/rc.d/status
> status
> heartbeat[12596]: 2009/06/23_08:32:55 info: Comm_now_up(): updating status
> to active
> heartbeat[12596]: 2009/06/23_08:32:55 info: Local status now set to:
> 'active'
> heartbeat[12596]: 2009/06/23_08:33:05 info: local resource transition
> completed.
> heartbeat[12596]: 2009/06/23_08:33:05 info: Initial resource acquisition
> complete (T_RESOURCES(us))
> heartbeat[12701]: 2009/06/23_08:33:06 info: Local Resource acquisition
> completed.
> harc[12743]:    2009/06/23_08:33:06 info: Running
> /etc/ha.d/rc.d/ip-request-resp ip-request-resp
> ip-request-resp[12743]: 2009/06/23_08:33:06 received ip-request-resp
> drbddisk::drbd_backend_2 OK yes
> ResourceManager[12764]: 2009/06/23_08:33:06 info: Acquiring resource group:
> xen-b1.fra1.mailcluster drbddisk::drbd_backend_2 xen::backend-B1
> ResourceManager[12764]: 2009/06/23_08:33:06 info: Running
> /etc/ha.d/resource.d/drbddisk drbd_backend_2 start
> ResourceManager[12764]: 2009/06/23_08:33:18 ERROR: Return code 1 from
> /etc/ha.d/resource.d/drbddisk
> ResourceManager[12764]: 2009/06/23_08:33:18 CRIT: Giving up resources due to
> failure of drbddisk::drbd_backend_2
> ResourceManager[12764]: 2009/06/23_08:33:18 info: Releasing resource group:
> xen-b1.fra1.mailcluster drbddisk::drbd_backend_2 xen::backend-B1
> ResourceManager[12764]: 2009/06/23_08:33:18 info: Running
> /etc/ha.d/resource.d/xen backend-B1 stop
> ResourceManager[12764]: 2009/06/23_08:33:18 info: Running
> /etc/ha.d/resource.d/drbddisk drbd_backend_2 stop
> hb_standby[12992]:      2009/06/23_08:33:48 Going standby [foreign].
> heartbeat[12596]: 2009/06/23_08:33:48 WARN: standby message [me] from
> xen-b1.fra1.mailcluster ignored.  Other side is in flux.
> heartbeat[12596]: 2009/06/23_08:33:49 info: Received shutdown notice from
> 'xen-a1.fra1.mailcluster'.
> heartbeat[12596]: 2009/06/23_08:33:49 info: Resources being acquired from
> xen-a1.fra1.mailcluster.
> heartbeat[13006]: 2009/06/23_08:33:49 info: acquire local HA resources
> (standby).
> ResourceManager[13033]: 2009/06/23_08:33:49 info: Acquiring resource group:
> xen-b1.fra1.mailcluster drbddisk::drbd_backend_2 xen::backend-B1
> heartbeat[13007]: 2009/06/23_08:33:49 info: Local Resource acquisition
> completed.
> ResourceManager[13033]: 2009/06/23_08:33:49 info: Running
> /etc/ha.d/resource.d/drbddisk drbd_backend_2 start
> ResourceManager[13033]: 2009/06/23_08:33:49 info: Running
> /etc/ha.d/resource.d/xen backend-B1 start
> heartbeat[13006]: 2009/06/23_08:33:51 info: local HA resource acquisition
> completed (standby).
> heartbeat[12596]: 2009/06/23_08:33:51 info: Standby resource acquisition
> done [all].
> harc[13402]:    2009/06/23_08:33:51 info: Running /etc/ha.d/rc.d/status
> status
> mach_down[13418]:       2009/06/23_08:33:52 info: Taking over resource group
> drbddisk::drbd_backend
> ResourceManager[13444]: 2009/06/23_08:33:52 info: Acquiring resource group:
> xen-a1.fra1.mailcluster drbddisk::drbd_backend xen::backend-A1
> ResourceManager[13444]: 2009/06/23_08:33:52 info: Running
> /etc/ha.d/resource.d/drbddisk drbd_backend start
> ResourceManager[13444]: 2009/06/23_08:33:52 info: Running
> /etc/ha.d/resource.d/xen backend-A1 start
> mach_down[13418]:       2009/06/23_08:33:54 info:
> /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
> mach_down[13418]:       2009/06/23_08:33:54 info: mach_down takeover
> complete for node xen-a1.fra1.mailcluster.
> heartbeat[12596]: 2009/06/23_08:33:54 info: mach_down takeover complete.
> harc[13721]:    2009/06/23_08:33:54 info: Running
> /etc/ha.d/rc.d/ip-request-resp ip-request-resp
> ip-request-resp[13721]: 2009/06/23_08:33:54 received ip-request-resp
> drbddisk::drbd_backend_2 OK yes
> ResourceManager[13742]: 2009/06/23_08:33:54 info: Acquiring resource group:
> xen-b1.fra1.mailcluster drbddisk::drbd_backend_2 xen::backend-B1
> heartbeat[12596]: 2009/06/23_08:34:00 info: Link
> xen-a1.fra1.mailcluster:eth0 dead.
> heartbeat[12596]: 2009/06/23_08:34:03 info: Heartbeat restart on node
> xen-a1.fra1.mailcluster
> heartbeat[12596]: 2009/06/23_08:34:03 info: Link
> xen-a1.fra1.mailcluster:eth0 up.
> heartbeat[12596]: 2009/06/23_08:34:03 info: Status update for node
> xen-a1.fra1.mailcluster: status init
> heartbeat[12596]: 2009/06/23_08:34:03 info: Status update for node
> xen-a1.fra1.mailcluster: status up
> harc[13875]:    2009/06/23_08:34:03 info: Running /etc/ha.d/rc.d/status
> status
> harc[13891]:    2009/06/23_08:34:03 info: Running /etc/ha.d/rc.d/status
> status
> heartbeat[12596]: 2009/06/23_08:34:04 info: Status update for node
> xen-a1.fra1.mailcluster: status active
> harc[13907]:    2009/06/23_08:34:04 info: Running /etc/ha.d/rc.d/status
> status
> heartbeat[12596]: 2009/06/23_08:34:05 info: remote resource transition
> completed.
> heartbeat[12596]: 2009/06/23_08:34:05 info: xen-b1.fra1.mailcluster wants to
> go standby [foreign]
> heartbeat[12596]: 2009/06/23_08:34:05 info: standby: xen-a1.fra1.mailcluster
> can take our foreign resources
> heartbeat[13923]: 2009/06/23_08:34:05 info: give up foreign HA resources
> (standby).
> ResourceManager[13936]: 2009/06/23_08:34:06 info: Releasing resource group:
> xen-a1.fra1.mailcluster drbddisk::drbd_backend xen::backend-A1
> ResourceManager[13936]: 2009/06/23_08:34:06 info: Running
> /etc/ha.d/resource.d/xen backend-A1 stop
> ResourceManager[13936]: 2009/06/23_08:34:11 info: Running
> /etc/ha.d/resource.d/drbddisk drbd_backend stop
> heartbeat[13923]: 2009/06/23_08:34:11 info: foreign HA resource release
> completed (standby).
> heartbeat[12596]: 2009/06/23_08:34:11 info: Local standby process completed
> [foreign].
> heartbeat[12596]: 2009/06/23_08:34:13 WARN: 1 lost packet(s) for
> [xen-a1.fra1.mailcluster] [15:17]
> heartbeat[12596]: 2009/06/23_08:34:13 info: remote resource transition
> completed.
> heartbeat[12596]: 2009/06/23_08:34:13 info: No pkts missing from
> xen-a1.fra1.mailcluster!
> heartbeat[12596]: 2009/06/23_08:34:13 info: Other node completed standby
> takeover of foreign resources.
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> Our setup is CentOS 5 with the packages from the ditribution repository.
> heartbeat is version 2.1.3
> 
> hope you can help me.
> 
> 
> greetings
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems



More information about the Linux-HA mailing list