[Linux-HA] Re: heartbeat shuts down all VM machines
rupert
rupertt at gmail.com
Mon Mar 3 09:40:21 MST 2008
On Fri, Feb 29, 2008 at 5:19 PM, rupert <rupertt at gmail.com> wrote:
> I did some google about the ucast errors, but not much info came arround.
>
> What can be the cause of this? I rebooted and/or restarted the
> machines but always on both machines the log fills with the following
>
>
> Feb 29 16:17:15 xen-B1 heartbeat: [2974]: ERROR: write failure on
>
> ucast eth0.: No such device
> Feb 29 16:17:17 xen-B1 heartbeat: [2974]: ERROR: glib: Unable to send
>
> [-1] ucast packet: No such device
> Feb 29 16:17:17 xen-B1 heartbeat: [2974]: ERROR: write failure on
>
> ucast eth0.: No such device
> Feb 29 16:17:19 xen-B1 heartbeat: [2974]: ERROR: glib: Unable to send
>
> [-1] ucast packet: No such device
> Feb 29 16:17:19 xen-B1 heartbeat: [2974]: ERROR: write failure on
>
> ucast eth0.: No such device
> --
> Feb 29 16:18:39 xen-A1 heartbeat: [2936]: ERROR: glib: Unable to send
>
> [-1] ucast packet: No such device
> Feb 29 16:18:39 xen-A1 heartbeat: [2936]: ERROR: write failure on
>
> ucast eth0.: No such device
>
> --are these related?
> [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:149) Waiting for 2050.
> [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:476)
> hotplugStatusCallback
> /local/domain/0/backend/vbd/2/2050/hotplug-status.
> [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:490)
> hotplugStatusCallback 1.
> [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:143) Waiting for
> devices irq.
> [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:143) Waiting for
> devices vkbd.
> [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:143) Waiting for
> devices vfb.
> [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:143) Waiting for
> devices pci.
> [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:143) Waiting for
> devices ioports.
> [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:143) Waiting for
> devices tap.
> [2008-02-29 09:31:26 xend 3575] DEBUG (DevController:143) Waiting for
> devices vtpm.
>
>
> On both machines there are a couple of network services that run well
> thorught eth0,
> so ther device is up. Can this be because xen created some iptables rules?
>
>
iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT udp -- anywhere anywhere udp dpt:domain
ACCEPT tcp -- anywhere anywhere tcp dpt:domain
ACCEPT udp -- anywhere anywhere udp dpt:bootps
ACCEPT tcp -- anywhere anywhere tcp dpt:bootps
Chain FORWARD (policy ACCEPT)
target prot opt source destination
ACCEPT all -- anywhere 192.168.122.0/24 state
RELATED,ESTABLISHED
ACCEPT all -- 192.168.122.0/24 anywhere
ACCEPT all -- anywhere anywhere
REJECT all -- anywhere anywhere
reject-with icmp-port-unreachable
REJECT all -- anywhere anywhere
reject-with icmp-port-unreachable
ACCEPT all -- mx2.mailcluster.solvians.com anywhere
PHYSDEV match --physdev-in vif2.1
ACCEPT udp -- anywhere anywhere PHYSDEV
match --physdev-in vif2.1 udp spt:bootpc dpt:bootps
ACCEPT all -- mx2.fra1.mailcluster anywhere PHYSDEV
match --physdev-in vif2.0
ACCEPT udp -- anywhere anywhere PHYSDEV
match --physdev-in vif2.0 udp spt:bootpc dpt:bootps
ACCEPT all -- anywhere anywhere PHYSDEV
match --physdev-in vif3.0
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
why does xen create something for 192.168.122.0/24 net, never used this here!
> thx for your help
>
> Heiko
>
>
>
> On Fri, Feb 29, 2008 at 9:33 AM, rupert <rupertt at gmail.com> wrote:
> > it works now much better, both systems did a reboot (dont know why),
> > and now both VM running on the first server, so how can i get the
> > second server to take back the 2nd VM?
> >
> >
> >
> > On Thu, Feb 28, 2008 at 1:19 PM, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> > > Hi,
> > >
> > >
> > >
> > > On Thu, Feb 28, 2008 at 12:11:31PM +0100, rupert wrote:
> > > > mmh, i just restart the 2nd server to check in hearbeat moves the VM
> > > > to the server1.
> > > > I couldnt find any info about that in the logfiles on the first
> > > > server, something like taking over backend-B1,
> > > > and one VM did not start. But after the reboot of the server2 after
> > > > some time it correctly starts the backend-B1
> > > >
> > > > heartbeat[4959]: 2008/02/28_10:36:19 WARN: Logging daemon is disabled
> > > > --enabling logging daemon is
> > > >
> > > > recommended
> > > > heartbeat[4959]: 2008/02/28_10:36:19 info: **************************
> > > > heartbeat[4959]: 2008/02/28_10:36:19 info: Configuration validated.
> > > > Starting heartbeat 2.1.2
> > > > heartbeat[4960]: 2008/02/28_10:36:19 info: heartbeat: version 2.1.2
> > > > heartbeat[4960]: 2008/02/28_10:36:19 info: Heartbeat generation: 1202824451
> > > > heartbeat[4960]: 2008/02/28_10:36:19 info: G_main_add_TriggerHandler:
> > > > Added signal manual handler
> > > > heartbeat[4960]: 2008/02/28_10:36:19 info: G_main_add_TriggerHandler:
> > > > Added signal manual handler
> > > > heartbeat[4960]: 2008/02/28_10:36:19 info: Removing
> > > > /var/run/heartbeat/rsctmp failed, recreating.
> > > > heartbeat[4960]: 2008/02/28_10:36:19 info: glib: ucast: write socket
> > > > priority set to IPTOS_LOWDELA
> > > >
> > > > Y on eth0
> > > > heartbeat[4960]: 2008/02/28_10:36:19 info: glib: ucast: bound send
> > > > socket to device: eth0
> > > > heartbeat[4960]: 2008/02/28_10:36:19 info: glib: ucast: bound receive
> > > > socket to device: eth0
> > > > heartbeat[4960]: 2008/02/28_10:36:19 info: glib: ucast: started on
> > > > port 694 interface eth0 to 172.
> > > >
> > > > 20.2.1
> > > > heartbeat[4960]: 2008/02/28_10:36:19 info: G_main_add_SignalHandler:
> > > > Added signal handler for sign
> > > >
> > > > al 17
> > > > heartbeat[4960]: 2008/02/28_10:36:19 info: Local status now set to: 'up'
> > > > heartbeat[4960]: 2008/02/28_10:38:20 WARN: node xen-a1.fra1.mailcluster: is dead
> > > > heartbeat[4960]: 2008/02/28_10:38:20 info: Comm_now_up(): updating
> > > > status to active
> > > > heartbeat[4960]: 2008/02/28_10:38:20 info: Local status now set to: 'active'
> > > > heartbeat[4960]: 2008/02/28_10:38:20 WARN: No STONITH device configured.
> > > > heartbeat[4960]: 2008/02/28_10:38:20 WARN: Shared disks are not protected.
> > > > heartbeat[4960]: 2008/02/28_10:38:20 info: Resources being acquired
> > > > from xen-a1.fra1.mailcluster.
> > > > harc[4989]: 2008/02/28_10:38:20 info: Running /etc/ha.d/rc.d/status status
> > > > heartbeat[4990]: 2008/02/28_10:38:20 info: Local Resource acquisition completed.
> > > > mach_down[5019]: 2008/02/28_10:38:20 info: Taking over resource
> > > > group drbddisk::drbd_backen
> > > > d
> > > > ResourceManager[5073]: 2008/02/28_10:38:20 info: Acquiring resource
> > > > group: xen-a1.fra1.mailcluste
> > > >
> > > > r drbddisk::drbd_backend xen::backend-A1
> > > > ResourceManager[5073]: 2008/02/28_10:38:20 info: Running
> > > > /etc/ha.d/resource.d/drbddisk drbd_backe
> > > >
> > > > nd start
> > > > heartbeat[4960]: 2008/02/28_10:38:30 info: Local Resource acquisition
> > > > completed. (none)
> > > > heartbeat[4960]: 2008/02/28_10:38:30 info: local resource transition completed.
> > > > ResourceManager[5073]: 2008/02/28_10:38:32 ERROR: Return code 1 from
> > > > /etc/ha.d/resource.d/drbddis
> > > > k
> > > > ResourceManager[5073]: 2008/02/28_10:38:32 CRIT: Giving up resources
> > > > due to failure of drbddisk::
> > > >
> > > > drbd_backend
> > >
> > > You have to find out why is drbddisk failing.
> > >
> > >
> > >
> > > > ResourceManager[5073]: 2008/02/28_10:38:32 info: Releasing resource
> > > > group: xen-a1.fra1.mailcluste
> > > >
> > > > r drbddisk::drbd_backend xen::backend-A1
> > > > ResourceManager[5073]: 2008/02/28_10:38:32 info: Running
> > > > /etc/ha.d/resource.d/xen backend-A1 stop
> > > > ResourceManager[5073]: 2008/02/28_10:38:33 info: Running
> > > > /etc/ha.d/resource.d/drbddisk drbd_backe
> > > >
> > > > nd stop
> > > > mach_down[5019]: 2008/02/28_10:38:33 info:
> > > > /usr/share/heartbeat/mach_down: nice_failback: f
> > > >
> > > > oreign resources acquired
> > > > mach_down[5019]: 2008/02/28_10:38:33 info: mach_down takeover
> > > > complete for node xen-a1.fra1
> > > >
> > > > .mailcluster.
> > > > heartbeat[4960]: 2008/02/28_10:38:33 info: mach_down takeover complete.
> > > > heartbeat[4960]: 2008/02/28_10:38:33 info: Initial resource
> > > > acquisition complete (mach_down)
> > > > harc[5232]: 2008/02/28_10:38:33 info: Running
> > > > /etc/ha.d/rc.d/ip-request-resp ip-request-resp
> > > > ip-request-resp[5232]: 2008/02/28_10:38:33 received ip-request-resp
> > > > drbddisk::drbd_backend_2 OK y
> > > >
> > > > es
> > > > ResourceManager[5253]: 2008/02/28_10:38:33 info: Acquiring resource
> > > > group: xen-b1.fra1.mailcluste
> > > >
> > > > r drbddisk::drbd_backend_2 xen::backend-B1
> > > > ResourceManager[5253]: 2008/02/28_10:38:33 info: Running
> > > > /etc/ha.d/resource.d/drbddisk drbd_backe
> > > >
> > > > nd_2 start
> > > > ResourceManager[5253]: 2008/02/28_10:38:33 info: Running
> > > > /etc/ha.d/resource.d/xen backend-B1 star
> > > >
> > > > t
> > > > hb_standby[5588]: 2008/02/28_10:39:03 Going standby [foreign].
> > > > heartbeat[4960]: 2008/02/28_10:39:03 info: xen-b1.fra1.mailcluster
> > > > wants to go standby [foreign]
> > > > heartbeat[4960]: 2008/02/28_10:39:13 WARN: No reply to standby
> > > > request. Standby request cancelled
> > > >
> > > > but after a reboot some minutes before i had the logfile flooding with
> > > > this message
> > > >
> > > > heartbeat[2966]: 2008/02/28_10:15:34 ERROR: glib: Unable to send [-1]
> > > > ucast packet: No such device
> > > > heartbeat[2966]: 2008/02/28_10:15:34 ERROR: write failure on ucast
> > > > eth0.: No such device
> > > > heartbeat[2966]: 2008/02/28_10:15:34 ERROR: glib: Unable to send [-1]
> > > > ucast packet: No such device
> > > > heartbeat[2966]: 2008/02/28_10:15:34 ERROR: write failure on ucast
> > > > eth0.: No such device
> > >
> > > Well, looks like eth0 doesn't exist.
> > >
> > >
> > > > I stopped iptables, but it didnt go away, only after a new reboot,
> > > > what the reason for this
> > > > error?
> > > >
> > > > in ha.cf should be both nodes have a "ucast eth0 172.20.2.1" entry?
> > >
> > > No. It should be ucast eth0 node2-ipaddress on node1 and vice
> > > versa on node2. To simplify management, you can put both ucast
> > > directives on both nodes. I believe that this is well documented
> > > in ha.cf.
> > >
> > > Thanks,
> > >
> > > Dejan
> > >
> > >
> > >
> > > > thx
> > > >
> > > > On Thu, Feb 28, 2008 at 11:18 AM, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> > > > > Hi,
> > > > >
> > > > >
> > > > > On Thu, Feb 28, 2008 at 08:36:33AM +0100, rupert wrote:
> > > > > > has no one some ideas to this matter?
> > > > >
> > > > > This is a drbd related issue. You should be better off in a drbd
> > > > > forum.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Dejan
> > > > >
> > > > >
> > > > >
> > > > > > thx
> > > > > >
> > > > > > On Tue, Feb 26, 2008 at 12:10 PM, rupert <rupertt at gmail.com> wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > i set up a cluster with 2 drbdb devices and 2 VM on each server.
> > > > > > > When one server goes down the other should take over the part of down one.
> > > > > > > The drbd goes like this:
> > > > > > > a -> a
> > > > > > > b <- b
> > > > > > >
> > > > > > > the other machine are not drbdb devices, just some loopback VM which
> > > > > > > caryy no data,
> > > > > > > can they be in the config for heartbeat?
> > > > > > >
> > > > > > > in my haresources I have the following entries on both servers
> > > > > > >
> > > > > > > xen-A1.fra1.mailcluster drbddisk::drbd_backend xen::backend-A1 xen::MX1-A1
> > > > > > > xen-B1.fra1.mailcluster drbddisk::drbd_backend_2 xen::backend-B1 xen::MX2-B1
> > > > > > >
> > > > > > > in ha.cf on the first server I set ucast to
> > > > > > > ucast eth0 172.20.1.1
> > > > > > > and
> > > > > > > ucast eth0 172.20.2.1
> > > > > > > on the second server
> > > > > > >
> > > > > > > when i restart the ha deamon it powers down all the VMs and makes on
> > > > > > > the first server
> > > > > > > all the drbd device primary but they should be on the first server
> > > > > > >
> > > > > > > GIT-hash: b3fe2bdfd3b9f7c2f923186883eb9e2a0d3a5b1b build by
> > > > > > > buildsvn at c5-x8664-build, 2008-02-13 19:17:43
> > > > > > > 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
> > > > > > > ns:135995280 nr:0 dw:779680 dr:135790386 al:224 bm:8602 lo:0 pe:0 ua:0 ap:0
> > > > > > > resync: used:0/31 hits:8442668 misses:8308 starving:0 dirty:0
> > > > > > > changed:8308
> > > > > > > act_log: used:0/257 hits:136296 misses:224 starving:0 dirty:0
> > > > > > > changed:224
> > > > > > > 1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
> > > > > > > ns:0 nr:663968 dw:663968 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
> > > > > > > resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
> > > > > > >
> > > > > > >
> > > > > > > on my first start heartbeat told me that the drbddisk is active and it
> > > > > > > shouldnt be,
> > > > > > > but its the one that is on each server the main drbdisk, the other is
> > > > > > > the backup
> > > > > > > for failouts.
> > > > > > >
> > > > > > > Resource drbddisk::drbd_backend_2 is active, and s
> > > > > > >
> > > > > > > hould not be!
> > > > > > > 2008/02/26_07:42:58 CRITICAL: Non-idle resources can affect data integrity!
> > > > > > > 2008/02/26_07:42:58 info: If you don't know what this means, then get help!
> > > > > > > 2008/02/26_07:42:58 info: Read the docs and/or source to
> > > > > > > /usr/share/heartbeat/Re
> > > > > > >
> > > > > > > sourceManager for more details.
> > > > > > > CRITICAL: Resource drbddisk::drbd_backend_2 is active, and should not be!
> > > > > > > CRITICAL: Non-idle resources can affect data integrity!
> > > > > > > info: If you don't know what this means, then get help!
> > > > > > > info: Read the docs and/or the source to
> > > > > > > /usr/share/heartbeat/ResourceManager fo
> > > > > > >
> > > > > > > r more details.
> > > > > > > 2008/02/26_07:42:58 CRITICAL: Non-idle resources will affect resource takeback!
> > > > > > > 2008/02/26_07:42:58 CRITICAL: Non-idle resources may affect data integrity!
> > > > > > >
> > > > > > >
> > > > > > > thx for your help
> > > > > > >
> > > > > > _______________________________________________
> > > > > > Linux-HA mailing list
> > > > > > Linux-HA at lists.linux-ha.org
> > > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > > > See also: http://linux-ha.org/ReportingProblems
> > > > >
> > > > > --
> > > > > Dejan
> > > > > _______________________________________________
> > > > > Linux-HA mailing list
> > > > > Linux-HA at lists.linux-ha.org
> > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > > See also: http://linux-ha.org/ReportingProblems
> > > > >
> > > > _______________________________________________
> > > > Linux-HA mailing list
> > > > Linux-HA at lists.linux-ha.org
> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > See also: http://linux-ha.org/ReportingProblems
> > > _______________________________________________
> > > Linux-HA mailing list
> > > Linux-HA at lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > >
> >
>
More information about the Linux-HA
mailing list