[Linux-HA] how to do fencing with drbd+heartbeat

Guochun Shi gshi at ncsa.uiuc.edu
Mon Jul 18 14:29:15 MDT 2005


Hi,

You are using ssh stonith; however when you unplug the cable, fibonacci cannot shoot euclides because these two machines are disconnected. 

If u want to test with ssh stonith, you need second physical cable so that when unplug the cable, the second one keeps 2 machines connected and ssh command
can go through.

-Guochun



At 05:09 PM 7/18/2005 -0300, you wrote:
>Hi ppl,
>
>I installed a HA cluster using drbd+heartbeat, but I'm getting trouble to solve a fencing issue.
>
>1) SCENARIO DESCRIPTION
>
>I have two identical servers, designed 'fibonacci' and 'euclides' where fibonacci is set to own the resources.
>For while, I want only 
>I've configured ssh with RSA key to test stonith. Also, I set PingNode to another machine in my local network.
>
>Versões dos softwares:
>kernel: Linux 2.6.12 
>drbd0.7-utils  0.7.10-3
>heartbeat      1.2.3-9 
>stonith        1.2.3-9 
>
>
>
>2) PROBLEM DESCRIPTION
>
>Everything's working right. Let's supose that fibonacci is running as primary.
>When I unplug the network cable of euclides, both servers goes to primary (drbd) and still waiting for connection.
>As the two servers are primary, when the network connection returns, on conflict occurs and drbd lost connection due incompatible modes.
>
>I would like to know how to prevent both machines comes in primary states or how to protect the machine that's returning to network do it in an invalid state.
>
>
>
>3) CONFIG FILES
>
>FILE /etc/drbd.conf:
>resource drbd11 {
>  protocol B;
>  incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";
>
>  startup {
>    wfc-timeout 60;
>    degr-wfc-timeout 60;
>  }
>
>  disk {
>    on-io-error   detach;
>  }
>
>  net {
>    timeout       60;
>    connect-int   10;
>    ping-int      10;
>    max-buffers     2048;
>    max-epoch-size  2048;
>  }
>
>  syncer {
>    rate 100M;
>    group 1;
>  }
>
>  on fibonacci {
>    device     /dev/drbd11;
>    disk       /dev/md11;
>    address    192.168.0.147:7788;
>    meta-disk  internal;
>  }
>
>  on euclides {
>    device    /dev/drbd11;
>    disk      /dev/md11;
>    address   192.168.0.148:7788;
>    meta-disk internal;
>  }
>}
>
>
>FILE /etc/ha.d/ha.cf:
>debugfile /var/log/ha-debug
>logfile /var/log/ha-log
>logfacility     local0
>keepalive 2
>deadtime 30
>udpport 694
>ucast eth0 192.168.0.147 192.168.0.148
>auto_failback on
>stonith_host * ssh euclides
>node fibonacci euclides
>ping 192.168.0.1
>respawn hacluster /usr/lib/heartbeat/ipfail
>
>
>
>FILE /etc/ha.d/haresources:
>fibonacci IPaddr::192.168.0.145/24/eth0/192.168.0.255 drbddisk::drbd11 Filesystem::/dev/drbd11::/backup::xfs
>
>
>
>4) LOG FILES
>
>/var/log/ha-log (these logs are from the secondary server, in this case fibonacci):
>** here I unplugged the secondary server.
>heartbeat: 2005/07/18_14:25:00 WARN: node 192.168.0.1: is dead
>heartbeat: 2005/07/18_14:25:00 info: Link 192.168.0.1:192.168.0.1 dead.
>heartbeat: 2005/07/18_14:25:00 info: Running /etc/ha.d/rc.d/status status
>heartbeat: 2005/07/18_14:25:02 WARN: node euclides: is dead
>heartbeat: 2005/07/18_14:25:02 info: Dead node euclides gave up resources.
>heartbeat: 2005/07/18_14:25:02 info: Resources being acquired from euclides.
>heartbeat: 2005/07/18_14:25:02 info: Link euclides:eth0 dead.
>heartbeat: 2005/07/18_14:25:02 info: Local Resource acquisition completed.
>heartbeat: 2005/07/18_14:25:02 info: Initial resource acquisition complete (T_RESOURCES(us))
>heartbeat: 2005/07/18_14:25:02 info: Local Resource acquisition completed.
>heartbeat: 2005/07/18_14:25:02 info: Running /etc/ha.d/rc.d/status status
>heartbeat: 2005/07/18_14:25:02 info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
>heartbeat: 2005/07/18_14:25:02 info: mach_down takeover complete.
>heartbeat: 2005/07/18_14:25:02 info: mach_down takeover complete for node euclides.
>heartbeat: 2005/07/18_14:25:02 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp
>heartbeat: 2005/07/18_14:25:02 received ip-request-resp IPaddr::192.168.0.145/24/eth0/192.168.0.255 OK yes
>heartbeat: 2005/07/18_14:25:02 info: Acquiring resource group: fibonacci IPaddr::192.168.0.145/24/eth0/192.168.0.255 drbddisk::drbd11
>heartbeat: 2005/07/18_14:25:02 info: Running /etc/ha.d/resource.d/IPaddr 192.168.0.145/24/eth0/192.168.0.255 start
>heartbeat: 2005/07/18_14:25:02 info: /sbin/ifconfig eth0:0 192.168.0.145  netmask 255.255.255.0 broadcast 192.168.0.255
>heartbeat: 2005/07/18_14:25:02 info: Sending Gratuitous Arp for 192.168.0.145 on eth0:0 [eth0]
>heartbeat: 2005/07/18_14:25:02 /usr/lib/heartbeat/send_arp -i 1010 -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-192.168.0.145 eth0 192.168.0.145 auto 192.168.0.145 ffffffffffff
>heartbeat: 2005/07/18_14:25:02 info: Running /etc/ha.d/resource.d/drbddisk drbd11 start
>heartbeat: 2005/07/18_14:25:02 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp
>heartbeat: 2005/07/18_14:25:02 received ip-request-resp IPaddr::192.168.0.145/24/eth0/192.168.0.255 OK yes
>heartbeat: 2005/07/18_14:25:02 info: Acquiring resource group: fibonacci IPaddr::192.168.0.145/24/eth0/192.168.0.255 drbddisk::drbd11
>heartbeat: 2005/07/18_14:25:13 info: Local Resource acquisition completed. (none)
>heartbeat: 2005/07/18_14:25:13 info: local resource transition completed.
>heartbeat: 2005/07/18_14:25:24 info: Link 192.168.0.1:192.168.0.1 up.
>heartbeat: 2005/07/18_14:25:24 WARN: Late heartbeat: Node 192.168.0.1: interval 54030 ms
>heartbeat: 2005/07/18_14:25:24 info: Status update for node 192.168.0.1: status ping
>heartbeat: 2005/07/18_14:25:24 WARN: Cluster node euclides returning after partition.
>heartbeat: 2005/07/18_14:25:24 WARN: Deadtime value may be too small.
>heartbeat: 2005/07/18_14:25:24 info: See documentation for information on tuning deadtime.
>heartbeat: 2005/07/18_14:25:24 WARN: 26 lost packet(s) for [euclides] [114:141]
>heartbeat: 2005/07/18_14:25:24 info: Link euclides:eth0 up.
>heartbeat: 2005/07/18_14:25:24 WARN: Late heartbeat: Node euclides: interval 54020 ms
>heartbeat: 2005/07/18_14:25:24 info: Status update for node euclides: status active
>heartbeat: 2005/07/18_14:25:24 info: Running /etc/ha.d/rc.d/status status
>heartbeat: 2005/07/18_14:25:26 info: Heartbeat shutdown in progress. (1921)
>heartbeat: 2005/07/18_14:25:26 info: Giving up all HA resources.
>heartbeat: 2005/07/18_14:25:29 info: Releasing resource group: fibonacci IPaddr::192.168.0.145/24/eth0/192.168.0.255 drbddisk::drbd11
>heartbeat: 2005/07/18_14:25:29 info: Running /etc/ha.d/resource.d/drbddisk drbd11 stop
>heartbeat: 2005/07/18_14:25:30 info: Running /etc/ha.d/resource.d/IPaddr 192.168.0.145/24/eth0/192.168.0.255 stop
>heartbeat: 2005/07/18_14:25:30 info: /sbin/route -n del -host 192.168.0.145
>heartbeat: 2005/07/18_14:25:30 info: /sbin/ifconfig eth0:0 down
>heartbeat: 2005/07/18_14:25:30 info: IP Address 192.168.0.145 released
>heartbeat: 2005/07/18_14:25:30 info: killing /usr/lib/heartbeat/ipfail process group 1931 with signal 15
>heartbeat: 2005/07/18_14:25:30 info: All HA resources relinquished.
>heartbeat: 2005/07/18_14:25:30 info: killing /usr/lib/heartbeat/ipfail process group 1931 with signal 15
>heartbeat: 2005/07/18_14:25:31 info: killing HBREAD process 1925 with signal 15
>heartbeat: 2005/07/18_14:25:31 info: killing HBWRITE process 1926 with signal 15
>heartbeat: 2005/07/18_14:25:31 info: killing HBREAD process 1927 with signal 15
>heartbeat: 2005/07/18_14:25:31 info: killing HBFIFO process 1923 with signal 15
>heartbeat: 2005/07/18_14:25:31 info: killing HBWRITE process 1924 with signal 15
>heartbeat: 2005/07/18_14:25:31 info: Core process 1923 exited. 5 remaining
>heartbeat: 2005/07/18_14:25:31 info: Core process 1924 exited. 4 remaining
>heartbeat: 2005/07/18_14:25:31 info: Core process 1925 exited. 3 remaining
>heartbeat: 2005/07/18_14:25:31 info: Core process 1926 exited. 2 remaining
>heartbeat: 2005/07/18_14:25:31 info: Core process 1927 exited. 1 remaining
>heartbeat: 2005/07/18_14:25:31 info: Heartbeat shutdown complete.
>heartbeat: 2005/07/18_14:25:31 info: Heartbeat restart triggered.
>heartbeat: 2005/07/18_14:25:31 info: Restarting heartbeat.
>heartbeat: 2005/07/18_14:25:31 info: Performing heartbeat restart exec.
>heartbeat: 2005/07/18_14:26:02 info: **************************
>heartbeat: 2005/07/18_14:26:02 info: Configuration validated. Starting heartbeat 1.2.3
>heartbeat: 2005/07/18_14:26:02 info: heartbeat: version 1.2.3
>heartbeat: 2005/07/18_14:26:02 info: Heartbeat generation: 130
>heartbeat: 2005/07/18_14:26:02 info: ucast: write socket priority set to IPTOS_LOWDELAY on eth0
>heartbeat: 2005/07/18_14:26:02 info: ucast: bound send socket to device: eth0
>heartbeat: 2005/07/18_14:26:02 info: ucast: bound receive socket to device: eth0
>heartbeat: 2005/07/18_14:26:02 info: ucast: started on port 694 interface eth0 to 192.168.0.147
>heartbeat: 2005/07/18_14:26:02 info: ping heartbeat started.
>heartbeat: 2005/07/18_14:26:02 info: pid 3288 locked in memory.
>heartbeat: 2005/07/18_14:26:02 info: Local status now set to: 'up'
>heartbeat: 2005/07/18_14:26:03 info: pid 3290 locked in memory.
>heartbeat: 2005/07/18_14:26:03 info: pid 3291 locked in memory.
>heartbeat: 2005/07/18_14:26:03 info: pid 3292 locked in memory.
>heartbeat: 2005/07/18_14:26:03 info: Link euclides:eth0 up.
>heartbeat: 2005/07/18_14:26:03 info: Status update for node euclides: status active
>heartbeat: 2005/07/18_14:26:03 info: pid 3293 locked in memory.
>heartbeat: 2005/07/18_14:26:03 info: pid 3294 locked in memory.
>heartbeat: 2005/07/18_14:26:03 info: Link 192.168.0.1:192.168.0.1 up.
>heartbeat: 2005/07/18_14:26:03 info: Status update for node 192.168.0.1: status ping
>heartbeat: 2005/07/18_14:26:03 info: Local status now set to: 'active'
>heartbeat: 2005/07/18_14:26:03 info: Starting child client "/usr/lib/heartbeat/ipfail" (1001,104)
>heartbeat: 2005/07/18_14:26:03 info: Running /etc/ha.d/rc.d/status status
>heartbeat: 2005/07/18_14:26:03 info: Checking status of STONITH device [ssh STONITH device]
>heartbeat: 2005/07/18_14:26:03 info: Exiting STONITH-stat process 3296 returned rc 0.
>heartbeat: 2005/07/18_14:26:03 info: Starting "/usr/lib/heartbeat/ipfail" as uid 1001  gid 104 (pid 3297)
>
>
>
>/var/log/kernel.log
>Jul 18 14:24:32 fibonacci kernel: tg3: eth0: Link is down.
>Jul 18 14:24:35 fibonacci kernel: drbd11: PingAck did not arrive in time.
>Jul 18 14:24:35 fibonacci kernel: drbd11: drbd11_asender [1917]: cstate Connected --> NetworkFailure
>Jul 18 14:24:35 fibonacci kernel: drbd11: asender terminated
>Jul 18 14:24:35 fibonacci kernel: drbd11: drbd11_receiver [456]: cstate NetworkFailure --> BrokenPipe
>Jul 18 14:24:35 fibonacci kernel: drbd11: short read expecting header on sock: r=-512
>Jul 18 14:24:35 fibonacci kernel: drbd11: worker terminated
>Jul 18 14:24:35 fibonacci kernel: drbd11: drbd11_receiver [456]: cstate BrokenPipe --> Unconnected
>Jul 18 14:24:35 fibonacci kernel: drbd11: Connection lost.
>Jul 18 14:24:35 fibonacci kernel: drbd11: drbd11_receiver [456]: cstate Unconnected --> WFConnection
>Jul 18 14:25:02 fibonacci kernel: drbd11: Secondary/Unknown --> Primary/Unknown
>Jul 18 14:25:23 fibonacci kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
>Jul 18 14:25:23 fibonacci kernel: tg3: eth0: Flow control is on for TX and on for RX.
>Jul 18 14:25:26 fibonacci kernel: drbd11: drbd11_receiver [456]: cstate WFConnection --> WFReportParams
>Jul 18 14:25:26 fibonacci kernel: drbd11: Handshake successful: DRBD Network Protocol version 74
>Jul 18 14:25:26 fibonacci kernel: drbd11: incompatible states (both Primary!)
>Jul 18 14:25:26 fibonacci kernel: drbd11: drbd11_receiver [456]: cstate WFReportParams --> StandAlone
>Jul 18 14:25:26 fibonacci kernel: drbd11: error receiving ReportParams, l: 72!
>Jul 18 14:25:26 fibonacci kernel: drbd11: worker terminated
>Jul 18 14:25:26 fibonacci kernel: drbd11: asender terminated
>Jul 18 14:25:26 fibonacci kernel: drbd11: drbd11_receiver [456]: cstate StandAlone --> StandAlone
>Jul 18 14:25:26 fibonacci kernel: drbd11: Connection lost.
>Jul 18 14:25:26 fibonacci kernel: drbd11: receiver terminated
>Jul 18 14:25:29 fibonacci kernel: drbd11: Primary/Unknown --> Secondary/Unknown
>
>
>
>thanks in advance and sorry if I couldn't be clear about my problem
>
>
>-- 
>Djalma Fadel Junior
>Diretor Técnico
>Ferasoft Corporation Ltda
>fadel at ferasoft.com.br
>
>_______________________________________________
>Linux-HA mailing list
>Linux-HA at lists.linux-ha.org
>http://lists.linux-ha.org/mailman/listinfo/linux-ha




More information about the Linux-HA mailing list