[Linux-HA] how to do fencing with drbd+heartbeat

Djalma Fadel Junior dfadel at ferasoft.com.br
Mon Jul 18 11:42:18 MDT 2005


Hi ppl,

I installed a HA cluster using drbd+heartbeat, but I'm getting trouble to solve a fencing issue.

1) SCENARIO DESCRIPTION

I have two identical servers, designed 'fibonacci' and 'euclides' where fibonacci is set to own the resources.
For while, I want only 
I've configured ssh with RSA key to test stonith. Also, I set PingNode to another machine in my local network.

Versões dos softwares:
kernel: Linux 2.6.12 
drbd0.7-utils  0.7.10-3
heartbeat      1.2.3-9 
stonith        1.2.3-9 



2) PROBLEM DESCRIPTION

Everything's working right. Let's supose that fibonacci is running as primary.
When I unplug the network cable of euclides, both servers goes to primary (drbd) and still waiting for connection.
As the two servers are primary, when the network connection returns, on conflict occurs and drbd lost connection due incompatible modes.

I would like to know how to prevent both machines comes in primary states or how to protect the machine that's returning to network do it in an invalid state.



3) CONFIG FILES

FILE /etc/drbd.conf:
resource drbd11 {
  protocol B;
  incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

  startup {
    wfc-timeout 60;
    degr-wfc-timeout 60;
  }

  disk {
    on-io-error   detach;
  }

  net {
    timeout       60;
    connect-int   10;
    ping-int      10;
    max-buffers     2048;
    max-epoch-size  2048;
  }

  syncer {
    rate 100M;
    group 1;
  }

  on fibonacci {
    device     /dev/drbd11;
    disk       /dev/md11;
    address    192.168.0.147:7788;
    meta-disk  internal;
  }

  on euclides {
    device    /dev/drbd11;
    disk      /dev/md11;
    address   192.168.0.148:7788;
    meta-disk internal;
  }
}


FILE /etc/ha.d/ha.cf:
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility     local0
keepalive 2
deadtime 30
udpport 694
ucast eth0 192.168.0.147 192.168.0.148
auto_failback on
stonith_host * ssh euclides
node fibonacci euclides
ping 192.168.0.1
respawn hacluster /usr/lib/heartbeat/ipfail



FILE /etc/ha.d/haresources:
fibonacci IPaddr::192.168.0.145/24/eth0/192.168.0.255 drbddisk::drbd11 Filesystem::/dev/drbd11::/backup::xfs



4) LOG FILES

/var/log/ha-log (these logs are from the secondary server, in this case fibonacci):
** here I unplugged the secondary server.
heartbeat: 2005/07/18_14:25:00 WARN: node 192.168.0.1: is dead
heartbeat: 2005/07/18_14:25:00 info: Link 192.168.0.1:192.168.0.1 dead.
heartbeat: 2005/07/18_14:25:00 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2005/07/18_14:25:02 WARN: node euclides: is dead
heartbeat: 2005/07/18_14:25:02 info: Dead node euclides gave up resources.
heartbeat: 2005/07/18_14:25:02 info: Resources being acquired from euclides.
heartbeat: 2005/07/18_14:25:02 info: Link euclides:eth0 dead.
heartbeat: 2005/07/18_14:25:02 info: Local Resource acquisition completed.
heartbeat: 2005/07/18_14:25:02 info: Initial resource acquisition complete (T_RESOURCES(us))
heartbeat: 2005/07/18_14:25:02 info: Local Resource acquisition completed.
heartbeat: 2005/07/18_14:25:02 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2005/07/18_14:25:02 info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
heartbeat: 2005/07/18_14:25:02 info: mach_down takeover complete.
heartbeat: 2005/07/18_14:25:02 info: mach_down takeover complete for node euclides.
heartbeat: 2005/07/18_14:25:02 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp
heartbeat: 2005/07/18_14:25:02 received ip-request-resp IPaddr::192.168.0.145/24/eth0/192.168.0.255 OK yes
heartbeat: 2005/07/18_14:25:02 info: Acquiring resource group: fibonacci IPaddr::192.168.0.145/24/eth0/192.168.0.255 drbddisk::drbd11
heartbeat: 2005/07/18_14:25:02 info: Running /etc/ha.d/resource.d/IPaddr 192.168.0.145/24/eth0/192.168.0.255 start
heartbeat: 2005/07/18_14:25:02 info: /sbin/ifconfig eth0:0 192.168.0.145  netmask 255.255.255.0 broadcast 192.168.0.255
heartbeat: 2005/07/18_14:25:02 info: Sending Gratuitous Arp for 192.168.0.145 on eth0:0 [eth0]
heartbeat: 2005/07/18_14:25:02 /usr/lib/heartbeat/send_arp -i 1010 -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-192.168.0.145 eth0 192.168.0.145 auto 192.168.0.145 ffffffffffff
heartbeat: 2005/07/18_14:25:02 info: Running /etc/ha.d/resource.d/drbddisk drbd11 start
heartbeat: 2005/07/18_14:25:02 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp
heartbeat: 2005/07/18_14:25:02 received ip-request-resp IPaddr::192.168.0.145/24/eth0/192.168.0.255 OK yes
heartbeat: 2005/07/18_14:25:02 info: Acquiring resource group: fibonacci IPaddr::192.168.0.145/24/eth0/192.168.0.255 drbddisk::drbd11
heartbeat: 2005/07/18_14:25:13 info: Local Resource acquisition completed. (none)
heartbeat: 2005/07/18_14:25:13 info: local resource transition completed.
heartbeat: 2005/07/18_14:25:24 info: Link 192.168.0.1:192.168.0.1 up.
heartbeat: 2005/07/18_14:25:24 WARN: Late heartbeat: Node 192.168.0.1: interval 54030 ms
heartbeat: 2005/07/18_14:25:24 info: Status update for node 192.168.0.1: status ping
heartbeat: 2005/07/18_14:25:24 WARN: Cluster node euclides returning after partition.
heartbeat: 2005/07/18_14:25:24 WARN: Deadtime value may be too small.
heartbeat: 2005/07/18_14:25:24 info: See documentation for information on tuning deadtime.
heartbeat: 2005/07/18_14:25:24 WARN: 26 lost packet(s) for [euclides] [114:141]
heartbeat: 2005/07/18_14:25:24 info: Link euclides:eth0 up.
heartbeat: 2005/07/18_14:25:24 WARN: Late heartbeat: Node euclides: interval 54020 ms
heartbeat: 2005/07/18_14:25:24 info: Status update for node euclides: status active
heartbeat: 2005/07/18_14:25:24 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2005/07/18_14:25:26 info: Heartbeat shutdown in progress. (1921)
heartbeat: 2005/07/18_14:25:26 info: Giving up all HA resources.
heartbeat: 2005/07/18_14:25:29 info: Releasing resource group: fibonacci IPaddr::192.168.0.145/24/eth0/192.168.0.255 drbddisk::drbd11
heartbeat: 2005/07/18_14:25:29 info: Running /etc/ha.d/resource.d/drbddisk drbd11 stop
heartbeat: 2005/07/18_14:25:30 info: Running /etc/ha.d/resource.d/IPaddr 192.168.0.145/24/eth0/192.168.0.255 stop
heartbeat: 2005/07/18_14:25:30 info: /sbin/route -n del -host 192.168.0.145
heartbeat: 2005/07/18_14:25:30 info: /sbin/ifconfig eth0:0 down
heartbeat: 2005/07/18_14:25:30 info: IP Address 192.168.0.145 released
heartbeat: 2005/07/18_14:25:30 info: killing /usr/lib/heartbeat/ipfail process group 1931 with signal 15
heartbeat: 2005/07/18_14:25:30 info: All HA resources relinquished.
heartbeat: 2005/07/18_14:25:30 info: killing /usr/lib/heartbeat/ipfail process group 1931 with signal 15
heartbeat: 2005/07/18_14:25:31 info: killing HBREAD process 1925 with signal 15
heartbeat: 2005/07/18_14:25:31 info: killing HBWRITE process 1926 with signal 15
heartbeat: 2005/07/18_14:25:31 info: killing HBREAD process 1927 with signal 15
heartbeat: 2005/07/18_14:25:31 info: killing HBFIFO process 1923 with signal 15
heartbeat: 2005/07/18_14:25:31 info: killing HBWRITE process 1924 with signal 15
heartbeat: 2005/07/18_14:25:31 info: Core process 1923 exited. 5 remaining
heartbeat: 2005/07/18_14:25:31 info: Core process 1924 exited. 4 remaining
heartbeat: 2005/07/18_14:25:31 info: Core process 1925 exited. 3 remaining
heartbeat: 2005/07/18_14:25:31 info: Core process 1926 exited. 2 remaining
heartbeat: 2005/07/18_14:25:31 info: Core process 1927 exited. 1 remaining
heartbeat: 2005/07/18_14:25:31 info: Heartbeat shutdown complete.
heartbeat: 2005/07/18_14:25:31 info: Heartbeat restart triggered.
heartbeat: 2005/07/18_14:25:31 info: Restarting heartbeat.
heartbeat: 2005/07/18_14:25:31 info: Performing heartbeat restart exec.
heartbeat: 2005/07/18_14:26:02 info: **************************
heartbeat: 2005/07/18_14:26:02 info: Configuration validated. Starting heartbeat 1.2.3
heartbeat: 2005/07/18_14:26:02 info: heartbeat: version 1.2.3
heartbeat: 2005/07/18_14:26:02 info: Heartbeat generation: 130
heartbeat: 2005/07/18_14:26:02 info: ucast: write socket priority set to IPTOS_LOWDELAY on eth0
heartbeat: 2005/07/18_14:26:02 info: ucast: bound send socket to device: eth0
heartbeat: 2005/07/18_14:26:02 info: ucast: bound receive socket to device: eth0
heartbeat: 2005/07/18_14:26:02 info: ucast: started on port 694 interface eth0 to 192.168.0.147
heartbeat: 2005/07/18_14:26:02 info: ping heartbeat started.
heartbeat: 2005/07/18_14:26:02 info: pid 3288 locked in memory.
heartbeat: 2005/07/18_14:26:02 info: Local status now set to: 'up'
heartbeat: 2005/07/18_14:26:03 info: pid 3290 locked in memory.
heartbeat: 2005/07/18_14:26:03 info: pid 3291 locked in memory.
heartbeat: 2005/07/18_14:26:03 info: pid 3292 locked in memory.
heartbeat: 2005/07/18_14:26:03 info: Link euclides:eth0 up.
heartbeat: 2005/07/18_14:26:03 info: Status update for node euclides: status active
heartbeat: 2005/07/18_14:26:03 info: pid 3293 locked in memory.
heartbeat: 2005/07/18_14:26:03 info: pid 3294 locked in memory.
heartbeat: 2005/07/18_14:26:03 info: Link 192.168.0.1:192.168.0.1 up.
heartbeat: 2005/07/18_14:26:03 info: Status update for node 192.168.0.1: status ping
heartbeat: 2005/07/18_14:26:03 info: Local status now set to: 'active'
heartbeat: 2005/07/18_14:26:03 info: Starting child client "/usr/lib/heartbeat/ipfail" (1001,104)
heartbeat: 2005/07/18_14:26:03 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2005/07/18_14:26:03 info: Checking status of STONITH device [ssh STONITH device]
heartbeat: 2005/07/18_14:26:03 info: Exiting STONITH-stat process 3296 returned rc 0.
heartbeat: 2005/07/18_14:26:03 info: Starting "/usr/lib/heartbeat/ipfail" as uid 1001  gid 104 (pid 3297)



/var/log/kernel.log
Jul 18 14:24:32 fibonacci kernel: tg3: eth0: Link is down.
Jul 18 14:24:35 fibonacci kernel: drbd11: PingAck did not arrive in time.
Jul 18 14:24:35 fibonacci kernel: drbd11: drbd11_asender [1917]: cstate Connected --> NetworkFailure
Jul 18 14:24:35 fibonacci kernel: drbd11: asender terminated
Jul 18 14:24:35 fibonacci kernel: drbd11: drbd11_receiver [456]: cstate NetworkFailure --> BrokenPipe
Jul 18 14:24:35 fibonacci kernel: drbd11: short read expecting header on sock: r=-512
Jul 18 14:24:35 fibonacci kernel: drbd11: worker terminated
Jul 18 14:24:35 fibonacci kernel: drbd11: drbd11_receiver [456]: cstate BrokenPipe --> Unconnected
Jul 18 14:24:35 fibonacci kernel: drbd11: Connection lost.
Jul 18 14:24:35 fibonacci kernel: drbd11: drbd11_receiver [456]: cstate Unconnected --> WFConnection
Jul 18 14:25:02 fibonacci kernel: drbd11: Secondary/Unknown --> Primary/Unknown
Jul 18 14:25:23 fibonacci kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
Jul 18 14:25:23 fibonacci kernel: tg3: eth0: Flow control is on for TX and on for RX.
Jul 18 14:25:26 fibonacci kernel: drbd11: drbd11_receiver [456]: cstate WFConnection --> WFReportParams
Jul 18 14:25:26 fibonacci kernel: drbd11: Handshake successful: DRBD Network Protocol version 74
Jul 18 14:25:26 fibonacci kernel: drbd11: incompatible states (both Primary!)
Jul 18 14:25:26 fibonacci kernel: drbd11: drbd11_receiver [456]: cstate WFReportParams --> StandAlone
Jul 18 14:25:26 fibonacci kernel: drbd11: error receiving ReportParams, l: 72!
Jul 18 14:25:26 fibonacci kernel: drbd11: worker terminated
Jul 18 14:25:26 fibonacci kernel: drbd11: asender terminated
Jul 18 14:25:26 fibonacci kernel: drbd11: drbd11_receiver [456]: cstate StandAlone --> StandAlone
Jul 18 14:25:26 fibonacci kernel: drbd11: Connection lost.
Jul 18 14:25:26 fibonacci kernel: drbd11: receiver terminated
Jul 18 14:25:29 fibonacci kernel: drbd11: Primary/Unknown --> Secondary/Unknown



thanks in advance and sorry if I couldn't be clear about my problem


-- 
Djalma Fadel Junior
Diretor Técnico
Ferasoft Corporation Ltda
fadel at ferasoft.com.br




More information about the Linux-HA mailing list