[Linux-HA] HA + DRBD problems...

Gael Pourriel gael.pourriel at gmail.com
Wed Mar 23 01:42:18 MST 2005


Dear all, I'm trying to configure linux-ha along with drbd and I'm
running into problems which I cannot solve.
I've got an active/passive configuration and followed the instructions on:
http://wiki.linux-ha.org/GettingStarted_2fDRBD

I've configured everything correctly for drbd, compiled the module,
added the devices in /dev, execute the startup script upong booting.
Everything seems to work. I've formatted my shared partition using
reiserfs and I'm able to mount it without any problem on both node.
However when I try to get HA to do it itself in the haresources script
it doesn't work.

haresources:
node1  IPaddr::192.168.200.10 drbddisk::r0
Filesystem::/dev/drbd0::/spare::reseirfs monit
node2

PS: I use monit to then start stop services such as httpd, smbd
etc...and that works fine.

ha.cf:
logfacility     local0
debug 0
deadtime 15
warntime 10
initdead 30
ucast eth0 10.0.1.1
ucast eth0 10.0.2.1
auto_failback on
node    node1
node    node2
ping 192.168.200.254
respawn nobody /lib/heartbeat/ipfail
apiauth ipfail gid=nobody uid=nobody

drdb.conf:
resource r0 {
  protocol C;
  startup {
    wfc-timeout  0;
    degr-wfc-timeout 120;    # 2 minutes.
  }
  disk {
    on-io-error   detach;
  }
  syncer {
    rate 10M;
    group 1;
    al-extents 257;
  }
  on node1 {
    device     /dev/drbd0;
    disk       /dev/hdb;
    address    10.0.1.1:7788;
    meta-disk  internal;
  }
  on node22 {
    device    /dev/drbd0;
    disk      /dev/hdb;
    address   10.0.2.1:7788;
    meta-disk internal;
  }
}

Basically when HA takes over, nothing happen, the network is not
configured properly, /spare isn't mounted and my service aren't
started, however if I take off the commands:
drbddisk::r0 Filesystem::/dev/drbd0::/spare::reseirfs then everything
works as expected.
I cannot see any obvious error message in the logs (attached) but I do see this:

drbd0: Secondary/Unknown --> Primary/Unknown
then 10 seconds later:
drbd0: Primary/Unknown --> Secondary/Unknown

Which looks like the drbddisk command start correctly hence the first:
drbd0: Secondary/Unknown --> Primary/Unknown but then issue the stop
command so I get:
drbd0: Primary/Unknown --> Secondary/Unknown then when the system
tries to mount it using the Filesystem command it fails hence stop the
taking over I guess.

The weird thing is that if I run these commands manually once drbddisk
is stopped i.e.:

/etc/ha.d/resource.d/drbddisk r0 start
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /spare reiserfs

Then it does work, /spare is fscked and mounted.

Why would HA start drbddidk::r0 then stop it 10 seconds later?

Gael

Mileages:
drbd: Version: 0.7.10 (api:77)
linux-ha: 1.99.3
kernel: 2.4.9 (RH AS 2.1)
-------------- next part --------------
Using /lib/modules/2.4.9-e.3custom/kernel/drivers/block/drbd.o
drbd: initialised. Version: 0.7.10 (api:77/proto:74)
drbd: SVN Revision: 1743 build by root at OSBUILDER, 2005-03-13 18:20:05
drbd: registered as block device major 147
Starting DRBD resources:    [ d0 drbd0: resync bitmap: bits=223232 words=6976
drbd0: size = 872 MB (892928 KB)
drbd0: 872 MB marked out-of-sync by on disk bit-map.
drbd0: Found 6 transactions (28 active extents) in activity log.
drbd0: drbdsetup [62]: cstate Unconfigured --> StandAlone
s0 n0 drbd0: drbdsetup [75]: cstate StandAlone --> Unconnected
drbd0: drbd0_receiver [76]: cstate Unconnected --> WFConnection
].
heartbeat: [199]: info: **************************
heartbeat: [199]: info: Configuration validated. Starting heartbeat 1.99.3
heartbeat: [200]: info: heartbeat: version 1.99.3
heartbeat: [200]: WARN: No Previous generation - starting at 1
heartbeat: [200]: info: Heartbeat generation: 1
heartbeat: [200]: info: No uuid found - generating an uuid
heartbeat: [200]: info: Creating FIFO /var/lib/heartbeat/fifo.
heartbeat: [200]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0
heartbeat: [200]: info: glib: ucast: bound send socket to device: eth0
heartbeat: [200]: info: glib: ucast: bound receive socket to device: eth0
heartbeat: [200]: info: glib: ucast: started on port 694 interface eth0 to 10.0.1.1
heartbeat: [200]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0
heartbeat: [200]: info: glib: ucast: bound send socket to device: eth0
heartbeat: [200]: info: glib: ucast: bound receive socket to device: eth0
heartbeat: [200]: info: glib: ucast: started on port 694 interface eth0 to 10.0.2.1
heartbeat: [200]: info: glib: ping heartbeat started.
heartbeat: [209]: info: pid 209 locked in memory.
heartbeat: [210]: info: pid 210 locked in memory.
heartbeat: [212]: info: pid 212 locked in memory.
heartbeat: [213]: info: pid 213 locked in memory.
heartbeat: [200]: info: pid 200 locked in memory.
heartbeat: [200]: info: Local status now set to: 'up'
heartbeat: [214]: info: pid 214 locked in memory.
heartbeat: [215]: info: pid 215 locked in memory.
heartbeat: [200]: info: Link 192.168.200.254:192.168.200.254 up.
heartbeat: [200]: info: Status update for node 192.168.200.254: status ping
heartbeat: [211]: info: pid 211 locked in memory.
heartbeat: [200]: WARN: node node2: is dead
heartbeat: [200]: info: Local status now set to: 'active'
heartbeat: [200]: info: Starting child client "/lib/heartbeat/ipfail" (99,99)
heartbeat: [216]: info: Starting "/lib/heartbeat/ipfail" as uid 99  gid 99 (pid 216)
heartbeat: [200]: WARN: No STONITH device configured.
heartbeat: [200]: WARN: Shared disks are not protected.
heartbeat: [200]: info: Resources being acquired from node2.
heartbeat: [217]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
heartbeat: [200]: debug: SO_PEERCRED returned [219, (99:99)]
heartbeat: [200]: debug: Verifying authentication: cred.uid=99 cred.gid=99
heartbeat: [200]: debug: Verifying authentication: uidptr=0x811c9cc gidptr=0x0
heartbeat: [200]: debug: SO_PEERCRED returned [219, (99:99)]
heartbeat: [200]: debug: Verifying authentication: cred.uid=99 cred.gid=99
heartbeat: [200]: debug: Verifying authentication: uidptr=0x0 gidptr=0x811ca5c
heartbeat: [200]: debug: SO_PEERCRED returned [219, (99:99)]
heartbeat: [200]: debug: Verifying authentication: cred.uid=99 cred.gid=99
heartbeat: [200]: debug: Verifying authentication: uidptr=0x80fa10c gidptr=0x80fa144
heartbeat: [200]: info: mach_down takeover complete.
heartbeat: [200]: info: Initial resource acquisition complete (mach_down)
heartbeat: [200]: debug: StartNextRemoteRscReq(): child count 1
heartbeat: [200]: debug: StartNextRemoteRscReq(): child count 1
heartbeat: [218]: info: Local Resource acquisition completed.
heartbeat: [336]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
heartbeat: [200]: info: Local Resource acquisition completed. (none)
heartbeat: [200]: info: local resource transition completed.
drbd0: Secondary/Unknown --> Primary/Unknown
drbd0: Primary/Unknown --> Secondary/Unknown


More information about the Linux-HA mailing list