[Linux-HA] NFS active-active failover OCF RA.
btimby at gmail.com
Wed Mar 24 10:28:19 MDT 2010
I would like some opinions on the OCF RA I wrote. I needed to support
an active-active setup for NFS, and googling found me no working
solution, so I put one together. I have read these list archives and
various resources around the 'net when putting this together. My
testing is favorable so far, but I would like to ask the experts. I
wrote up a description of my solution on my blog, the RA is linked
from there. I will copy the text and link in this email. I am using
Heartbeat 3 and Pacemaker on CentOS 5.4.
I have need for an active-active NFS cluster. For review, and
active-active cluster is two boxes that export two resources (one
each). Each box acts as a backup for the other box’s resource. This
way, both boxes actively serve clients (albeit for different NFS
The first problem I ran into with this setup is the nfsserver OCF
resource agent that comes with Heartbeat is not suitable. This is
because it works by stopping/starting the nfs server via it’s init
script. For my situation, NFS will always be running, I just want to
add/remove exports on failover.
Adding and removing exports is fairly easy under Linux, you use the
$ exportfs -o rw,sync,mp 192.168.1.0/24:/mnt/fs/to/export
The options correspond to those you would place into /etc/exports, and
the rest is the host:/path portion, also as it would go into
/etc/exports. To remove an export, you specify the following:
$ exportfs -u 192.168.1.0/24:/mnt/fs/to/export
Therefore what I needed was an OCF RA that managed NFS exports using
exportfs. I wrote one and it is available at the link below.
However there are two remaining issues.
The first is that when you export a file system via NFS, a unique fsid
is generated for that file system. The client machines that mount the
exported file system use this id to generate handles to
directories/files. This fsid is generated using the major/minor of the
device being exported. This is a problem for me, as the device being
exported is a DRBD volume with LVM on top of it. This means that when
the LVM OCF RA fails over the LVM volgroup, the major/minor will
change. In fact, the first device on my system had a minor of 4. This
was true of both nodes. If a resource migrates, it receives the minor
4, as the existing volgroup already occupies 4. This means that the
fsid will change for the exported file system and all client file
handles are stale after failover.
To fix this, each exported file system needs a unique fsid option
passed to exportfs:
$ exportfs -o rw,sync,mp,fsid=1 192.168.1.0/24:/mnt/fs/to/export
Note that fsid=0 has special meaning in NFSv4, so avoid it unless you
read the docs and understand it’s special use. I have taken care of
this in my RA by generating a random fsid in case one is not already
assigned. This random fsid is then written to the DRBD device, and
used on the other node when the file system is exported. This way the
fsid is both unique and persistent (remains same on other node after
The other problem is that the /var/lib/nfs/rmtab file needs to be
synchronized. This file contains the clients whom have mounted the
exported file system. Again, I handle this in my RA by saving the
relevant rmtab entries onto the DRBD device, and restoring them to the
other node’s rmtab file. I also remove these entries from the node on
which the resource is stopped.
This gives me a smooth failover of NFS from one node to the other and
back again. To use my RA, simply install it onto your cluster nodes
Then you can create a resource using that RA, it requires three parameters.
1. exportfs_dir - the directory to export.
2. exportfs_clientspec - the client specification to export to
3. exportfs_options - the options as you would specify in /etc/exports.
If you provide an fsid in the exportfs_options param, that value will
be honored, the random fsid is only generated when fsid is absent.
This seems to work perfectly on my cluster running CentOS 5.4, I
tested using an Ubuntu 9.10 client.
** Update **
I posted a new version of the OCF RA. The problem being that it was
only backing up rmtab when the resource is being stopped. Needless to
say, this only covers the graceful failover scenario, if the service
dies, the backup is never made. I have remedied this by spawning a
process that continually backs up rmtab. This process is then killed
when the resource is stopped. This should cover resource failures as
well as graceful failovers.
More information about the Linux-HA