[Linux-HA] raw disk heartbeat

hm at seneca.muc.de hm at seneca.muc.de
Fri Mar 11 14:39:39 MST 2005

Howdy folks,

I'm interested in contributing the raw disk thing. IMHO this could serve as
a general plugin for heartbeat in SCSI-like environments, like external
SCSI RAID, SAN ir iSCSI. As long as it's a block device, it'll work. 

I've been braining about the architecture, and the best I've come up with
so far is (and this is where I expect your constructive flames):

Two nodes share a partition, e.g. /dev/sdk1. The "first" node uses the
lower half, the "second" one uses the upper half. Who is first and second
is determined by this line in ha.cf

hblun /dev/sdk1 node1 node2

This line has to be identical on both nodes, so both know where to read
and where to write. The size of the ring buffer is determined by the size
of the partition itself. Bigger partition -> bigger buffer. Depends on
the minimum partition size on any given hardware I guess. Normally, it will
be 1 cylinder, like 4..8 Mbyte. In a SAN, a minimum LUN may be much larger
in some cases. 

(Question: do we really need a ring buffer? We can lose messages over
UDP or serial too. What makes the plugins check their input by the way?
They are not interrupt driven as far as I can see.) (The source code is
full of funny remarks but it explains little... )

The on-disk data structure for an individual host will be like

struct ringbuffer {
	off_t	readp;
	off_t	writep;
	char 	buffer[size of partition / 2];

Does mmap make sense here? Don't think so - if the buffer is, say, 2
megabytes then... Otherwise simple read() and write() will do. raw(8) makes
no sense here because we don't want a sector-aligned character device. 

This architecture allows for more than two nodes in a cluster at a later
date I hope, like

hblun /dev/sdk1 node1 node2
hblun /dev/sdk2 node1 node3
hblun /dev/sdk3 node2 node3
# and if we're attaching to multiple SAN boxes, we could add 
hblun /dev/sdp1 node1 node2
hblun /dev/sdp2 node1 node3
hblun /dev/sdp3 node2 node3
# and supervise all storage access paths this way
# or if using multipath
hblun /dev/md3 node1 node2
# or something

This way we get a circular structure like with serial. 

In an environment with high loads on the storage cables we may need to tune
the heartbeat rate down, so we may also need a specific keepalive parameter
for hblun. How can that be done? I guess the master invokes all of the
plugins every keepalive seconds irrespective of the media, right? 

Comments please... 

I think I understood serial.c and ucast.c well enough to start actual
coding as soon as the architecture is agreed upon. 

There was a young poet named Dan,
Whose poetry never would scan.
	When told this was so,
	He said, "Yes, I know.

More information about the Linux-HA mailing list