[Linux-HA] raw disk heartbeat
Lars Marowsky-Bree
lmb at suse.de
Sat Mar 12 07:14:11 MST 2005
On 2005-03-11T22:39:39, hm at seneca.muc.de wrote:
> Howdy folks,
>
> I'm interested in contributing the raw disk thing. IMHO this could serve as
> a general plugin for heartbeat in SCSI-like environments, like external
> SCSI RAID, SAN ir iSCSI. As long as it's a block device, it'll work.
Ok, I saw this list mail after I answered your mail in private ;-) I
don't feel like translating those comments now, but I hope you'll find
them useful.
Some direct feedback to your mail here:
> Two nodes share a partition, e.g. /dev/sdk1. The "first" node uses the
> lower half, the "second" one uses the upper half. Who is first and second
> is determined by this line in ha.cf
It needs a meta-superblock at the beginning of the block device to tell
us how many node buckets there are and how big they are. Do not limit
this to two nodes.
> This line has to be identical on both nodes, so both know where to read
> and where to write. The size of the ring buffer is determined by the size
> of the partition itself. Bigger partition -> bigger buffer. Depends on
> the minimum partition size on any given hardware I guess. Normally, it will
> be 1 cylinder, like 4..8 Mbyte. In a SAN, a minimum LUN may be much larger
> in some cases.
>
> (Question: do we really need a ring buffer? We can lose messages over
> UDP or serial too.
Sure, but packet loss results in retransmission, and retransmission is
costly. If we can make retransmissions rare, this is better. If we have
a buffer which we can fill up with a bunch of messages in a go, and then
have the other nodes pick them up, that helps a LOT compared to just
having space for a single message (the other extreme).
> What makes the plugins check their input by the way?
> They are not interrupt driven as far as I can see.) (The source code is
> full of funny remarks but it explains little... )
You need to poll every so often with disks, of course, because we don't
get notified if it changes.
> The on-disk data structure for an individual host will be like
>
> struct ringbuffer {
> off_t readp;
> off_t writep;
> char buffer[size of partition / 2];
> }
Almost. I'd give each node it's own bucket superblock, with node name /
uuid, whether a writer task is currently active or not, giving the
current cursor it's writing at, maybe a merely informational timestamp
of when the last message was written et cetera.
This would allow a reader task, which knows the node buffer is circular,
to pick up anything since its last "visit" of that buffer quickly.
> Does mmap make sense here? Don't think so - if the buffer is, say, 2
> megabytes then... Otherwise simple read() and write() will do. raw(8) makes
> no sense here because we don't want a sector-aligned character device.
O_SYNC is still needed, of course write caching on the node would be
deadly.
> hblun /dev/sdk1 node1 node2
> hblun /dev/sdk2 node1 node3
> hblun /dev/sdk3 node2 node3
Too complex to setup, the nodes ought to autoconfigure which bucket on
disk they use. And two-way communication like this doesn't scale at
all, make it broadcast - because that's essentially what heartbeat
assumes as the underlaying topology.
Please see my private mail about more thoughts on this.
> This way we get a circular structure like with serial.
Bad.
> In an environment with high loads on the storage cables we may need to tune
> the heartbeat rate down, so we may also need a specific keepalive parameter
> for hblun. How can that be done? I guess the master invokes all of the
> plugins every keepalive seconds irrespective of the media, right?
For writing, yes. (Not only every heartbeat interval, but when it has
something to send, of course.). But the readers send messages they pick
up to the MCP.
Sincerely,
Lars Marowsky-Brée <lmb at suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business
More information about the Linux-HA
mailing list