odr draftv2
Jerome Etienne
jetienne@arobas.net
Mon, 3 Apr 2000 04:18:21 -0400
this is the second draft of odr, please comment.
the difference with v1 are :
- ability to have several writers at the same time.
- ability to read/write the block device during a resync phase.
- bugs in log fixed based on Stephen Tweedie's suggestions.
- strip all the fancy mecanisms to acknowledge blocks.
- the scope of odr is more clearly defined. (no quorum/lifemonitor)
ps: sorry for the delay
rougth draft (v2) about a disk mirrored via network:
----------------------------------------------------
General:
- project name: online disk replicator (aka odr)
- Aim: odr allows to share (read/write) a block device among several nodes.
if the fs does the locking, several nodes can mount the device at the
time. if nodes fail, the remaining ones try not to interupt the service.
- this doc is about the protocol part. not the integration in linux
kernel. im a protocol guy and need more work to fully understand
the possible issues of a kernel implementation.
- the main loop will be close to the following steps:
1. build: copy the whole disk in the mirrors disk.
2. init: switch on all the nodes and start the life monitor.
3. run: the writers write on the disk and the mirrors mirror.
4. fail: the life monitor detects the master death and trigger
the recovery.
5. recovery: sync it and go to 'run'
----------------------------------------------------------------
Requirement:
- it should be as flexible as possible because the feature needed in odr
change a lot according to the applications.
- possible to have several mirrors and several writers.
- possible to mount read/write the device on several cpu assuming the upper
layer (i.e. fs) does the proper locking.
- the design criteria is the safety. a fs corruption MUST be avoided when
possible. efficiency comes after (even if the flexibility of the options
may allow to change that order)
- must not assume that packet are received in the sent order. required
to mirror across internet.
----------------------------------------------------------------
Limitations:
- in case of several writers, the fs coherency can't be assured at the
block device level only. the lock mecanism required is assumed to be
handled by another protocol and is out of the scope of this document.
- in 'odr shares a block device across a network', network mostly means
LAN, even if the protocol will send routable ip packets, so to mirror
across internet is fully supported but not very practical.
----------------------------------------------------------------
Terminologie:
- writer: a sender of data. possibly several in the odr system.
a writer can be insync or outofsync. a newly elected master is said
outofsync. it become insync as soon as it's local image of the block
device is syncrohnize with the 'common image'.
- mirror: the receiver of the data. one or more in the odr system.
nMirror is the number of mirror available.
- pool: the set of the writers and the mirrors
- node: a member of the pool. it can be either a writer or a mirror.
- resyncher: a node currently in resynch phase
- takeover: when the writer die and a former mirror become a writer.
- life monitor: the software/protocol which monitor the node's life in
the pool. if the writer die, a take over is triggered.
- local disk: the disk physically in the node box
- network disk: the kind of virtual disk shared through the network
- nodeid: a uniq identificator for each node. currently it is the ip address.
advantage it is easy to gather (the source ip address of each packet).
----------------------------------------------------------------
How to build a mirror from scratch:
- the mirror disk is initialized from scratch. copy the whole block device
by requesting each and every block from a disk insync.
- if the source disk is busy by new operations during the copy, the copy
can virtually take forever (e.g. the writer renew the blocks faster than
mirror copy them). solution: to slow down the master with a source quench.
- to copy the whole disk can be long. it should be avoided when possible.
Sometime it is required. e.g. the crash last long and require write
operations no more in the log. Moreover, as we arent aware of the fs
layout, even the unused blocks will be requested.
possible 'outofband' solution: to do a special script for this.
freeze all writers, build and exchange a .gz of the device, will
be more effective because of the compression (especially in case
of 'almost empty' disk)
-----------------------------------
way to log the write operations:
- a log record is a nodeid(32bits for ipv4), a wseq(64bits, wrap around
handled via 'tcp sequence number arithmetic'), a blknb(32bits), a
log state(1bit: intend/comitted).
- a single log file in which the log records are recorded in the order
the blocks are written on the local drive. the current wseq of all
the writers are periodically written/flushed in the log file.
- IMPORTANT: the log records may be in different orders depending on
the hosts.
- the flush of all the wseq is used in case of resynch, the resyncher
send a request describing all the current writers wseq, the source
find the position in the log (all the missing nodes are above this
position) and roll the log from here. if the resyncher receive a wseq
operation already done, just discard it, it may happen because the
log order isnt the same on each hosts.
- the block is written in the log with in a 'intend' state before the
all the confirmations (from network and local disk), once all
confirmations are received, update the log record to a 'commited'
state.
- the log is a 'normal' file stored on a fs, so to avoid recursive
loop, it must not stored in odr device.
- the log doesnt store the block data itself, only metadata about the
block so a block may have been rewritten. e.g. log (wseq:1, blknb:40)
and (wseq:2, blknb:40). you can't have the block used for wseq1. when
you replay the log, wseq:1 will give you the block written in wseq:2.
a direct consequence: if you replay the log, you must replay it all,
you can't stop in the middle of the log and mount the disk.
-----------------------------------
how to resync after being off:
- the resyncher read its own log file to know what are the last wseq for
each writer. chose a peer among the poll (WORK: how ? see read from
network) and send him the list of all the wseq. the amount of block
to send is likely to be large.
- the resynchee is the node which provide the block to the resyncher.
- The resynchee find its oldest log record which match the request sent
by the resyncher. It is important to get the oldest one because hosts
may have log records in different orders. Some block might be transmitted
even if the resyncher already got them, but it will discard them because
of their older wseq. it is believe that there will be only a 'out of order'
blocks. WORK: duplicated with 'way to log'
- the resynchee read a given number of block from the log and send them to
the resyncher.
- the resyncher is in charge to retransmit the request if there is a packet
loss. a packet may be a resquest or the one of the replied blocks.
- during the resynch, the resyncher doesnt listen to current write
operations.
- if a read request occurs for a wseq too old to be in the log, 2
possibles actions copy the whole block device or just declare it dead.
in any case, we want to avoid it so tune the log size accordingly.
WORK: explain the criteria to choose a good log size
-----------------------------------
How to write a block on the block device:
- the higher layer is assumed to do the proper locks to assure fs coherency.
- the writer send the block on the network, and wait for an ack from all
nodes. the block request is completed when all the ack has been received.
if, after a delay, at least one ack is missing, the block is retransmited.
- the block is written on the local disk after sending it to the network
to reduce the latency. the local write is likely to be completed before
the writer receive all the ack.
- there is a delay to wait before to complete the write even if all the
ack are received. this delay is used to slow down the writer when needed.
the default value is 0.
- a ack packet is replied in unicast.
-----------------------------------
How to read a block from the block device:
- the higher layer is assumed to do the proper lock to assure fs coherency.
- if the disk is insync, the block is read on the local drive.
- if the local disk is outofsync, the read is made on the network without
local cache, so slower but still possible (ha spirit).
even a disk less station can use the network disk doing read and write
through the network. this feature is used mainly during the disk resync.
- if the disk is outofsync, the reader send a request (WORK: to whom ?)
with the blknb and wait for the result i.e. block or error code.
if, after a timeout, no result has been received, reemit the request.
if the destination of the request is no more alive, choose another one.
- NOTE: to read a block here is different than to get a block from the
network to resynch. here the block is requested by blknb, to resynch
the block are requested by wseq.
- WORK: to whom send the request ? obviously not to everybody (answer
storm). if i know the source, request it from the source, else pick
a node at random ? an anycast mecanism ?
-----------------------------------
How to receive a block from the network:
- each node got a list of all the current wseq for each writter.
- if the received wseq is <= than the current the wseq for this nodeid,
the block is already there, simply discard it (it is a rexmit for
another mirror which hasnt acknowledged)
- the ack is sent when the block has been successfully written on the disk.
the ack must not be sent when the block is received from the mirror. if the
mirror do so, the block write may failed (disk error or crash of the
mirror before the write is completed), the writer wont be aware of it.
- WORK: if the wseq isnt exactly the next, what to do ? discard ? queue it
for latter use ?
-----------------------------------
How to slow down a writer:
- in some cases, it is usefull to slow down a writer. for example: one
node try to resync and the writer renew the blocks faster that the
resyncer is able to copy them.
- each writer got a minimal delay to perform a write operation. this
delay is 0 by default. each time a writer receive a source quench
it increase this delay and set a timer. if this timer expire the
delay is decreased.
- the packet is send in unicast to the writer which triggered the
source quench.
- WORK: how the delay is increased/decreased? how long is the timer?
which condition triggers a source quench ? maybe to use a better
metric to slow down (more global, not per block...). would be tunned
by experience.
------------------------------------------------------------
Transport:
- read/write operations require to retransmit packets after a timeout.
how to tune the timeout ? a automatic round time trip estimation ala tcp ?
a manual configuration ? automatic is better but harder.
- the usual ethernet mtu is 1500bytes. so a block greater will be fragmented
across ethernet. packet fragmentations wastes bandwidth(specially in case
of rexmit), increase the overhead and packet loss probability, it should
be avoided when possible. set up a link with with jumbo packets when
possible.
- either plain IP or UDP. both to be routable. UDP is easily handled from
userspace (register a well known port). IP doesnt have the overhead of the
udp headers (is it significant when most packets are around 4k?) and require
priviledged right to acceed from userspace. if the packet is bigger than
the MTU, do the fragmentation at the ip level.
- some communications will use multicast (e.g. writer to mirrors). the
multicast ip address will be INADDR_ODR_GROUP (register a nonroutable
multicast address). In case of mirroring across internet, the address
will be configured by the administrator (out of scope of this document).
- IPSec is able to provide encryption, authentification and integrity.
no security is integrated in the protocol itself.
----------------------------------------------------------------
Difference between the draft v1 to v2:
- ability to have several writers at the same time.
- ability to read/write the block device even if it is a resync phase,
suggested by Marcelo Tosatti.
- bugs in log fixed based on Stephen Tweedie's suggestions.
- strip all the fancy mecanisms to acknowledge blocks to keep only
Wait_N_Ack i.e. each mirror acknowledge each block. the performance
isn't that bad because there are several pending write at the same
time. Moreover it became too complexe with several writers.
(WORK: blabla write by chunk in ext3 WORK: blabla race between fs
lock and possible packet loss)
- the scope of odr is more clearly defined. odr doesnt handle the quorum
problem and the lifemonitor.
---------------------------------------------------------------
Credits:
- idea of a online disk replicator comes from drbd made by Philipp Reisner.
- thanks to Stephen Tweedie and Marcelo Tosatti for their comments.
- design of the odr protocol by jerome etienne.