odr draftv2

Jerome Etienne jetienne@arobas.net
Mon, 3 Apr 2000 04:18:21 -0400


this is the second draft of odr, please comment.

the difference with v1 are :
	- ability to have several writers at the same time.
	- ability to read/write the block device during a resync phase.
	- bugs in log fixed based on Stephen Tweedie's suggestions.
	- strip all the fancy mecanisms to acknowledge blocks.
	- the scope of odr is more clearly defined. (no quorum/lifemonitor)

ps: sorry for the delay


rougth draft (v2) about a disk mirrored via network:
----------------------------------------------------
General:
- project name: online disk replicator (aka odr)
- Aim: odr allows to share (read/write) a block device among several nodes.
  if the fs does the locking, several nodes can mount the device at the 
  time. if nodes fail, the remaining ones try not to interupt the service.
- this doc is about the protocol part. not the integration in linux 
  kernel. im a protocol guy and need more work to fully understand 
  the possible issues of a kernel implementation.
- the main loop will be close to the following steps:
	1. build: copy the whole disk in the mirrors disk.
	2. init: switch on all the nodes and start the life monitor.
	3. run: the writers write on the disk and the mirrors mirror.
	4. fail: the life monitor detects the master death and trigger
	   the recovery.
	5. recovery: sync it and go to 'run'
----------------------------------------------------------------
Requirement:
- it should be as flexible as possible because the feature needed in odr
  change a lot according to the applications.
- possible to have several mirrors and several writers.
- possible to mount read/write the device on several cpu assuming the upper
  layer (i.e. fs) does the proper locking.
- the design criteria is the safety. a fs corruption MUST be avoided when
  possible. efficiency comes after (even if the flexibility of the options
  may allow to change that order)
- must not assume that packet are received in the sent order. required
  to mirror across internet.
----------------------------------------------------------------
Limitations:
- in case of several writers, the fs coherency can't be assured at the 
  block device level only. the lock mecanism required is assumed to be 
  handled by another protocol and is out of the scope of this document.
- in 'odr shares a block device across a network', network mostly means
  LAN, even if the protocol will send routable ip packets, so to mirror
  across internet is fully supported but not very practical.
----------------------------------------------------------------
Terminologie:
- writer: a sender of data. possibly several in the odr system.
  a writer can be insync or outofsync. a newly elected master is said
  outofsync. it become insync as soon as it's local image of the block
  device is syncrohnize with the 'common image'.
- mirror: the receiver of the data. one or more in the odr system.
  nMirror is the number of mirror available.
- pool: the set of the writers and the mirrors
- node: a member of the pool. it can be either a writer or a mirror.
- resyncher: a node currently in resynch phase
- takeover: when the writer die and a former mirror become a writer.
- life monitor: the software/protocol which monitor the node's life in
  the pool. if the writer die, a take over is triggered.
- local disk: the disk physically in the node box
- network disk: the kind of virtual disk shared through the network
- nodeid: a uniq identificator for each node. currently it is the ip address.
  advantage it is easy to gather (the source ip address of each packet).
----------------------------------------------------------------
How to build a mirror from scratch:
- the mirror disk is initialized from scratch. copy the whole block device
  by requesting each and every block from a disk insync.
- if the source disk is busy by new operations during the copy, the copy
  can virtually take forever (e.g. the writer renew the blocks faster than
  mirror copy them). solution: to slow down the master with a source quench.
- to copy the whole disk can be long. it should be avoided when possible. 
  Sometime it is required. e.g. the crash last long and require write 
  operations no more in the log. Moreover, as we arent aware of the fs
  layout, even the unused blocks will be requested. 
  possible 'outofband' solution: to do a special script for this. 
  freeze all writers, build and exchange a .gz of the device, will
  be more effective because of the compression (especially in case 
  of 'almost empty' disk)
-----------------------------------
way to log the write operations:
- a log record is a nodeid(32bits for ipv4), a wseq(64bits, wrap around
  handled via 'tcp sequence number arithmetic'), a blknb(32bits), a 
  log state(1bit: intend/comitted).
- a single log file in which the log records are recorded in the order 
  the blocks are written on the local drive. the current wseq of all 
  the writers are periodically written/flushed in the log file.
- IMPORTANT: the log records may be in different orders depending on
  the hosts.
- the flush of all the wseq is used in case of resynch, the resyncher
  send a request describing all the current writers wseq, the source
  find the position in the log (all the missing nodes are above this 
  position) and roll the log from here. if the resyncher receive a wseq
  operation already done, just discard it, it may happen because the
  log order isnt the same on each hosts.
- the block is written in the log with in a 'intend' state before the
  all the confirmations (from network and local disk), once all 
  confirmations are received, update the log record to a 'commited' 
  state.
- the log is a 'normal' file stored on a fs, so to avoid recursive 
  loop, it must not stored in odr device.
- the log doesnt store the block data itself, only metadata about the
  block so a block may have been rewritten. e.g. log (wseq:1, blknb:40)
  and (wseq:2, blknb:40). you can't have the block used for wseq1. when
  you replay the log, wseq:1 will give you the block written in wseq:2.
  a direct consequence: if you replay the log, you must replay it all,
  you can't stop in the middle of the log and mount the disk.
-----------------------------------
how to resync after being off:
- the resyncher read its own log file to know what are the last wseq for
  each writer. chose a peer among the poll (WORK: how ? see read from 
  network) and send him the list of all the wseq. the amount of block 
  to send is likely to be large.
- the resynchee is the node which provide the block to the resyncher.
- The resynchee find its oldest log record which match the request sent
  by the resyncher. It is important to get the oldest one because hosts
  may have log records in different orders. Some block might be transmitted
  even if the resyncher already got them, but it will discard them because
  of their older wseq. it is believe that there will be only a 'out of order'
  blocks. WORK: duplicated with 'way to log'
- the resynchee read a given number of block from the log and send them to
  the resyncher.
- the resyncher is in charge to retransmit the request if there is a packet
  loss. a packet may be a resquest or the one of the replied blocks.
- during the resynch, the resyncher doesnt listen to current write 
  operations.
- if a read request occurs for a wseq too old to be in the log, 2 
  possibles actions copy the whole block device or just declare it dead.
  in any case, we want to avoid it so tune the log size accordingly.
  WORK: explain the criteria to choose a good log size
-----------------------------------
How to write a block on the block device:
- the higher layer is assumed to do the proper locks to assure fs coherency.
- the writer send the block on the network, and wait for an ack from all
  nodes. the block request is completed when all the ack has been received.
  if, after a delay, at least one ack is missing, the block is retransmited.
- the block is written on the local disk after sending it to the network
  to reduce the latency. the local write is likely to be completed before 
  the writer receive all the ack.
- there is a delay to wait before to complete the write even if all the 
  ack are received. this delay is used to slow down the writer when needed.
  the default value is 0.
- a ack packet is replied in unicast.
-----------------------------------
How to read a block from the block device:
- the higher layer is assumed to do the proper lock to assure fs coherency.
- if the disk is insync, the block is read on the local drive.
- if the local disk is outofsync, the read is made on the network without
  local cache, so slower but still possible (ha spirit).
  even a disk less station can use the network disk doing read and write
  through the network. this feature is used mainly during the disk resync.
- if the disk is outofsync, the reader send a request (WORK: to whom ?) 
  with the blknb and wait for the result i.e. block or error code. 
  if, after a timeout, no result has been received, reemit the request. 
  if the destination of the request is no more alive, choose another one.
- NOTE: to read a block here is different than to get a block from the 
  network to resynch. here the block is requested by blknb, to resynch 
  the block are requested by wseq.
- WORK: to whom send the request ? obviously not to everybody (answer 
  storm). if i know the source, request it from the source, else pick 
  a node at random ? an anycast mecanism ?
-----------------------------------
How to receive a block from the network:
- each node got a list of all the current wseq for each writter.
- if the received wseq is <= than the current the wseq for this nodeid,
  the block is already there, simply discard it (it is a rexmit for 
  another mirror which hasnt acknowledged)
- the ack is sent when the block has been successfully written on the disk.
  the ack must not be sent when the block is received from the mirror. if the
  mirror do so, the block write may failed (disk error or crash of the
  mirror before the write is completed), the writer wont be aware of it.
- WORK: if the wseq isnt exactly the next, what to do ? discard ? queue it 
  for latter use ?
-----------------------------------
How to slow down a writer:
- in some cases, it is usefull to slow down a writer. for example: one 
  node try to resync and the writer renew the blocks faster that the
  resyncer is able to copy them.
- each writer got a minimal delay to perform a write operation. this
  delay is 0 by default. each time a writer receive a source quench
  it increase this delay and set a timer. if this timer expire the 
  delay is decreased.
- the packet is send in unicast to the writer which triggered the 
  source quench.
- WORK: how the delay is increased/decreased? how long is the timer? 
  which condition triggers a source quench ? maybe to use a better
  metric to slow down (more global, not per block...). would be tunned
  by experience.
------------------------------------------------------------
Transport:
- read/write operations require to retransmit packets after a timeout. 
  how to tune the timeout ? a automatic round time trip estimation ala tcp ?
  a manual configuration ? automatic is better but harder.
- the usual ethernet mtu is 1500bytes. so a block greater will be fragmented
  across ethernet. packet fragmentations wastes bandwidth(specially in case
  of rexmit), increase the overhead and packet loss probability, it should
  be avoided when possible. set up a link with with jumbo packets when 
  possible.
- either plain IP or UDP. both to be routable. UDP is easily handled from
  userspace (register a well known port). IP doesnt have the overhead of the
  udp headers (is it significant when most packets are around 4k?) and require
  priviledged right to acceed from userspace. if the packet is bigger than 
  the MTU, do the fragmentation at the ip level.
- some communications will use multicast (e.g. writer to mirrors). the 
  multicast ip address will be INADDR_ODR_GROUP (register a nonroutable
  multicast address). In case of mirroring across internet, the address
  will be configured by the administrator (out of scope of this document).
- IPSec is able to provide encryption, authentification and integrity.
  no security is integrated in the protocol itself.
----------------------------------------------------------------
Difference between the draft v1 to v2:
- ability to have several writers at the same time.
- ability to read/write the block device even if it is a resync phase, 
  suggested by Marcelo Tosatti.
- bugs in log fixed based on Stephen Tweedie's suggestions.
- strip all the fancy mecanisms to acknowledge blocks to keep only 
  Wait_N_Ack i.e. each mirror acknowledge each block. the performance
  isn't that bad because there are several pending write at the same
  time. Moreover it became too complexe with several writers.
  (WORK: blabla write by chunk in ext3 WORK: blabla race between fs 
  lock and possible packet loss)
- the scope of odr is more clearly defined. odr doesnt handle the quorum
  problem and the lifemonitor.
---------------------------------------------------------------
Credits: 
- idea of a online disk replicator comes from drbd made by Philipp Reisner.
- thanks to Stephen Tweedie and Marcelo Tosatti for their comments.
- design of the odr protocol by jerome etienne.