[Linux-HA] Documentation of heartbeat protocol

Lars Ellenberg lars.ellenberg at linbit.com
Thu Oct 14 02:41:32 MDT 2010

On Wed, Oct 13, 2010 at 10:42:33PM +0200, Lars Marowsky-Bree wrote:
> On 2010-10-13T20:36:59, Steve Davies <davies147 at gmail.com> wrote:
> > In particular I am interested to understand the meaning of the various
> > sequence numbers and so forth, and their implications when hosts
> > fail-over, die, return to active status etc. Basically the sort of
> > thing you would find in a protocol RFC.
> > 
> > Many thanks for any pointers.
> The heartbeat protocol and the CCM are not extremely well documented.
> For corosync, there's the protocol specification for totem at least. But
> I guess most of the things would still be in code ;-)
> Regards,
>     Lars
> --
> Architect Storage/HA, OPS Engineering, Novell, Inc.


So if you are going to spend time improving the cluster communications
code, that time would better be spend understanding and improving
corosync.  There is enough work to do, (automatic recovery of) redundant
rings, membership when starting with all cluster comm down to allow for
a two-node tiebreaker and stonith of the other node to make progress,
probably a few other interesting higher level issues, and certainly a
few not-so-interesting janitor level things.

Corosync (or, at least the algorithms it implements) are much better
documented, or should we say: documented at all, besides reading the code.

Even though I myself spent some time in the heartbeat ipc messaging and
cluster communication layer lately, I'd not recommend anyone _starting_
on this to do so. I sometimes have to, as that's part of being the
appointed maintainer of the heartbeat stuff.

If you have the choice, go understand corosync,
and the algorithms involved, and improve it.
Steve Dake would be the guy to ask for advice on corosync,
I'm sure he won't be opposed to someone helping out with corosync
maintenance and development.

If you happen to be somehow target locked on heartbeat, tell us why,
and what you are trying to achieve, and we figure something out.

If you are "just" homing in on cluster communications,
please go for corosync.

Why should I say so, even if I currently still advocate the use of
heartbeat/pacemaker over corosync/pacemaker in production setups?

Because heartbeat is legacy.  It works (most of the time), but actually
no one really knows anymore how exactly, or why, it works.

Corosync may not work as good as we would like it to in various
scenarios. But at least we know what exactly it tries to do, and why,
as the algorithms involved are documented.
And thus time improving corosync, identifying and overcoming its
limitations, would be time well spent.

Whereas time figuring out what exactly heartbeat does, and why it may do
it the way it does by reverse engeneering the code, is probably not
exactly wasted, but possibly close, sometimes, even though it may be an
interesting and educating experience.

: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

More information about the Linux-HA mailing list