Web page for Linux-HA

Michael Rowan mtr@cutaway.com
Mon, 19 Oct 1998 12:14:31 -0400


alanr@bell-labs.com wrote:

> Understood.  But I would propose a little different methodology.  I would
> propose that we use a tool like Jim Trocki's "Mon" to diagnose various
> cluster subsystems.  For example, "Can I ping my router/switch"?, and "Can I
> read from my disk?" are two questions that one can ask using a system like
> Mon.  I can imagine an HA system having 4 or 5 interfaces, and you should be
> able to test and diagnose each separately.  That is difficult to do if
> you're using something like a heartbeat which when it comes back on any of
> your networks tells you only that at least one is still working.
> 

Yea, I saw you talk about mon in the archives.  Remember,
you heartbeat over every available network, with a
non-intrusive scheme, and each hearbeat carries the same
node-based information with it.  As state migrates based on
network functionality and node health, the heartbeat packets
change to track these state transitions. 

I don't know enough about mon to comment intelligently, but
I do know how intertwined resources issues can be, and how
you can't really totally separate the issues, although you
can have separate modules that track each issue
independently. 


> > The serial lines can not support this
> > because of speed and complexity in their connectivity (think
> > about connecting serial lines outside a 2 node cluster).
> > And one of the keys is that we must assess the
> > functionability (I love that word) of the LAN lines being
> > used by the clients.  It doesn't do the cluster much good if
> > the outside world can't connect to the black-box we call a
> > cluster.
> 
> The purpose of the serial port in the diagram is ONLY so that the cluster
> can communicate it's internal status with itself, including diagnostic
> status.  For example, if two machines cannot communicate with each other,
> there are at least three possible failed components.  M1's ethernet, M2,
> ethernet, or the hub/switch that connects them.  If you use diagnostics like
> "I can ping my router/switch" you can gracefully take down the truly
> isolated node with both sides knowing and agreeing on it.

I understand what you were saying.  I am saying that serial
is only useful when it augments heartbeating on usable
networks. 

> 
> > However, when a network and, in particular, a NIC starts
> > behaving badly, it is important to know the true state of
> > the machines.  Often the only way to determine the
> > difference between a node failure and a network failure is
> > through serial line connections.  It also allows us to know
> > the state of the IP stack on these machines, assuming the
> > user didn't use SLIP for the serial connections.  And
> > knowing the difference is a big deal in clustering since you
> > can end up with (particularly depending on the disk hardware
> > being used) data divergence and or corruption.
> 
> Agreed.  (see above)
> 
> > On the messaging side, another key field that needs to be
> 
> > added is version number, and some kind of simple packet
> > verification stamp.  As we move forward you will find that
> > the messages grow and change rapidly over versions, and the
> > size of the packet can only be determined through version
> > stamps.
> 
> Great observation.  That had somehow slipped my mind.
> 
> > Also, a star configuration isn't really optimal, in terms of
> > availability and scaling.  There are two other approaches
> > (which can also be mixed) that achieve these things a bit
> > better: the first is the neighbor scheme, the second a tree
> > scheme.
> 
> I assumed a ring topology.  I think you call it a neighbor topology.

Sorry -- I read into too much into the name "ring
topology".  Usually a ring insinuates that A passes a packet
to B, which forwards it around the ring till it gets back to
A (or to A-1). With the neighbor scheme, D never sees a
packet from A except at key configuration state points, like
broadcasts on boot, intervening neighbor failure and the
like.

> >
> > On the files in /etc/ha.d/heartbeat issue, I think this is
> > probably a mistake if I understand it properly.  The state
> > machine that will evolve to track the cluster is incredibly
> > complex, and requires synchronized movement of the whole
> > cluster between states, particularly when you start throwing
> > resource migration into the mix.  To place some of that
> > dependency into the filesystem seems dangerous to me.
> > Clusters are difficult enough for knowledgable [sic] SI
> > people to get working, and to add a very simple thing that
> > can get tromped and create a horrid mess of the state
> > transitions (meaning, you need to understand some god awful
> > number of possibilities in the mis-configuration checks on
> > the state transitions) scares me.  I'm not easily scared ;-)
> 
> I'm not sure I follow your arguments here, but I only read index.html last
> night, and the others I wrote months ago.  I'll try and make the
> communication clearer (but might end up making it more confused).
> 
> I intend the configuration file in /etc/ha.d to only document the inter-node
> communication configuration, NOT the current state.

I read a few things into your web page document.  That for
each message type there was a related script to run the
first time that message type is sent -- I may have
mis-interpreted that.  If I didn't, you don't want that --
the message types you have are good in that they are
simple.  The state transistions are not simple, and trying
to have a direct relationship between message types and
states would be a mistake.

The second thing I took from this discussion what that the
order a cluster moves through states or message types was
dependent on what it found in the directory
(/etc/ha.d/whatever).  That was what I was arguing against.


> 
> > Also, doing live reconfig is a mess.  Its doable, but
> > talking about a simple hanewconfig message seems to
> > trivialize the process, and may mis-represent the complexity
> > of this problem at this stage in development.  Or not -- I
> > guess I am interested in what you and others say about it.
> 
> My thinking is that what I would *like* is one that doesn't have to be taken
> down for anything.  That's a great goal.  My guess is that at first it will
> have to be taken down and restarted for everything, then after some releases
> there will be a few types of reconfigs that it can survive, etc.  I may now
> understand the comment on the filesystem (maybe?).  Cluster reconfigurations
> ARE potentially very complex.  The big deal about them is that you HAVE to
> be able to at least take a node out of service (quiesce it), or you have no
> HA at all.  The right way to view the messages on that page is that except
> for ALIVE and DEAD messages, everything else is "thinking out loud".
> 
> Maybe I should prioritize my goals for "What we can do", "What we are trying
> to do right now", and "What we are thinking about doing next", and "What we
> are still thinking about".  Cluster reconfiguration is by any measure in
> this last category.  I put it into the document because I wanted to see if
> it (in any sense) could be thought of as "fitting in".

You got it -- reconfig is tough, and best approached in a
piece meal fashion.  I wasn't afraid so much that you didn't
understand this, but that the presense of the reconfig
message along with the description might milead folks about
it -- your message is needed, to be sure, but its the tip of
the iceberg in this respect, and there are lots of things to
do aside from fire the new config off to all the other
nodes.  

I think we are in agreement here...

> 
> > I also get the feeling that this description ties together
> > too tightly the message types and the possible state
> > transitions within the cluster management software.  They
> > are separate and distint, and need to be left so since the
> > state transitions will be a fairly hairy mess compared to
> > the message types -- more state within message types is what
> > will evolve if we build something that is going to work well
> > without needing to be completely rewritten.
> 
> My assumption is that a complete rewrite is inevitable.  However, I didn't
> want it to be needed for a year or so.  What I was attempting to describe
> was an infrastructure for adminstrative communication, and one applicaiton
> of it the "alive/dead" heartbeat messages.  It is not my intent to limit the
> infrastructure as to message types, nor restrict message content.  I may
> have done that.  Suggestions on another message format are welcome.  I left
> the last field with completely undefined semantics for that reason.  Maybe I
> need three fields or five fields (potentially structured in practice) for
> growth.  But since they can be compound fields (with the right pack/unpack
> conventions), maybe only one more is needed.

There are definately other fields needed, but your approach
to having a base message with extensions, much like X does
it, is the right way.  I think we know enough here to avoid
a massive rewrite anytime soon.  And, for the record, having
a rewrite in a year isn't all that great since getting this
whole thing right could take that long.  It has taken
dedicated teams who get paid to do this all day longer, and
at this point I don't know (although you or others might)
how much help there will be on this thing at both the design
phase as well as writing code.  I expect we will really have
an advantage in the testing phase (which is all the more
reason to design in some reasonable, if not great, debugging
facilities, BTW).  But on the other hand, we are dealing
with a different beast than most HA shops.  Most are running
on a mid-range platform that has a fairly short list of
hardware/software possibilities compared to the Intel
platform (or even, Alpha).  I am not using MS's wolfpack as
an example (but for the record, they have taken, I think,
almost 2 years *after* they got a working cluster from DEC).
-- I hesitated to mention that since I didn't want to start
a long diatribe on what the state of wolfpack is, where it
came from, and all that other nonsense.  Just using it as an
example of complexity.

> 
> > Cluster management is probably one of the most complex
> > things you can tackle.  I have *never* know anyone to be
> > able to solve this whole problem in less than 3x or 4x of
> > the time they professed on their most conservative
> > schedules.  And I have worked with and known some pretty
> > fucking good distributed systems folks ;-)
> 
> At this point, I'm trying to define an infrastructure which will make this a
> possible goal.  Although it no doubt appeared that I thought I knew what I
> was doing, I am not under that illusion :-)
> 
> > Anyway, there are some top layer comments.  Lets see what
> > develops in the way of opinion and discussion and move
> > forward.  Thanks for kick starting this Alan.
> 
> Thanks for your thoughtful comments.

Not a problem.  I like these kinds of discussions...

> 
> OK -- Let me say a couple of things in summary:
> 
> 1)  I'm trying to define a communications infrastructure that will support
>     cluster management, not actually trying to outline or define how that
>     scheme will work at this point.

Understood.  But a lot of this ties together, and we need to
discuss the things we need to do, and in some cases how we
do them, in order to make the right choices up front.  There
are a lot of dependencies, I just wanna at least voice some
of them I have played with.

> 
> 2)  I want to support the "keep alive" message infrastructure in the very
>     near term.  I do want to define these message semantics soon.

Fair enough, although you don't want to do that in a vaccum
with respect to the system that will be built around it.  


> 
> 3)  I'm especially open to suggestions on how to change/improve the basic
>     message structure and format with this goal in mind.  I want to:
>         - Keep to ASCII messages, and allow heterogeneous clusters
>         - Allow for future growth in "structured content"

Heterogeneous clusters don't require ASCII messages, but the
whore of debugging this system will benefit from ASCII
messages. 


> 
> 4)  I'm open to rewrites when they become necessary.  I want them to be
>     no more often than once a year or so.  I want to be able to isolate
>     the layers of the infrastructure from each other so that it doesn't
>     become a complete *system* rewrite each time.
> 
We have enough experience in this list to prevent massive
rewrites in the near term, as long as we don't barrel into
this with too little planning.

Who else on this list has cluster experience (and is willing
to be involved in design, at least, and possibly coding)? 
It isn't a requirement or anything, but understanding what
kind of knowledge centers are present or even lurking would
help out. \

> 5)  I have a bias for small clusters since you get HA with 2 or 3 nodes,
>     and that above that you get increased capacity.  I suppose that we
>     should think about supporting a Beowulf-sized cluster, but that
>     seems a lot harder

In reality, a huge percentage of clusters out there are 2-4
node.  But having having the benefit of not inventing this
from scratch, putting a system in place that does scale
seems an intelligent move.

> 6)  My personal communication style includes "thinking out loud", and
>     in this case that means "out loud in email and/or web pages".  If
>     I sound like I don't have any idea what I'm talking about, well,
>     I very well may not have any.  But I expect to learn quickly :-)
> 

Well, the same goes for me.  Anyone who can't live with
being wrong, get out now ;-)