[Linux-HA] The Heartbeat of learning

Digimer lists at alteeve.ca
Wed Oct 16 08:44:51 MDT 2013


On 15/10/13 22:36, 邢立明 wrote:
> Hello dear Heartbeat team:
> 
> Thank you very much for your reply,I still have the following two
> questions:
> 
> 1、How to get the heart line disconnected, Heartbeat triggered by events?
> 2、Heartbeat is disconnected, how to set only one machine provides service?

Corosync uses the totem protocol for "heartbeat" like monitoring of the
other node's health. A token is passed around to each node, the node
does some work (like acknowledge old messages, send new ones), and then
it passes the token on to the next node. This goes around and around all
the time. Should a node note pass it's token on after a short timeout
period, the token is declared lost, an error count goes up and a new
token is sent. If too many tokens are lost in a row, the node is
declared lost/dead.

Once the node is declared lost, the remaining nodes reform a new
cluster. If enough nodes are left to form quorum (simple majority), then
the new cluster will continue to provide services. In two-node clusters,
quorum is disabled so each node can work on it's own.

Corosync itself only cares about cluster membership, message passing and
quorum (as of corosync v2+). What happens after the cluster reforms is
up to the cluster resource manager. In this case, that would be pacemaker.

When pacemaker is told that membership has changed because a node died,
it looks to see what services might have been lost. Once it knows what
was lost, it looks at the rules it's been given and decides what to do.

Generally, the first thing it does is "stonith" the lost node. This is a
process where the lost node is powered off, called power fencing, or cut
off from the network/storage, called fabric fencing. In either case, the
idea is to make sure that the lost node is in a known state. If this is
skipped, the node could recover later and try to provide cluster
services, not having realized that it was removed from the cluster. This
could cause problems from confusing switches to corrupting data.

In two-node clusters, there is also a chance of a "split-brain". Because
quorum has to be disabled, it is possible for both nodes to think the
other node is dead and both try to provide the same cluster services. By
using stonith, after the nodes break from one another (which could
happen with a network failure, for example), neither node will offer
services until one of them has stonith'ed the other. The faster node
will win and the slower node will shut down (or be isolated). The
survivor can then run services safely without risking a split-brain.

Once the dead node has been stonithed, pacemaker then decides what to do
with the lost services. Generally, this means "restart the service here
that had been running on the dead node". The details of this, though,
are decided by you when you configure the resources in pacemaker.

Hope this helps! It's pretty high-level and simplifies a few things, but
hopefully it helps you understand the mechanics. :)

digimer

PS - Please reply to the mailing list. Discussions like this can help
others by being public and stored in archives.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?


More information about the Linux-HA mailing list