[Linux-ha-dev] Feature: Fast node failure detection in heartbeat
Zou, Yixiong
yixiong.zou at intel.com
Tue Feb 8 13:03:27 MST 2005
Hi,
I am experimenting with a patch that implements fast node failure
detections in heartbeat. And the results are very promising. So I am
going to publish the patch to the mailing list soon. Big thanks to gshi
for all his help.
Below is a readme file in case you are wondering what this feature is about.
---------------------------------------------------------------------------
1. What is nodefail?
Nodefail is a sample utility that can be used to notify heartbeat about node
failures. Used correctly in combination with certain special hardwares, it
MIGHT be able to reduce the node failure detection time and provide faster
failovers.
2. How fast can heartbeat detect node failures by itself?
We've done quite extensive testing on different platforms of heartbeat
clusters. The results are very consistent and do not vary much from platform
to platform. Heartbeat by itself can 'reliably' detect node failure around
400ms. Yes, you can set the deadtime in the ha.cf to be as short as 200ms.
(heartbeat would have problem running if it is set lower than that.) But the
results will not be consistent. You might have a 170ms failure detection for
one test and 450ms for the next test. And if you repeat the test enough times,
the average is in 350ms - 400ms range. You can try the FastFailover test in
CTS to verify this. And don't worry about the overhead that might incur with
the syslog daemon over the network. My tests show that the overhead of the CTS
is actually negligible, even down to the millisecond range.
Of course, 400ms is still plenty fast for lots of people. But if you are
planning on using heartbeat in some mission critical environment where every
millisecond counts, and you set the deadtime to 200ms, it would not be very
satisfying to know that most of the times it takes longer to detect the node
failures. Another problem is with the inconsistency. If you set the deadtime
to 200ms, the actual time that it takes to detect the node failure varies
significantly from case to case. 2-out-of-10 times it detected the failure
within 220ms, but then the rest of the times it detects failure in 300ms,
400ms, even 450ms. Thus it will be nice if we can improve upon this situation
somehow.
3. How does nodefail work?
Let's examine how normally heartbeat detects the node failure first.
Basically, heartbeat runs in a loop that periodically checks for every node to
see if a 'heartbeat' message from that node is received within the 'deadtime'.
If not, then the node is immediately marked as "dead" and the failover process
is then triggered. So as long as the 'deadtime' has not expired, the node is
always considered alive, even though the node could be dead already, and if the
'deadtime' expired, the node is always considered dead, even though the node
could still be alive.
However, IF we could be informed by someone or something after a node failure
occur, theoretically we can then tell heartbeat to go mark that node as dead
immediately without waiting for the lengthy 'deadtime'. And this is exactly how
nodefail works. It signon to heartbeat as a client and tells heartbeat that a
particular node is dead, please start the failover sequence. This is assuming,
of course, the network itself does not fail and the time it takes from the node
failure occurred to the other node receiving this notification is less than the
'deadtime' itself.
4. What hardware platform can it support?
So the million dollar question is what is that certain something that can
notify us about a node failure event then? Unfortunately I do not have a
definite answer to that question. In theory, however, what is needed is a
platform that has the capability of sending out a notification when certain
event occurs, plus a configurable watchdog timer.
Here's how they tie together:
Assuming we have a regular two node heartbeat cluster, node A and node
B. We configure the heartbeat to tickle the watchdog timer on each
system. When the watchdog timer on node A is not tickled for a
pre-configured slice of time, in stead of rebooting the box, the
watchdog timer logs an event to the (hardware or software) platform,
and the platform then send out a event notification to a pre-configured
destination. Upon receiving this notification, a handler can then be
triggered to inform the heartbeat on node B about the node failure
event just occurred on the node A. Node B then marks node A as dead,
and starts the failover sequence, which hopefully involves STONITH node
A as the first step. Although both a hardware platform and a software
platform will work, certainly a hardware based platform could provide
better availability then a software platform can in this case.
One example of this platform is Intel's Langley platform, with IPMI version
1.5 and IPMI over LAN feature enabled. The Langley has a watchdog timer that
can be configured to log an event to the System Event Log in the BMC (Baseboard
Management Controller) when it is not being tickled for a certain amount of
time. And the IPMI in the BMC is capable of sending out a SNMP v1 alert to a
pre-configured destination when certain event is logged. In this case, we can
configure the BMC in node A to send the SNMP alert to the node B. And on node
B, we can have snmptrapd daemon to invoke the nodefail utility to inform the
heartbeat daemon about the node failure event.
If you have a set of Langleys and you need more information regarding how to set
these up, you can find all the tools you need in this project page:
http://ipmiutil.sourceforge.net/
One other possible platform is the OpenHPI, or rather any hardware platform
that supports OpenHPI (http://openhpi.sourceforge.net/). The OpenHPI should be
capable of delivering these watchdog timer events using whatever underling
hardware technology. So instead of running the SNMP trap daemon, you might
need an OpenHPI application that periodically polls for the HPI events. I am
not familiar with OpenHPI however. So this might not be accurate. Inputs
about this are more than welcome.
5. How is this node failure notification implemented?
The node failure notification is implemented by extending the heartbeat api with
this call:
int (*alert_notify)(ll_cluster_t*, const char * node, int action);
The notifier calls this function with two parameters, the node name which the
failure just occurred, and the action for heartbeat to take. Currently, two
types of actions are defined.
a) ACTION_FASTRECOVERY
This tells heartbeat to mark the node as dead and starts the recovery
process immediately, including the STONITH, if configured.
b) ACTION_BROADCAST
This tells heartbeat to just broadcast a forged 'deadstatus' message to
the cluster on behalf of the dead node. To all the nodes in the
cluster, this message looks like as if it comes from the failed node.
This will trigger the recovery, but not necessarily the STONITH.
Note: the actions can be bitwised together. But you have to be very careful
about this. If the broadcast message is received after the node is marked as
dead, the heartbeat would think that a dead node came back from a partition
and it would restart itself, which is probably not desirable. So it is advised
not to bitwise these two actions when making this call.
Also, only the client who signon as "alertnotifier" can make this call. You
should put 'apiauth alertnotifier uid=the_alert_user_id' in the ha.cf file.
6. Any test results?
I wrote the faildetection utility to test how fast we can detect the node
failover with nodefail. A CTS test case can be written, but I don't know
anything about python so I will do that later. The basic idea is the same
though. The faildetection signon to heartbeat on node A as a client. It then
ssh into the node B and issue a 'killall -9 heartbeat' command, and also at the
same time, sends out a SNMP trap from node B to node A. The snmptrapd running
on node A are configured to invoke the nodefail utility. Once the
faildetection client on node A got the NodeStatus callback from the heartbeat,
it calculates the elapsed time since the heartbeat on node B was killed. The
results are quite good. I am able to consistently get failure detection time
within 100ms. Many times I am even able to get the failure detection time to
merely 20ms, which is amazingly fast compared to the 400ms. And I did all that
without stressing my system. Here are the settings in my ha.cf file:
keepalive 50ms
deadtime 500ms
warntime 250ms
Because of nodefail, I can set the deadtime to be fairly long and still get a
consistent fast node failure detection. During my test, a couple times the
failover time increased significantly from sub 100ms to more than 400ms. And I
quickly did a "/etc/init.d/snmptrapd status" check, sure enough, it was dead.
After restarting the snmptrapd, the time drops down to 30ms range again. So my
test environment is really not that highly available.
I do not have access to a pair of Langleys right now to test it on a real
production system. But I think the idea is proven. In real situations, there
could be extra overhead for the platform to generate that alert notification.
Assuming that would cost an extra 100ms, we have the node failure detection
time at around 200ms range, which is still a significant improvement.
-----------------------------------
Yixiong Zou (yixiong.zou at intel.com)
Open Source Technology Center
Intel Corp.
All views expressed in this email are those of the individual sender.
More information about the Linux-HA-Dev
mailing list