[Linux-HA] [patch] clock_t wrapped around causingfalse resourcestart failure

Tavanyar, Simon Simon.Tavanyar at stratus.com
Wed Jul 1 08:22:58 MDT 2009


Aahh ... so it's measured after reboot. That makes sense.
Thanks, Lars.

Should I wait for you to log a bugzilla on this?




-----Original Message-----
From: linux-ha-bounces at lists.linux-ha.org
[mailto:linux-ha-bounces at lists.linux-ha.org] On Behalf Of Lars Ellenberg
Sent: Wednesday, July 01, 2009 6:28 AM
To: linux-ha at lists.linux-ha.org
Subject: Re: [Linux-HA] [patch] clock_t wrapped around causingfalse
resourcestart failure

On Tue, Jun 30, 2009 at 10:03:29AM -0400, Tavanyar, Simon wrote:
> Hi Dejan,
> 
> The bug looks like a one-off occurrence. We run hundreds of hours of
> system stress tests in a week, moving resources between main and
standby
> systems, and we haven't seen this error in a couple years. (There was
a
> longclock error back in 2007 found by my colleague Simon Graham).
> 
> The longclock wrap occurred within 2:45 of a reboot. 
> The apparent coincidence seems to be that we were starting resources
on
> a back-up node around 165 seconds after the node had been rebooted and
> hearbeat restarted. As I expect you know, somewhere between 160 and
175
> seconds after a heartbeat start, the longclock is configured to wrap.

Nonono.
times() is coupled to uptime,
uptime is measured in jiffies,
and initial jiffies in linux kernel is
include/linux/jiffies.h:
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))

so the first wrap occurs 5 minutes after boot.
nothing to do with heartbeat start, or longclock.

> The rareness of this makes me think we hit a really obscure window... 

yes.
but.

it is totally broken.
some grepping in my mercurial checkouts suggest that you are using
heartbeat, which is using its "plumbing" wrappers.

[skip this paragraph unless you are heartbeat coder]
in particular, the killing is done by
TrackedProcTimeoutFunction(),
and the timeout for that is set by SetTrackedProcTimeouts(),
which calls into Gmain_timeout_add -> Gmain_timeout_add_full
(still in the heartbeat plumbing wrappers),
which creates a new event source of type GTimoutAppend
(also heartbeat plumbing) with methods defined by Gmain_timeout_funcs.

The Gmain_timeout_prepare does a simple check on
"is the next time this event shall trigger in the past?",
and if so, triggers immediately.

Now, that check is broken.
in lib/clplumbing/GSource.c, Gmain_timeout_prepare():
	if (cmp_longclock(lnow, append->nexttime) >= 0)
assuming that nexttime was set correctly, and lnow is correct, too,
and further assuming your clock_t is only 32 bit,
longclock_t is defined as unsigned long long,
and that thing becomes:

	if ((unsigned long long)(lnow) >= (unsigned long
long)(append-nexttime))

which exactly does _NOT_ care for wrap around :(

cmp_longclock macro and functions are broken.  conceivably some other
longclock related macros/functions may be broken as well.

it should probably be more like
(caution, not even compile tested)

diff -r 731f8f7b5450 include/clplumbing/longclock.h
--- a/include/clplumbing/longclock.h	Tue Jun 30 12:02:16 2009 +0200
+++ b/include/clplumbing/longclock.h	Wed Jul 01 12:27:27 2009 +0200
@@ -127,9 +127,9 @@
 	((longclock_t)(l) - (longclock_t)(r))
 
 #	define	cmp_longclock(l,r)			\
-	(((longclock_t)(l) < (longclock_t)(r))		\
+	((((long long)(l) - (long long)(r)) < 0)	\
 	?	-1					\
-	: (((longclock_t)(l) > (longclock_t)(r))	\
+	: ((((long long)(l) - (long long)(r)) > 0)	\
 	?	+1 : 0))
 #endif
 
diff -r 731f8f7b5450 lib/clplumbing/longclock.c
--- a/lib/clplumbing/longclock.c	Tue Jun 30 12:02:16 2009 +0200
+++ b/lib/clplumbing/longclock.c	Wed Jul 01 12:27:27 2009 +0200
@@ -259,12 +259,7 @@
 int
 cmp_longclock(longclock_t l, longclock_t r)
 {
-	if (l < r) {
-		return -1;
-	}
-	if (l > r) {
-		return 1;
-	}
-	return 0;
+	return (((long long)(l) - (long long)(r)) < 0) ? -1
+	     : (((long long)(l) - (long long)(r)) > 0) ? +1 : 0;
 }
 #endif /* CLOCK_T_IS_LONG_ENOUGH */



-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD(r) and LINBIT(r) are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems



More information about the Linux-HA mailing list