[Linux-HA] [patch] clock_t wrapped around causingfalse resourcestart failure

Lars Ellenberg lars.ellenberg at linbit.com
Wed Jul 1 09:13:42 MDT 2009


On Wed, Jul 01, 2009 at 04:59:24PM +0200, Dejan Muhamedagic wrote:
> Hi Lars,
> 
> On Wed, Jul 01, 2009 at 04:44:06PM +0200, Lars Ellenberg wrote:
> > On Wed, Jul 01, 2009 at 10:22:58AM -0400, Tavanyar, Simon wrote:
> > > Aahh ... so it's measured after reboot. That makes sense.
> > > Thanks, Lars.
> > > 
> > > Should I wait for you to log a bugzilla on this?
> > 
> > bugzilla and me don't go along very well, really ;-)
> > so please feel free to open that bugzilla,
> > and attach my proposed patch there.
> > 
> > just to reiterate:
> > the problem is not the uptime wrap or resulting longclock wrap,
> > as that is handled correctly in cl_times() and time_longclock.
> > 
> > but the cmp_longclock comparing unsigned values
> > without checking for wrap.
> 
> The wrap is stuffed into upper bits of the longclock_t variables:
> 
>     return (lc_wrapcount | (longclock_t)timesval);
> 
> Or am I missing something? The only explanation I can think of is
> that this is actually wrong:

absolutely.

>     if (cmp_longclock(lnow, append->nexttime) >= 0) {
> 
> i.e. it should be strictly greater. I guess that what happened is
> that somehow the prepare function got called fast enough, so that
> both timestamps ended equal.

absolutely not.
 ;)

Much worse!
as far as I see, cmp_longclock is broken (does not care for wrap)
even on 64bit archs, and the time window for this to happen is about
as large as the larges action (or else) timeout you set somewhere in
crm/lrm/whatever.

did you have a look at my explanation ealier in this thread,
and the patch? (part of that message pasted here for reference)


        if (cmp_longclock(lnow, append->nexttime) >= 0)
assuming that nexttime was set correctly, and lnow is correct, too,
and further assuming your clock_t is only 32 bit,
longclock_t is defined as unsigned long long,
and that thing becomes:

        if ((unsigned long long)(lnow) >= (unsigned long long)(append->nexttime))

which exactly does _NOT_ care for wrap around :(


example:
	say, you start with a _large_ lnow (e.g. the equivalent of "-15
	seconds"), and add an intervall of 30 seconds, resulting
	"nexttime" become the equivalent of +15 seconds.
	comparing absolute _unsigned_ means that test ends up comparing
	if (something verry large >= something very small).
	which is always true :(

	you need instead cast both values to _signed_, and compare the
	_difference_ against zero, as below:


cmp_longclock macro and functions are broken.  conceivably some other
longclock related macros/functions may be broken as well.

it should probably be more like
(caution, not even compile tested)
((and this time incompletely copy'n'pasted,
  so whitespace will be wrong))

diff -r 731f8f7b5450 include/clplumbing/longclock.h
--- a/include/clplumbing/longclock.h    Tue Jun 30 12:02:16 2009 +0200
+++ b/include/clplumbing/longclock.h    Wed Jul 01 12:27:27 2009 +0200
@@ -127,9 +127,9 @@
        ((longclock_t)(l) - (longclock_t)(r))

 #      define  cmp_longclock(l,r)                      \
-       (((longclock_t)(l) < (longclock_t)(r))          \
+       ((((long long)(l) - (long long)(r)) < 0)        \
        ?       -1                                      \
-       : (((longclock_t)(l) > (longclock_t)(r))        \
+       : ((((long long)(l) - (long long)(r)) > 0)      \
        ?       +1 : 0))

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.



More information about the Linux-HA mailing list