Re: [boost] [thread] best practice to lock multiple mutexes

9 Feb 2008

      Peter Dimov wrote:
...
Phil Endecott:
...
Peter's comment about evil spin-waits in lock() is interesting as I was
considering implementing something like that.  I wonder about this 
instead:
void mutex::lock() {
  if (atomic-attempt-to-lock) { // uncontended
    return;
  };
  // first wait, try yielding since it's fast:
  sched_yield();
  if (atomic-attempt-to-lock) { // got lock on 2nd attempt
    return;
  }
  // We failed again.  Maybe there are lots of threads waiting; in that
  // case it will be more efficient to futex_wait() since the kernel will
  // then only wake us when our lock has been woken.
  futex_wait(....);
}
This isn't very good, either. Consider a scenario where your time slice is 
10ms, but the mutex is unlocked 3ms after you attempt a lock.
So some other thread gets to do some (useful) work for 7 ms.  But I 
take your point.
...
In my opinion, mutex::lock should never do fancy stuff behind the user's 
back in an attempt to achieve "better performance". If a 
trylock/sched_yield/lock pattern is better for a specific workload, it can 
be implemented in user space. If it isn't, there is no way to undo the 
damage.
Yes, I agree emphatically with this.  I was pleased, for example, to 
see that Howard's implementation of the multiple locking algorithm 
doesn't need access to any "internals" to work efficiently.  The 
approach that I've been experimenting with is to supply a template 
parameter to the mutex class that determines its behaviour, i.e.

template <typename WaitWake>
class mutex {
public:
   void lock();  // tries to lock with an atomic operation, and
                 // calls WaitWake.wait() if it fails.
   void unlock();  // calls WaitWake.wake().
   // etc. - an N2447-like interface
};

The "WaitWake" concept is something with a futex-like interface:

class Futex {
   int& var;
public:
   Futex(int& var_);
   void wait(int expected);  // returns immediately if var!=expected,
                             // else blocks until wake() is called
   void wake();
   // etc.
};

I originally factored this out so that I could substitute a yield loop 
or spinlock using these degenerate classes:

class Yield {
public:
   Yield(int& var_) {}
   void wait(int expected) { ::sched_yield(); }
   void wake() {}
};

class Spin {
public:
   Spin(int& var_) {}
   void wait(int expected) {}
   void wake() {}
};

Since then I've written a condition-variable class that also takes a 
"WaitWake" template parameter.

But there are a couple of things that I can't do with this design:

- The yield-loop and spin-loop are compatible, in the sense that 
different locks on the same mutex could use different implementations 
(based on the expectation of blocking in each case) and it would still 
work.  The futex could be made compatible.  So perhaps the WaitWake 
should be a parameter of the lock, not of the mutex.

- But whether a lock is better off spinning or yielding is actually a 
function of what the _other_ thread that currently holds the lock is 
doing, not what this thread is going to do.  So maybe we want a scheme 
like this (pseudo-code):

mutex m;
...
{
   lock l(m, 1e-6);
   var++;  // lock is only held for a few nanoseconds
}
...
{
   lock l(m, 10);
   var = result_of_http_request(); // lock is held for seconds
}

class lock {
   mutex& m;
public:
   lock(mutex& m_, float likely_hold_time):
     m(m_)
   {
     if (m.likely_hold_time < 1e-3) {
       m.spin_lock();
     } else {
       m.yield_lock();
     }
     m.likely_hold_time = likely_hold_time;
   }
   ~lock() {
     m.unlock();
   }
   .....
};

My thought process gets about this far and then I think "this is too 
complicated".  Simpler is better.

- My current scheme also doesn't allow for the sort of spin-then-wait 
or yield-then-wait design at the top of this message.  There are lots 
of variations on this like "if (m.lock_holder_cpu != this_cpu) { spin 
for a bit }".

In practice, my real code doesn't need most of these features; I have 
maybe a dozen threads, not thousands, and lock contention seems to be 
very rare.  I got a worthwhile speedup by bypassing glibc's 
pthread_mutex and pthread_condition implementation and going straight 
to futexes (for the fast uncontended case), and I try to do as much 
work as possible between context switches.  If we had a set of 
concurrency benchmarks available then I'd investigate a bit further.

Regards,

Phil.