Re: [boost] [thread] best practice to lock multiple mutexes

10 Feb 2008

      Howard Hinnant wrote:
...
On Feb 8, 2008, at 6:19 PM, Phil Endecott wrote:
...
If you can share the code, I'll see how it performs on my Linux  
systems.
Thanks Phil!  Sorry for the delay.  I wanted to update my test to use  
the latest std::syntax and rerun them again.  First, here's the code:
OK, I have made some minimal mods to work with my atomic-ops-and-futex 
mutex, and I get the following results.  All of these results should be 
treated as provisional since I don't have 100% confidence that my mutex 
implementation is actually correct.

For my 1GHz uniprocessor VIA-C3 system:

Lock-in-order:   8 s
Retry:           7 s
Retry and yield: 7 s

For a 3.6 GHz Intel Pentium 4 dual-core system:

Lock-in-order:   still running.....
Retry:           13 s
Retry and yield: 10 s

So the most obvious initial observation is that you really don't want 
to run this sort of code on a multiprocessor system; this agrees with 
your observation:
...
Interesting side note:  During the first run my disk backup utility  
kicked in.  This extra competition for resources actually made the  
test case run faster!
I imagine that the allocation of threads to processors could also 
explain things like:
...
I have no explanation for the longer time in the third run.
I know that there is a certain amount of theoretical work about 
affinity of threads to CPUs but I don't know what OSes actually do in 
practice; do they migrate threads based on any run-time measures?

Something that you might find interesting is that I modified the 
chopstick as follows:

struct Chopstick
{
      Chopstick() : random_next_(1) {}
      Mutex mut_;
      unsigned random_next_;
      char padding[8000];        <<<============
};

My aim was to prevent the chopsticks from being close together in the 
cache.  This only works if your Mutex object actually contains the 
shared state itself, rather than being a pointer or some other sort of 
handle.  By doing this, I increased the performance on the dual-core 
system with the retry-and-yield locking strategy to 6 seconds!  
(Finally better than the system with a third of the clock speed (and a 
tenth of the price)!  Yay!)

I think that this sort of cache-friendly design is important; the 
effect that I've measured here would occur in real programs, not just 
in artificial benchmarks like this.  Perhaps the Mutex needs its own 
allocator, or something?

I got a further improvement, down to 5 s on the dual-core system, by 
using my yield-based mutex (i.e. no futex at all, just a 
yield-in-a-loop to lock).  So the scale of this benchmark (5 threads) 
is sufficiently small that the benefits of the futex (i.e. the thread 
is not woken prematurely) are not outweighed by the overhead of 
recording which thread is waiting for what.  My guess is that your 
additional yield() is part of the same phenomenon: if the number of 
philosophers were increased until my Mutex<Futex> were better than my 
Mutex<Yield>, then the benefit of the additional yield in the multilock 
would also go away.

I plan to do some more measurements with different systems and 
different mutex implementations (e.g. the evil-spin-then-wait and 
friends).  But we should remember that this is only an artificial benchmark.

Something that might be interesting to measure is the total number of 
context switches (per CPU) while the benchmark runs.  I can see that 
with /usr/bin/time -f '%c %w'.  If you have any similar facilities you 
might like to see how correlated the total number of context switches 
is with the total runtime.

Regards,  Phil