Re: [boost] [thread] best practice to lock multiple mutexes

29 Feb 2008

      Anthony Williams wrote:
...
My spreadsheet with the results in can be downloaded from 
http://www.justsoftwaresolutions.co.uk/files/dining_results.ods
Excellent work!
...
Not all of Andrey's runs are in there, but enough to get a reasonable mean and 
Standard-deviation for most cases.
This shows some interesting observations. Firstly, the padding (8k between 
entries) is hugely beneficial for the small (8-byte) mutexes (boost 1.35, and my 
BTS-based variant), but the benefit is less for the CRITICAL_SECTION based 
mutex, which is 24-bytes. If you consider that the data is only 4 bytes, and 
each cache line is 64 bytes, this is unsurprising --- with a 24 byte mutex and 4 
byte data, you can only fit 2.5 entries in a cache line, so it's only adjacent 
entries that clash, whereas with a 8 byte mutex and 4 byte data you get 5 
entries per cache line, so there is much more conflict. This makes me wonder if 
it's worth padding the mutex to 64 bytes, so it occupies an entire cache line, 
or maybe adding a 64-byte alignment requirement.
I think that the mutex should be grouped together with the data that 
it's protecting, e.g.

// BAD:
mutex mut[64];
Thing things[64];

// BETTER:
pair<mutex,Thing> protected_things[64];

The presence of the "things" will separate out the mutexes, avoiding 
the cache-thrashing that you'd get if they were packed together, and 
furthermore if you're lucky the "thing" will be pre-fetched into your 
cache as soon as you lock its mutex.  You could even do this:

template <typename T>
class Locked: public T {
   mutex mut;
public:
   void lock() { mut.lock(); }
   void unlock() { mut.unlock(); }
};

Locked<Thing> locked_things[64];

For small "things" it would still help to pad to the cache line size.  
Various ways to do this come to mind: add a compiler-specific alignment 
attribute (I think there's a proposal for a standard way to do this in 
C++0x, right?), add an explicit padding field dependent on sizeof(T), 
or for dynamically allocated objects write a Locked::operator new that 
rounds up the size requested.

Of course it's wasteful to pad for mutexes that are rarely contested, 
and in many cases mutexes are rarely contested.  Wouldn't it be great 
if our compilers could do profile-driven memory layout optimisation?  
Sounds like a good PhD topic to me.  In the meantime, I wonder what 
else could be done at the library level to help programs not thrash 
their caches?
...
Secondly, the yield is quite beneficial in some cases (e.g. 58% improvement), 
but often detrimental (up to 33% degradation). Overall, I think it is not worth 
adding.
I still find that a mutex that ONLY uses yield gives very good 
performance.  This is because I don't yet have a benchmark where there 
are large numbers of _unrelated_ threads running on the system.  
"Dining Philosophers" is not representative of many real-world 
problems, unfortunately....
...
The next thing I noticed was quite surprising. On the machines with 16 hardware 
threads, the 32-thread versions often ran faster than the 16-thread versions. I 
can only presume that this is because the increased thread count meant less 
contention, and thus fewer blocking waits.
Yes.
...
Finally, though the BTS-based mutex is faster than the boost 1.35 one in most 
cases, the CRITICAL_SECTION based version is actually very competitive, and 
fastest in many cases. I found this surprising, because it didn't match my prior 
experience, but might be a consequence of the (generally undesirable) properties 
of CRITICAL_SECTIONS: some threads end up getting all the locks, and running to 
completion very quickly, thus reducing the running thread count for the 
remainder of the test.
I think fairness is over-rated.  An unfair schedule will reduce 
mean-time-to-completion, and get better cache hit rates.

Cheers,

Phil.

Re: [boost] [thread] best practice to lock multiple mutexes

Phil Endecott