Re: [boost] [thread] best practice to lock multiple mutexes

29 Feb 2008

      Anthony Williams <anthony_w.geo <at> yahoo.com> writes:
...
Kowalke Oliver (QD IT PA AS <Oliver.Kowalke <at> qimonda.com> writes:
...
...
Andrey Tcherepanov <moyt63c02 <at> sneakemail.com> writes:
...
I sent you results to anthony_w.geo<at>yahoo.com, please
let me know
if you received them or not - these corp. filters play
strange games
sometimes...
Received, thank you.
Would you share the data?
My spreadsheet with the results in can be downloaded from
http://www.justsoftwaresolutions.co.uk/files/dining_results.ods

Not all of Andrey's runs are in there, but enough to get a reasonable mean and
Standard-deviation for most cases.

This shows some interesting observations. Firstly, the padding (8k between
entries) is hugely beneficial for the small (8-byte) mutexes (boost 1.35, and my
BTS-based variant), but the benefit is less for the CRITICAL_SECTION based
mutex, which is 24-bytes. If you consider that the data is only 4 bytes, and
each cache line is 64 bytes, this is unsurprising --- with a 24 byte mutex and 4
byte data, you can only fit 2.5 entries in a cache line, so it's only adjacent
entries that clash, whereas with a 8 byte mutex and 4 byte data you get 5
entries per cache line, so there is much more conflict. This makes me wonder if
it's worth padding the mutex to 64 bytes, so it occupies an entire cache line,
or maybe adding a 64-byte alignment requirement.

Secondly, the yield is quite beneficial in some cases (e.g. 58% improvement),
but often detrimental (up to 33% degradation). Overall, I think it is not worth
adding.

The next thing I noticed was quite surprising. On the machines with 16 hardware
threads, the 32-thread versions often ran faster than the 16-thread versions. I
can only presume that this is because the increased thread count meant less
contention, and thus fewer blocking waits.

Finally, though the BTS-based mutex is faster than the boost 1.35 one in most
cases, the CRITICAL_SECTION based version is actually very competitive, and
fastest in many cases. I found this surprising, because it didn't match my prior
experience, but might be a consequence of the (generally undesirable) properties
of CRITICAL_SECTIONS: some threads end up getting all the locks, and running to
completion very quickly, thus reducing the running thread count for the
remainder of the test. 

Anthony
--
Anthony Williams
Just Software Solutions Ltd - http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL