
Anthony Williams: ...
Finally, though the BTS-based mutex is faster than the boost 1.35 one in most cases, the CRITICAL_SECTION based version is actually very competitive, and fastest in many cases. I found this surprising, because it didn't match my prior experience, but might be a consequence of the (generally undesirable) properties of CRITICAL_SECTIONS: some threads end up getting all the locks, and running to completion very quickly, thus reducing the running thread count for the remainder of the test.
This is a general property of artificial benchmarks that do nothing outside of the critical region. In such a case running the same thread to completion is indeed the most efficient way to accomplish the overall task. Sensible user code will generally try to keep its critical regions short and move as much of the code outside as possible (to increase the amount of available parallelism).