Re: [boost] [thread] best practice to lock multiple mutexes

8 Feb 2008

      On Feb 8, 2008, at 2:42 PM, Anthony Williams wrote:
...
Howard Hinnant <hinnant <at> twcny.rr.com> writes:
...
template <class L1, class L2>
void
lock(L1& l1, L2& l2)
{
    while (true)
    {
        {
        unique_lock<L1> __u1(l1);
        if (l2.try_lock())
        {
            __u1.release();
            break;
        }
        }
        std::this_thread::yield();
snipped rest of multi-lock code.
...
3.  The yields are absolutely necessary to get the high performance
mentioned in the previous note.
I find this comment intriguing. Is this always the case, or does it  
depend on
the specifics of how a mutex is implemented on a given platform?
That is a question I've been asking and myself for a long time,  
unfortunately without a satisfactory answer.  Here's my experience,  
and a few guesses...

On 2 core Macs and 2 core Windows, I've measured improved speeds with  
the yields, using a "dining philosophers" style simulation.  I haven't  
tested on Linux.

Here are results (dating around 3 years ago):

In order:  63 seconds.
try/back off/without yield:  47 seconds.
try/back off/with yield:  34 seconds.

These numbers come from 5 threads each wanting two mutexes each  
(classic dining philosopher setup).  The same runs on single core  
machines looked like:

In order:  293 seconds.
try/back off/without yield:  not tested.
try/back off/with yield:  278 seconds.

The single core results were measured on different machines than the  
dual core results and so aren't comparable to the dual core results.

I set up an additional "dining philosophers" test where each   
philosopher set at the corner of a cube (8) and attempted to attain a  
mutex associated with the edge of the cube adjacent to the corner (12  
mutexes).  That is, each diner required "3 forks" to eat.  The results  
were:

In order:  226 seconds.
try/back off/without yield:  not tested.
try/back off/with yield:  77 seconds.

Unfortunately I didn't run the last two tests without the yields.

I believe the try/back off algorithms are faster than the in order  
algorithm because in the latter a thread sits on a mutex while blocked  
for another.  Thus a "fork" that somebody else could be using is  
idled.  But in the try/back off algorithms the diners are "more  
polite".  If they can not eat, they offer their fork up so that  
someone else may be able to eat.

Now what about the yields...  Blocking on a mutex should be very  
similar to a yield.  But the first test is showing a significant speed  
advantage to the yields.  The above isn't noise, the tests were run  
many times and the averages are reported.  On running the tests on a  
Mac I viewed the CPU activity in the "Activity Monitor" application.   
When running without the yields, the activity monitor looked like:

(I hope you can see the graphic).  The red in the graphic indicates  
"system time" while the green indicates "user time".  In case you  
can't see the graphic, it looks like around 20% to 25% of the time is  
spent in "system", with the rest spent in "user".  When running with  
the yields, the monitor looks like:

Now the "system time" is almost completely gone (maybe 7% or 8% system  
time).  So more "user time" translates to faster application (not so  
much "system time").  But this still doesn't answer your precise  
question:  Is this behavior really portable?  I'm afraid I don't  
really know for sure.  It has just been my experience on Mac and  
Windows.  The Windows tests were done 3 years ago, and I asked someone  
to do them for me.  I didn't have the machine available to investigate  
myself.  The Mac tests were done at the same time, but I've replicated  
the results more recently.  I haven't replicated the Windows results  
more recently.

As long as I'm posting graphics (I hope nobody minds), here's another  
interesting snapshot of the 8 diner/ 12 fork / 3 forks at a time  
case.  Both snapshots are taken as the test is ending:

Lock in order:

try/back off/with yield:

The graphs show that the lock-in-order algorithm isn't making full use  
of the cpu's in the final seconds of the test.  The reason for this is  
that some diners "hogged" the system and got thru eating while other  
diners sat around not getting their work done.  In the end there  
weren't enough diners left to keep both cpu's busy (I don't recall if  
we are looking at 1 or 2 diners at the end of this test).

try/back off/with yield graphic shows the same effect as described  
above, but over a *much* shorter time period.  Both cpu's are fully  
busy in "user time" for nearly the entire test.

I hope this gives you some insights, despite the lack of a clear  
answer to your question.  Perhaps you or others can present a good  
explanation for this behavior regarding the yields. :-)

-Howard