[boost] Re: lightweight mutex for MacOS

29 Jan 2004

      I've been reviewing this thread, and synchronization primitives with 
the PPC instruction set in general.  I've looked especially at 
Alexander's comments.  And I've done stress testing and performance 
testing on Mac OS X, both CFM and Mach-O platforms.

After reviewing an old PowerPC 601 User's Manual which has an appendix 
on synchronization examples, which does take into account 
multi-processor architectures, I've made slight modifications to 
Martin's/Howard's posted code, namely the addition of isync to the 
constructor, and sync to the destructor:

inline
lightweight_mutex::scoped_lock::scoped_lock(lightweight_mutex & m): 
m_(m)
{
     register volatile int* p = &m_.a_;
     register int f;
     register int one = 1;
     asm
     {
     loop:
         lwarx  f, 0, p
         cmpwi  f, 0
         bne-   yield
         stwcx. one, 0, p
         beq+   done
     }
     yield:
     yield_thread();
     goto loop;
     asm
     {
     done:
         isync
     }
}

inline
lightweight_mutex::scoped_lock::~scoped_lock()
{
     asm
     {
         sync
     }
     m_.a_ = 0;
}

The 601 UM has example code that looks very much like this.

To test this code I created a volatile global int that a test function 
would both increment and decrement, testing values before and after 
every change, and of course locking a mutex so that only one thread 
could enter the test function at a time.  10 threads are set up to loop 
over the test function a million times.  Optimizations are full on, so 
as to collect timing information as well.

I used Metrowerks::threads to create threads, and to provide 
OS-supported mutexes for comparison.  Metrowerks::threads is very 
similar to boost::threads.  On the Mach-O platform Metrowerks::threads 
is a thin inlined wrapper over the native pthread library (BSD).

volatile int global = 0;

#ifdef NDEBUG
#undef NDEBUG
#endif

#include <cassert>

#include <msl_thread>

#ifdef LWM
lightweight_mutex mut;
#else
Metrowerks::mutex mut;
#endif

void test()
{
#ifdef LWM
     lightweight_mutex::scoped_lock lock(mut);
#else
     Metrowerks::mutex::scoped_lock lock(mut);
#endif
     assert(global == 0);
     ++global;
     assert(global == 1);
     --global;
     assert(global == 0);
}

void run_test()
{
     for (unsigned i = 0; i < 0x000FFFFF; ++i)
         test();
}

#include <iostream>
#include <time.h>

int main()
{
     const int num_thread = 10;
     clock_t t = clock();
#ifndef UNCONTESTED
     Metrowerks::thread_group g;
     for (int i = 0; i < num_thread; ++i)
         g.create_thread(run_test);
     g.join_all();
#else
     for (int i = 0; i < num_thread; ++i)
         run_test();
#endif
     t = clock() - t;
     std::cout << "Done in " << t/(double)CLOCKS_PER_SEC << " seconds\n";
}

The test can be run in "contested" or "uncontested" mode.  The 
contested mode creates 10 threads to run run_test() simultaneously.  
The uncontested mode simply calls run_test() 10 times.  In the 
uncontested mode the mutexes are still locked and unlocked the same 
number of times, but the lock is guaranteed to not be held by another 
thread.

Also as a sanity check I created a "naive" scoped_lock:

inline
lightweight_mutex::scoped_lock::scoped_lock(lightweight_mutex & m): 
m_(m)
{
     while (m_.a_)
         yield_thread();
     m_.a_ = 1;
}

inline
lightweight_mutex::scoped_lock::~scoped_lock()
{
     m_.a_ = 0;
}

The purpose of this version was to insure that the test is actually 
testing what I was intending.  Sure enough, this variant consistently 
asserted on the Mach-O platform and hung or crashed on the CFM platform 
when run in contested mode.

All testing was done on a dual 500 Mhz G4 running Mac OS X 10.3.  An 
activity monitor was checked periodically during the tests and showed 
that both processors were being utilized to run the test.

Performance Data:

OS:  Used MP tasks on CFM, and pthreads on Mach-O
LWM:  lightweight_mutex
naive:  see "naive" description above

All times in seconds, lower is faster/better.

+--------------------+-------+--------+---------+
|                    |   OS  |   LWM  |   naive |
+--------------------+-------+--------+---------+
| CFM      contested |  248  |  1.86  |   crash |
+--------------------+-------+--------+---------+
| CFM    uncontested | 28.5  |  1.79  |   0.994 |
+--------------------+-------+--------+---------+
| Mach-O   contested |  202  |  3.03  |  assert |
+--------------------+-------+--------+---------+
| Mach-O uncontested | 3.26  |  1.69  |   0.800 |
+--------------------+-------+--------+---------+

 From the timings there appears to be sufficient motivation to explore 
the lightweight_mutex on these platforms.

 From a correctness standpoint I can not find any faults of the lwm 
through testing.  I even bumped the number of threads up to 100, and 
the number of test cycles up to 256 million for a contested LWM run 
(Mach-O only).  The run took 146 minutes on my 2 x 500 Mhz machine and 
indicated absolutely zero problems (I did not have the patience to make 
such a run with the OS-based synchronization).  I monitored the test to 
insure that all 100 worker threads were active for the great majority 
of this time.  I also noted that the test consumed approximately 1.85 
processors (on average, up to 1.90 processors when I was reading the 
paper, down to 1.15 processors when I was surfing the web).

Alexander, I have no doubt that you will see problems with this 
implementation, and I appreciate your comments.  Could you please 
comment specifically on how the lwarx, stwcx., isync and sync 
instructions do not address your concerns, thanks.  Or could you 
perhaps suggest a test I could realistically run to expose problems.

-Howard

[boost] Re: lightweight mutex for MacOS

Howard Hinnant