
I've been reviewing this thread, and synchronization primitives with the PPC instruction set in general. I've looked especially at Alexander's comments. And I've done stress testing and performance testing on Mac OS X, both CFM and Mach-O platforms. After reviewing an old PowerPC 601 User's Manual which has an appendix on synchronization examples, which does take into account multi-processor architectures, I've made slight modifications to Martin's/Howard's posted code, namely the addition of isync to the constructor, and sync to the destructor: inline lightweight_mutex::scoped_lock::scoped_lock(lightweight_mutex & m): m_(m) { register volatile int* p = &m_.a_; register int f; register int one = 1; asm { loop: lwarx f, 0, p cmpwi f, 0 bne- yield stwcx. one, 0, p beq+ done } yield: yield_thread(); goto loop; asm { done: isync } } inline lightweight_mutex::scoped_lock::~scoped_lock() { asm { sync } m_.a_ = 0; } The 601 UM has example code that looks very much like this. To test this code I created a volatile global int that a test function would both increment and decrement, testing values before and after every change, and of course locking a mutex so that only one thread could enter the test function at a time. 10 threads are set up to loop over the test function a million times. Optimizations are full on, so as to collect timing information as well. I used Metrowerks::threads to create threads, and to provide OS-supported mutexes for comparison. Metrowerks::threads is very similar to boost::threads. On the Mach-O platform Metrowerks::threads is a thin inlined wrapper over the native pthread library (BSD). volatile int global = 0; #ifdef NDEBUG #undef NDEBUG #endif #include <cassert> #include <msl_thread> #ifdef LWM lightweight_mutex mut; #else Metrowerks::mutex mut; #endif void test() { #ifdef LWM lightweight_mutex::scoped_lock lock(mut); #else Metrowerks::mutex::scoped_lock lock(mut); #endif assert(global == 0); ++global; assert(global == 1); --global; assert(global == 0); } void run_test() { for (unsigned i = 0; i < 0x000FFFFF; ++i) test(); } #include <iostream> #include <time.h> int main() { const int num_thread = 10; clock_t t = clock(); #ifndef UNCONTESTED Metrowerks::thread_group g; for (int i = 0; i < num_thread; ++i) g.create_thread(run_test); g.join_all(); #else for (int i = 0; i < num_thread; ++i) run_test(); #endif t = clock() - t; std::cout << "Done in " << t/(double)CLOCKS_PER_SEC << " seconds\n"; } The test can be run in "contested" or "uncontested" mode. The contested mode creates 10 threads to run run_test() simultaneously. The uncontested mode simply calls run_test() 10 times. In the uncontested mode the mutexes are still locked and unlocked the same number of times, but the lock is guaranteed to not be held by another thread. Also as a sanity check I created a "naive" scoped_lock: inline lightweight_mutex::scoped_lock::scoped_lock(lightweight_mutex & m): m_(m) { while (m_.a_) yield_thread(); m_.a_ = 1; } inline lightweight_mutex::scoped_lock::~scoped_lock() { m_.a_ = 0; } The purpose of this version was to insure that the test is actually testing what I was intending. Sure enough, this variant consistently asserted on the Mach-O platform and hung or crashed on the CFM platform when run in contested mode. All testing was done on a dual 500 Mhz G4 running Mac OS X 10.3. An activity monitor was checked periodically during the tests and showed that both processors were being utilized to run the test. Performance Data: OS: Used MP tasks on CFM, and pthreads on Mach-O LWM: lightweight_mutex naive: see "naive" description above All times in seconds, lower is faster/better. +--------------------+-------+--------+---------+ | | OS | LWM | naive | +--------------------+-------+--------+---------+ | CFM contested | 248 | 1.86 | crash | +--------------------+-------+--------+---------+ | CFM uncontested | 28.5 | 1.79 | 0.994 | +--------------------+-------+--------+---------+ | Mach-O contested | 202 | 3.03 | assert | +--------------------+-------+--------+---------+ | Mach-O uncontested | 3.26 | 1.69 | 0.800 | +--------------------+-------+--------+---------+ From the timings there appears to be sufficient motivation to explore the lightweight_mutex on these platforms. From a correctness standpoint I can not find any faults of the lwm through testing. I even bumped the number of threads up to 100, and the number of test cycles up to 256 million for a contested LWM run (Mach-O only). The run took 146 minutes on my 2 x 500 Mhz machine and indicated absolutely zero problems (I did not have the patience to make such a run with the OS-based synchronization). I monitored the test to insure that all 100 worker threads were active for the great majority of this time. I also noted that the test consumed approximately 1.85 processors (on average, up to 1.90 processors when I was reading the paper, down to 1.15 processors when I was surfing the web). Alexander, I have no doubt that you will see problems with this implementation, and I appreciate your comments. Could you please comment specifically on how the lwarx, stwcx., isync and sync instructions do not address your concerns, thanks. Or could you perhaps suggest a test I could realistically run to expose problems. -Howard