lightweight mutex for MacOS

Hi I would like to put forward an implementation of a lightweight mutex for MacOS. I have tested it with shared_ptr_mt_test.cpp without problem. It's fairly short, so I hope you don't mind me posting it directly here for review. I should add also that I didn't really write this code (that's not a disclaimer - Howard Hinnant cooked it up after I asked the Codewarrior newsgroup for help!) Anyway, here's the code: #if __MACH__ #include <sched.h> #else #include <Multiprocessing.h> #endif namespace boost { namespace detail { class lightweight_mutex { private: volatile int a_; lightweight_mutex(lightweight_mutex const &); lightweight_mutex & operator=(lightweight_mutex const &); public: lightweight_mutex(): a_(0) { } class scoped_lock; friend class scoped_lock; class scoped_lock { private: lightweight_mutex & m_; scoped_lock(scoped_lock const &); scoped_lock & operator=(scoped_lock const &); void yield_thread() { #if __MACH__ sched_yield(); #else MPYield(); #endif } public: explicit scoped_lock(lightweight_mutex & m); ~scoped_lock() { m_.a_ = 0; } }; }; inline lightweight_mutex::scoped_lock::scoped_lock(lightweight_mutex & m) : m_(m) { register volatile int *p = &m_.a_; register int f; register int one = 1; asm { loop: lwarx f, 0, p cmpwi f, 0 bne- yield stwcx. one, 0, p beq+ done } yield: yield_thread(); goto loop; done: ; } } // namespace detail } // namespace boost Martin

Martin Taylor wrote:
Hi
I would like to put forward an implementation of a lightweight mutex for MacOS. I have tested it with shared_ptr_mt_test.cpp without problem.
You don't have proper memory barriers for MP-safety. Abusing the scheduler is also not good. regards, alexander.

On Jan 27, 2004, at 10:38 AM, Alexander Terekhov wrote:
Martin Taylor wrote:
Hi
I would like to put forward an implementation of a lightweight mutex for MacOS. I have tested it with shared_ptr_mt_test.cpp without problem.
You don't have proper memory barriers for MP-safety. Abusing the scheduler is also not good.
Do you have any specific suggestions which would make this design at least as good as that in lwm_gcc.hpp? -Howard

Howard Hinnant wrote:
On Jan 27, 2004, at 10:38 AM, Alexander Terekhov wrote:
Martin Taylor wrote:
Hi
I would like to put forward an implementation of a lightweight mutex for MacOS. I have tested it with shared_ptr_mt_test.cpp without problem.
You don't have proper memory barriers for MP-safety. Abusing the scheduler is also not good.
Do you have any specific suggestions which would make this design at least as good as that in lwm_gcc.hpp?
http://www.boost.org/boost/detail/lwm_gcc.hpp is also broken. Do you have "futexes" on Mac? If so, you might want to take a look at http://listman.redhat.com/archives/phil-list/2003-October/msg00029.html http://listman.redhat.com/archives/phil-list/2003-October/msg00030.html And, perhaps, also http://groups.google.com/groups?selm=3FA61F96.223FDE85%40web.de regards, alexander. P.S. http://groups.google.com/groups?selm=3FE2E91D.E600BC7E%40web.de

Alexander Terekhov <terekhov@web.de> writes: | Howard Hinnant wrote:
On Jan 27, 2004, at 10:38 AM, Alexander Terekhov wrote:
Martin Taylor wrote:
Hi
I would like to put forward an implementation of a lightweight mutex for MacOS. I have tested it with shared_ptr_mt_test.cpp without problem.
You don't have proper memory barriers for MP-safety. Abusing the scheduler is also not good.
Do you have any specific suggestions which would make this design at least as good as that in lwm_gcc.hpp?
| http://www.boost.org/boost/detail/lwm_gcc.hpp is also broken. Because of the sched_yield or something else? Perhaps the lwm_gcc should just be removed? (and lwm_linux) In the lwm_linux case futexes could be used if new enough glibc and kernel. (OTOH in that case it is likely that pthread's mutex is implement with futex) -- Lgb

Lars Gullik Bjønnes wrote: [...]
| http://www.boost.org/boost/detail/lwm_gcc.hpp is also broken.
Because of the sched_yield or something else?
Yes (plus the issue of memory barriers).
Perhaps the lwm_gcc should just be removed? (and lwm_linux)
Yes, definitely.
In the lwm_linux case futexes could be used if new enough glibc and kernel.
No need for new enough glibc; only new enough kernel (one with sensible and correct futex impl -- probably 3.2.something ;-) ), I think.
(OTOH in that case it is likely that pthread's mutex is implement with futex)
Current impl aside for a moment, pthread_mutex_t does some dispatching because it represents multiple mutex types. A "lightweight mutex" doesn't need really that. Well, I guess that boost simply needs standard atomic<> template -- for something along the lines of www.terekhov.de/pthread_refcount_t/experimental/refcount.cpp regards, alexander.

Alexander Terekhov wrote:
Lars Gullik Bjønnes wrote: [...]
http://www.boost.org/boost/detail/lwm_gcc.hpp is also broken.
Because of the sched_yield or something else?
Yes (plus the issue of memory barriers).
Perhaps the lwm_gcc should just be removed? (and lwm_linux)
Yes, definitely.
These two are off by default. I'll probably remove them as soon as I switch the Windows shared_ptr to the atomic version.
Current impl aside for a moment, pthread_mutex_t does some dispatching because it represents multiple mutex types. A "lightweight mutex" doesn't need really that.
True, and it doesn't need CV support either. That said, a pthread_mutex is close enough; the most important requirement for lightweight_mutex is "header-only" (which is why boost::mutex can't be used.)
Well, I guess that boost simply needs standard atomic<> template
For shared_ptr yes, but quick_allocator still needs an "ordinary" header-only mutex.

I've been reviewing this thread, and synchronization primitives with the PPC instruction set in general. I've looked especially at Alexander's comments. And I've done stress testing and performance testing on Mac OS X, both CFM and Mach-O platforms. After reviewing an old PowerPC 601 User's Manual which has an appendix on synchronization examples, which does take into account multi-processor architectures, I've made slight modifications to Martin's/Howard's posted code, namely the addition of isync to the constructor, and sync to the destructor: inline lightweight_mutex::scoped_lock::scoped_lock(lightweight_mutex & m): m_(m) { register volatile int* p = &m_.a_; register int f; register int one = 1; asm { loop: lwarx f, 0, p cmpwi f, 0 bne- yield stwcx. one, 0, p beq+ done } yield: yield_thread(); goto loop; asm { done: isync } } inline lightweight_mutex::scoped_lock::~scoped_lock() { asm { sync } m_.a_ = 0; } The 601 UM has example code that looks very much like this. To test this code I created a volatile global int that a test function would both increment and decrement, testing values before and after every change, and of course locking a mutex so that only one thread could enter the test function at a time. 10 threads are set up to loop over the test function a million times. Optimizations are full on, so as to collect timing information as well. I used Metrowerks::threads to create threads, and to provide OS-supported mutexes for comparison. Metrowerks::threads is very similar to boost::threads. On the Mach-O platform Metrowerks::threads is a thin inlined wrapper over the native pthread library (BSD). volatile int global = 0; #ifdef NDEBUG #undef NDEBUG #endif #include <cassert> #include <msl_thread> #ifdef LWM lightweight_mutex mut; #else Metrowerks::mutex mut; #endif void test() { #ifdef LWM lightweight_mutex::scoped_lock lock(mut); #else Metrowerks::mutex::scoped_lock lock(mut); #endif assert(global == 0); ++global; assert(global == 1); --global; assert(global == 0); } void run_test() { for (unsigned i = 0; i < 0x000FFFFF; ++i) test(); } #include <iostream> #include <time.h> int main() { const int num_thread = 10; clock_t t = clock(); #ifndef UNCONTESTED Metrowerks::thread_group g; for (int i = 0; i < num_thread; ++i) g.create_thread(run_test); g.join_all(); #else for (int i = 0; i < num_thread; ++i) run_test(); #endif t = clock() - t; std::cout << "Done in " << t/(double)CLOCKS_PER_SEC << " seconds\n"; } The test can be run in "contested" or "uncontested" mode. The contested mode creates 10 threads to run run_test() simultaneously. The uncontested mode simply calls run_test() 10 times. In the uncontested mode the mutexes are still locked and unlocked the same number of times, but the lock is guaranteed to not be held by another thread. Also as a sanity check I created a "naive" scoped_lock: inline lightweight_mutex::scoped_lock::scoped_lock(lightweight_mutex & m): m_(m) { while (m_.a_) yield_thread(); m_.a_ = 1; } inline lightweight_mutex::scoped_lock::~scoped_lock() { m_.a_ = 0; } The purpose of this version was to insure that the test is actually testing what I was intending. Sure enough, this variant consistently asserted on the Mach-O platform and hung or crashed on the CFM platform when run in contested mode. All testing was done on a dual 500 Mhz G4 running Mac OS X 10.3. An activity monitor was checked periodically during the tests and showed that both processors were being utilized to run the test. Performance Data: OS: Used MP tasks on CFM, and pthreads on Mach-O LWM: lightweight_mutex naive: see "naive" description above All times in seconds, lower is faster/better. +--------------------+-------+--------+---------+ | | OS | LWM | naive | +--------------------+-------+--------+---------+ | CFM contested | 248 | 1.86 | crash | +--------------------+-------+--------+---------+ | CFM uncontested | 28.5 | 1.79 | 0.994 | +--------------------+-------+--------+---------+ | Mach-O contested | 202 | 3.03 | assert | +--------------------+-------+--------+---------+ | Mach-O uncontested | 3.26 | 1.69 | 0.800 | +--------------------+-------+--------+---------+ From the timings there appears to be sufficient motivation to explore the lightweight_mutex on these platforms. From a correctness standpoint I can not find any faults of the lwm through testing. I even bumped the number of threads up to 100, and the number of test cycles up to 256 million for a contested LWM run (Mach-O only). The run took 146 minutes on my 2 x 500 Mhz machine and indicated absolutely zero problems (I did not have the patience to make such a run with the OS-based synchronization). I monitored the test to insure that all 100 worker threads were active for the great majority of this time. I also noted that the test consumed approximately 1.85 processors (on average, up to 1.90 processors when I was reading the paper, down to 1.15 processors when I was surfing the web). Alexander, I have no doubt that you will see problems with this implementation, and I appreciate your comments. Could you please comment specifically on how the lwarx, stwcx., isync and sync instructions do not address your concerns, thanks. Or could you perhaps suggest a test I could realistically run to expose problems. -Howard

Howard Hinnant wrote: [...]
Alexander, I have no doubt that you will see problems with this implementation, and I appreciate your comments.
Get rid of yield. With yield you basically have a spinlock, not a general purpose mutex. Think of a uniprocessor [under priority scheduling rules] with some higher priority thread contending for a lock currently owned by some lower priority thread.
Could you please comment specifically on how the lwarx, stwcx., isync and sync instructions do not address your concerns, thanks.
Use of op+isync for "op.hoist_load+hoist_store" (aka op.acquire) seem to be what everyone is doing out there (looks a bit strange if you ask me). For "op.sink_load+sink_store" (aka op.release) you can use lwsync+op (lightweight, oh well, form of sync). www-106.ibm.com/developerworks/eserver/articles/powerpc.html listman.redhat.com/archives/phil-list/2003-August/msg00039.html regards, alexander.

On 29 Jan 2004, at 14:28, Alexander Terekhov wrote:
Howard Hinnant wrote: [...]
Alexander, I have no doubt that you will see problems with this implementation, and I appreciate your comments.
Get rid of yield. With yield you basically have a spinlock, not a general purpose mutex. Think of a uniprocessor [under priority scheduling rules] with some higher priority thread contending for a lock currently owned by some lower priority thread.
I thought that the lightweight_mutex didn't have to be a general purpose mutex - in fact it explictily says just that in the comments of lightweight_mutex.hpp: // boost::detail::lightweight_mutex meets a subset of the Mutex concept // requirements: http://www.boost.org/libs/thread/doc/mutex_concept.html#Mutex ... // * Not a general purpose mutex, use boost::mutex, CRITICAL_SECTION or // pthread_mutex instead. ... // The current implementation can use a pthread_mutex, a CRITICAL_SECTION, // or a platform-specific spinlock. Regards, Martin

And the lwm_win32 is just a spinlock. Isn't it something like (haven't looked at the code in a while) while (InterlockedExchange(...)) { Sleep(1); } ? Thanks Russell Martin Taylor wrote:
On 29 Jan 2004, at 14:28, Alexander Terekhov wrote:
Howard Hinnant wrote: [...]
Alexander, I have no doubt that you will see problems with this implementation, and I appreciate your comments.
Get rid of yield. With yield you basically have a spinlock, not a general purpose mutex. Think of a uniprocessor [under priority scheduling rules] with some higher priority thread contending for a lock currently owned by some lower priority thread.
I thought that the lightweight_mutex didn't have to be a general purpose mutex - in fact it explictily says just that in the comments of lightweight_mutex.hpp:
// boost::detail::lightweight_mutex meets a subset of the Mutex concept // requirements: http://www.boost.org/libs/thread/doc/mutex_concept.html#Mutex ... // * Not a general purpose mutex, use boost::mutex, CRITICAL_SECTION or // pthread_mutex instead. ... // The current implementation can use a pthread_mutex, a CRITICAL_SECTION, // or a platform-specific spinlock.
Regards, Martin
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Russell Hind wrote: [...]
while (InterlockedExchange(...)) { Sleep(1);
http://aspn.activestate.com/ASPN/Mail/Message/1551222 regards, alexander.

Martin Taylor wrote: [...]
// or a platform-specific spinlock.
http://google.com/groups?selm=vnU47.809%24rc5.60740%40news.cpqcorp.net ("... you should never use a spinlock unless ...") regards, alexander.

On Jan 29, 2004, at 9:28 AM, Alexander Terekhov wrote:
Get rid of yield.
Thanks for the suggestion! New timing information: +--------------------+-------+--------+------------+---------+ | | OS | LWM |LWM/no yield| naive | +--------------------+-------+--------+------------+---------+ | CFM contested | 248 | 1.86 | 9.25 | crash | +--------------------+-------+--------+------------+---------+ | CFM uncontested | 28.5 | 1.79 | 1.77 | 0.994 | +--------------------+-------+--------+------------+---------+ | Mach-O contested | 202 | 3.03 | 15.4 | assert | +--------------------+-------+--------+------------+---------+ | Mach-O uncontested | 3.26 | 1.69 | 1.76 | 0.800 | +--------------------+-------+--------+------------+---------+ That lack of yield appears to produce a 400% performance drop for this particular test (contested variant) on these platforms. I understand the potential for priority inversion, but for these performance gains, and for a "cheap fast mutex", the yield looks like a good deal to me. I agree that for a more general purpose mutex other decisions might be made. Thanks for the lwsync tip! Alas, CodeWarrior does not currently support this instruction else I would've included timings for that change too. -Howard
participants (6)
-
Alexander Terekhov
-
Howard Hinnant
-
larsbj@gullik.net
-
Martin Taylor
-
Peter Dimov
-
Russell Hind