
Hello Phil, On Mon, 30 Nov 2009, Phil Endecott wrote:
Helge Bahmann wrote:
if you don't believe me, just disassemble pthread_mutex_lock/unlock on any linux system.
When I last looked at the glibc source, I believe I found quite a lot of code that needed to test whether the mutex was e.g. recursive or not, on top of that core functionality.
Sure, there is a lot of code, but the fast-path check for a normal mutex (properly marked with __builtin_expect) is just at the top of the function, and it drops straight into CAS, so with the compiler doing its job, it *should* not matter. (However, the fetch of the thread id from TLS should be moved out of the fast path as well -- the compiler is probably not clever enough to move it away).
As for the space overhead of a pthread_mutex_t... if you cannot pay that, just use hashed locks.
A hashed lock library would be welcome here, I'm sure.
Yes, this would be a really helpful addition to Boost.Thread -- implementing fallback for atomic operations is just not feasible without.
Last note: calling "sched_yield" on contention is about the *worst* thing you can do -- linux/glibc will call futex(..., FUTEX_WAIT, ...) instead on contention, which can properly suspend the thread exactly until the lock is released *and* is about 1/2 - 2/3 the cost of sched_yield in the case the lock was released before the thread could be put to sleep.
Whatever gains you think you may achieve, you won't.
My work on this was backed up with extensive benchmarking, disassembly of the generated code, and other evaluation. You can find some of the results in the list archive from about two years ago. There are many different types of system with different characteristics (uniprocessor vs multiprocessor, two threads vs 10000 threads, etc etc). Two particular cases that I'll mention are:
I guess this is the code you used for testing? https://svn.chezphil.org/mutex_perf/trunk I would say that your conclusions are valid for ARM only (I don't know the architecture or libc peculiarities), for x86 there are some subtleties which IMHO invalidate the comparison. Your spinlock implementation defers to __sync_lock_test_and_set, which in turn generates an "xchgl" instruction, and NOT an "lock xchgl" instruction (yes, these gcc primitives are tricky which is why I avoid them). Section 8.2.2 of Intel's architecture manual (Volume 3B) states that only the "lock" variants of the atomic instructions serialize reads wrt to writes, therefore using a non-locked variant may allow reads to be scheduled out of the critical section and see invalid data. Additionally, __sync_synchronize just does nothing on x86, so this does not help either. Your VIA C3 will be classified as "i586" (not i686, although it is lacking only few instructions), and therefore not use NPTL/futex at all, but old LinuxThreads. (Yes, this border case is badly supported by glibc). I guess if you force linking to the "wrong" i686 pthread library, pthread_mutex_* numbers will improve drastically. Your futex implementation is however correct, and your P4 numbers for example match my numbers quite well -- and pthread_mutex_* is pretty close.
- An inline spin lock is the only thing that doesn't involve a function call, so leaf functions remain leaf functions and are themselves more likely to be inlined or otherwise optimised. On systems with small caches or small flash chips where code size is important, this is a significant benefit.
I'm not sure I'm following here -- for small cache sizes, inlining is *not* preferrable, right?
- sched_yield is a great deal easier to port between operating systems than anything else, other than a spin lock.
sched_yield has one very valid use case: SCHED_RR tasks of the same priority yielding to each other. Using it on lock contention is just calling into the scheduler saying "I'm waiting for something, but I won't tell you what -- please have a guess at what is best to do next". And when you have already gone to all the lengths of detecting contention, this is a relatively poor thing to do. Sometimes you may do this as an ugly hack, but I'm not sure this should be promoted as "good practice" by codifying this in a library. A home-grown futex-based implementation is of course valid and useful, but on most architectures it will not be faster, and when it is not, I fail to see why it would not be preferrable to fix the problems at the libc level instead. Regards, Helge