Performance of various mutex implementations on Linux

Dear All, I have been benchmarking different mutex implementations on Linux. The variations that I have considered are the following: NO_LOCKING: - no mutex at all, as a baseline for comparison. SPINLOCK - uses an atomic operation and a tight loop. YIELD - uses an atomic operation like SPINLOCK, and calls sched_yield() if the lock is held by another thread. PTHREADS - uses the pthread_mutex_*() functions; this is what the current Boost mutexes do (as I understand it). The pthread_mutex functions are implemented using the futex() system call. FUTEX - this is my own implementation of the same algorithm that PTHREADS uses internally, based on the futex() system call, but it uses inlining to avoid a function call in the un-contended case. Apart from the PTHREADS method, these should all be considered unproven code; if anyone would like to review them for any subtle concurrency issues or other problems you are very welcome to do so. For background details of futexes, I suggest Ullrich Drepper's paper "Futexes Are Tricky". All of these results were obtained with g++ 4.1.2 with -O3. Data Structure Sizes -------------------- The size of the mutex object, in bytes, is as follows: NO_LOCKING: 0 SPINLOCK: 4 [*] YIELD: 4 [*] FUTEX: 4 [*] PTHREADS: 24 [*] in all of these cases I believe that a single byte, or perhaps even a single bit in the case of SPINLOCK and YIELD, would be sufficient; there may be some obscure alignment issue though. I believe that the large size of the pthreads mutex object is because it supports a number of variants such as recursive and checking mutexes, and timeouts, etc. all in the same structure. My other implementations are all very basic with none of these features. Execution Time, Non-Contended Case ---------------------------------- In many uses mutexes are rarely contended, so optimising the speed of the uncontended locking and unlocking operations is a common objective. I have measured this overhead on three machines: egypt: 1GHz VIA C3 x86 uniprocessor linux-2.6.19 ghana: 266MHz Intel XScale ARM uniprocessor linux-2.6.19 atum: 3.6GHz Intel Pentium 4 x86 2 processor linux-2.6.17 The times in nanoseconds for one lock and one unlock are: egypt ghana atum NO_LOCKING 0 0 0 SPINLOCK 17 30 5.7 YIELD 13 38 6.4 FUTEX 50 95 26 PTHREADS 115 986 39 Execution Time, Contended Case ------------------------------ See my detailed results, linked below, for numbers (the runtimes given are for 10^8 operations). The execution time in the contended case includes both the execution time of the lock and unlock functions and the time spent waiting for the lock to become available; it's not straightforward to distinguish these and I haven't tried to. The broad pattern is that SPINLOCK, naturally, is very slow, and the other methods are relatively fast. In my benchmark, YIELD does better than FUTEX, but this may well be a feature of the structure of my benchmark; I would expect FUTEX to do better with large numbers of threads. PTHREADS gets worse more quickly than other other methods as the number of mutexes and threads increases, perhaps because the larger size of its structures is less cache-friendly. Code Size --------- The code size overhead in bytes for each method is as follows. (Note that this is with -O3, not -Os. My impression with -Os is that it does even less inlining that I expected, e.g. even a static function that's only called once is not inlined. I may investigate this further.) x86 ARM NO_LOCKING 0 0 SPINLOCK 42 32 YIELD 31 56 FUTEX 66 96 PTHREADS 25 28 Observations ------------ I suggest that we seriously consider using an inline futex implementation of mutexes as the default choice on Linux, rather than calling the pthread_mutex functions. This is about twice as fast (on x86) and saves a lot of data size. On the other hand, it increases code size. The code size and speed can be improved further by using a simpler method that just calls sched_yield(). However, this method is less suitable for use as a default implementation because it will have worse complexity in applications with contention: futex() knows which thread to wake up next, while sched_yield() doesn't. ARM Observations ---------------- Most ARM processors have only one atomic instruction, swap. My implementations of SPINLOCK, YIELD and FUTEX make use of this. PTHREADS has abysmal performance on ARM; I suspect, but have not yet checked, that it somehow emulates the atomic instructions that it doesn't have by calling into the kernel. You can find all of the mutex implementation and benchmarking scripts and the full results here: https://svn.chezphil.org/mutex_perf/trunk/ I hope this is of interest. Regards, Phil.

Thanks for doing some useful benchmarks!
The code size and speed can be improved further by using a simpler method that just calls sched_yield(). However, this method is less suitable for use as a default implementation because it will have worse complexity in applications with contention: futex() knows which thread to wake up next, while sched_yield() doesn't.
My (limited) understanding from recent linux kernel mailing list discussions regarding supposed regressions in the scheduler is that sched_yield does not have a well-defined behaviour for anything other than SCHED_FIFO tasks and as such its implementation and performance may vary greatly depending on a particular scheduler implementation. The recent regression noted for a network benchmark called iperf, used sched_yield and saw major changes to performance. The advice was not to rely on sched_yield other than in simple 'don't care' apps. (and that iperf wasn't Perhaps try your sched_yield tests on most recent kernel and play with the different scheduler sched_yield strategies to see if that influences your non-contended/contended settings. Probably shouldn't end up relying on it anyway so the point may be moot... http://lkml.org/lkml/2007/10/1/279 might be a start

Hi Paul, Paul Baxter wrote:
My (limited) understanding from recent linux kernel mailing list discussions regarding supposed regressions in the scheduler is that sched_yield does not have a well-defined behaviour for anything other than SCHED_FIFO tasks and as such its implementation and performance may vary greatly depending on a particular scheduler implementation.
Yes, undoubtedly. I would not suggest using a yield-based lock if you expect that it will actually be contended more than a very small proportion of the time. (Unless perhaps the total number of threads/processes in the system is small).
The recent regression noted for a network benchmark called iperf, used sched_yield and saw major changes to performance. http://lkml.org/lkml/2007/10/1/279 might be a start
And interestingly it was on an ARM system (a NAS box), where moving from yield to pthreads might not help much if my analysis is accurate! The fix that was suggested removed the need for the lock altogether, if I understand it correctly.
The advice was not to rely on sched_yield other than in simple 'don't care' apps. (and that iperf wasn't
Perhaps try your sched_yield tests on most recent kernel and play with the different scheduler sched_yield strategies to see if that influences your non-contended/contended settings. Probably shouldn't end up relying on it anyway so the point may be moot...
There are too many different access patterns for me to investigate them all, so I'm not going to make any claims except for the non-contended case. I keep hoping that someone has already written a benchmark suite for studying this sort of thing. The important thing is that my futex code will have the same algorithmic behaviour as pthreads in all cases, but with better performance, and I suggest using it for applications where the probability of contention is measurable. Regards, Phil.

"Phil Endecott" <spam_from_boost_dev@chezphil.org> writes:
I have been benchmarking different mutex implementations on Linux. The variations that I have considered are the following:
Thanks for the benchmarks. It may well be worth replacing the pthread_mutex_t-based mutex with a futex-based one in some cases. Anthony -- Anthony Williams Just Software Solutions Ltd - http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Anthony Williams wrote:
"Phil Endecott" <spam_from_boost_dev@chezphil.org> writes:
I have been benchmarking different mutex implementations on Linux. The variations that I have considered are the following:
Thanks for the benchmarks. It may well be worth replacing the pthread_mutex_t-based mutex with a futex-based one in some cases.
I'd just like to encourage this effort. Where I currently work lock performance is a big issue, and no we don't have any small repeatable tests - yet. Most of the tests have been real system tests and do not lend themselves to simple comparisons and also before my time... What I would say is that I will certainly be looking at Phil's code and trying it out. From what I know in other parts of our code base, where performance really counts, we use futexes. So I'd really like to see this in boost::thread even if it is a preprocessor option. Jamie
Anthony
participants (4)
-
Anthony Williams
-
Jamie Allsop
-
Paul Baxter
-
Phil Endecott