Performance of various mutex implementations on Linux

23 Oct 2007

      Dear All,

I have been benchmarking different mutex implementations on Linux.  The 
variations that I have considered are the following:

NO_LOCKING: - no mutex at all, as a baseline for comparison.

SPINLOCK - uses an atomic operation and a tight loop.

YIELD - uses an atomic operation like SPINLOCK, and calls sched_yield() 
if the lock is held by another thread.

PTHREADS - uses the pthread_mutex_*() functions; this is what the 
current Boost mutexes do (as I understand it).  The pthread_mutex 
functions are implemented using the futex() system call.

FUTEX - this is my own implementation of the same algorithm that 
PTHREADS uses internally, based on the futex() system call, but it uses 
inlining to avoid a function call in the un-contended case.

Apart from the PTHREADS method, these should all be considered unproven 
code; if anyone would like to review them for any subtle concurrency 
issues or other problems you are very welcome to do so.

For background details of futexes, I suggest Ullrich Drepper's paper 
"Futexes Are Tricky".

All of these results were obtained with g++ 4.1.2 with -O3.

Data Structure Sizes
--------------------

The size of the mutex object, in bytes, is as follows:

NO_LOCKING: 0
SPINLOCK:   4 [*]
YIELD:      4 [*]
FUTEX:      4 [*]
PTHREADS:  24

[*] in all of these cases I believe that a single byte, or perhaps even 
a single bit in the case of SPINLOCK and YIELD, would be sufficient; 
there may be some obscure alignment issue though.

I believe that the large size of the pthreads mutex object is because 
it supports a number of variants such as recursive and checking 
mutexes, and timeouts, etc. all in the same structure.  My other 
implementations are all very basic with none of these features.

Execution Time, Non-Contended Case
----------------------------------

In many uses mutexes are rarely contended, so optimising the speed of 
the uncontended locking and unlocking operations is a common 
objective.  I have measured this overhead on three machines:

egypt:  1GHz   VIA   C3        x86 uniprocessor  linux-2.6.19
ghana:  266MHz Intel XScale    ARM uniprocessor  linux-2.6.19
atum:   3.6GHz Intel Pentium 4 x86 2 processor   linux-2.6.17

The times in nanoseconds for one lock and one unlock are:

               egypt    ghana    atum
NO_LOCKING        0        0       0
SPINLOCK         17       30       5.7
YIELD            13       38       6.4
FUTEX            50       95      26
PTHREADS        115      986      39

Execution Time, Contended Case
------------------------------

See my detailed results, linked below, for numbers (the runtimes given 
are for 10^8 operations).  The execution time in the contended case 
includes both the execution time of the lock and unlock functions and 
the time spent waiting for the lock to become available; it's not 
straightforward to distinguish these and I haven't tried to.  The broad 
pattern is that SPINLOCK, naturally, is very slow, and the other 
methods are relatively fast.  In my benchmark, YIELD does better than 
FUTEX, but this may well be a feature of the structure of my benchmark; 
I would expect FUTEX to do better with large numbers of threads.  
PTHREADS gets worse more quickly than other other methods as the number 
of mutexes and threads increases, perhaps because the larger size of 
its structures is less cache-friendly.

Code Size
---------

The code size overhead in bytes for each method is as follows.
(Note that this is with -O3, not -Os.  My impression with -Os is that 
it does even less inlining that I expected, e.g. even a static function 
that's only called once is not inlined.  I may investigate this further.)

                 x86      ARM
NO_LOCKING        0        0
SPINLOCK         42       32
YIELD            31       56
FUTEX            66       96
PTHREADS         25       28

Observations
------------

I suggest that we seriously consider using an inline futex 
implementation of mutexes as the default choice on Linux, rather than 
calling the pthread_mutex functions.  This is about twice as fast (on 
x86) and saves a lot of data size.  On the other hand, it increases 
code size.

The code size and speed can be improved further by using a simpler 
method that just calls sched_yield().  However, this method is less 
suitable for use as a default implementation because it will have worse 
complexity in applications with contention: futex() knows which thread 
to wake up next, while sched_yield() doesn't.

ARM Observations
----------------

Most ARM processors have only one atomic instruction, swap.  My 
implementations of SPINLOCK, YIELD and FUTEX make use of this.  
PTHREADS has abysmal performance on ARM; I suspect, but have not yet 
checked, that it somehow emulates the atomic instructions that it 
doesn't have by calling into the kernel.

You can find all of the mutex implementation and benchmarking scripts 
and the full results here:

https://svn.chezphil.org/mutex_perf/trunk/

I hope this is of interest.

Regards,

Phil.

Phil Endecott

Paul Baxter

Phil Endecott

Anthony Williams

Jamie Allsop

tags

participants (4)