Re: [boost] [fiber] new version in vault

1 Dec 2009

      Hello Phil,

On Mon, 30 Nov 2009, Phil Endecott wrote:
...
Helge Bahmann wrote:
...
if you don't believe me, just disassemble pthread_mutex_lock/unlock on any 
linux system.
When I last looked at the glibc source, I believe I found quite a lot of code 
that needed to test whether the mutex was e.g. recursive or not, on top of 
that core functionality.
Sure, there is a lot of code, but the fast-path check for a normal mutex 
(properly marked with __builtin_expect) is just at the top of the 
function, and it drops straight into CAS, so with the compiler doing its 
job, it *should* not matter.

(However, the fetch of the thread id from TLS should be moved out of the 
fast path as well -- the compiler is probably not clever enough to move it 
away).
...
...
As for the space overhead of a pthread_mutex_t... if you cannot pay that, 
just use hashed locks.
A hashed lock library would be welcome here, I'm sure.
Yes, this would be a really helpful addition to Boost.Thread -- 
implementing fallback for atomic operations is just not feasible without.
...
...
Last note: calling "sched_yield" on contention is about the *worst* thing 
you can do -- linux/glibc will call futex(..., FUTEX_WAIT, ...) instead on 
contention, which can properly suspend the thread exactly until the lock is 
released *and* is about 1/2 - 2/3 the cost of sched_yield in the case the 
lock was released before the thread could be put to sleep.
Whatever gains you think you may achieve, you won't.
My work on this was backed up with extensive benchmarking, disassembly of the 
generated code, and other evaluation.    You can find some of the results in 
the list archive from about two years ago.  There are many different types of 
system with different characteristics (uniprocessor vs multiprocessor, two 
threads vs 10000 threads, etc etc).  Two particular cases that I'll mention 
are:
I guess this is the code you used for testing?

   https://svn.chezphil.org/mutex_perf/trunk

I would say that your conclusions are valid for ARM only (I don't know the 
architecture or libc peculiarities), for x86 there are some subtleties 
which IMHO invalidate the comparison.

Your spinlock implementation defers to __sync_lock_test_and_set, which in 
turn generates an "xchgl" instruction, and NOT an "lock xchgl" instruction 
(yes, these gcc primitives are tricky which is why I avoid them). Section 
8.2.2 of Intel's architecture manual (Volume 3B) states that only the 
"lock" variants of the atomic instructions serialize reads wrt to writes, 
therefore using a non-locked variant may allow reads to be scheduled out 
of the critical section and see invalid data. Additionally, 
__sync_synchronize just does nothing on x86, so this does not help either.

Your VIA C3 will be classified as "i586" (not i686, although it is lacking 
only few instructions), and therefore not use NPTL/futex at all, but old 
LinuxThreads. (Yes, this border case is badly supported by glibc). I guess 
if you force linking to the "wrong" i686 pthread library, pthread_mutex_* 
numbers will improve drastically.

Your futex implementation is however correct, and your P4 numbers for 
example match my numbers quite well -- and pthread_mutex_* is 
pretty close.
...
- An inline spin lock is the only thing that doesn't involve a function call, 
so leaf functions remain leaf functions and are themselves more likely to be 
inlined or otherwise optimised.  On systems with small caches or small flash 
chips where code size is important, this is a significant benefit.
I'm not sure I'm following here -- for small cache sizes, inlining is 
*not* preferrable, right?
...
- sched_yield is a great deal easier to port between operating systems than 
anything else, other than a spin lock.
sched_yield has one very valid use case: SCHED_RR tasks of the same 
priority yielding to each other. Using it on lock contention is just 
calling into the scheduler saying "I'm waiting for something, but I won't 
tell you what -- please have a guess at what is best to do next". And when 
you have already gone to all the lengths of detecting contention, this is 
a relatively poor thing to do. Sometimes you may do this as an ugly hack, 
but I'm not sure this should be promoted as "good practice" by codifying 
this in a library.

A home-grown futex-based implementation is of course valid and useful, but 
on most architectures it will not be faster, and when it is not, I fail to 
see why it would not be preferrable to fix the problems at the libc level 
instead.

Regards, Helge