
On Tue, 1 Dec 2009, Anthony Williams wrote:
Helge Bahmann <hcb@chaoticmind.net> writes:
On Mon, 30 Nov 2009, Phil Endecott wrote:
My work on this was backed up with extensive benchmarking, disassembly of the generated code, and other evaluation. You can find some of the results in the list archive from about two years ago. There are many different types of system with different characteristics (uniprocessor vs multiprocessor, two threads vs 10000 threads, etc etc). Two particular cases that I'll mention are:
I guess this is the code you used for testing?
https://svn.chezphil.org/mutex_perf/trunk
I would say that your conclusions are valid for ARM only (I don't know the architecture or libc peculiarities), for x86 there are some subtleties which IMHO invalidate the comparison.
Your spinlock implementation defers to __sync_lock_test_and_set, which in turn generates an "xchgl" instruction, and NOT an "lock xchgl" instruction (yes, these gcc primitives are tricky which is why I avoid them).
On x86 these are equivalent --- the LOCK prefix is automatically asserted for XCHG. See the XCHG instruction docs in the Intel manual volumne 2B.
Yes you're right, forgot this odd one :/ Which still makes me wonder what is going on -- it's the first time I see "lock xchgl" being noticeably faster than "lock cmpxchgl".