Re: [boost] [fiber] new version in vault

1 Dec 2009


      On Tue, 1 Dec 2009, Anthony Williams wrote:
...
Helge Bahmann <hcb@chaoticmind.net> writes:
...
On Mon, 30 Nov 2009, Phil Endecott wrote:
...
My work on this was backed up with extensive benchmarking,
disassembly of the generated code, and other evaluation.    You can
find some of the results in the list archive from about two years
ago.  There are many different types of system with different
characteristics (uniprocessor vs multiprocessor, two threads vs
10000 threads, etc etc).  Two particular cases that I'll mention
are:
I guess this is the code you used for testing?
https://svn.chezphil.org/mutex_perf/trunk
I would say that your conclusions are valid for ARM only (I don't know
the architecture or libc peculiarities), for x86 there are some
subtleties which IMHO invalidate the comparison.
Your spinlock implementation defers to __sync_lock_test_and_set, which
in turn generates an "xchgl" instruction, and NOT an "lock xchgl"
instruction (yes, these gcc primitives are tricky which is why I avoid
them).
On x86 these are equivalent --- the LOCK prefix is automatically
asserted for XCHG. See the XCHG instruction docs in the Intel manual
volumne 2B.
Yes you're right, forgot this odd one :/ Which still makes me wonder what 
is going on -- it's the first time I see "lock xchgl" being noticeably 
faster than "lock cmpxchgl".