
Hi Phil! Thanks for your interest, and I appreciate any help for Arm, as I don't have this architecture available. Am Monday 30 November 2009 17:02:14 schrieb Phil Endecott: [snip]
Architecture v6 introduced 32-bit load-locked/store-conditional instructions. Architecture v7 introduced 16- and 8-bit versions.
The library already has infrastructure in place to emulate 8- and 16-bit atomics by "embedding" them into a properly aligned 32-bit atomic (created "on the fly" through appropriate pointer casts). FWIW ppc and Alpha require this already, as they do not have 8/16-bit ll/sc. This is of course slower than native 8-/16-bit versions, but is workable. I will shortly be adding a small howto on adding platform support to the library.
ARM Linux has kernel support that provides compare-and-swap even on processors that don't support it by guaranteeing to not interrupt code in certain address ranges. This has the cost of a function call, i.e. it's slower than inline assembler but a lot faster than a system call. Kernels that don't support this are now sufficiently old that I think they can be ignored. Newer versions of gcc may use this mechanism when the atomic builtins are used, but versions of gcc that don't do this are sufficiently widespread that they should still be supported efficiently.
these functions are part of libc, glibc or the vdso?
I believe that OS X on ARM (i.e. the iPhone) always runs on architecture v6 or newer. However Apple supply a version of gcc that is too old to support ARM atomics via the builtins. The "recommended" way to do atomics is via a set of function calls described here: http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPa ges/man3/atomic.3.html I have not looked at what these functions do or tried to benchmark them. They are also available on other OS X platforms.
these should easily be usable, but - the *Barrier versions are still stronger than what is required (see below) - there are no "Load with Barrier" and "Store with Barrier" operations, these would have to be emulated with compare_exchange
I note that you don't seem to use the gcc atomic builtins even on platforms where they have worked for a while e.g. x86. Any reason for that?
on x86 it would not matter; on all other platforms, the intrinsics have the unfortunate side-effect of always acting as (usually bi-directional) memory barriers. There are however legitimate use cases, for example the following operation (equivalent to __sync_fetch_and_add): atomic<int>::fetch_add(1, memory_order_acq_rel) is 2 to 3 times slower on ppc than the version not enforcing memory ordering: atomic<int>::fetch_add(1, memory_order_relaxed) If you always use fully-fenced versions, then any lock-free algorithm will usually be noticeably *slower* than the platform's native mutex lock/unlock operation (which use only the weakest barriers necessary), making the whole exercise rather pointless. Cheers Helge