Notice: Boost.Atomic (atomic operations library)

Hello, as promised I have started extracting an atomic operations library. Current state is available at: http://www.chaoticmind.net/~hcb/projcets/boost.atomic It implements boost::atomic<TYPE> which faithfully mimics std::atomic<TYPE> as specified in the C++0x draft standard. As allowed by the standard, operations transparently fall back to locking when the underlying architecture does not support the requested operation, so the library already contains a "fallback" implementation that works an all platforms (using mutex from boost::thread). It currently natively supports gcc/x86, gcc/powerpc and gcc/alpha (I can vouch for the correctness of the implementations on these targets). It contains some entirely untested support for building implementations from CAS operations on other systems (e.g. _InterlockedCompareExchange on win), so I would greatly appreciate any feedback if it works/doesn't on any particular platform. There is some preliminary documentation, but not in boostdoc format -- after unsuccessfully struggling with bjam/boostbook & friends for a few hours I simply gave up and reverted to trusty old doxygen :( Is there any step-by-step guide on how to create, build and document a new library? I could really use that as the boost build and documentation system is pretty alien to an autotools-accustomed guy like me. Best regards Helge

Am Sunday 29 November 2009 23:49:52 schrieb Helge Bahmann:
Is there any step-by-step guide on how to create, build and document a new library? I could really use that as the boost build and documentation system is pretty alien to an autotools-accustomed guy like me.
there is a guide on how to set up the build system: http://www.boost.org/doc/libs/1_41_0/doc/html/boostbook/getting/started.html there is none on how to write documentation, this is how I got started: - try if you can successfully bjam inside libs/intrusive/doc/ - copy intrusive/doc/Jamfile.v2 and intrusive/doc/intrusive.qbk to your library dir (must be inside the boost tree!) - change anything that looks "intrusive" in the jamfile to your library - open the .qbk file and delete all sections except the one containing "xinclude autodoc.xml" - bjam you should get a html-documentation built that contains the typical boost reference documentation for your library.

Helge Bahmann wrote:
Hello,
as promised I have started extracting an atomic operations library. Current state is available at:
Here is the correct link: http://www.chaoticmind.net/~hcb/projects/boost.atomic/ [snip]

Helge Bahmann wrote:
as promised I have started extracting an atomic operations library.
Hi Helge, This will be a very useful contribution to Boost - thanks for proposing it. I have worked on a number of different ARM platforms with their own idiosyncrasies and I would be happy to help add support for them. The challenge is that there are numerous combinations of processor version, compiler version and OS to think about, and it's not clear what the best choice is especially if binary compatibility is needed. Here is the situation as I understand it (and maybe other ARM users can confirm/deny this): Architecture v6 introduced 32-bit load-locked/store-conditional instructions. Architecture v7 introduced 16- and 8-bit versions. Earlier architecture versions are still sufficiently widespread that efficient support is still desirable. I've never found a gcc macro to indicate the target architecture version passed to -march. Newer versions of gcc may generate these instructions when the atomic builtins are used, but versions of gcc that don't do this are sufficiently widespread that they should still be supported efficiently. ARM Linux has kernel support that provides compare-and-swap even on processors that don't support it by guaranteeing to not interrupt code in certain address ranges. This has the cost of a function call, i.e. it's slower than inline assembler but a lot faster than a system call. Kernels that don't support this are now sufficiently old that I think they can be ignored. Newer versions of gcc may use this mechanism when the atomic builtins are used, but versions of gcc that don't do this are sufficiently widespread that they should still be supported efficiently. I believe that OS X on ARM (i.e. the iPhone) always runs on architecture v6 or newer. However Apple supply a version of gcc that is too old to support ARM atomics via the builtins. The "recommended" way to do atomics is via a set of function calls described here: http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPag... I have not looked at what these functions do or tried to benchmark them. They are also available on other OS X platforms. I note that you don't seem to use the gcc atomic builtins even on platforms where they have worked for a while e.g. x86. Any reason for that? Cheers, Phil.

Hi Phil! Thanks for your interest, and I appreciate any help for Arm, as I don't have this architecture available. Am Monday 30 November 2009 17:02:14 schrieb Phil Endecott: [snip]
Architecture v6 introduced 32-bit load-locked/store-conditional instructions. Architecture v7 introduced 16- and 8-bit versions.
The library already has infrastructure in place to emulate 8- and 16-bit atomics by "embedding" them into a properly aligned 32-bit atomic (created "on the fly" through appropriate pointer casts). FWIW ppc and Alpha require this already, as they do not have 8/16-bit ll/sc. This is of course slower than native 8-/16-bit versions, but is workable. I will shortly be adding a small howto on adding platform support to the library.
ARM Linux has kernel support that provides compare-and-swap even on processors that don't support it by guaranteeing to not interrupt code in certain address ranges. This has the cost of a function call, i.e. it's slower than inline assembler but a lot faster than a system call. Kernels that don't support this are now sufficiently old that I think they can be ignored. Newer versions of gcc may use this mechanism when the atomic builtins are used, but versions of gcc that don't do this are sufficiently widespread that they should still be supported efficiently.
these functions are part of libc, glibc or the vdso?
I believe that OS X on ARM (i.e. the iPhone) always runs on architecture v6 or newer. However Apple supply a version of gcc that is too old to support ARM atomics via the builtins. The "recommended" way to do atomics is via a set of function calls described here: http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPa ges/man3/atomic.3.html I have not looked at what these functions do or tried to benchmark them. They are also available on other OS X platforms.
these should easily be usable, but - the *Barrier versions are still stronger than what is required (see below) - there are no "Load with Barrier" and "Store with Barrier" operations, these would have to be emulated with compare_exchange
I note that you don't seem to use the gcc atomic builtins even on platforms where they have worked for a while e.g. x86. Any reason for that?
on x86 it would not matter; on all other platforms, the intrinsics have the unfortunate side-effect of always acting as (usually bi-directional) memory barriers. There are however legitimate use cases, for example the following operation (equivalent to __sync_fetch_and_add): atomic<int>::fetch_add(1, memory_order_acq_rel) is 2 to 3 times slower on ppc than the version not enforcing memory ordering: atomic<int>::fetch_add(1, memory_order_relaxed) If you always use fully-fenced versions, then any lock-free algorithm will usually be noticeably *slower* than the platform's native mutex lock/unlock operation (which use only the weakest barriers necessary), making the whole exercise rather pointless. Cheers Helge

Helge Bahmann wrote:
Hi Phil!
Thanks for your interest, and I appreciate any help for Arm, as I don't have this architecture available.
Currently my ARM v4 (XScale) dev system is a bit broken, but I might be able to fix it. I have working v6/v7 systems.
Am Monday 30 November 2009 17:02:14 schrieb Phil Endecott: [snip]
Architecture v6 introduced 32-bit load-locked/store-conditional instructions. Architecture v7 introduced 16- and 8-bit versions.
The library already has infrastructure in place to emulate 8- and 16-bit atomics by "embedding" them into a properly aligned 32-bit atomic (created "on the fly" through appropriate pointer casts). FWIW ppc and Alpha require this already, as they do not have 8/16-bit ll/sc. This is of course slower than native 8-/16-bit versions, but is workable.
I will shortly be adding a small howto on adding platform support to the library.
That will be useful.
ARM Linux has kernel support that provides compare-and-swap even on processors that don't support it by guaranteeing to not interrupt code in certain address ranges. This has the cost of a function call, i.e. it's slower than inline assembler but a lot faster than a system call. Kernels that don't support this are now sufficiently old that I think they can be ignored. Newer versions of gcc may use this mechanism when the atomic builtins are used, but versions of gcc that don't do this are sufficiently widespread that they should still be supported efficiently.
these functions are part of libc, glibc or the vdso?
It's something provided by the kernel in a vdso-like way; I'm not sure if it's actually vdso. For the details google for __kernel_cmpxchg and/or look at entry-armv.S in the kernel source.
I believe that OS X on ARM (i.e. the iPhone) always runs on architecture v6 or newer. However Apple supply a version of gcc that is too old to support ARM atomics via the builtins. The "recommended" way to do atomics is via a set of function calls described here: http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPa ges/man3/atomic.3.html I have not looked at what these functions do or tried to benchmark them. They are also available on other OS X platforms.
these should easily be usable, but - the *Barrier versions are still stronger than what is required (see below) - there are no "Load with Barrier" and "Store with Barrier" operations, these would have to be emulated with compare_exchange
Since these devices are (currently) all uniprocessor, many of these issues are (currently) unimportant.
I note that you don't seem to use the gcc atomic builtins even on platforms where they have worked for a while e.g. x86. Any reason for that?
on x86 it would not matter; on all other platforms, the intrinsics have the unfortunate side-effect of always acting as (usually bi-directional) memory barriers. There are however legitimate use cases, for example the following operation (equivalent to __sync_fetch_and_add):
atomic<int>::fetch_add(1, memory_order_acq_rel)
is 2 to 3 times slower on ppc than the version not enforcing memory ordering:
atomic<int>::fetch_add(1, memory_order_relaxed)
If you always use fully-fenced versions, then any lock-free algorithm will usually be noticeably *slower* than the platform's native mutex lock/unlock operation (which use only the weakest barriers necessary), making the whole exercise rather pointless.
Right. Cheers, Phil.

Helge Bahmann wrote:
I will shortly be adding a small howto on adding platform support to the library.
Please let me know when this is ready. I currently have a bit of time to spend on this. In summary, the cases that I need to handle are: 1. Linux kernel provided memory-barrier and CAS operations (only); 2. Asm load-locked/store-conditional for words (only); 3. As 2 but also for smaller types. I could probably work this out from the source, but it would save some time to have some hints.... Phil.

Hi Phil, Am Thursday 03 December 2009 15:58:03 schrieb Phil Endecott:
Helge Bahmann wrote:
I will shortly be adding a small howto on adding platform support to the library.
Please let me know when this is ready. I currently have a bit of time to spend on this.
There is one planned internal API change (switching to the four-parameter compare_exchange_*) still pending, but that should be straight-forward afterwards. If you want to try, I have started writing things up, the current state is at: http://www.chaoticmind.net/~hcb/projects/boost.atomic/doc/architecture_suppo... (the tarball also contains the generated docs, still busy reworking to use boostdoc)
In summary, the cases that I need to handle are:
1. Linux kernel provided memory-barrier and CAS operations (only);
does any of these arm platforms (this is pre-v6 probably?) actually support smp? if not, then the barriers will probably be NOPs
2. Asm load-locked/store-conditional for words (only); 3. As 2 but also for smaller types.
sounds like this is going to be one of the most complicated platforms, so I really appreciate your experience here...
I could probably work this out from the source, but it would save some time to have some hints....
It's probably still required to delve into the source, but I hope that the write-up provides a good entry point (but of course don't hesitate on commenting, there is lots of things to improve). Best regards Helge

Helge Bahmann wrote:
Hi Phil, Am Thursday 03 December 2009 15:58:03 schrieb Phil Endecott:
Helge Bahmann wrote:
I will shortly be adding a small howto on adding platform support to the library.
Please let me know when this is ready. I currently have a bit of time to spend on this.
There is one planned internal API change (switching to the four-parameter compare_exchange_*) still pending, but that should be straight-forward afterwards. If you want to try, I have started writing things up, the current state is at:
http://www.chaoticmind.net/~hcb/projects/boost.atomic/doc/architecture_suppo...
Thanks, just what I need.
(the tarball also contains the generated docs, still busy reworking to use boostdoc)
In summary, the cases that I need to handle are:
1. Linux kernel provided memory-barrier and CAS operations (only);
does any of these arm platforms (this is pre-v6 probably?) actually support smp? if not, then the barriers will probably be NOPs
I think the barrier is a DMB instruction, but in principle the kernel could put nothing there on uniprocessors. There's still the small overhead of the call, which we could consider omitting if we were certain that it was a uniprocessor. Anyway, in this case I think I need to implement load, store and compare_exchange_weak using the kernel-provided functions and add your __build_atomic_from_minimal and __build_atomic_from_larger_type on top. (BTW, why do you use leading __s ? I was under the impression that such identifiers were reserved.)
2. Asm load-locked/store-conditional for words (only); 3. As 2 but also for smaller types.
sounds like this is going to be one of the most complicated platforms, so I really appreciate your experience here...
Hmmm.... Would it be possible to add another set of builders that could use load-locked and store-conditional functions from a lower layer? This could reduce the amount of assembler needed. Cheers, Phil.

On Thu, 3 Dec 2009, Phil Endecott wrote:
1. Linux kernel provided memory-barrier and CAS operations (only);
does any of these arm platforms (this is pre-v6 probably?) actually support smp? if not, then the barriers will probably be NOPs
I think the barrier is a DMB instruction, but in principle the kernel could put nothing there on uniprocessors. There's still the small overhead of the call, which we could consider omitting if we were certain that it was a uniprocessor.
out of curiosity -- DMB also enforces ordered MMIO access? This would be stronger than required. If this is always an "emulated" CAS then I don't think DMB would be required under any circumstances -- if the system is uni-processor, then obviously no barrier is required. If it is multi-processor, then the emulation requires an internal spin-lock in the kernel, which must itself already include sufficient memory barriers.
Anyway, in this case I think I need to implement load, store and compare_exchange_weak using the kernel-provided functions and add your __build_atomic_from_minimal and __build_atomic_from_larger_type on top.
I'm not sure if the kernel-provided CAS is restarted or aborted on interruption, if it is restarted then it will not fail spuriously and qualifies for compare_exchange_strong -- in that case I would recommend to additionally manually implement "exchange", have c_ex_weak call c_ex_strong and use __build_atomic_from_exchange (yes, it's not that well-named).
(BTW, why do you use leading __s ? I was under the impression that such identifiers were reserved.)
habit of mine to name really internal stuff that way, I can change it if it collides with boost coding style
2. Asm load-locked/store-conditional for words (only); 3. As 2 but also for smaller types.
sounds like this is going to be one of the most complicated platforms, so I really appreciate your experience here...
Hmmm....
Would it be possible to add another set of builders that could use load-locked and store-conditional functions from a lower layer? This could reduce the amount of assembler needed.
The problem is that ll/sc are quite constrained on the architectures that I know of -- most processors will clear the reservation established by ll when there is a memory reference to the same cacheline before the sc, some will do this for _any_ memory reference, so that the ll/sc loop could effectively live-lock. I don't think it is possible to constrain the compiler sufficiently to prevent it from accidentally inserting such memory references if you allow C++ code between these instructions (either -O0 builds not inlining the wrapper functions, or -O2 with very aggressive inlining moving code in between), so I fear that exposing ll/sc will be rather brittle. Best regards, Helge

Helge Bahmann wrote:
On Thu, 3 Dec 2009, Phil Endecott wrote:
(BTW, why do you use leading __s ? I was under the impression that such identifiers were reserved.)
habit of mine to name really internal stuff that way, I can change it if it collides with boost coding style
It collides with the standard. The compiler/stdlib is explicitly allowed to declare keywords and macros containing __ (or starting with _ and an uppercase letter) and break your code.

Helge Bahmann wrote:
On Thu, 3 Dec 2009, Phil Endecott wrote:
1. Linux kernel provided memory-barrier and CAS operations (only);
does any of these arm platforms (this is pre-v6 probably?) actually support smp? if not, then the barriers will probably be NOPs
I think the barrier is a DMB instruction, but in principle the kernel could put nothing there on uniprocessors. There's still the small overhead of the call, which we could consider omitting if we were certain that it was a uniprocessor.
out of curiosity -- DMB also enforces ordered MMIO access? This would be stronger than required.
I don't know.
If this is always an "emulated" CAS
It could be an ll/sc sequence on systems that have those instructions. I don't think that counts as "emulated" in this sense, so memory barriers are needed - right?
then I don't think DMB would be required under any circumstances -- if the system is uni-processor, then obviously no barrier is required. If it is multi-processor, then the emulation requires an internal spin-lock in the kernel, which must itself already include sufficient memory barriers.
Anyway, in this case I think I need to implement load, store and compare_exchange_weak using the kernel-provided functions and add your __build_atomic_from_minimal and __build_atomic_from_larger_type on top.
I'm not sure if the kernel-provided CAS is restarted or aborted on interruption
I'm pretty sure that currently it's restarted, but that may not be guaranteed.
, if it is restarted then it will not fail spuriously and qualifies for compare_exchange_strong -- in that case I would recommend to additionally manually implement "exchange", have c_ex_weak call c_ex_strong and use __build_atomic_from_exchange (yes, it's not that well-named).
I believe you, but I'm getting out of my depth here.
(BTW, why do you use leading __s ? I was under the impression that such identifiers were reserved.)
habit of mine to name really internal stuff that way, I can change it if it collides with boost coding style
I think that would be a good idea - "namespace detail" sufficiently identifies these things as being internal. While you're at it, I suggest adding some license/copyright headers.
2. Asm load-locked/store-conditional for words (only); 3. As 2 but also for smaller types.
sounds like this is going to be one of the most complicated platforms, so I really appreciate your experience here...
Hmmm....
Would it be possible to add another set of builders that could use load-locked and store-conditional functions from a lower layer? This could reduce the amount of assembler needed.
The problem is that ll/sc are quite constrained on the architectures that I know of -- most processors will clear the reservation established by ll when there is a memory reference to the same cacheline before the sc, some will do this for _any_ memory reference, so that the ll/sc loop could effectively live-lock. I don't think it is possible to constrain the compiler sufficiently to prevent it from accidentally inserting such memory references if you allow C++ code between these instructions (either -O0 builds not inlining the wrapper functions, or -O2 with very aggressive inlining moving code in between), so I fear that exposing ll/sc will be rather brittle.
I suppose that's true, but it's unfortunate. Maybe someone can think of a trick to help us. Phil.

On Fri, 4 Dec 2009, Phil Endecott wrote:
Helge Bahmann wrote:
If this is always an "emulated" CAS
It could be an ll/sc sequence on systems that have those instructions. I don't think that counts as "emulated" in this sense, so memory barriers are needed - right?
ah, I didn't think it would be used at all if the platform could do ll/sc... you're right, in that case, you will definitely need the barriers
(BTW, why do you use leading __s ? I was under the impression that such identifiers were reserved.)
habit of mine to name really internal stuff that way, I can change it if it collides with boost coding style
I think that would be a good idea - "namespace detail" sufficiently identifies these things as being internal. While you're at it, I suggest adding some license/copyright headers.
okay, as Peter Dimov pointed out it collides with stdlib, so will change that Helge

Hi Helge, In your load and store methods, you have code something like: T i; T v=*reinterpret_cast<volatile const T *>(&i); *reinterpret_cast<volatile T *>(&i)=v; Shouldn't this be const_cast ? I.e. something like: T v = const_cast<volatile const T&>(i); const_cast<volatile T &>(i) = v; (Comments from language lawyers welcome!) Phil.

Hi Helge, In your x86 code you have: bool compare_exchange_strong(T &e, T d, memory_order order=memory_order_seq_cst) volatile { T prev=e; __asm__ __volatile__("lock cmpxchgb %1, %2\n" : "=a" (prev) : "q" (d), "m" (i), "a" (e) : "memory"); bool success=(prev==e); e=prev; return success; } Can you explain why 'e' is a reference and why you assign back to it? Maybe it would help if you could write out what that asm does in pseudo-code. The kernel_cmpxchg that I have does: if (*ptr == oldval) { *ptr = newval; return 0; } else { return !0; } I think I can just write bool compare_exchange_strong(T &e, T d, memory_order order=memory_order_seq_cst) volatile { return kernel_cmpxchg(e,d,&i) == 0; } but the extra stuff in your x86 version makes me suspect there is more to it. Phil.

Am Friday 04 December 2009 18:06:31 schrieb Phil Endecott:
Hi Helge,
In your x86 code you have:
bool compare_exchange_strong(T &e, T d, memory_order order=memory_order_seq_cst) volatile { T prev=e; __asm__ __volatile__("lock cmpxchgb %1, %2\n" : "=a" (prev) : "q" (d), "m" (i), "a" (e) : "memory"); bool success=(prev==e); e=prev; return success; }
Can you explain why 'e' is a reference and why you assign back to it?
the standard requires the "found" value to be passed back, so you don't need to perform another "load" on retrying the operation
Maybe it would help if you could write out what that asm does in pseudo-code. The kernel_cmpxchg that I have does:
there is a pseudo-code explanation in the boost::atomic class documentation Helge

Helge Bahmann wrote:
Am Friday 04 December 2009 18:06:31 schrieb Phil Endecott:
Hi Helge,
In your x86 code you have:
bool compare_exchange_strong(T &e, T d, memory_order order=memory_order_seq_cst) volatile { T prev=e; __asm__ __volatile__("lock cmpxchgb %1, %2\n" : "=a" (prev) : "q" (d), "m" (i), "a" (e) : "memory"); bool success=(prev==e); e=prev; return success; }
Can you explain why 'e' is a reference and why you assign back to it?
the standard requires the "found" value to be passed back, so you don't need to perform another "load" on retrying the operation
Maybe it would help if you could write out what that asm does in pseudo-code. The kernel_cmpxchg that I have does:
there is a pseudo-code explanation in the boost::atomic class documentation
Ah sorry, I didn't realise that compare_exchange_* were in the external interface; for some reason I thought they were internal. OK, so since my kernel_cmpxchg() function doesn't return the old value but only a flag, I need something like: bool success = kernel_cmpxchg(e,d,&i); if (!success) e = i; But i may have changed again between the kernel_cmpxchg() and that assignment. Is that OK? Should I use load() there? If so, what memory order is needed? Cheers, Phil.

On Fri, 4 Dec 2009, Phil Endecott wrote:
Helge Bahmann wrote:
there is a pseudo-code explanation in the boost::atomic class documentation
Ah sorry, I didn't realise that compare_exchange_* were in the external interface; for some reason I thought they were internal.
they are external, but I have now changed the implementation that backends must implement "four-operand" compare_exchange_*, which specifies two memory_order operands (one for successful exchange, one for failed), but give it a day until it is committed into the public repo the three-operand version is then provided by the front-end
OK, so since my kernel_cmpxchg() function doesn't return the old value but only a flag, I need something like:
bool success = kernel_cmpxchg(e,d,&i); if (!success) e = i;
But i may have changed again between the kernel_cmpxchg() and that assignment. Is that OK?
that's okay -- there are two cases: - "new" value (loaded after failed cmpxchg) does not match "expected", then the caller receive a more up-to-date value and retry with this one, after all it does not matter whether it failed with the "old" or "new" mismatching value - "new" value (loaded after failed cmpxchg) matches "expected": compare_exchange_weak is allowed to fail "spuriously"
Should I use load() there? If so, what memory order is needed?
the load is internal, so load it "relaxed" , you only have to maintain the memory order for the "total" cmpxchg, successful or failed Best regards, Helge

On Sun, Nov 29, 2009 at 4:49 PM, Helge Bahmann <hcb@chaoticmind.net> wrote:
It contains some entirely untested support for building implementations from CAS operations on other systems (e.g. _InterlockedCompareExchange on win), so I would greatly appreciate any feedback if it works/doesn't on any particular platform.
Just to clarify, delegating to Interlocked* API on windows is going to be an officially supported feature of the API, but currently just isn't tested well enough right? I'm just unclear if it's going to be officially supported, or if this is more like experimental and maybe not officially supported in the first release. Zach

Am Monday 30 November 2009 17:33:03 schrieb Zachary Turner:
On Sun, Nov 29, 2009 at 4:49 PM, Helge Bahmann <hcb@chaoticmind.net> wrote:
It contains some entirely untested support for building implementations from CAS operations on other systems
okay this was badly worded -- the infrastructure for building atomic operations from just a single platform-specific CAS operation is finished and well tested, it is just that I do not have every conceivable compiler/os combination available to test if some particular platform-CAS works (or even compiles). In particular, I have no windows system available to me for testing, so things will take some time as I constantly have to ask others helping me out for implementation and/or testing. If however, by sheer luck, I have managed to hit the right ifdef/include combination required for _InterlockedCompareExchange on the first attempt and without any compile-testing, then it will compile and run correctly already.
(e.g. _InterlockedCompareExchange on win), so I would greatly appreciate any feedback if it works/doesn't on any particular platform.
Just to clarify, delegating to Interlocked* API on windows is going to be an officially supported feature of the API, but currently just isn't tested well enough right? I'm just unclear if it's going to be officially supported, or if this is more like experimental and maybe not officially supported in the first release.
I would not dare calling something a "release" that didn't support the _Interlocked* family of operations ;) But I'm optimistic that this will probably be sorted out by the end of the week -- and there will be a properly optimized implementation using all of the _Interlocked* functions instead of always falling back to _InterlockedCompareExchange. Best regards, Helge

On Mon, Nov 30, 2009 at 11:25 AM, Helge Bahmann <hcb@chaoticmind.net> wrote:
Am Monday 30 November 2009 17:33:03 schrieb Zachary Turner:
Just to clarify, delegating to Interlocked* API on windows is going to be an officially supported feature of the API, but currently just isn't tested well enough right? I'm just unclear if it's going to be officially supported, or if this is more like experimental and maybe not officially supported in the first release.
I would not dare calling something a "release" that didn't support the _Interlocked* family of operations ;)
That's how I felt too upon first reading the original post but didn't want to word it quite so strongly, so I'm glad we're in agreement :) Zach

Helge Bahmann wrote:
Am Monday 30 November 2009 17:33:03 schrieb Zachary Turner:
On Sun, Nov 29, 2009 at 4:49 PM, Helge Bahmann <hcb@chaoticmind.net> wrote:
It contains some entirely untested support for building implementations from CAS operations on other systems
okay this was badly worded -- the infrastructure for building atomic operations from just a single platform-specific CAS operation is finished and well tested, it is just that I do not have every conceivable compiler/os combination available to test if some particular platform-CAS works (or even compiles). In particular, I have no windows system available to me for testing, so things will take some time as I constantly have to ask others helping me out for implementation and/or testing.
If however, by sheer luck, I have managed to hit the right ifdef/include combination required for _InterlockedCompareExchange on the first attempt and without any compile-testing, then it will compile and run correctly already.
FYI, there is boost/detail/interlocked.hpp, which adds quite a bit of compatibility among different Windows versions. It supports _Interlocked intrinsics, as well as native API calls.

Helge Bahmann wrote:
Hello,
as promised I have started extracting an atomic operations library. Current state is available at:
Thanks a lot for this proposal. It was one of the most awaited additions (at least, awaited by me :)). I haven't looked at the implementation yet, but the doc doesn't state that enums are valid template parameters. Is it really so?

Am Monday 30 November 2009 18:54:20 schrieb Andrey Semashev:
Helge Bahmann wrote:
Hello,
as promised I have started extracting an atomic operations library. Current state is available at:
Thanks a lot for this proposal. It was one of the most awaited additions (at least, awaited by me :)).
I haven't looked at the implementation yet, but the doc doesn't state that enums are valid template parameters. Is it really so?
Currently yes, but after checking C++0x draft I will relax that requirement shortly (anything that is a POD will be a valid template argument for "atomic<TYPE>") -- I am still trying to figure out the easiest way of "killing" the fetch_add&similar members. Regards, Helge

Hello Helge, I am trying to use Boost.Atomic in my project but am experiencing the following two problems: - boost/atomic/memory_order.hpp: enum memory_order is redeclared from boost/memory_order.hpp (1.37) + however boost/memory_order.hpp does not define memory_order_consume, from a quick glance however that enum value does not seem to be used (it appears in some switch statements but only as a fallthrough) so I simply removed boost/atomic/memory_order.hpp and removed all uses of memory_order_consume - boost::atomic<T>::atomic() and boost::atomic<T>::atomic(T v) declare the variable "verify_valid_atomic_integral" but never use it which breaks -Werror + to fix this I simply added the line "(void)verify_valid_atomic_integral;" to both constructors I have not tested the changes, beyond testing that it compiles. A patch is attached. With kind regards, Mikael Olenfalk On Sun, Nov 29, 2009 at 11:49 PM, Helge Bahmann <hcb@chaoticmind.net> wrote:
Hello,
as promised I have started extracting an atomic operations library. Current state is available at:
http://www.chaoticmind.net/~hcb/projcets/boost.atomic
It implements boost::atomic<TYPE> which faithfully mimics std::atomic<TYPE> as specified in the C++0x draft standard. As allowed by the standard, operations transparently fall back to locking when the underlying architecture does not support the requested operation, so the library already contains a "fallback" implementation that works an all platforms (using mutex from boost::thread).
It currently natively supports gcc/x86, gcc/powerpc and gcc/alpha (I can vouch for the correctness of the implementations on these targets). It contains some entirely untested support for building implementations from CAS operations on other systems (e.g. _InterlockedCompareExchange on win), so I would greatly appreciate any feedback if it works/doesn't on any particular platform.
There is some preliminary documentation, but not in boostdoc format -- after unsuccessfully struggling with bjam/boostbook & friends for a few hours I simply gave up and reverted to trusty old doxygen :(
Is there any step-by-step guide on how to create, build and document a new library? I could really use that as the boost build and documentation system is pretty alien to an autotools-accustomed guy like me.
Best regards Helge _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Hello Mikael, On Tue, 1 Dec 2009, Mikael Olenfalk wrote:
Hello Helge,
I am trying to use Boost.Atomic in my project but am experiencing the following two problems:
- boost/atomic/memory_order.hpp: enum memory_order is redeclared from boost/memory_order.hpp (1.37)
Yes, I realized that today; the definition in boost/memory_order.hpp is outdated, it should be replaced with boost/atomic/memory_order.hpp or augmented to include "memory_order_consume" (it is contained in more "recent" proposals of the C++0x standard, see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2643.html).
+ however boost/memory_order.hpp does not define memory_order_consume, from a quick glance however that enum value does not seem to be used (it appears in some switch statements but only as a fallthrough) so I simply removed boost/atomic/memory_order.hpp and removed all uses of memory_order_consume
it *is* used, to force a memory barrier on alpha (where many other architectures don't need one). Don't remove it, this is wrong.
- boost::atomic<T>::atomic() and boost::atomic<T>::atomic(T v) declare the variable "verify_valid_atomic_integral" but never use it which breaks -Werror + to fix this I simply added the line "(void)verify_valid_atomic_integral;" to both constructors
This is a placeholder to verify that the passed type is in fact an integral type (lots of things break if it is not) -- this is not the best way to achieve it, I plan to refactor the code and remove it entirely (but will not be in the repository until this evening). Thanks for testing & best regards Helge

Helge Bahmann wrote:
Hello Mikael,
On Tue, 1 Dec 2009, Mikael Olenfalk wrote:
Hello Helge,
I am trying to use Boost.Atomic in my project but am experiencing the following two problems:
- boost/atomic/memory_order.hpp: enum memory_order is redeclared from boost/memory_order.hpp (1.37)
Yes, I realized that today; the definition in boost/memory_order.hpp is outdated, it should be replaced with boost/atomic/memory_order.hpp or augmented to include "memory_order_consume" (it is contained in more "recent" proposals of the C++0x standard, see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2643.html).
I will add _consume to boost/memory_order.hpp. FWIW, the enum values in it are chosen so that one can use if( mo & memory_order_acquire ) { // insert trailing fence } and if( mo & memory_order_release ) { // insert leading fence } instead of a switch. I think that your PPC trailing fence (isync) is wrong for loads. isync should only be used after a conditional jump (if one wants acquire semantics). For loads, you need either a trailing lwsync, or a fake never-taken branch + isync. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2745.html Your use of a single lock for seq_cst operations has given me pause, but now that I've thought about it some more, I think that this is not necessary. Per-location locks also provide sequential consistency. There is already boost/smart_ptr/detail/spinlock_pool.hpp that you may use for the fallback - if you like.

Am Tuesday 01 December 2009 15:30:12 schrieb Peter Dimov:
I will add _consume to boost/memory_order.hpp.
thanks, memory_order.hpp is going to be removed from Boost.Atomic
I think that your PPC trailing fence (isync) is wrong for loads. isync should only be used after a conditional jump (if one wants acquire semantics). For loads, you need either a trailing lwsync, or a fake never-taken branch + isync.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2745.html
this is very helpful indeed, will fix the implementation
Your use of a single lock for seq_cst operations has given me pause, but now that I've thought about it some more, I think that this is not necessary. Per-location locks also provide sequential consistency.
yes you're right -- I guess this was guided by the mistaken idea that the actual memory accesses must be serialized (instead of just considering the observable behaviour)
There is already boost/smart_ptr/detail/spinlock_pool.hpp that you may use for the fallback - if you like.
this looks like what I need, but maybe such a thing should live under "thread" perhaps, instead of smart_ptr? Thanks for the feedback! Helge

Hi, Helge Bahmann wrote:
Hello,
as promised I have started extracting an atomic operations library. Current state is available at:
Glad to hear the Boost atomic library is started. This is a major library needed for a lot of other Boost libraries and of course also for the users. Unfortunately I haven't the competences to help you in the implementation, but I'm sure you will have a good support from the Boost community. BTW, is your intention to implement the complete interface from the C++0x draft standard atomic proposal? I've added your library to the LibrariesUnderConstruction wiki page https://svn.boost.org/trac/boost/wiki/LibrariesUnderConstruction#Boost.Atomi.... Let me know if you want to add something to this wiki. Best, Vicente -- View this message in context: http://old.nabble.com/Notice%3A-Boost.Atomic-%28atomic-operations-library%29... Sent from the Boost - Dev mailing list archive at Nabble.com.

On Wed, 2 Dec 2009, Vicente Botet Escriba wrote:
BTW, is your intention to implement the complete interface from the C++0x draft standard atomic proposal?
The full interface for the "atomic<T>" template for all permissible built-in and user-defined types as well as mapping them to platform-specific atomic operations when possible (almost[1] done), atomic_flag, and typedefs for all "useful" integral types. I'm not sure if the free-standing functions are of too much value (personally I dislike them for C++), I will certainly add them if someone wants them, but probably it would be preferrable for them not to live in the root namespace "boost". There are some subtle limitations to what can be done with "memory_order_consume" without compiler support, but for all the use-cases I can currently imagine (e.g. double-checked singleton pattern) it will fully work as advertised even without. [1] except for: pointer arithmetic and four-operand compare_exchange_*

Helge Bahmann wrote:
On Wed, 2 Dec 2009, Vicente Botet Escriba wrote:
I'm not sure if the free-standing functions are of too much value (personally I dislike them for C++), I will certainly add them if someone wants them, but probably it would be preferrable for them not to live in the root namespace "boost".
I think, free standing functions can be useful if one needs to operate on a POD type (which atomic<T> isn't). For example, one could safely use functions with local statics.

Andrey Semashev wrote:
Helge Bahmann wrote:
On Wed, 2 Dec 2009, Vicente Botet Escriba wrote:
I'm not sure if the free-standing functions are of too much value (personally I dislike them for C++), I will certainly add them if someone wants them, but probably it would be preferrable for them not to live in the root namespace "boost".
I think, free standing functions can be useful if one needs to operate on a POD type (which atomic<T> isn't). For example, one could safely use functions with local statics. _____________________________________________ I believe Helge's plan was to provide an implementation of C++-0x atomics - PODs are not part of the 'atomic' interface - so I would vote against PODs.

Am Thursday 03 December 2009 19:18:10 schrieb Andrey Semashev:
Oliver Kowalke wrote:
I believe Helge's plan was to provide an implementation of C++-0x atomics - PODs are not part of the 'atomic' interface - so I would vote against PODs.
My bad. I was under impression that these functions operate on integral types rather than classes. Sorry.
Supporting PODs is probably not possible in a portable fashion -- there may be architectures that simply cannot read/write bytes individually (for example Cell SPU always reads/writes 16 bytes at a time), so an "atomic" uint8_t placed in the same word as a "non-atomic" uint8_t simply won't work (I don't think atomic<uint8_t> is required to be one byte in size and may therefore provide requisite "padding"). Helge
participants (10)
-
Andrey Semashev
-
Helge Bahmann
-
Jamie Allsop
-
Mikael Olenfalk
-
Oliver Kowalke
-
Peter Dimov
-
Phil Endecott
-
Stefan Strasser
-
Vicente Botet Escriba
-
Zachary Turner