boost.atomic and boost.lockfree backports

Hartmut and I have been using modified versions of Boost.Lockfree and Boost.Atomic in our codebase at work. I've put git repositories of our versions of both libraries up on github, so that the lockfree/atomic authors can integrate our bug fixes and/or feature extensions if they wish. The code is under the BSL. In addition to a number of bug fixes/tweaks, our variants contain the following enhancements: * boost.atomic: 128bit lockfree atomics for MSVC 2008+ (using _InterlockedCompareExchange128 and SSE2 intrinsics) and GCC-compatible POSIX compilers (using 16 byte __sync intrinsics). This code is x86-64 specific, and disabled by default. Define BOOST_ATOMIC_HAVE_SSE2 to enable in x86-64 MSVC 2008+ environments. Define BOOST_ATOMIC_HAVE_SSE2, BOOST_ATOMIC_HAVE_GNU_SYNC_16 and BOOST_ATOMIC_HAVE_GNU_ALIGNED_16 to enable for GCC-compatible compilers. The following command line flags are also needed for GCC-compatible compilers: -mcx16 (enables 16byte sync intrinsics) and -msse2. * boost.lockfree: A growable lockfree deque, based on algorithms by M. M. Michael. Backed by 128bit atomics. This is the first lockfree data structure I've written, so take it with a grain of salt. There's one bug that I'm aware of, unfortunately, it occurs infrequently so I haven't had much luck tracking it down. Links: https://github.com/brycelelbach/boost.atomic https://github.com/brycelelbach/boost.lockfree P.S. I don't have a list of the specific bug fixes on hand, I'll have to go through the SVN logs for that. The only non-trivial one that I know of is the problem with intel-linux that I found and fixed (intel doesn't define *amd64* macros, and they were used for x86-64 bit detection). -- Bryce Lelbach aka wash boost-spirit.com px.cct.lsu.edu github.com/lll-project

hello bryce, i will have a look at the code and possibly integrate the dequeue into the review version of boost.lockfree ... some comments in advance:
* boost.atomic: 128bit lockfree atomics for MSVC 2008+ (using _InterlockedCompareExchange128 and SSE2 intrinsics) and GCC-compatible POSIX compilers (using 16 byte __sync intrinsics). This code is x86-64 specific, and disabled by default. Define BOOST_ATOMIC_HAVE_SSE2 to enable in x86-64 MSVC 2008+ environments. Define BOOST_ATOMIC_HAVE_SSE2,
... atomics and sse2? how does this work together? afair, sse2 reads/writes are not even guaranteed to be atomic and both concepts are quite orthogonal ...
* boost.lockfree: A growable lockfree deque, based on algorithms by M. M. Michael. Backed by 128bit atomics. This is the first lockfree data structure I've written, so take it with a grain of salt. There's one bug that I'm aware of, unfortunately, it occurs infrequently so I haven't had much luck tracking it down.
will try to review the code and get back to you ... finding bugs in lockfree code is almost impossible to achieve via testing, but one needs to analyse it very carefully ...
P.S. I don't have a list of the specific bug fixes on hand, I'll have to go through the SVN logs for that. The only non-trivial one that I know of is the problem with intel-linux that I found and fixed (intel doesn't define *amd64* macros, and they were used for x86-64 bit detection).
it would be very helpful to see the specific diffs. only getting a tarball (or a repository without history) without knowing the branching point makes it quite time-consuming to find actual differences between your version and helge's/mine. importing your private svn repo into git via git-svn and rewriting the resulting git repository with git-filter-branch to clean the parts, which are not related to atomic or lockfree would be the cleanest way to go ... cheers, tim -- tim@klingt.org http://tim.klingt.org May music never just become another way of making money. Keith Tippett

* boost.atomic: 128bit lockfree atomics for MSVC 2008+ (using _InterlockedCompareExchange128 and SSE2 intrinsics) and GCC-compatible POSIX compilers (using 16 byte __sync intrinsics). This code is x86-64 specific, and disabled by default. Define BOOST_ATOMIC_HAVE_SSE2 to enable in x86-64 MSVC 2008+ environments. Define BOOST_ATOMIC_HAVE_SSE2,
... atomics and sse2? how does this work together? afair, sse2 reads/writes are not even guaranteed to be atomic and both concepts are quite orthogonal ...
double-checked with intel's software developer's manual, volume 3a, section 8.1.1: there is no way to access 128bit words atomically. afaik, the only way to implement them correctly is to use cmpxchg16b ... -- tim@klingt.org http://tim.klingt.org Art is either a complaint or do something else John Cage quoting Jasper Johns

On 2011.05.29_12.29.34, Tim Blechmann wrote:
double-checked with intel's software developer's manual, volume 3a, section 8.1.1: there is no way to access 128bit words atomically. afaik, the only way to implement them correctly is to use cmpxchg16b ...
Thanks for catching this Tim - I'll fix this in our code. Will get you a list of the specific changes later today. -- Bryce Lelbach aka wash boost-spirit.com px.cct.lsu.edu github.com/lll-project

double-checked with intel's software developer's manual, volume 3a, section 8.1.1: there is no way to access 128bit words atomically. afaik, the only way to implement them correctly is to use cmpxchg16b ...
That's the answer we got from Intel & AMD architects a year ago. Unlike 8-byte aligned memory access on x86, there are no guarantees for atomicity for 16-byte memory accesses on x64 (except for cmpxchg16b). Though I believe all implementations did at that time. A few notes, after a one minute look over one file only: Some of the functions seem pretty weird. E.g. why is a 128-bit fetch_add implemented in terms of a packed add. Don't individual components of the vector just wrap around on overflow without carrying over? And __declspec(align) takes a number of bytes not bits. BTW, VC 9 had a fairly random set of supported intrinsics for various locked operations on x86/x64. I added a couple more to make this a bit more consistent across bit widths in VC10 -- to support my implementation of <atomic>. Also alignment is not really alignment in x86 VC. Sadly, VC++ for x86 has no strong stack alignment. By default, there are no stronger guarantees than that ESP is 4 byte aligned on function entry. Hence, for locals with stronger alignment requirements dynamic stack alignment is required (interprocedural optimizations can sometimes elide that setup). Here doubles differ from long longs from types with explicit alignment requirements (i.e. __declspec(align)). Only the latter are really guaranteed to produce aligned addresses when declared as locals. I'm not sure what aligned_storage and friends do, but just because __alignof reports some value does not mean that locals will be properly aligned. -hg

hello bryce, i went through your implementation and the paper in detail and have a few comments: * dequeue should contain atomic<anchor> instead as anchor wrapping atomic<pair>, as it makes the code way more expressive. currently there is no way to see when/why/which operations are actually atomic ... * related to that, it is not really clear, which memory barriers are required where and why ... specifying them explicitly instead of using the default memory_order_seq_cst may be very useful in terms of performance * you can use compare_exchange_weak instead of compare_exchange_strong, if you check the result. * it might make sense to add padding between pool and anchor and in the node between left and right * // FIXME: Should we check both pointers here? ... both pointers are updated atomically ... so i do not think so .. * line 417/478 ... no need to increment the ABA tag: you change the pointer from a non-zero value to NULL, so no ABA problem can occur * line 440/501 ... no need to increment ABA tag here, either: you change r or l to prev, which cannot be identical, so no ABA problem can occur * afaict, the dequeue will only be lockfree on non-ancient x86-64 architectures, supporting 16byte CAS ... no ppc/x86 ... probably no mips/alpha. therefore a fixed-sized version of the algorithm will probably be very useful, using integer indices instead of pointers ... * tagged_ptr_pair could simply be implemented as std::pair<tagged_ptr>, of as struct {tagged_ptr left, right;}; no? that should make the implementation more generic and does not assume any platform specifics ... * an integration into the boost.lockfree testsuite would be nice ... cheers, tim

On Mon, May 30, 2011 at 6:26 AM, Tim Blechmann <tim@klingt.org> wrote:
* afaict, the dequeue will only be lockfree on non-ancient x86-64 architectures, supporting 16byte CAS ... no ppc/x86 ... probably no mips/alpha. therefore a fixed-sized version of the algorithm will probably be very useful, using integer indices instead of pointers ...
At least on Windows, you can use the high 20 bits of a 64-bit pointer as an ABA counter. This is what Windows' own lock-free stack (InterlockedPushEntrySList) does. Even so, a fixed-size version of the algorithm would still have its uses. -- Cory Nelson http://int64.org

* afaict, the dequeue will only be lockfree on non-ancient x86-64 architectures, supporting 16byte CAS ... no ppc/x86 ... probably no mips/alpha. therefore a fixed-sized version of the algorithm will probably be very useful, using integer indices instead of pointers ...
At least on Windows, you can use the high 20 bits of a 64-bit pointer as an ABA counter. This is what Windows' own lock-free stack (InterlockedPushEntrySList) does. Even so, a fixed-size version of the algorithm would still have its uses.
:) ... this algorithm requires two pointer/tag pairs and a 3-state enum to be accessed atomically -- tim@klingt.org http://tim.klingt.org There's no such entity as "most people". These are generalities. All generalities are meaningless. You've got to pin it down to a specific person doing a specific thing at a specific time and space. William S. Burroughs
participants (4)
-
Bryce Lelbach
-
Cory Nelson
-
Holger Grund
-
Tim Blechmann