[lock-free] CDS -yet another lock-free library

Hi, I'd like to present to boost community the CDS (Concurrent Data Structure) C++ library at libcds.sourceforge.net. It is an attempt to generalize well-known lock-free (and a few lock-based) data structures discovered by famous researches (M.Michael, N.Shavit and others). The library is mostly header-based, only the kernel of light-weight memory managers is resided in small dynamic library. The CDS library supports a set of modern processor architectures: x86, amd64, ia64, sparc; compiler supported is GCC 4 and MS Visual C++ 2008. The last CDS release 0.6.0 is developed close to C++ Memory Model proposal. I need the recommendations from boost community about : 1. Are the atomic primitives implemented correct for each of supported processor?.. I'm not an expert in this area... 2. Is the memory order constraint (from C++1x proposal) choosed correctly for lock-free data structure implementation?.. I known it is very large and magic question. Maybe, somebody analyzes the library code and find the errors. And I be very glad if anybody joins to CDS project. It is very interesting area of computing science with nonstandard algorithms and crazy implementation (and debugging notably).

I'd like to present to boost community the CDS (Concurrent Data Structure) C++ library at libcds.sourceforge.net. It is an attempt to generalize well-known lock-free (and a few lock-based) data structures discovered by famous researches (M.Michael, N.Shavit and others). The library is mostly header-based, only the kernel of light-weight memory managers is resided in small dynamic library.
... interesting ... unfortunately hazard pointer are patented ... i guess it is similar with the pass-the-buck algorithm :/ i am currently quite busy with other things, but maybe i can have a look at your implementation in the next few days ... tim -- tim@klingt.org http://tim.klingt.org Just what the hell is the experimental tradition? Morton Feldman

Tim Blechmann
... interesting ... unfortunately hazard pointer are patented ... i guess it is similar with the pass-the-buck algorithm :/
Yes, Hazard Pointer is patented :-(. IIRC, the pass-the-buck algorithm requires double-CAS... Of course, Harris & Fraser's MCAS would be helped in this case. The experiments are needed...
i am currently quite busy with other things, but maybe i can have a look at your implementation in the next few days ...
I would be very much obliged to you Regards, Max

Hi Maxim, Khiszinsky, Maxim wrote:
Hi, I'd like to present to boost community the CDS (Concurrent Data Structure) C++ library at libcds.sourceforge.net. It is an attempt to generalize well-known lock-free (and a few lock-based) data structures discovered by famous researches (M.Michael, N.Shavit and others). The library is mostly header-based, only the kernel of light-weight memory managers is resided in small dynamic library. The CDS library supports a set of modern processor architectures: x86, amd64, ia64, sparc; compiler supported is GCC 4 and MS Visual C++ 2008. The last CDS release 0.6.0 is developed close to C++ Memory Model proposal. I need the recommendations from boost community about : 1. Are the atomic primitives implemented correct for each of supported processor?.. I'm not an expert in this area...
Have you considered using the proposed Boost.Atomic? This should support more platforms. If there is something missing from Boost.Atomic that is needed for this purpose, it would be useful to know about it. Regards, Phil.

I'd like to present to boost community the CDS (Concurrent Data Structure) C++ library at libcds.sourceforge.net. It is an attempt to generalize well-known lock-free (and a few lock-based) data structures discovered by famous researches (M.Michael, N.Shavit and others). The library is mostly header-based, only the kernel of light-weight memory managers is resided in small dynamic library. The CDS library supports a set of modern processor architectures: x86, amd64, ia64, sparc; compiler supported is GCC 4 and MS Visual C++ 2008. The last CDS release 0.6.0 is developed close to C++ Memory Model proposal. I need the recommendations from boost community about : 1. Are the atomic primitives implemented correct for each of supported processor?.. I'm not an expert in this area...
Have you considered using the proposed Boost.Atomic? This should support more platforms.
If there is something missing from Boost.Atomic that is needed for this purpose, it would be useful to know about it.
Hi Phil, I searched the sandbox to see if there was a folder for Boost.Atomic, just to make sure that I didn't miss it. Are you referring to the contents of <boost/interprocess/detail/atomic.hpp>? A while ago I was working on a project that required some new atomic primitives which were missing from that file, and I think that they would be nice additions to an atomic operations library: 1. `atomic_dec_if_nz` - Atomically decrements an integer if it is not zero 2. `atomic_inc_if_nmax` - Atomically increments an integer if it is not the maximum possible value for its type 3. `atomic_inc_if_nz_nmax` - Atomically increments an integer if it is neither zero or the maximum possible value for its type I have attached a file that contains my (g++, x86) implementations of these functions for `uint8_t` and a basic unit test. Do you think that these would be good to include in Boost.Atomic? If so I can work on x86 versions for more standard int types as well as different compilers. Daniel

Daniel Trebbien wrote:
I searched the sandbox to see if there was a folder for Boost.Atomic, just to make sure that I didn't miss it. Are you referring to the contents of <boost/interprocess/detail/atomic.hpp>?
No. I'm referring to the proposed Boost.Atomic by Helge Bahmann. See http://www.chaoticmind.net/~hcb/projects/boost.atomic/ Phil.

Have you considered using the proposed Boost.Atomic? This should support more platforms.
If there is something missing from Boost.Atomic that is needed for this purpose, it would be useful to know about it.
Yes, I considered Boost.Atomic some time ago and I decided to implement separate atomics in the CDS library: 1. Not all processor architectures that I need are implemented in Boost.Atomic (maybe, today it is not right). 2. Function-based implementation of atomics produces non-optimal code in some cases. Consider the usual implementation of atomic with explicit memory ordering: Static inline void store( atomic_t * pDest, atomic_t nVal, memory_order order ) { switch ( order ) { case memory_order_relaxed: *pDest = nVal; break ; case ... case ... } } The problem is that the compiler (in some cases) generates case-based code when 'order' parameter is constant for caller: store( &myAtomic, 10, memory_order_relaxed) ; in this case instead of ONE assembler store instruction the compiler may generate many branch instruction. It is not optimal :-(. And 99% of code with atomic primitives has *constant* memory_order parameter. More optimized implementation of atomic is template-based: Template <memory_order ORDER> void store( atomic_t * pDest, atomic_t nVal ) ; template <> void store<memory_order_relaxed>( pDest, nVal ) { *pDest = nVal ; } ... And so on for each memory_order constant. Template-based implementation is more optimized: the memory_order constant is a *compiler selector* to choose appropriate implementation. No compiler optimization required. I think, the C++ Memory Model proposal implies significant support from C++ compiler for the best result. In fact, std::atomic should be the compiler intrinsic to generate optimal code. Regards, Max

AMDG Khiszinsky, Maxim wrote:
2. Function-based implementation of atomics produces non-optimal code in some cases. Consider the usual implementation of atomic with explicit memory ordering: Static inline void store( atomic_t * pDest, atomic_t nVal, memory_order order ) { switch ( order ) { case memory_order_relaxed: *pDest = nVal; break ; case ... case ... } } The problem is that the compiler (in some cases) generates case-based code when 'order' parameter is constant for caller: store( &myAtomic, 10, memory_order_relaxed) ; in this case instead of ONE assembler store instruction the compiler may generate many branch instruction. It is not optimal :-(. And 99% of code with atomic primitives has *constant* memory_order parameter.
Have you actually observed this? On which compilers? In Christ, Steven Watanabe

On Tue, 30 Mar 2010, Khiszinsky, Maxim wrote:
Steven Watanabe wrote:
The problem is that the compiler (in some cases) generates case-based code when 'order' parameter is constant for caller
Have you actually observed this? On which compilers?
MS Visual C++ 2008, GCC 4.3
I can't second this. Sure, if you turn optimization off, code looks ridiculous. Both gcc (tested with various versions ranging from 3.3 to 4.4) and msvc (visual studio express) optimize the switch/case with constant operand into nothingness. Whether templating helps is somewhat besiders the point if the C++0x interface for atomic operations is specified to use parameters. I would however be very interested in sparc and ia64 atomic operations, would you be interested in implementing (or lending me a hand) them for boost.atomic? Best regards, Helge

Helge Bahmann wrote:
Sure, if you turn optimization off, code looks ridiculous. Both gcc (tested with various versions ranging from 3.3 to 4.4) and msvc (visual studio express) optimize the switch/case with constant operand into nothingness.
Sorry, the problem observed on MS VC++ 2008 x86 with full optimization and SSE instruction set. See my previous answer to Phil with explanations. GCC is not studied by me.
I would however be very interested in sparc and ia64 atomic operations, would you be interested in implementing (or lending me a hand) them for boost.atomic?
You may try CDS implementation of sparc and ia64 atomics: cds\compiler\gcc\ia64\atomic.h - for itanium ia64 cds\compiler\gcc\sparc\atomic.h - for sparc 64bit mode These files contain template and common versions of 32bit and 64bit atomic primitives. If you need I can extract it from CDS for you, no problem. It works well with CDS tests but I'm not sure whether it is correct...

On Tue, 30 Mar 2010, Khiszinsky, Maxim wrote:
Helge Bahmann wrote:
Sure, if you turn optimization off, code looks ridiculous. Both gcc (tested with various versions ranging from 3.3 to 4.4) and msvc (visual studio express) optimize the switch/case with constant operand into nothingness.
Sorry, the problem observed on MS VC++ 2008 x86 with full optimization and SSE instruction set. See my previous answer to Phil with explanations.
I see, and I would wager to guess that the problem with the following: atomic64_t v ; v = _mm_loadl_epi64( (__m128i *) pMem ).m128i_i64[0] ; are the funky type-casts confusing aliasing analysis sufficiently so that the compiler believes that "order" might be modified by the above statements. You could try changing that to something like: __m128i i = _mm_loadl_epi64( (__m128i * _Restrict) pMem); if (order == ...) ... return i.m128i_i64[0]; I can assure you that MSVC does not have an "intrinsic" problem with making the desired optimization (at least I know of no such problem with Boost.Atomic).
I would however be very interested in sparc and ia64 atomic operations, would you be interested in implementing (or lending me a hand) them for boost.atomic?
You may try CDS implementation of sparc and ia64 atomics: cds\compiler\gcc\ia64\atomic.h - for itanium ia64 cds\compiler\gcc\sparc\atomic.h - for sparc 64bit mode These files contain template and common versions of 32bit and 64bit atomic primitives.
If you need I can extract it from CDS for you, no problem. It works well with CDS tests but I'm not sure whether it is correct...
Thanks will look into that and send you a version for testing Best regards Helge

Helge Bahmann wrote:
You could try changing that to something like: __m128i i = _mm_loadl_epi64( (__m128i * _Restrict) pMem); if (order == ...) ... return i.m128i_i64[0];
Thanks, Helge! My bug
Thanks will look into that and send you a version for testing
Ok, and I'll try transforming CDS to use Boost.Atomic. It will safer, I hope Regards, Max

On Tue, 30 Mar 2010, Khiszinsky, Maxim wrote:
Helge Bahmann wrote:
You could try changing that to something like: __m128i i = _mm_loadl_epi64( (__m128i * _Restrict) pMem); if (order == ...) ... return i.m128i_i64[0];
Thanks, Helge! My bug
I'm not sure this works, it is just an attempt based on a bold suspicion of what the actual problem is :) But it would sure be interesting if you could try that out or something similar Regards Helge

Helge Bahmann wrote:
I'm not sure this works, it is just an attempt based on a bold suspicion of what the actual problem is :) But it would sure be interesting if you could try that out or something similar
__restrict didn't solve the problem on MS Visual C++ 2008 :-( static inline atomic64_t load64( atomic64_t volatile const * pMem ) { // Atomically loads 64bit value by SSE intrinsics __m128i v = _mm_loadl_epi64( (__m128i const * __restrict) pMem ) ; return v.m128i_i64[0] ; } - infinite loop in spinlock on atomic64_t. However, Peter Dimov's advice corrected the error - volatile! Yes! This code static inline atomic64_t load64( atomic64_t volatile const * pMem ) { // Atomically loads 64bit value by SSE intrinsics __m128i volatile v = _mm_loadl_epi64( (__m128i const *) pMem ) ; return v.m128i_i64[0] ; } works well. Thanks, Peter! Regards, Max

On Tue, Mar 30, 2010 at 4:19 PM, Khiszinsky, Maxim <Maxim.Khiszinsky@billing.ru> wrote:
Helge Bahmann wrote:
[snip]
- infinite loop in spinlock on atomic64_t.
However, Peter Dimov's advice corrected the error - volatile! Yes! This code
static inline atomic64_t load64( atomic64_t volatile const * pMem ) { // Atomically loads 64bit value by SSE intrinsics __m128i volatile v = _mm_loadl_epi64( (__m128i const *) pMem ) ; return v.m128i_i64[0] ; }
Shouldn't volatile go in the cast and not in v? It is fine if v is stored in a register, but you want the read to pMem not to be cached. What you are doing with the cast right now is casting away volatile-ness in addition of changing the type of the pointed object. The fact that it is now working is probably just due to the fact that the compiler is confused: the compiler is still free to hoist the loads from pmem, as long as all stores and loads to v (which is in the stack) are done. -- gpd

Giovanni Piero Deretta wrote
Shouldn't volatile go in the cast and not in v? It is fine if v is stored in a register, but you want the read to pMem not to be cached. What you are doing with the cast right now is casting away volatile-ness in addition of changing the type of the pointed object. The fact that it is now working is probably just due to the fact that the compiler is confused: the compiler is still free to hoist the loads from pmem, as long as all stores and loads to v (which is in the stack) are done.
Yes, I agree However, the parameter of _mm_loadl_epi64 is declared as non-volatile and casting _mm_loadl_epi64( (__m128i volatile const *) pMem ) generates error. The solution from CDS library is a workaround for MS Visual C++ compiler. I cannot invent something better now. Second solution: use 64bit CAS to load/store 64bit values on x86. It seems too heavy for just loading/storing it isn't? Regards, Max

On Wed, 31 Mar 2010, Khiszinsky, Maxim wrote:
Giovanni Piero Deretta wrote
Shouldn't volatile go in the cast and not in v? It is fine if v is stored in a register, but you want the read to pMem not to be cached. What you are doing with the cast right now is casting away volatile-ness in addition of changing the type of the pointed object. The fact that it is now working is probably just due to the fact that the compiler is confused: the compiler is still free to hoist the loads from pmem, as long as all stores and loads to v (which is in the stack) are done.
Yes, I agree However, the parameter of _mm_loadl_epi64 is declared as non-volatile and casting _mm_loadl_epi64( (__m128i volatile const *) pMem ) generates error. The solution from CDS library is a workaround for MS Visual C++ compiler. I cannot invent something better now.
Second solution: use 64bit CAS to load/store 64bit values on x86. It seems too heavy for just loading/storing it isn't?
this is actually what I do in Boost.Atomic; I *think* it is cheaper than shuffling around the values between SSE and general purpose registers (it sure is cheaper than MMX considering you also have to issue emms) Regards, Helge

Helge Bahmann wrote
Second solution: use 64bit CAS to load/store 64bit values on x86. It seems too heavy for just loading/storing it isn't?
this is actually what I do in Boost.Atomic; I *think* it is cheaper than shuffling around the values between SSE and general purpose registers (it sure is cheaper than MMX considering you also have to issue emms)
It's easy to test! Express test: CDS's RecursiveSpinLock<atomic64_t> (load64 is used actively by TATAS algo for busy wait when CAS acquiring the lock is failed) Equipment: WinXP Intel Core2 (3GHz, 2 core, no HT), MSVC++ 2008, release build with full optimization SSE2 load64: static inline atomic64_t load64( atomic64_t volatile const * pMem ) { __m128i volatile v = _mm_loadl_epi64( (__m128i const *) pMem ) ; return v.m128i_i64[0] ; } result (one of, average): Spinlock_MT::recursiveSpinLock64 Lock test, thread count=8 loop per thread=1000000... Duration=2.21852 CAS64 load64 (no CAS loop): static inline atomic64_t load64( atomic64_t volatile const * pMem ) { atomic64_t cur = 0 ; return _InterlockedCompareExchange64( const_cast<atomic64_t volatile *>(pMem), cur, cur ) ; } result (one of, average): Spinlock_MT::recursiveSpinLock64 Lock test, thread count=8 loop per thread=1000000... Duration=2.79662 +20% performance for SSE2. Not so bad I wait more :) Unfortunately, I have no access to multi-processor Win32 server for testing now. Note, boost.atomic uses CAS-based loop for load64, so, I think the performance gain could be more. Regards, Max

static inline atomic64_t load64( atomic64_t volatile const * pMem ) { // Atomically loads 64bit value by SSE intrinsics __m128i v = _mm_loadl_epi64( (__m128i const * __restrict) pMem ) ; return v.m128i_i64[0] ; }
are you sure, that _mm_loadl_epi64 is actually atomic in your case? according to the intel manual, the value should be 64bit aligned or should not cross a cache line boundary. are you sure, that this cannot occur? cheers, tim -- tim@klingt.org http://tim.klingt.org The price an artist pays for doing what he wants is that he has to do it. William S. Burroughs

Tim Blechmann wrote:
are you sure, that _mm_loadl_epi64 is actually atomic in your case? according to the intel manual, the value should be 64bit aligned or should not cross a cache line boundary. are you sure, that this cannot occur?
Yes, atomic64_t is declared as typedef long long __declspec( align(8) ) atomic64_t ; for x86 win32 (for x86 gcc there is similar declaration) So, I hope it is aligned and atomic. In additions, I check the proper alignment in calls by assert( is_aligned<8>( pAddr )) ; where is_aligned is: template <int ALIGN, typename T> static inline bool is_aligned(T const * p) { return (((uptr_atomic_t)p) & (ALIGN - 1)) == 0 ; } Regards, Max

Khiszinsky, Maxim wrote:
Tim Blechmann wrote:
are you sure, that _mm_loadl_epi64 is actually atomic in your case? according to the intel manual, the value should be 64bit aligned or should not cross a cache line boundary. are you sure, that this cannot occur?
Yes, atomic64_t is declared as typedef long long __declspec( align(8) ) atomic64_t ;
you may run into issues, when using aligned members in heap-allocated structs: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15795 this hit me once a few years ago... cheers, tim -- tim@klingt.org http://tim.klingt.org When you open windows, bugs get in.

Tim Blechmann wrote:
you may run into issues, when using aligned members in heap-allocated structs
Hmm... You are right. There is alloc_aligned.h header for each supported OS in the CDS lib. However, it seems they are not used now. I need inspect the CDS code for this problem more carefully. I have a suspicion... Thanks, Tim! Regards, Max

you may run into issues, when using aligned members in heap-allocated structs
Hmm... You are right. There is alloc_aligned.h header for each supported OS in the CDS lib. However, it seems they are not used now. I need inspect the CDS code for this problem more carefully. I have a suspicion...
using an aligned storage allocator won't help, if the struct is constructed via placement new. one could over-allocate memory to force alignment at run-time, but that may cause a higher overhead than using cmpxchg ... cheers, tim -- tim@klingt.org http://tim.klingt.org Every word is like an unnecessary stain on silence and nothingness Samuel Beckett

Tim Blechmann wrote:
using an aligned storage allocator won't help, if the struct is constructed via placement new.
Yes, Tim, and you are right. The lib is actively used placement new in dynamic data structs. There are many assertions in the CDS code about unaligned new'ed data as the marks of my fight with misalignment. At first glance the CDS's implementation of Tagged Pointer memory management is particularly suspicious. I should inspect the lib more accurately.
one could over-allocate memory to force alignment at run-time, but that may cause a higher overhead than using cmpxchg ...
I don't like over-allocation and run-time alignment too :( However, using CAS for loading only (or for storing only)... IMHO it is extreme solution, and CAS has completely different semantics. Thanks, Tim, to focus my attention on alignment problem. Regards, Max

Steven Watanabe wrote:
The problem is that the compiler (in some cases) generates case-based code when 'order' parameter is constant for caller: store( &myAtomic, 10, memory_order_relaxed) ; in this case instead of ONE assembler store instruction the compiler may generate many branch instruction. It is not optimal :-(. And 99% of code with atomic primitives has *constant* memory_order parameter.
Have you actually observed this? On which compilers?
Sorry, my previous post is incorrect. The problem has been observed on MS VC++ 2008 x86 build with full optimization. GCC 4.3.3 is my compiler for Unix systems. I did not study the problem on GCC. -----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Steven Watanabe Sent: Monday, March 29, 2010 6:49 PM To: boost@lists.boost.org Subject: Re: [boost] [lock-free] CDS -yet another lock-free library AMDG Khiszinsky, Maxim wrote:
2. Function-based implementation of atomics produces non-optimal code in some cases. Consider the usual implementation of atomic with explicit memory ordering: Static inline void store( atomic_t * pDest, atomic_t nVal, memory_order order ) { switch ( order ) { case memory_order_relaxed: *pDest = nVal; break ; case ... case ... } } The problem is that the compiler (in some cases) generates case-based code when 'order' parameter is constant for caller: store( &myAtomic, 10, memory_order_relaxed) ; in this case instead of ONE assembler store instruction the compiler may generate many branch instruction. It is not optimal :-(. And 99% of code with atomic primitives has *constant* memory_order parameter.
Have you actually observed this? On which compilers? In Christ, Steven Watanabe _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Khiszinsky, Maxim wrote:
Have you considered using the proposed Boost.Atomic? This should support more platforms.
If there is something missing from Boost.Atomic that is needed for this purpose, it would be useful to know about it.
Yes, I considered Boost.Atomic some time ago and I decided to implement separate atomics in the CDS library:
1. Not all processor architectures that I need are implemented in Boost.Atomic (maybe, today it is not right).
You listed x86, amd64, ia64, sparc; Boost.Atomic currently supports (I think) x86, amd64, alpha, ppc and arm. Of course ia64 and sparc implementations for Boost.Atomic would be useful.
2. Function-based implementation of atomics produces non-optimal code in some cases. Consider the usual implementation of atomic with explicit memory ordering: Static inline void store( atomic_t * pDest, atomic_t nVal, memory_order order ) { switch ( order ) { case memory_order_relaxed: *pDest = nVal; break ; case ... case ... } } The problem is that the compiler (in some cases) generates case-based code when 'order' parameter is constant for caller: store( &myAtomic, 10, memory_order_relaxed) ; in this case instead of ONE assembler store instruction the compiler may generate many branch instruction. It is not optimal :-(. And 99% of code with atomic primitives has *constant* memory_order parameter.
I would like to think that all modern compilers could get this right, at least if the right level of optimisation were enabled. Can you please tell us in what case you have observed this?
More optimized implementation of atomic is template-based:
Template <memory_order ORDER> void store( atomic_t * pDest, atomic_t nVal ) ; template <> void store<memory_order_relaxed>( pDest, nVal ) { *pDest = nVal ; } ... And so on for each memory_order constant. Template-based implementation is more optimized: the memory_order constant is a *compiler selector* to choose appropriate implementation. No compiler optimization required.
If it's true that compilers get this wrong then an approach like your suggestion should be considered. However it's not just Boost.Atomic that would need to be re-done but also the spec for the proposed c++0x features, which would be more difficult (!). Hopefully Helge will notice this thread soon... Phil.

Have you considered using the proposed Boost.Atomic? This should support more platforms.
If there is something missing from Boost.Atomic that is needed for this purpose, it would be useful to know about it.
Yes, I considered Boost.Atomic some time ago and I decided to implement separate atomics in the CDS library:
1. Not all processor architectures that I need are implemented in Boost.Atomic (maybe, today it is not right).
You listed x86, amd64, ia64, sparc; Boost.Atomic currently supports (I think) x86, amd64, alpha, ppc and arm. Of course ia64 and sparc implementations for Boost.Atomic would be useful.
there seem to be some issues with boost.atomic on msvc (at least two people had issues when using my proposed lockfree library with boost.atomic) for ia64 some algorithms would need to be adapted, since iirc, there is no true double-word compare-and-exchange, but rather a compare8byte / exchange16byte instruction ... tim -- tim@klingt.org http://tim.klingt.org Contrary to general belief, an artist is never ahead of his time but most people are far behind theirs. Edgar Varèse

Tim Blechmann wrote
for ia64 some algorithms would need to be adapted, since iirc, there is no true double-word compare-and-exchange, but rather a compare8byte / exchange16byte instruction ...
cmp8xchg16 ... I'm not found any practical application of Itanium cmp8xchg16. At least in my tasks. Regards, Max

for ia64 some algorithms would need to be adapted, since iirc, there is no true double-word compare-and-exchange, but rather a compare8byte / exchange16byte instruction ...
cmp8xchg16 ... I'm not found any practical application of Itanium cmp8xchg16. At least in my tasks.
the idea is to compare only the tag -- tim@klingt.org http://tim.klingt.org Lesser artists borrow, great artists steal. Igor Stravinsky

Tim Blechmann wrote:
cmp8xchg16 ... I'm not found any practical application of Itanium cmp8xchg16. At least in my tasks.
the idea is to compare only the tag
Hmm... Interesting. I use other implementation of tagged pointer when each pointer has its own incremental tag. Comparing only the tag means that there is global counter incremented atomically. As a result a contention on this global counter may be encountered... On the other hand, cmp8xchg16 lets me possibility implementing tagged pointer on Itanium... Thanks, Tim! Your advice is really interesting for me Regards, Max

On Tue, Mar 30, 2010 at 3:01 AM, Khiszinsky, Maxim <Maxim.Khiszinsky@billing.ru> wrote:
Tim Blechmann wrote:
cmp8xchg16 ... I'm not found any practical application of Itanium cmp8xchg16. At least in my tasks.
the idea is to compare only the tag
Hmm... Interesting. I use other implementation of tagged pointer when each pointer has its own incremental tag. Comparing only the tag means that there is global counter incremented atomically. As a result a contention on this global counter may be encountered... On the other hand, cmp8xchg16 lets me possibility implementing tagged pointer on Itanium...
You should be able to use the exact same algos, but compare only the tag. No global counter needed. -- Cory Nelson

Cory Nelson wrote:
Comparing only the tag means that there is global counter incremented atomically. As a result a contention on this global counter may be encountered...
You should be able to use the exact same algos, but compare only the tag. No global counter needed.
I'm not understanding :-( Formally: Lets P is the set of all tagged pointer of our application. To resolve ABA-problem with cmp8xchg16 the following condition must be true: for each {p, q} from P: p.tag != q.tag p is struct { atomic64_t tag ; void * ptr ; }; Then, when we change tagged pointer p we must change its tag. How? Simply increment ++p.tag?.. In this case the condition may be violated... Another question: how the p.tag should be initialized. It seems, a global counter C appears: p.tag = atomic_inc( &C ) ; The global counter keeps the condition as true. Maybe, my reasoning is not true and you known best solution. Regards, Max

Phil Endecott wrote
You listed x86, amd64, ia64, sparc; Boost.Atomic currently supports (I think) x86, amd64, alpha, ppc and arm. Of course ia64 and sparc implementations for Boost.Atomic would be useful.
Yes, indeed.
The problem is that the compiler (in some cases) generates case-based code when 'order' parameter is constant for caller: store( &myAtomic, 10, memory_order_relaxed) ; in this case instead of ONE assembler store instruction the compiler may generate many branch instruction. It is not optimal :-(. And 99% of code with atomic primitives has *constant* memory_order parameter.
I would like to think that all modern compilers could get this right, at least if the right level of optimisation were enabled. Can you please tell us in what case you have observed this?
I observed this when I analyzed the performance decreasing on x86 MS Visual C++ 2008 (with full optimization, of course). The code like this: static inline atomic64_t load64( atomic64_t volatile const * pMem, memory_order order ) { atomic64_t v ; v = _mm_loadl_epi64( (__m128i *) pMem ).m128i_i64[0] ; if ( order == memory_order_seq_cst ) fence( memory_order_seq_cst) ; return v ; } and call: atomic64_t n = load64( &myVar, memory_order_relaxed ) ; produced asm code with spurious comparing and branch [if ( order == memory_order_seq_cst )]. After that I reorganized code using template (membar_xxx are wrappers around memory_order_xxx constants) template <typename ORDER> static inline atomic64_t load64( atomic64_t volatile const * pMem ) { // Atomically loads 64bit value by SSE intrinsics atomic64_t v = _mm_loadl_epi64( (__m128i *) pMem ).m128i_i64[0] ; return v ; } template <> static inline atomic64_t load64<membar_seq_cst>( atomic64_t volatile const * pMem ) { // Atomically loads 64bit value by SSE intrinsics atomic64_t v = _mm_loadl_epi64( (__m128i *) pMem ).m128i_i64[0] ; fence<membar_seq_cst>() ; return v ; } atomic64_t n = load64<member_relaxed>( &myVar ) ; And this implementation works well - no spurious branch instructions. As a result, I reorganized the CDS library to use template atomic functions. Note it is not a global problem. In other calls of load64 I found that the code generated is well. I cannot find any regularity :-( Maybe this is MSVC problem only - bad optimization for SSE instructions. Another example of MSVC optimization error for x86 - see my class cds::lock::RecursiveSpinT: void acquireLock() { // TATAS algorithm while ( !tryAcquireLock() ) { while ( m_spin.template load<membar_relaxed>() ) { backoff() ; // VC++ 2008 bug: the compiler generates infinite loop for int64 m_spin. // It seems, it is result of aggressive optimization: the code generated reads the value saved in registers // instead of reading volatile int64 atomics in the loop // The following compiler barrier prevents this bug CDS_COMPILER_RW_BARRIER ; // _ReadWriteBarrier() for MSVC } } }
If it's true that compilers get this wrong then an approach like your suggestion should be considered. However it's not just Boost.Atomic that would need to be re-done but also the spec for the proposed c++0x features, which would be more difficult (!).
I think it is task for compiler developers. Template-based implementation is a workaround only. However, I like templates :-) Regards, Max _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Tue, 30 Mar 2010, Khiszinsky, Maxim wrote:
Another example of MSVC optimization error for x86 - see my class cds::lock::RecursiveSpinT: void acquireLock() { // TATAS algorithm while ( !tryAcquireLock() ) { while ( m_spin.template load<membar_relaxed>() ) { backoff() ; // VC++ 2008 bug: the compiler generates infinite loop for int64 m_spin.
this is not a bug; the compiler is simply allowed to assume that the value of a variable does not change intermittently Regards Helge

Khiszinsky, Maxim wrote:
static inline atomic64_t load64( atomic64_t volatile const * pMem, memory_order order ) { atomic64_t v ; v = _mm_loadl_epi64( (__m128i *) pMem ).m128i_i64[0] ; if ( order == memory_order_seq_cst ) fence( memory_order_seq_cst) ; return v ; }
I don't think that this is correct. First, the seq_cst fence, if needed, must come before the load, not after it; second, a fence on load is not needed for seq_cst semantics on x86, since loads have acquire semantics, but stores must be locked (LOCK XCHG, or in this case LOCK CMPXCHG8). It depends on the implementation of store64 though; I could be missing something. I'm also not quite positive on whether SSE loads are guaranteed to acquire, but if they don't, you need a trailing fence for memory_order_acquire as well.
Maybe this is MSVC problem only - bad optimization for SSE instructions. Another example of MSVC optimization error for x86 - see my class cds::lock::RecursiveSpinT: void acquireLock() { // TATAS algorithm while ( !tryAcquireLock() ) { while ( m_spin.template load<membar_relaxed>() ) { backoff() ; // VC++ 2008 bug: the compiler generates infinite loop for int64 m_spin.
You're casting away volatile in load64.
participants (9)
-
Cory Nelson
-
Daniel Trebbien
-
Giovanni Piero Deretta
-
Helge Bahmann
-
Khiszinsky, Maxim
-
Peter Dimov
-
Phil Endecott
-
Steven Watanabe
-
Tim Blechmann