
Phil Endecott wrote
You listed x86, amd64, ia64, sparc; Boost.Atomic currently supports (I think) x86, amd64, alpha, ppc and arm. Of course ia64 and sparc implementations for Boost.Atomic would be useful.
Yes, indeed.
The problem is that the compiler (in some cases) generates case-based code when 'order' parameter is constant for caller: store( &myAtomic, 10, memory_order_relaxed) ; in this case instead of ONE assembler store instruction the compiler may generate many branch instruction. It is not optimal :-(. And 99% of code with atomic primitives has *constant* memory_order parameter.
I would like to think that all modern compilers could get this right, at least if the right level of optimisation were enabled. Can you please tell us in what case you have observed this?
I observed this when I analyzed the performance decreasing on x86 MS Visual C++ 2008 (with full optimization, of course). The code like this: static inline atomic64_t load64( atomic64_t volatile const * pMem, memory_order order ) { atomic64_t v ; v = _mm_loadl_epi64( (__m128i *) pMem ).m128i_i64[0] ; if ( order == memory_order_seq_cst ) fence( memory_order_seq_cst) ; return v ; } and call: atomic64_t n = load64( &myVar, memory_order_relaxed ) ; produced asm code with spurious comparing and branch [if ( order == memory_order_seq_cst )]. After that I reorganized code using template (membar_xxx are wrappers around memory_order_xxx constants) template <typename ORDER> static inline atomic64_t load64( atomic64_t volatile const * pMem ) { // Atomically loads 64bit value by SSE intrinsics atomic64_t v = _mm_loadl_epi64( (__m128i *) pMem ).m128i_i64[0] ; return v ; } template <> static inline atomic64_t load64<membar_seq_cst>( atomic64_t volatile const * pMem ) { // Atomically loads 64bit value by SSE intrinsics atomic64_t v = _mm_loadl_epi64( (__m128i *) pMem ).m128i_i64[0] ; fence<membar_seq_cst>() ; return v ; } atomic64_t n = load64<member_relaxed>( &myVar ) ; And this implementation works well - no spurious branch instructions. As a result, I reorganized the CDS library to use template atomic functions. Note it is not a global problem. In other calls of load64 I found that the code generated is well. I cannot find any regularity :-( Maybe this is MSVC problem only - bad optimization for SSE instructions. Another example of MSVC optimization error for x86 - see my class cds::lock::RecursiveSpinT: void acquireLock() { // TATAS algorithm while ( !tryAcquireLock() ) { while ( m_spin.template load<membar_relaxed>() ) { backoff() ; // VC++ 2008 bug: the compiler generates infinite loop for int64 m_spin. // It seems, it is result of aggressive optimization: the code generated reads the value saved in registers // instead of reading volatile int64 atomics in the loop // The following compiler barrier prevents this bug CDS_COMPILER_RW_BARRIER ; // _ReadWriteBarrier() for MSVC } } }
If it's true that compilers get this wrong then an approach like your suggestion should be considered. However it's not just Boost.Atomic that would need to be re-done but also the spec for the proposed c++0x features, which would be more difficult (!).
I think it is task for compiler developers. Template-based implementation is a workaround only. However, I like templates :-) Regards, Max _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost