
Khiszinsky, Maxim wrote:
static inline atomic64_t load64( atomic64_t volatile const * pMem, memory_order order ) { atomic64_t v ; v = _mm_loadl_epi64( (__m128i *) pMem ).m128i_i64[0] ; if ( order == memory_order_seq_cst ) fence( memory_order_seq_cst) ; return v ; }
I don't think that this is correct. First, the seq_cst fence, if needed, must come before the load, not after it; second, a fence on load is not needed for seq_cst semantics on x86, since loads have acquire semantics, but stores must be locked (LOCK XCHG, or in this case LOCK CMPXCHG8). It depends on the implementation of store64 though; I could be missing something. I'm also not quite positive on whether SSE loads are guaranteed to acquire, but if they don't, you need a trailing fence for memory_order_acquire as well.
Maybe this is MSVC problem only - bad optimization for SSE instructions. Another example of MSVC optimization error for x86 - see my class cds::lock::RecursiveSpinT: void acquireLock() { // TATAS algorithm while ( !tryAcquireLock() ) { while ( m_spin.template load<membar_relaxed>() ) { backoff() ; // VC++ 2008 bug: the compiler generates infinite loop for int64 m_spin.
You're casting away volatile in load64.