
On Wed, Dec 19, 2012 at 2:36 PM, Tim Blechmann <tim@klingt.org> wrote:
hi all,
i need another pair of eyes regarding boost.atomic on x86:
the implementation of memory barrier is merely a compiler barrier, but not a CPU barrier, as it is using code like: __asm__ __volatile__ ("" ::: "memory");
afaict, one should use a `real' CPU barrier like "mfence" or "lock; addl $0,0(%%esp)". is this correct?
apart from that, i've seen that compare_exchange is using explicit memory barriers before/after "cmpxchg" instructions. i somehow though that cmpxchg and the 8b/16b variants implicitly issue a memory barrier, so the resulting code would generate multiple memory barriers.
can someone with some insights in the x86 architecture confirm this?
I'm not claiming to be a specialist in IA32 but here's my understanding. There are several groups of functions that Boost.Atomic uses to implement operations. The platform_fence_before/after functions are used to prevent the compiler to reorder the generated code across the atomic op. The functions are also used to enforce hardware fences when required by the memory order argument. The platform_fence_after_load/platform_fence_before_store/platform_fence_after_store functions are used specifically for load and store ops in the similar way.There are also the platform_cmpxchg32_strong/platform_cmpxchg32 functions that perform CAS; these functions are only used by the generic CAS-based implementations (which is not the case for x86 anymore, as I rewrote the implementation for Windows). Now, the "before" functions need only to implement write barriers to prevent stores traveling below the atomic op. Similarly, the "after" functions need only to implement read barriers. Of course, the barriers are only required when the appropriate memory order is requested by user. On x86 memory view is almost always synchronized (AFAIK, the only exception is non-temporal stores, which are usually finalized with explicit mfence anyway), so unless the user requests memory_order_seq_cst only compiler barrier will suffice. As for memory_order_seq_cst, it requires global sequencing, and here's the part I'm not sure about. Lock-prefixed ops on x86 are full fences themselves, so it looks like no special hardware fence is needed in this case either. So unless I'm missing something, mfence could be removed as well in this case. Could somebody confirm that?