[atomic, x86] need another pair of eyes

hi all, i need another pair of eyes regarding boost.atomic on x86: the implementation of memory barrier is merely a compiler barrier, but not a CPU barrier, as it is using code like: __asm__ __volatile__ ("" ::: "memory"); afaict, one should use a `real' CPU barrier like "mfence" or "lock; addl $0,0(%%esp)". is this correct? apart from that, i've seen that compare_exchange is using explicit memory barriers before/after "cmpxchg" instructions. i somehow though that cmpxchg and the 8b/16b variants implicitly issue a memory barrier, so the resulting code would generate multiple memory barriers. can someone with some insights in the x86 architecture confirm this? thanks, tim

Tim Blechmann wrote:
Probably not, but can you please be more specific? In, f.ex. gcc-x86.hpp there are several functions that fit the description. Which one do you think is wrong?
The locked instructions (and XCHG, which is implicitly locked) are indeed "memory barriers" from CPU point of view, but not necessarily from compiler point of view. The calls to platform_fence_before and platform_fence_after are there for the compiler barrier, as far as I can see.

On Wed, Dec 19, 2012 at 2:36 PM, Tim Blechmann <tim@klingt.org> wrote:
I'm not claiming to be a specialist in IA32 but here's my understanding. There are several groups of functions that Boost.Atomic uses to implement operations. The platform_fence_before/after functions are used to prevent the compiler to reorder the generated code across the atomic op. The functions are also used to enforce hardware fences when required by the memory order argument. The platform_fence_after_load/platform_fence_before_store/platform_fence_after_store functions are used specifically for load and store ops in the similar way.There are also the platform_cmpxchg32_strong/platform_cmpxchg32 functions that perform CAS; these functions are only used by the generic CAS-based implementations (which is not the case for x86 anymore, as I rewrote the implementation for Windows). Now, the "before" functions need only to implement write barriers to prevent stores traveling below the atomic op. Similarly, the "after" functions need only to implement read barriers. Of course, the barriers are only required when the appropriate memory order is requested by user. On x86 memory view is almost always synchronized (AFAIK, the only exception is non-temporal stores, which are usually finalized with explicit mfence anyway), so unless the user requests memory_order_seq_cst only compiler barrier will suffice. As for memory_order_seq_cst, it requires global sequencing, and here's the part I'm not sure about. Lock-prefixed ops on x86 are full fences themselves, so it looks like no special hardware fence is needed in this case either. So unless I'm missing something, mfence could be removed as well in this case. Could somebody confirm that?

Andrey Semashev wrote: On Wed, Dec 19, 2012 at 3:27 PM, Peter Dimov <lists@pdimov.com> wrote:
It doesn't seem necessary, but it's only used at one place, in atomic_flag::clear. All uses of READ_BARRIER or WRITE_BARRIER seem wrong, though. READ_WRITE_BARRIER is needed in all places where these are used.

On Wed, Dec 19, 2012 at 4:00 PM, Peter Dimov <lists@pdimov.com> wrote:
Ok, thank you, I will correct that.
All uses of READ_BARRIER or WRITE_BARRIER seem wrong, though. READ_WRITE_BARRIER is needed in all places where these are used.
Why? The intention was to prevent outer reads and writes being reordered with the atomic op. Did I miss something?

On Wed, Dec 19, 2012 at 4:28 PM, Peter Dimov <lists@pdimov.com> wrote:
I suppose you're right. I was under impression that _ReadBarrier and _WriteBarrier intrinsics were equivalent to the acquire and release compiler fences but reading their docs now this doesn't seem to be correct. Only _ReadWriteBarrier has the comment that the compiler doesn't reorder operations across it. Thank you, I will correct the code.

It's been quite a while since I had my hands on VC' codegen, but I'm fairly certain the sole purpose of these intrins is to prevent reordering in the compiler. They don't result in any code being emitted directly. I guess, in real-world cases it's rather unlikely to make a difference as there's usually some expression with side-effects in between memory operations or there are volatile accesses. However, these intrinsics are needed to avoided things like redundant-load-after-load optimizations. E.g.: // address-taken globals extern int x; extern int y; void foo() { int lx = x; int ly = y; use(x); sync_operation(); lx = x; use(x); } If use and sync_operation have no side effects, the second load of x would usually be optimized out. The initial load might be ordered after the sync_operation. This is where the _*Barrier intrinsics come in. -hg

hi andrey and peter, to be more specific: of the test machines at sandia sometimes has test failures for boost.lockfree, which i cannot reproduce on my machine. the algorithm in question (the spsc_queue) only makes use of atomic reads and writes and only requires two memory barriers. the sandia machine is a 4 cpu, 8 core/cpu machine, while my machines are all single-cpu multi-core. peter, the compare_exchange is in boost/atomic/detail/gcc-x86.hpp, the code in question looks like: bool compare_exchange_strong( value_type & expected, value_type desired, memory_order success_order, memory_order failure_order) volatile { value_type previous = expected; platform_fence_before(success_order); __asm__ ( "lock ; cmpxchgl %2, %1" : "+a" (previous), "+m" (v_) : "r" (desired) ); bool success = (previous == expected); if (success) platform_fence_after(success_order); else platform_fence_after(failure_order); expected = previous; return success; } imo, the platform_fence_before/after is not necessary ... tim

On Wed, Dec 19, 2012 at 3:40 PM, Tim Blechmann <tim@klingt.org> wrote:
andrey & peter,
could you review this patch?
As I understand, hardware fences in platform_fence_before and platform_fence_after are not needed. lfence is a no-op for normal memory, as far as I know, so it can be removed as well. The hardware fences in platform_fence_before_store and platform_fence_after_store are not needed considering how store() operations are implemented. If user requests seq_cst then store is implemented as an atomic exchange() which is a full fence itself. Otherwise the store should be synchronized only with the atomic variable itself, and on x86 a regular store with a compiler barrier is sufficient.

On Wed, Dec 19, 2012 at 3:34 PM, Tim Blechmann <tim@klingt.org> wrote:
imo, the platform_fence_before/after is not necessary ...
The cmpxchgl asm statement is not viewed as a compiler barrier, so the compiler can reorder the preceeding or subsequent operations with it. But if all asm operations are marked as memory clobbering it looks like the before/after functions become useless and can be removed.

On Wed, Dec 19, 2012 at 6:34 AM, Tim Blechmann <tim@klingt.org> wrote:
...
imo, the platform_fence_before/after is not necessary ...
tim
If there is a bug, _removing_ barriers isn't going to solve it. Or do I misunderstand the situation? (I agree that the barriers may be unnecessary, but I don't see how it helps with the test failures.) Tony

Tim Blechmann wrote:
unfortunately this test still failed on one of the sandia machines (gcc-4.7.1_0x) ...
I don't know if this is the cause of the failure, but the test still has a race condition. If get() calls get_elements(), which returns false, and then the writer pushes the last elements and sets running to false, get() will then return. I suppose it needs to be something along the lines of: void get(void) { while( running ) { get_elements(); } while( get_elements() ); } or equivalent.

Incidentally, the mfence in gcc-x86.hpp line 86, in platform_fence_after_load, is unnecessary as well, if I'm not mistaken. The semantics of memory_order_seq_cst are (intentionally) such that it doesn't require fences on the load side, as long as all seq_cst stores or modifications are locked or done with xchg (which has an implicit lock prefix). This also seems to apply to windows.hpp, line 203.

Tim Blechmann wrote:
i need another pair of eyes regarding boost.atomic on x86:
It's not only x86 that needs more eyes. I wrote the ARM code, and it is full of FIXMEs. Although I believe that it basically works, I don't know anything like enough about e.g. memory barriers and the current code is little more than a guess. This really should be reviewed by someone more knowledgeable before it is released. One thing that definitely needs attention is the ARM architecture versions detection stuff, which needs to cope with "ARM v7S" (related: https://svn.boost.org/trac/boost/ticket/7599). Also, how do people feel about using Boost.Atomic underneath shared_ptr? (https://svn.boost.org/trac/boost/ticket/5625 - again, I've been using this for a long time and it seems to work.) Maybe as a non-default initially if we're nervous. It might help to find any bugs in Boost.Atomic... Regards, Phil.

On Saturday 22 December 2012 14:44:40 Phil Endecott wrote:
In general I welcome this suggestion. However, shared_ptr probably should not require linking to Boost.Atomic, so some header-only mode should be provided by Boost.Atomic first.

On Saturday 22 December 2012 16:06:33 Tim Blechmann wrote:
I was thinking about splitting the implementation into a guaranteed-to-be- lock-free and the lock pool-based parts (the latter one requires linking to the library). The lock-free part should be accessible through a separate public header, it should not include the lock-based part. boost/atomic.hpp should still include both parts, thus providing the complete atomic<> implementation. It's rather straightforward. Basically, base_atomic<> implementation should be moved from base.hpp into a separate header, and base.hpp should only contain forward declarations. Auto-linking code from config.hpp should be extracted as well and only included when the lock pool-based implementation is included.
participants (6)
-
Andrey Semashev
-
Gottlob Frege
-
Holger Grund
-
Peter Dimov
-
Phil Endecott
-
Tim Blechmann