[atomic] Help understanding consume order
Hi, I'm reviewing (again) Boost.Atomic code and struggling to understand the consume order and in particular what should it mean on architectures other than DEC Alpha. I read the explanation here: http://en.cppreference.com/w/cpp/atomic/memory_order but the point eludes me. Take ARM for example and the explanation in the "Release-Consume ordering" section. The producer thread allocates the string and stores the pointer with a release operation, so that the pointer, the string contents and the 'data' integer are visible to other threads. Now the consumer thread reads the pointer with a consume operation. According to the explanation in the article, on ARM the consume operation need not issue any specific fences to be able to use the pointer and the string body. In that case, the consume operation becomes equivalent to relaxed (plus prohibiting compiler optimizations). But is there a guarantee that the string body will be visible to the consumer? Shouldn't the consume operation be promoted to acquire instead? I guess, that's the ultimate question: how should consume ordering be handled on conventional architectures?
hi andrey,
I'm reviewing (again) Boost.Atomic code and struggling to understand the consume order and in particular what should it mean on architectures other than DEC Alpha.
I read the explanation here:
http://en.cppreference.com/w/cpp/atomic/memory_order
but the point eludes me. Take ARM for example and the explanation in the "Release-Consume ordering" section. The producer thread allocates the string and stores the pointer with a release operation, so that the pointer, the string contents and the 'data' integer are visible to other threads.
Now the consumer thread reads the pointer with a consume operation. According to the explanation in the article, on ARM the consume operation need not issue any specific fences to be able to use the pointer and the string body. In that case, the consume operation becomes equivalent to relaxed (plus prohibiting compiler optimizations). But is there a guarantee that the string body will be visible to the consumer? Shouldn't the consume operation be promoted to acquire instead?
the alpha can speculatively load an memory region. the consume semantics prevents issues a memory barrier to make sure that the memory is read after the atomic pointer is written to. so from my understanding, on arm and x86, relaxed has the same guarantee than consume ... would be great if someone could confirm this, though. tim
On Monday 02 June 2014 09:45:46 Tim Blechmann wrote:
hi andrey,
I'm reviewing (again) Boost.Atomic code and struggling to understand the consume order and in particular what should it mean on architectures other than DEC Alpha.
I read the explanation here:
http://en.cppreference.com/w/cpp/atomic/memory_order
but the point eludes me. Take ARM for example and the explanation in the "Release-Consume ordering" section. The producer thread allocates the string and stores the pointer with a release operation, so that the pointer, the string contents and the 'data' integer are visible to other threads.
Now the consumer thread reads the pointer with a consume operation. According to the explanation in the article, on ARM the consume operation need not issue any specific fences to be able to use the pointer and the string body. In that case, the consume operation becomes equivalent to relaxed (plus prohibiting compiler optimizations). But is there a guarantee that the string body will be visible to the consumer? Shouldn't the consume operation be promoted to acquire instead?
the alpha can speculatively load an memory region. the consume semantics prevents issues a memory barrier to make sure that the memory is read after the atomic pointer is written to. so from my understanding, on arm and x86, relaxed has the same guarantee than consume ...
would be great if someone could confirm this, though.
There is some discussion in this proposal: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2664.htm In particular it mentions that ARM and PPC reorder data-dependent operations, so I assume it means a fence is required? BTW, gcc 4.8 generates fences for consume loads, similarly to acquire, for both ARM and PPC. It may be that gcc devs just went the simplest and safest route when implementing consume though.
On Sun, Jun 1, 2014 at 11:58 AM, Andrey Semashev <andrey.semashev@gmail.com> wrote:
Hi,
I'm reviewing (again) Boost.Atomic code and struggling to understand the consume order and in particular what should it mean on architectures other than DEC Alpha.
I read the explanation here:
http://en.cppreference.com/w/cpp/atomic/memory_order
but the point eludes me. Take ARM for example and the explanation in the "Release-Consume ordering" section. The producer thread allocates the string and stores the pointer with a release operation, so that the pointer, the string contents and the 'data' integer are visible to other threads.
Now the consumer thread reads the pointer with a consume operation. According to the explanation in the article, on ARM the consume operation need not issue any specific fences to be able to use the pointer and the string body. In that case, the consume operation becomes equivalent to relaxed (plus prohibiting compiler optimizations). But is there a guarantee that the string body will be visible to the consumer? Shouldn't the consume operation be promoted to acquire instead?
ARM and many other RMO architectures (like PPC and unlike Alpha), guarantee that a load and the load it depends on won't be reordered, so, together with the release operation on the writer side, the load_consume guarantees the visibility of the string body. The exact definition of load dependency (basically the address of the dependent load is computed as a function of the value returned by the depending load) is defined at the instruction level and is quite tricky to recover at the high level C++ language. C++11 tried to do it, but according to a few the current working is both very hard to implement and both not strong enough and too strict in some cases. In the meantime GCC (and a few other compilers) punts on load_consume and simply promotes it to load_acquire. Note that x86, a TSO machine, has even stronger guarantees, any load is a load_aquire.
I guess, that's the ultimate question: how should consume ordering be handled on conventional architectures.
That's hard to do without compiler help unfortunately. Compilers have started doing some quite aggressive optimisations (like value speculation and PGO) that can break loads dependencies. The linux kernel for example gets by by explicitly disabling those optimisations, not doing PGO and targeting a specific compiler. See n2664 and the recent epic thread on gcc-dev. HTH, -- gpd
On Monday 02 June 2014 12:42:04 Giovanni Piero Deretta wrote:
On Sun, Jun 1, 2014 at 11:58 AM, Andrey Semashev <andrey.semashev@gmail.com> wrote:
ARM and many other RMO architectures (like PPC and unlike Alpha), guarantee that a load and the load it depends on won't be reordered, so, together with the release operation on the writer side, the load_consume guarantees the visibility of the string body.
Ok, so basically, on ARM and PPC consume operation should be equivalent to relaxed + a compiler barrier to prevent possible code reordering, am I right?
The exact definition of load dependency (basically the address of the dependent load is computed as a function of the value returned by the depending load) is defined at the instruction level and is quite tricky to recover at the high level C++ language. C++11 tried to do it, but according to a few the current working is both very hard to implement and both not strong enough and too strict in some cases.
In the meantime GCC (and a few other compilers) punts on load_consume and simply promotes it to load_acquire.
Yes, I noticed that. MSVC seems to follow the same route as well.
I guess, that's the ultimate question: how should consume ordering be handled on conventional architectures.
That's hard to do without compiler help unfortunately. Compilers have started doing some quite aggressive optimisations (like value speculation and PGO) that can break loads dependencies. The linux kernel for example gets by by explicitly disabling those optimisations, not doing PGO and targeting a specific compiler.
I hope, speculative loads and stores can be prevented with compiler barriers. But it looks like PGO/IPO/LTO may ruin the code involving this memory ordering. I suppose, even if I put a compiler barrier (which in case of gcc is an empty asm statement with memory in the clobber list) it will not be in effect during the optimization phase?
See n2664 and the recent epic thread on gcc-dev.
Is it this thread? https://gcc.gnu.org/ml/gcc/2014-02/msg00052.html I'm reading it now, very interesting, thank you for the tip.
On Mon, Jun 2, 2014 at 6:11 PM, Andrey Semashev <andrey.semashev@gmail.com> wrote:
On Monday 02 June 2014 12:42:04 Giovanni Piero Deretta wrote:
On Sun, Jun 1, 2014 at 11:58 AM, Andrey Semashev < andrey.semashev@gmail.com> wrote:
ARM and many other RMO architectures (like PPC and unlike Alpha), guarantee that a load and the load it depends on won't be reordered, so, together with the release operation on the writer side, the load_consume guarantees the visibility of the string body.
Ok, so basically, on ARM and PPC consume operation should be equivalent to relaxed + a compiler barrier to prevent possible code reordering, am I right?
I'm afraid things are a bit more complex. It is not just a problem of reordering, so memory barriers are not enough. See below.
I guess, that's the ultimate question: how should consume ordering be handled on conventional architectures.
That's hard to do without compiler help unfortunately. Compilers have started doing some quite aggressive optimisations (like value speculation and PGO) that can break loads dependencies. The linux kernel for example gets by by explicitly disabling those optimisations, not doing PGO and targeting a specific compiler.
I hope, speculative loads and stores can be prevented with compiler barriers. But it looks like PGO/IPO/LTO may ruin the code involving this memory ordering. I suppose, even if I put a compiler barrier (which in case of gcc is an empty asm statement with memory in the clobber list) it will not be in effect during the optimization phase?
Modulo bugs, GCC will respect memory barriers for all optimizations. Lemme show an example that can break. Suppose you have a pair of intrusive single consumer single producer memory queues. Two threads use the two queues for bidirectional access and most of the time can enqueue nodes popped from one queue to the other queue; if the threads are able to run in lock step, the queue size will always be of size 1, the same node is always reused. Occasionally, when there is some contention, one thread might be slowed down and the other thread might need to allocate a new node. struct queue { void push_back(node * p); // release node * pop_front(); // consume }; The head and tail pointers of the queue are atomic<node*>, but the node content is ever accessed by a single thread at a time, so the nodes contain no atomic objects. A consumer can safely access the content of the node as the node address depends on the load from the queue back Let's not go in too much in implementation details, but assume that push_back can be implemented via just store_release stores and pop_front with load_consume loads; in particular pop_front will look something like this: [[carries_dependency]] node * pop_front() { node * x = head.load(memory_order_consume); // magic to update head and node goes here return x; } Note the "carries dependency", we will get back to it. Now, let's say that the implementation of load_consume for your compiler is not optimal, so you attempt to roll your own semi portable implementation with of relaxed loads and compiler barriers: node * pop_front() { node * x = head.load(memory_order_relaxed); // sort-of-portable compiler memory barrier atomic_signal_fence(std::memory_order_acquire); // magic to update head here return x; } Alternate implementations that use volatile and compiler specific fences are also possible. There is no acquire in sight, so "carries dependency" would have no effect. This will work fine most of the time. Then one day you enable PGO and your compiler sees that the pop_front always misses the cache (obviously as it is doing cross thread communication) and figures out that the address of the node is almost always the same (likely if it comes from a custom allocator): node * x = queue.pop_front(); // cache miss, tens or hundreds of cycles // x is usually 0x12345600 frob(x->payload); pop and use of x are on a critical piece of code which is latency bound, so the speculative value optimization kicks in and trades an high latency load with a low latency and quite predictable branch: node * x = queue.pop_front(); // compiler inserted branch, 1-2 cycles if (x == 0x12345600) x = 0x12345600; frob(x->payload); Thanks to the magic of OoO execution and branch prediction, in the majority of cases frob needs not to wait for the high latency load of x and can be speculated ahead. If the address turned out to be different, the OoO machinery will roll back the speculation. On average this can potentially be a big win. Except that it breaks our code: on the successful speculation case, the load of payload no longer depends on the consume pop_front: the branch explicitly broke the dependency. Note that no load reordering is done by the compiler, so no amount of compiler memory barriers will help. Also the code may be broken in completely non obvious places: between the load from head and the load of payload there might be an arbitrary amount of function calls (there is at least one here: pop_front) and they could even be on different translation units. This is of course only one (the one that I'm familiar with) of the optimizations that can break seemly correct code, but there are others (like aggressive expression reduction) that not even require PGO. The authors of the C++ memory model knew about this issue and added quite complex rules to prevent the compiler from breaking dependencies. Compilers that may perform dependency breaking optimizations, are required to preserve consume ordering at function calls boundaries by issuing explicit fences and upgrading the consumes to acquires. The alternative would be to require whole program compilation which is untenable. Alternatively, sprinkling the code carefully with [[carries_dependency]]supposedly can prevent the compiler from issuing the fence. In practice implementing the required dependency tracking machinery on existing compilers has proven too hard so they issue the fence immediately when performing the initial load, so it is possible that the committee will change this part of the standard: see N4036 on the last mailing. For the time being we are stuck with either load_acquire or very fragile workarounds.
See n2664 and the recent epic thread on gcc-dev.
Is it this thread?
yes. HTH, -- gpd
On Tue, Jun 3, 2014 at 1:38 AM, Giovanni Piero Deretta <gpderetta@gmail.com> wrote: [snip a very helpful example]
This is of course only one (the one that I'm familiar with) of the optimizations that can break seemly correct code, but there are others (like aggressive expression reduction) that not even require PGO.
I see. So there's basically no way I can protect myself from such code injection. The problem is further complicated by the fact that not all code is "mine" in case of atomic<> implementation.
The authors of the C++ memory model knew about this issue and added quite complex rules to prevent the compiler from breaking dependencies. Compilers that may perform dependency breaking optimizations, are required to preserve consume ordering at function calls boundaries by issuing explicit fences and upgrading the consumes to acquires. The alternative would be to require whole program compilation which is untenable. Alternatively, sprinkling the code carefully with [[carries_dependency]]supposedly can prevent the compiler from issuing the fence.
In practice implementing the required dependency tracking machinery on existing compilers has proven too hard so they issue the fence immediately when performing the initial load, so it is possible that the committee will change this part of the standard: see N4036 on the last mailing.
N4036 looks interesting, although I suspect adding another type qualifier to the language has the potential to break code, which now only accounts for const and volatile. This is more specific to C++. Also, it's not quite clear how atomic.load(consume) could return a value dep preserving-qualified type but not atomic.load(other_memory_order).
For the time being we are stuck with either load_acquire or very fragile workarounds.
Ok, so I did in Boost.Atomic. Thank you again for the most helpful comments.
participants (3)
-
Andrey Semashev
-
Giovanni Piero Deretta
-
Tim Blechmann