Re: [boost] [atomic] Help understanding consume order

2 Jun 2014

      On Mon, Jun 2, 2014 at 6:11 PM, Andrey Semashev <andrey.semashev@gmail.com>
wrote:
...
On Monday 02 June 2014 12:42:04 Giovanni Piero Deretta wrote:
...
On Sun, Jun 1, 2014 at 11:58 AM, Andrey Semashev <
andrey.semashev@gmail.com>
wrote:
ARM and many other RMO architectures (like PPC and unlike Alpha),
guarantee
that a load and the load it depends on won't be reordered, so, together
with the release operation on the writer side, the load_consume
guarantees
the visibility of the string body.
Ok, so basically, on ARM and PPC consume operation should be equivalent to
relaxed + a compiler barrier to prevent possible code reordering, am I
right?
I'm afraid things are a bit more complex. It is not just a problem of
reordering, so memory barriers are not enough. See below.
...
...
...
I guess, that's the ultimate question: how should consume ordering be
handled
on conventional architectures.
That's hard to do without compiler help unfortunately. Compilers have
started doing some quite aggressive optimisations (like value speculation
and PGO) that can break loads dependencies. The linux kernel for example
gets by by explicitly disabling those optimisations, not doing PGO and
targeting a specific compiler.
I hope, speculative loads and stores can be prevented with compiler
barriers.
But it looks like PGO/IPO/LTO may ruin the code involving this memory
ordering. I suppose, even if I put a compiler barrier (which in case of
gcc is
an empty asm statement with memory in the clobber list) it will not be in
effect during the optimization phase?
Modulo bugs, GCC will respect memory barriers for all optimizations. Lemme
show an example that can break. Suppose you have a pair of intrusive single
consumer single producer memory queues.

Two threads use the two queues for bidirectional access and most of the
time can enqueue nodes popped from one queue to the other queue; if the
threads are able to run in lock step, the queue size will always be of size
1, the same node is always reused. Occasionally, when there is some
contention, one thread might be slowed down and the other thread might need
to allocate a new node.
struct queue
{
      void push_back(node * p); // release
      node * pop_front();            // consume
};

The head and tail pointers of the queue are atomic<node*>, but the node
content is ever accessed by a single thread at a time, so the nodes contain
no atomic objects. A consumer can safely access the content of the node as
the node address depends on the load from the queue back

Let's not go in too much in implementation details, but assume that
push_back can be implemented via just store_release stores and pop_front
with load_consume loads; in particular pop_front will look something like
this:

[[carries_dependency]]
node * pop_front()
{
   node * x = head.load(memory_order_consume);

   // magic to update head and node goes here

   return x;
}

Note the "carries dependency", we will get back to it.

Now, let's say that the implementation of load_consume for your compiler is
not optimal, so you attempt to roll your own semi portable implementation
with of relaxed loads and compiler barriers:

node * pop_front()
{
   node * x = head.load(memory_order_relaxed);
   // sort-of-portable compiler memory barrier
   atomic_signal_fence(std::memory_order_acquire);

   // magic to update head here
   return x;
}

Alternate implementations that use volatile and compiler specific fences
are also possible. There is no acquire in sight, so "carries dependency"
would have no effect.

This will work fine most of the time. Then one day you enable PGO and your
compiler sees that the pop_front always misses the cache (obviously as it
is doing cross thread communication) and figures out that the address of
the node is almost always the same (likely if it comes from a custom
allocator):

node * x = queue.pop_front(); // cache miss, tens or hundreds of cycles

// x is usually 0x12345600

frob(x->payload);

pop and use of x are on a critical piece of code which is latency bound, so
the speculative value optimization kicks in and trades an high latency load
with a low latency and quite predictable branch:

node * x = queue.pop_front();

// compiler inserted branch, 1-2 cycles
if (x == 0x12345600)
    x = 0x12345600;

frob(x->payload);

Thanks to the magic of OoO execution and branch prediction, in the majority
of cases frob needs not to wait for the high latency load of x and can be
speculated ahead.  If the address turned out to be different, the OoO
machinery will roll back the speculation. On average this can potentially
be a big win.

Except that it breaks our code: on the successful speculation case, the
load of payload no longer depends on the consume pop_front: the branch
explicitly broke the dependency. Note that no load reordering is done by
the compiler, so no amount of compiler memory barriers will help.  Also the
code may be broken in completely non obvious places: between the load from
head and the load of payload there might be an arbitrary amount of function
calls (there is at least one here: pop_front) and they could even be on
different translation units.

This is of course only one (the one that I'm familiar with) of the
optimizations that can break seemly correct code, but there are others
(like aggressive expression reduction) that not even require PGO.

The authors of the C++ memory model knew about this issue and added quite
complex rules to prevent the compiler from breaking dependencies. Compilers
that may perform dependency breaking optimizations, are required to
preserve consume ordering at function calls boundaries by issuing explicit
fences and upgrading the consumes to acquires. The alternative would be to
require whole program compilation which is untenable. Alternatively,
sprinkling the code carefully with [[carries_dependency]]supposedly can
prevent the compiler from issuing the fence.

In practice implementing the required dependency tracking machinery on
existing compilers has proven too hard so they issue the fence immediately
when performing the initial load, so it is possible that the committee will
change this part of the standard: see N4036 on the last mailing.

For the time being we are stuck with either load_acquire or very fragile
workarounds.
...
...
See n2664 and the recent epic thread on gcc-dev.
Is it this thread?
https://gcc.gnu.org/ml/gcc/2014-02/msg00052.html
yes.

HTH,

-- gpd