[atomic, x86] need another pair of eyes

older
Re: [boost] Boost SIMD beta release

Tim Blechmann

19 Dec 2012 19 Dec '12

10:36 a.m.

hi all, i need another pair of eyes regarding boost.atomic on x86: the implementation of memory barrier is merely a compiler barrier, but not a CPU barrier, as it is using code like: __asm__ __volatile__ ("" ::: "memory"); afaict, one should use a `real' CPU barrier like "mfence" or "lock; addl $0,0(%%esp)". is this correct? apart from that, i've seen that compare_exchange is using explicit memory barriers before/after "cmpxchg" instructions. i somehow though that cmpxchg and the 8b/16b variants implicitly issue a memory barrier, so the resulting code would generate multiple memory barriers. can someone with some insights in the x86 architecture confirm this? thanks, tim

Show replies by date

Peter Dimov

19 Dec 19 Dec

11:17 a.m.

Tim Blechmann wrote:

...

hi all,

i need another pair of eyes regarding boost.atomic on x86:

the implementation of memory barrier is merely a compiler barrier, but not a CPU barrier, as it is using code like: __asm__ __volatile__ ("" ::: "memory");

afaict, one should use a `real' CPU barrier like "mfence" or "lock; addl $0,0(%%esp)". is this correct?

Probably not, but can you please be more specific? In, f.ex. gcc-x86.hpp there are several functions that fit the description. Which one do you think is wrong?

...

apart from that, i've seen that compare_exchange is using explicit memory barriers before/after "cmpxchg" instructions. i somehow though that cmpxchg and the 8b/16b variants implicitly issue a memory barrier, so the resulting code would generate multiple memory barriers.

The locked instructions (and XCHG, which is implicitly locked) are indeed "memory barriers" from CPU point of view, but not necessarily from compiler point of view. The calls to platform_fence_before and platform_fence_after are there for the compiler barrier, as far as I can see.

Andrey Semashev

11:19 a.m.

On Wed, Dec 19, 2012 at 2:36 PM, Tim Blechmann <tim@klingt.org> wrote:

...

hi all,

i need another pair of eyes regarding boost.atomic on x86:

the implementation of memory barrier is merely a compiler barrier, but not a CPU barrier, as it is using code like: __asm__ __volatile__ ("" ::: "memory");

afaict, one should use a `real' CPU barrier like "mfence" or "lock; addl $0,0(%%esp)". is this correct?

apart from that, i've seen that compare_exchange is using explicit memory barriers before/after "cmpxchg" instructions. i somehow though that cmpxchg and the 8b/16b variants implicitly issue a memory barrier, so the resulting code would generate multiple memory barriers.

can someone with some insights in the x86 architecture confirm this?

I'm not claiming to be a specialist in IA32 but here's my understanding. There are several groups of functions that Boost.Atomic uses to implement operations. The platform_fence_before/after functions are used to prevent the compiler to reorder the generated code across the atomic op. The functions are also used to enforce hardware fences when required by the memory order argument. The platform_fence_after_load/platform_fence_before_store/platform_fence_after_store functions are used specifically for load and store ops in the similar way.There are also the platform_cmpxchg32_strong/platform_cmpxchg32 functions that perform CAS; these functions are only used by the generic CAS-based implementations (which is not the case for x86 anymore, as I rewrote the implementation for Windows). Now, the "before" functions need only to implement write barriers to prevent stores traveling below the atomic op. Similarly, the "after" functions need only to implement read barriers. Of course, the barriers are only required when the appropriate memory order is requested by user. On x86 memory view is almost always synchronized (AFAIK, the only exception is non-temporal stores, which are usually finalized with explicit mfence anyway), so unless the user requests memory_order_seq_cst only compiler barrier will suffice. As for memory_order_seq_cst, it requires global sequencing, and here's the part I'm not sure about. Lock-prefixed ops on x86 are full fences themselves, so it looks like no special hardware fence is needed in this case either. So unless I'm missing something, mfence could be removed as well in this case. Could somebody confirm that?

Peter Dimov

11:27 a.m.

Andrey Semashev wrote:

...

On x86 memory view is almost always synchronized (AFAIK, the only exception is non-temporal stores, which are usually finalized with explicit mfence anyway), so unless the user requests memory_order_seq_cst only compiler barrier will suffice. As for memory_order_seq_cst, it requires global sequencing, and here's the part I'm not sure about. Lock-prefixed ops on x86 are full fences themselves, so it looks like no special hardware fence is needed in this case either. So unless I'm missing something, mfence could be removed as well in this case. Could somebody confirm that?

Where (file/line) is the mfence that can be removed?

Andrey Semashev

11:35 a.m.

On Wed, Dec 19, 2012 at 3:27 PM, Peter Dimov <lists@pdimov.com> wrote:

...

Andrey Semashev wrote:

...
On x86 memory view is almost always synchronized (AFAIK, the only exception is non-temporal stores, which are usually finalized with explicit mfence anyway), so unless the user requests memory_order_seq_cst only compiler barrier will suffice. As for memory_order_seq_cst, it requires global sequencing, and here's the part I'm not sure about. Lock-prefixed ops on x86 are full fences themselves, so it looks like no special hardware fence is needed in this case either. So unless I'm missing something, mfence could be removed as well in this case. Could somebody confirm that?

Where (file/line) is the mfence that can be removed?

atomic/detail/windows.hpp, line 174. Comparing it to gcc-x86.hpp, the gcc implementation indeed doesn't have it, so I'm probably right.

Peter Dimov

noon

Andrey Semashev wrote: On Wed, Dec 19, 2012 at 3:27 PM, Peter Dimov <lists@pdimov.com> wrote:

...

...
Where (file/line) is the mfence that can be removed?

atomic/detail/windows.hpp, line 174.

Comparing it to gcc-x86.hpp, the gcc implementation indeed doesn't have it, so I'm probably right.

It doesn't seem necessary, but it's only used at one place, in atomic_flag::clear. All uses of READ_BARRIER or WRITE_BARRIER seem wrong, though. READ_WRITE_BARRIER is needed in all places where these are used.

Andrey Semashev

12:17 p.m.

On Wed, Dec 19, 2012 at 4:00 PM, Peter Dimov <lists@pdimov.com> wrote:

...

Andrey Semashev wrote: On Wed, Dec 19, 2012 at 3:27 PM, Peter Dimov <lists@pdimov.com> wrote:

...
...
Where (file/line) is the mfence that can be removed?

atomic/detail/windows.hpp, line 174.

Comparing it to gcc-x86.hpp, the gcc implementation indeed doesn't have it, so I'm probably right.

It doesn't seem necessary, but it's only used at one place, in atomic_flag::clear.

Ok, thank you, I will correct that.

...

All uses of READ_BARRIER or WRITE_BARRIER seem wrong, though. READ_WRITE_BARRIER is needed in all places where these are used.

Why? The intention was to prevent outer reads and writes being reordered with the atomic op. Did I miss something?

Peter Dimov

12:28 p.m.

Andrey Semashev wrote:

...

...
All uses of READ_BARRIER or WRITE_BARRIER seem wrong, though. READ_WRITE_BARRIER is needed in all places where these are used.

Why? The intention was to prevent outer reads and writes being reordered with the atomic op. Did I miss something?

The acquire/release semantics are not specific to reads or writes. An acquire load should prevent subsequent writes from being reordered so as to precede it, and a read barrier doesn't do that, I think.

Andrey Semashev

4:41 p.m.

On Wed, Dec 19, 2012 at 4:28 PM, Peter Dimov <lists@pdimov.com> wrote:

...

Andrey Semashev wrote:

...
...
All uses of READ_BARRIER or WRITE_BARRIER seem wrong, though. READ_WRITE_BARRIER is needed in all places where these are used.

Why? The intention was to prevent outer reads and writes being reordered with the atomic op. Did I miss something?

The acquire/release semantics are not specific to reads or writes. An acquire load should prevent subsequent writes from being reordered so as to precede it, and a read barrier doesn't do that, I think.

I suppose you're right. I was under impression that _ReadBarrier and _WriteBarrier intrinsics were equivalent to the acquire and release compiler fences but reading their docs now this doesn't seem to be correct. Only _ReadWriteBarrier has the comment that the compiler doesn't reorder operations across it. Thank you, I will correct the code.

Holger Grund

5:37 p.m.

...

...
...
...
All uses of READ_BARRIER or WRITE_BARRIER seem wrong, though. READ_WRITE_BARRIER is needed in all places where these are used.

Why? The intention was to prevent outer reads and writes being reordered with the atomic op. Did I miss something?

The acquire/release semantics are not specific to reads or writes. An acquire load should prevent subsequent writes from being reordered so as to precede it, and a read barrier doesn't do that, I think.

I suppose you're right. I was under impression that _ReadBarrier and _WriteBarrier intrinsics were equivalent to the acquire and release compiler fences but reading their docs now this doesn't seem to be correct. Only _ReadWriteBarrier has the comment that the compiler doesn't reorder operations across it. Thank you, I will correct the code.

It's been quite a while since I had my hands on VC' codegen, but I'm fairly certain the sole purpose of these intrins is to prevent reordering in the compiler. They don't result in any code being emitted directly. I guess, in real-world cases it's rather unlikely to make a difference as there's usually some expression with side-effects in between memory operations or there are volatile accesses. However, these intrinsics are needed to avoided things like redundant-load-after-load optimizations. E.g.: // address-taken globals extern int x; extern int y; void foo() { int lx = x; int ly = y; use(x); sync_operation(); lx = x; use(x); } If use and sync_operation have no side effects, the second load of x would usually be optimized out. The initial load might be ordered after the sync_operation. This is where the _*Barrier intrinsics come in. -hg

Tim Blechmann

11:34 a.m.

hi andrey and peter, to be more specific: of the test machines at sandia sometimes has test failures for boost.lockfree, which i cannot reproduce on my machine. the algorithm in question (the spsc_queue) only makes use of atomic reads and writes and only requires two memory barriers. the sandia machine is a 4 cpu, 8 core/cpu machine, while my machines are all single-cpu multi-core. peter, the compare_exchange is in boost/atomic/detail/gcc-x86.hpp, the code in question looks like: bool compare_exchange_strong( value_type & expected, value_type desired, memory_order success_order, memory_order failure_order) volatile { value_type previous = expected; platform_fence_before(success_order); __asm__ ( "lock ; cmpxchgl %2, %1" : "+a" (previous), "+m" (v_) : "r" (desired) ); bool success = (previous == expected); if (success) platform_fence_after(success_order); else platform_fence_after(failure_order); expected = previous; return success; } imo, the platform_fence_before/after is not necessary ... tim

Tim Blechmann

11:40 a.m.

andrey & peter, could you review this patch? thanks, tim

Peter Dimov

12:03 p.m.

Tim Blechmann wrote:

...

andrey & peter,

could you review this patch?

It's wrong; there is no reason for all these barriers.

Andrey Semashev

12:09 p.m.

On Wed, Dec 19, 2012 at 3:40 PM, Tim Blechmann <tim@klingt.org> wrote:

...

andrey & peter,

could you review this patch?

As I understand, hardware fences in platform_fence_before and platform_fence_after are not needed. lfence is a no-op for normal memory, as far as I know, so it can be removed as well. The hardware fences in platform_fence_before_store and platform_fence_after_store are not needed considering how store() operations are implemented. If user requests seq_cst then store is implemented as an atomic exchange() which is a full fence itself. Otherwise the store should be synchronized only with the atomic variable itself, and on x86 a regular store with a compiler barrier is sufficient.

Andrey Semashev

11:41 a.m.

On Wed, Dec 19, 2012 at 3:34 PM, Tim Blechmann <tim@klingt.org> wrote:

...

imo, the platform_fence_before/after is not necessary ...

The cmpxchgl asm statement is not viewed as a compiler barrier, so the compiler can reorder the preceeding or subsequent operations with it. But if all asm operations are marked as memory clobbering it looks like the before/after functions become useless and can be removed.

Tim Blechmann

11:57 a.m.

...

...
imo, the platform_fence_before/after is not necessary ...

The cmpxchgl asm statement is not viewed as a compiler barrier, so the compiler can reorder the preceeding or subsequent operations with it. But if all asm operations are marked as memory clobbering it looks like the before/after functions become useless and can be removed.

that's my understanding, too ... can you review this patch? thnx, tim

Gottlob Frege

4:51 p.m.

On Wed, Dec 19, 2012 at 6:34 AM, Tim Blechmann <tim@klingt.org> wrote:

...

hi andrey and peter,

to be more specific: of the test machines at sandia sometimes has test failures for boost.lockfree, which i cannot reproduce on my machine. the algorithm in question (the spsc_queue) only makes use of atomic reads and writes and only requires two memory barriers. the sandia machine is a 4 cpu, 8 core/cpu machine, while my machines are all single-cpu multi-core.

...

imo, the platform_fence_before/after is not necessary ...

tim

If there is a bug, _removing_ barriers isn't going to solve it. Or do I misunderstand the situation? (I agree that the barriers may be unnecessary, but I don't see how it helps with the test failures.) Tony

Tim Blechmann

6:24 p.m.

...

...
...
imo, the platform_fence_before/after is not necessary ...

tim

If there is a bug, _removing_ barriers isn't going to solve it. Or do I misunderstand the situation? (I agree that the barriers may be unnecessary, but I don't see how it helps with the test failures.)

of course, it is completely unrelated to the test failures, i was just surprised to see these barriers ...

Peter Dimov

6:34 p.m.

Tim Blechmann wrote:

...

...
...
...
imo, the platform_fence_before/after is not necessary ...

tim

If there is a bug, _removing_ barriers isn't going to solve it. Or do I misunderstand the situation? (I agree that the barriers may be unnecessary, but I don't see how it helps with the test failures.)

of course, it is completely unrelated to the test failures, i was just surprised to see these barriers ...

There's nothing surprising in having compiler barriers before release operations and after acquire operations.

Peter Dimov

4:58 p.m.

Tim Blechmann wrote:

...

to be more specific: of the test machines at sandia sometimes has test failures for boost.lockfree, which i cannot reproduce on my machine. the algorithm in question (the spsc_queue) only makes use of atomic reads and writes and only requires two memory barriers. the sandia machine is a 4 cpu, 8 core/cpu machine, while my machines are all single-cpu multi-core.

Which tests fail, by the way?

Tim Blechmann

6:29 p.m.

...

...
to be more specific: of the test machines at sandia sometimes has test

...
failures for boost.lockfree, which i cannot reproduce on my machine. the algorithm in question (the spsc_queue) only makes use of atomic reads and writes and only requires two memory barriers. the sandia machine is a 4 cpu, 8 core/cpu machine, while my machines are all single-cpu multi-core. Which tests fail, by the way?

the stress test for the spsp_queue (spsc_queue_test_caching) ... one thread enqueueing a large number of elements, one thread dequeueing ...

Peter Dimov

6:35 p.m.

Tim Blechmann wrote:

...

...
Which tests fail, by the way?

the stress test for the spsp_queue (spsc_queue_test_caching) ... one thread enqueueing a large number of elements, one thread dequeueing ...

Yes but how? Which BOOST_REQUIRE fails, and what does it say?

Tim Blechmann

6:45 p.m.

...

...
...
Which tests fail, by the way?

the stress test for the spsp_queue (spsc_queue_test_caching) ... one thread enqueueing a large number of elements, one thread dequeueing ... Yes but how? Which BOOST_REQUIRE fails, and what does it say?

some elements are lost ... the receiver reads less elements than the sender enqueues: BOOST_REQUIRE_EQUAL(received_nodes, nodes_per_thread);

Peter Dimov

7:02 p.m.

Tim Blechmann wrote:

...

...
...
...
Which tests fail, by the way?

the stress test for the spsp_queue (spsc_queue_test_caching) ... one thread enqueueing a large number of elements, one thread dequeueing ... Yes but how? Which BOOST_REQUIRE fails, and what does it say?

some elements are lost ... the receiver reads less elements than the sender enqueues: BOOST_REQUIRE_EQUAL(received_nodes, nodes_per_thread);

I think that this is a possible and legitimate outcome of your test. Nothing guarantees that the receiver will see the last batch of elements before seeing running = false. You have to set running = false from the writer thread for that to be guaranteed.

Tim Blechmann

7:20 p.m.

...

...
...
...
...
the stress test for the spsp_queue (spsc_queue_test_caching) ... one thread enqueueing a large number of elements, one thread dequeueing ... Yes but how? Which BOOST_REQUIRE fails, and what does it say?

some elements are lost ... the receiver reads less elements than the sender enqueues: BOOST_REQUIRE_EQUAL(received_nodes, nodes_per_thread);

I think that this is a possible and legitimate outcome of your test. Nothing guarantees that the receiver will see the last batch of elements before seeing running = false. You have to set running = false from the writer thread for that to be guaranteed.

good point! will adapt the testsuite and hope this will resolve the issue ... thanks a lot!

Tim Blechmann

22 Dec 22 Dec

3:02 p.m.

hi peter & others,

...

...
...
...
...
Which tests fail, by the way?

...
>> > >> > the stress test for the spsp_queue (spsc_queue_test_caching) ... one >> > thread enqueueing a large number of elements, one thread dequeueing >> > ... Yes but how? Which BOOST_REQUIRE fails, and what does it say?

some elements are lost ... the receiver reads less elements than the sender enqueues: BOOST_REQUIRE_EQUAL(received_nodes, nodes_per_thread); I think that this is a possible and legitimate outcome of your test. Nothing guarantees that the receiver will see the last batch of elements before seeing running = false. You have to set running = false from the writer thread for that to be guaranteed.

unfortunately this test still failed on one of the sandia machines (gcc-4.7.1_0x) ... iac, i wonder: would anyone be able to give me ssh access to a multi-cpu (not multi-core) machine to do some experiments? thnx, tim

Peter Dimov

3:23 p.m.

Tim Blechmann wrote:

...

unfortunately this test still failed on one of the sandia machines (gcc-4.7.1_0x) ...

I don't know if this is the cause of the failure, but the test still has a race condition. If get() calls get_elements(), which returns false, and then the writer pushes the last elements and sets running to false, get() will then return. I suppose it needs to be something along the lines of: void get(void) { while( running ) { get_elements(); } while( get_elements() ); } or equivalent.

Tim Blechmann

9:22 p.m.

...

...
unfortunately this test still failed on one of the sandia machines (gcc-4.7.1_0x) ...

I don't know if this is the cause of the failure, but the test still has a race condition. If get() calls get_elements(), which returns false, and then the writer pushes the last elements and sets running to false, get() will then return.

true ... waiting again for the tests to cycle ... thnx, tim

Peter Dimov

23 Dec 23 Dec

1:39 a.m.

Tim Blechmann wrote:

...

...
...
unfortunately this test still failed on one of the sandia machines (gcc-4.7.1_0x) ...

I don't know if this is the cause of the failure, but the test still has a race condition. If get() calls get_elements(), which returns false, and then the writer pushes the last elements and sets running to false, get() will then return.

true ... waiting again for the tests to cycle ...

thnx, tim

I think that you missed one place - spsc_queue.cpp, line 397?

Tim Blechmann

9:47 a.m.

...

I think that you missed one place - spsc_queue.cpp, line 397?

oops, right

Peter Dimov

19 Dec 19 Dec

5:34 p.m.

Incidentally, the mfence in gcc-x86.hpp line 86, in platform_fence_after_load, is unnecessary as well, if I'm not mistaken. The semantics of memory_order_seq_cst are (intentionally) such that it doesn't require fences on the load side, as long as all seq_cst stores or modifications are locked or done with xchg (which has an implicit lock prefix). This also seems to apply to windows.hpp, line 203.

Andrey Semashev

10:02 p.m.

On Wed, Dec 19, 2012 at 9:34 PM, Peter Dimov <lists@pdimov.com> wrote:

...

Incidentally, the mfence in gcc-x86.hpp line 86, in platform_fence_after_load, is unnecessary as well, if I'm not mistaken. The semantics of memory_order_seq_cst are (intentionally) such that it doesn't require fences on the load side, as long as all seq_cst stores or modifications are locked or done with xchg (which has an implicit lock prefix). This also seems to apply to windows.hpp, line 203.

I have committed my fixes as well as removing mfence in platform_fence_after_load. I've also added pause instructions to spin loops to optimize performance and CPU power consumption.

Phil Endecott

22 Dec 22 Dec

2:44 p.m.

New subject: [atomic] ARM needs more eyes too

Tim Blechmann wrote:

...

i need another pair of eyes regarding boost.atomic on x86:

It's not only x86 that needs more eyes. I wrote the ARM code, and it is full of FIXMEs. Although I believe that it basically works, I don't know anything like enough about e.g. memory barriers and the current code is little more than a guess. This really should be reviewed by someone more knowledgeable before it is released. One thing that definitely needs attention is the ARM architecture versions detection stuff, which needs to cope with "ARM v7S" (related: https://svn.boost.org/trac/boost/ticket/7599). Also, how do people feel about using Boost.Atomic underneath shared_ptr? (https://svn.boost.org/trac/boost/ticket/5625 - again, I've been using this for a long time and it seems to work.) Maybe as a non-default initially if we're nervous. It might help to find any bugs in Boost.Atomic... Regards, Phil.

Andrey Semashev

2:55 p.m.

New subject: [atomic] ARM needs more eyes too

On Saturday 22 December 2012 14:44:40 Phil Endecott wrote:

...

Also, how do people feel about using Boost.Atomic underneath shared_ptr? (https://svn.boost.org/trac/boost/ticket/5625 - again, I've been using this for a long time and it seems to work.) Maybe as a non-default initially if we're nervous. It might help to find any bugs in Boost.Atomic...

In general I welcome this suggestion. However, shared_ptr probably should not require linking to Boost.Atomic, so some header-only mode should be provided by Boost.Atomic first.

Tim Blechmann

3:06 p.m.

New subject: [atomic] ARM needs more eyes too

...

...
Also, how do people feel about using Boost.Atomic underneath

...
shared_ptr? (https://svn.boost.org/trac/boost/ticket/5625 - again, I've been using this for a long time and it seems to work.) Maybe as a non-default initially if we're nervous. It might help to find any bugs in Boost.Atomic... In general I welcome this suggestion. However, shared_ptr probably should not require linking to Boost.Atomic, so some header-only mode should be provided by Boost.Atomic first.

i guess boost.atomic could be extended to work on all platforms that are currently supported by shared_ptr. the required subset could the be header-only ... tim

Andrey Semashev

3:24 p.m.

New subject: [atomic] ARM needs more eyes too

On Saturday 22 December 2012 16:06:33 Tim Blechmann wrote:

...

...
...
Also, how do people feel about using Boost.Atomic underneath

...
shared_ptr? (https://svn.boost.org/trac/boost/ticket/5625 - again, I've been using this for a long time and it seems to work.) Maybe as a non-default initially if we're nervous. It might help to find any bugs in Boost.Atomic...

In general I welcome this suggestion. However, shared_ptr probably should not require linking to Boost.Atomic, so some header-only mode should be provided by Boost.Atomic first.

i guess boost.atomic could be extended to work on all platforms that are currently supported by shared_ptr. the required subset could the be header-only ...

I was thinking about splitting the implementation into a guaranteed-to-be- lock-free and the lock pool-based parts (the latter one requires linking to the library). The lock-free part should be accessible through a separate public header, it should not include the lock-based part. boost/atomic.hpp should still include both parts, thus providing the complete atomic<> implementation. It's rather straightforward. Basically, base_atomic<> implementation should be moved from base.hpp into a separate header, and base.hpp should only contain forward declarations. Auto-linking code from config.hpp should be extracted as well and only included when the lock pool-based implementation is included.

4586

Age (days ago)

4590

Last active (days ago)

List overview

Download

35 comments

6 participants

participants (6)

Andrey Semashev
Gottlob Frege
Holger Grund
Peter Dimov
Phil Endecott
Tim Blechmann