
Hi Tim, Hopefully I'll have at least a partial review. I only know enough about the subject to be a little dangerous. (One course in concurrent data structures, one successfully implemented lockfree hash table - except for the memory reclamation. Unspecified barriers and ABA problems. :p ) Considering the difficulty of the subject, you seem to have carefully chosen a small set of algorithms where at least there is consensus that the sources you are using have solved the problem correctly. I apologize if this has already been covered, but I don't see anything in the documentation about which platforms and compilers this has been tested on. These algorithms are extremely susceptible to compiler and processor optimizations. Anyway, I thought I would try to run the tests, and maybe try changing the testing methodology a little, but I got a few compiler errors, below. Defining uint fixed most of these, but bench_1.cpp also appears to be out of date? Of course I also have my lack of understanding of boost build library linking to contend with. Hope to try again later. Cheers, Gordon ~/lib/boost_lockfree/libs/lockfree/test ~/src/boost-trunk grodo-mcwoohu:test gordon$ bjam ...found 141 targets... ...updating 21 targets... darwin.compile.c++ ../../../bin.v2/libs/lockfree/test/bench_1.test/darwin-4.5.3/debug/threading-multi/bench_1.o g++: unrecognized option '-no-cpp-precomp' bench_1.cpp: In function 'void test_fifo_pop()': bench_1.cpp:67:23: error: no matching function for call to 'boost::lockfree::fifo<long int>::dequeue(long int*)' ../../../boost/lockfree/fifo.hpp:214:10: note: candidate is: bool boost::lockfree::detail::fifo<T, freelist_t, Alloc>::dequeue(T&) [with T = long int, freelist_t = boost::lockfree::caching_freelist_t, Alloc = std::allocator<long int>] bench_1.cpp: In function 'void test_fifo_pop_unsafe()': bench_1.cpp:77:30: error: no matching function for call to 'boost::lockfree::fifo<long int>::dequeue_unsafe(long int*)' ../../../boost/lockfree/fifo.hpp:255:10: note: candidate is: bool boost::lockfree::detail::fifo<T, freelist_t, Alloc>::dequeue_unsafe(T&) [with T = long int, freelist_t = boost::lockfree::caching_freelist_t, Alloc = std::allocator<long int>] bench_1.cpp: In function 'void test_stack_pop()': bench_1.cpp:87:19: error: no matching function for call to 'boost::lockfree::stack<long int>::pop(long int*)' ../../../boost/lockfree/stack.hpp:179:10: note: candidate is: bool boost::lockfree::stack<T, freelist_t, Alloc>::pop(T&) [with T = long int, freelist_t = boost::lockfree::caching_freelist_t, Alloc = std::allocator<long int>] bench_1.cpp: In function 'void test_stack_pop_unsafe()': bench_1.cpp:97:26: error: no matching function for call to 'boost::lockfree::stack<long int>::pop_unsafe(long int*)' ../../../boost/lockfree/stack.hpp:207:10: note: candidate is: bool boost::lockfree::stack<T, freelist_t, Alloc>::pop_unsafe(T&) [with T = long int, freelist_t = boost::lockfree::caching_freelist_t, Alloc = std::allocator<long int>] "g++" -ftemplate-depth-128 -O0 -fno-inline -Wall -g -dynamic -no-cpp-precomp -gdwarf-2 -fexceptions -fPIC -I"../../.." -c -o "../../../bin.v2/libs/lockfree/test/bench_1.test/darwin-4.5.3/debug/threading-multi/bench_1.o" "bench_1.cpp" ...failed darwin.compile.c++ ../../../bin.v2/libs/lockfree/test/bench_1.test/darwin-4.5.3/debug/threading-multi/bench_1.o... ...skipped <p../../../bin.v2/libs/lockfree/test/bench_1.test/darwin-4.5.3/debug/threading-multi>bench_1 for lack of <p../../../bin.v2/libs/lockfree/test/bench_1.test/darwin-4.5.3/debug/threading-multi>bench_1.o... ...skipped <p../../../bin.v2/libs/lockfree/test/bench_1.test/darwin-4.5.3/debug/threading-multi>bench_1.run for lack of <p../../../bin.v2/libs/lockfree/test/bench_1.test/darwin-4.5.3/debug/threading-multi>bench_1... darwin.compile.c++ ../../../bin.v2/libs/lockfree/test/fifo_test.test/darwin-4.5.3/debug/threading-multi/fifo_test.o g++: unrecognized option '-no-cpp-precomp' fifo_test.cpp:103:18: error: 'uint' does not name a type fifo_test.cpp: In member function 'void fifo_tester<freelist_t>::add()': fifo_test.cpp:116:14: error: 'uint' was not declared in this scope fifo_test.cpp:116:19: error: expected ';' before 'i' fifo_test.cpp:116:26: error: 'i' was not declared in this scope fifo_test.cpp:116:31: error: 'nodes_per_thread' was not declared in this scope fifo_test.cpp: In member function 'void fifo_tester<freelist_t>::run()': fifo_test.cpp:187:1: error: 'nodes_per_thread' was not declared in this scope "g++" -ftemplate-depth-128 -O0 -fno-inline -Wall -g -dynamic -no-cpp-precomp -gdwarf-2 -fexceptions -fPIC -I"../../.." -c -o "../../../bin.v2/libs/lockfree/test/fifo_test.test/darwin-4.5.3/debug/threading-multi/fifo_test.o" "fifo_test.cpp" ...failed darwin.compile.c++ ../../../bin.v2/libs/lockfree/test/fifo_test.test/darwin-4.5.3/debug/threading-multi/fifo_test.o... ...skipped <p../../../bin.v2/libs/lockfree/test/fifo_test.test/darwin-4.5.3/debug/threading-multi>fifo_test for lack of <p../../../bin.v2/libs/lockfree/test/fifo_test.test/darwin-4.5.3/debug/threading-multi>fifo_test.o... ...skipped <p../../../bin.v2/libs/lockfree/test/fifo_test.test/darwin-4.5.3/debug/threading-multi>fifo_test.run for lack of <p../../../bin.v2/libs/lockfree/test/fifo_test.test/darwin-4.5.3/debug/threading-multi>fifo_test... darwin.link ../../../bin.v2/libs/lockfree/test/freelist_test.test/darwin-4.5.3/debug/threading-multi/freelist_test ld: library not found for -lboost_thread collect2: ld returned 1 exit status "g++" -o "../../../bin.v2/libs/lockfree/test/freelist_test.test/darwin-4.5.3/debug/threading-multi/freelist_test" "../../../bin.v2/libs/lockfree/test/freelist_test.test/darwin-4.5.3/debug/threading-multi/freelist_test.o" -lboost_thread -lboost_unit_test_framework -g ...failed darwin.link ../../../bin.v2/libs/lockfree/test/freelist_test.test/darwin-4.5.3/debug/threading-multi/freelist_test... ...skipped <p../../../bin.v2/libs/lockfree/test/freelist_test.test/darwin-4.5.3/debug/threading-multi>freelist_test.run for lack of <p../../../bin.v2/libs/lockfree/test/freelist_test.test/darwin-4.5.3/debug/threading-multi>freelist_test... darwin.compile.c++ ../../../bin.v2/libs/lockfree/test/ringbuffer_test.test/darwin-4.5.3/debug/threading-multi/ringbuffer_test.o g++: unrecognized option '-no-cpp-precomp' ringbuffer_test.cpp:209:14: error: 'uint' does not name a type ringbuffer_test.cpp: In member function 'void ringbuffer_tester::add()': ringbuffer_test.cpp:225:14: error: 'uint' was not declared in this scope ringbuffer_test.cpp:225:19: error: expected ';' before 'i' ringbuffer_test.cpp:225:26: error: 'i' was not declared in this scope ringbuffer_test.cpp:225:31: error: 'nodes_per_thread' was not declared in this scope ringbuffer_test.cpp: In member function 'void ringbuffer_tester::run()': ringbuffer_test.cpp:284:1: error: 'nodes_per_thread' was not declared in this scope ringbuffer_test.cpp: In member function 'void ringbuffer_tester_buffering::add()': ringbuffer_test.cpp:315:14: error: 'uint' was not declared in this scope ringbuffer_test.cpp:315:19: error: expected ';' before 'i' ringbuffer_test.cpp:315:26: error: 'i' was not declared in this scope ringbuffer_test.cpp:315:31: error: 'nodes_per_thread' was not declared in this scope ringbuffer_test.cpp: In member function 'void ringbuffer_tester_buffering::run()': ringbuffer_test.cpp:386:1: error: 'nodes_per_thread' was not declared in this scope

hi gordon,
I apologize if this has already been covered, but I don't see anything in the documentation about which platforms and compilers this has been tested on. These algorithms are extremely susceptible to compiler and processor optimizations.
boost.lockfree should NOT be prone to compilers/processors, mainly because it relies to boost.atomic for atomic operations and memory barriers. will include a page about tested platforms/compilers.
Defining uint fixed most of these, but bench_1.cpp also appears to be out of date?
ouch ... bench_1.cpp should have been removed before the review. but i should replace uint with boost::uint32_t.
Of course I also have my lack of understanding of boost build library linking to contend with. Hope to try again later.
btw, the tarball also includes a cmake-based build system. cheers, tim

Hi Tim, Hartmut, I apologize that this is late and not an in-depth review. I think the library should be accepted. IMO the big question with a library like this is not so much the design as the correctness. What can we do as reviewers to verify the correctness, assuming we are not lockfree gurus ourselves? Go back to the original sources. So I did; see below. The other important point is to verify that the author is going to maintain the library in case problems do crop up. Tim seems to have a good attitude about suggestions, so I have no doubts here. - What is your evaluation of the design? I am not hugely concerned about naming but it's important that the naming scheme scales as more algorithms are added. What if there are multiple good algorithms for the same functionality, with different tradeoffs? I think the most accurate thing would be to catalogue the algorithms instead of the "data structures", which I find to be a confusing term in this domain. However, it would painful to users to have to refer to the michael_and_scott_queue_algo or whatever, so I'm not sure what the best solution is. I like Tony's suggestion, but I'm not clear if the attributes are just going to keep multiplying. Could the freelist be moved into the public interface? This is a useful data structure and a simple way to avoid calls to the memory allocator. - What is your evaluation of the implementation? I verified the queue implementation against the Michael/Scott pseudo-code. It matches pretty much line-for-line. Since mathematical proofs are difficult to grasp and these algorithms are usually proved by the test of time, being able to compare against known-good code is essential. There's one exception: an extra check on next_ptr is commented: * however we reuse the tagged_ptr part for the and clear the next part during node * allocation. we can observe a null-pointer here. There's a word missing there. Are you saying that next_ptr can be null unlike in the original algo, because null is used to mark a new node? Does this affect any other part of the algo? This should also get a paragraph in the Rationale. - What is your evaluation of the documentation? It seems rather light. I'd like to see references for the other two algorithms, which the doc currently ascribes to "folklore" - if someone would like to verify them in the way I did for the queue, where would they go? What could they read to understand how it works? I'd like to see some performance comparisons on a couple of architectures. It is okay to have a big disclaimer saying these results may not be typical and the user should do their own benchmarks. I am not experienced enough to evaluate whether the use of barriers and the c++ memory model is correct. I would like to see something in the rationale describing how these choices are made. I also thought I recalled Tim saying that one of the algorithms might require more barriers for untested architectures, but I couldn't find anything about that in the documentation. As Phil mentioned, the documentation should have a section about when/when not to use lock-free. It doesn't have to be authoritative and can refer to other web resources. This is really just an expansion of the Performance section of the intro. - What is your evaluation of the potential usefulness of the library? Lock-free can perform better than locked data structures, and what's more important, it can have more predictable performance. I am surprised by Julien's finding that the fifo actually slows down with more processors; in my limited experience I would think this would happen with locked but not lockfree data structures. Anyway, it is always good to have more choices to compare, and in realtime applications, these structures can be essential. - Did you try to use the library? With what compiler? Did you have any problems? Unfortunately, I did not. - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? I spent about 5-6 hours reading the docs, comparing code against pseudo-code, and writing this. - Are you knowledgeable about the problem domain? As well as having done a good amount of debugging concurrent programs, I took an excellent Multicore class with Alberto Lerner at NYU. For the final project, my team successfully implemented a lock-free hash table, the Split-Ordered-List of Shalev/Shavit, which uses the lock-free list-based set of Michael. It proved to have far superior performance to the naive implementation of simply putting a read-write lock on an unordered_set. (Near-linear speedup versus horrible slowdowns for the locked version. We did not compare against TBB.) However, without a double-width compare-and-swap on all the compilers we were using (i.e. no Boost.Atomic), we were forced to use Michael's Safe Memory Reclamation instead of freelists, and as the paper says, "memory barriers may be needed" - but where? We ran into horrible ABA problems. I guess that algo is patented anyway. We could clean up the split-ordered list code and adapt it to freelists for donation to Lockfree, if the library is accepted. Cheers, Gordon

hi gordon, thanks for the review!
IMO the big question with a library like this is not so much the design as the correctness. What can we do as reviewers to verify the correctness, assuming we are not lockfree gurus ourselves? Go back to the original sources. So I did; see below.
well, there is one problem about the publications done by the lock-free gurus: they don't care about the implementation, but assume a sequencially consistent memory model, so they don't need to care about the required memory barriers or the like ...
Could the freelist be moved into the public interface? This is a useful data structure and a simple way to avoid calls to the memory allocator.
i need to think about this. i have changed the freelist implementation a bit in order to use array indices as alternative to pointers (for ll/sc platforms without support for double-word CAS). the new interface is rather complex because i need to resolve from a `handle' (pointer or 16bit index) to a node pointer (and reverse). maybe tagged_pointer and freelist can be exposed to the public interface at a later point, but for now i hesitate to expose the full freelist interface. maybe but maybe a simplified version could be exposed?
- What is your evaluation of the implementation?
I verified the queue implementation against the Michael/Scott pseudo-code. It matches pretty much line-for-line. Since mathematical proofs are difficult to grasp and these algorithms are usually proved by the test of time, being able to compare against known-good code is essential.
There's one exception: an extra check on next_ptr is commented:
* however we reuse the tagged_ptr part for the and clear the next part during node * allocation. we can observe a null-pointer here.
There's a word missing there. Are you saying that next_ptr can be null unlike in the original algo, because null is used to mark a new node? Does this affect any other part of the algo? This should also get a paragraph in the Rationale.
good eye, the missing word is `freelist'. the michael/scott paper only mentions that they use a freelist, but they simply omit the that part from the pseudocode. in my implementation, the same memory location is used for linking the nodes of the queue and the nodes of the freelist.
- What is your evaluation of the documentation?
It seems rather light.
I'd like to see references for the other two algorithms, which the doc currently ascribes to "folklore" - if someone would like to verify them in the way I did for the queue, where would they go? What could they read to understand how it works?
i should probably add a recommendation to `the art of multiprocessor programming'. iirc it also covers the michael/scott queue ...
I'd like to see some performance comparisons on a couple of architectures. It is okay to have a big disclaimer saying these results may not be typical and the user should do their own benchmarks.
i've started to implement a set of simple benchmarks, for including results, i will need some help (my access to multiprocessor hardware is limited to my workstation and my laptop)
I am not experienced enough to evaluate whether the use of barriers and the c++ memory model is correct. I would like to see something in the rationale describing how these choices are made. I also thought I recalled Tim saying that one of the algorithms might require more barriers for untested architectures, but I couldn't find anything about that in the documentation.
hopefully this is not the case anymore: all memory barriers are abstracted in the memory model of atomic<>. no memory barrier needs to be issued manually, but all barriers are issued via atomic<> members.
As Phil mentioned, the documentation should have a section about when/when not to use lock-free. It doesn't have to be authoritative and can refer to other web resources. This is really just an expansion of the Performance section of the intro.
ok
- What is your evaluation of the potential usefulness of the library?
Lock-free can perform better than locked data structures, and what's more important, it can have more predictable performance. I am surprised by Julien's finding that the fifo actually slows down with more processors; in my limited experience I would think this would happen with locked but not lockfree data structures.
lock-free data structures cannot scale linear: when multiple threads try to perform the same operations, one compare_exchange will fail and one of the threads has to retry. the more threads the more retries.
- Are you knowledgeable about the problem domain?
As well as having done a good amount of debugging concurrent programs, I took an excellent Multicore class with Alberto Lerner at NYU.
For the final project, my team successfully implemented a lock-free hash table, the Split-Ordered-List of Shalev/Shavit, which uses the lock-free list-based set of Michael. It proved to have far superior performance to the naive implementation of simply putting a read-write lock on an unordered_set. (Near-linear speedup versus horrible slowdowns for the locked version. We did not compare against TBB.)
However, without a double-width compare-and-swap on all the compilers we were using (i.e. no Boost.Atomic), we were forced to use Michael's Safe Memory Reclamation instead of freelists, and as the paper says, "memory barriers may be needed" - but where? We ran into horrible ABA problems. I guess that algo is patented anyway.
hazard pointers are patented, so they are not an option for boost. finding the right positions for memory barriers can be quite hard and usually requires a full understanding of the data structure. atomic<> will make it way easier to write a correct algorithm, since their member functions by default to a sequencially consistent memory model.
We could clean up the split-ordered list code and adapt it to freelists for donation to Lockfree, if the library is accepted.
as a workaround for the dcas requirement one can use array indices (the reason for my current freelist refactoring). if i feel comfortable with your implementation, i am happy to adopt it! cheers, tim

Hi Tim! On Jul 29, 2011, at 3:00 PM, Tim Blechmann wrote:
IMO the big question with a library like this is not so much the design as the correctness. What can we do as reviewers to verify the correctness, assuming we are not lockfree gurus ourselves? Go back to the original sources. So I did; see below.
well, there is one problem about the publications done by the lock-free gurus: they don't care about the implementation, but assume a sequencially consistent memory model, so they don't need to care about the required memory barriers or the like ...
Yes. Your rationale that the default options for atomic operations should ensure sequential consistency makes sense to me (from my limited experience).
Could the freelist be moved into the public interface? This is a useful data structure and a simple way to avoid calls to the memory allocator. i need to think about this. i have changed the freelist implementation a bit in order to use array indices as alternative to pointers (for ll/sc platforms without support for double-word CAS).
It's better to say double-width CAS - there was a double-word CAS in separate locations that died a long time ago.
the new interface is rather complex because i need to resolve from a `handle' (pointer or 16bit index) to a node pointer (and reverse). maybe tagged_pointer and freelist can be exposed to the public interface at a later point, but for now i hesitate to expose the full freelist interface. maybe but maybe a simplified version could be exposed?
An interesting problem. If my team adapts split-ordered list to freelists, it will give another basis for the abstraction.
good eye, the missing word is `freelist'. the michael/scott paper only mentions that they use a freelist, but they simply omit the that part from the pseudocode. in my implementation, the same memory location is used for linking the nodes of the queue and the nodes of the freelist.
Aha. I'd like to see this kind of stuff in the documentation, either on the pages describing each algo or in the Rationale.
I'd like to see references for the other two algorithms, which the doc currently ascribes to "folklore" - if someone would like to verify them in the way I did for the queue, where would they go? What could they read to understand how it works?
i should probably add a recommendation to `the art of multiprocessor programming'. iirc it also covers the michael/scott queue ...
Oh cool, I have that but I haven't looked at it. If the 'folklore' is via there, please say so. So I looked at Herlihy's stack versus yours, and there is one mismatch in push(): bool push(T const & v) { node * newnode = pool.construct(v); if (newnode == 0) return false; tagged_node_ptr old_tos = tos.load(detail::memory_order_relaxed); for (;;) { tagged_node_ptr new_tos (newnode, old_tos.get_tag()); newnode->next.set_ptr(old_tos.get_ptr()); if (tos.compare_exchange_weak(old_tos, new_tos)) return true; } } In the Herlihy implementation, old_tos is fetched effectively at the beginning of the loop. It seems to me that with your code, if someone else pushes or pops between when old_tos is fetched and the CAS, it will keep on looping until it sees the same node that used to be at the top is on top again. In other words, you want to verify that top still points to what you just put in new->next, not that top again points to whatever it pointed to before you started looping. I have not compared the 'folkloric' queue yet.
I'd like to see some performance comparisons on a couple of architectures. It is okay to have a big disclaimer saying these results may not be typical and the user should do their own benchmarks.
i've started to implement a set of simple benchmarks, for including results, i will need some help (my access to multiprocessor hardware is limited to my workstation and my laptop)
Yes, I'm sure myself and others can find resources for that.
I am not experienced enough to evaluate whether the use of barriers and the c++ memory model is correct. I would like to see something in the rationale describing how these choices are made. I also thought I recalled Tim saying that one of the algorithms might require more barriers for untested architectures, but I couldn't find anything about that in the documentation.
hopefully this is not the case anymore: all memory barriers are abstracted in the memory model of atomic<>. no memory barrier needs to be issued manually, but all barriers are issued via atomic<> members.
Aha, the new memory model is starting to make a little sense to me!
lock-free data structures cannot scale linear: when multiple threads try to perform the same operations, one compare_exchange will fail and one of the threads has to retry. the more threads the more retries.
Well, you happen to have chosen the data structures with the tightest memory access patterns. :-D Since you are only accessing the beginning and end of the stack/queue, you're seeing massive contention. A hash table can scale linearly. I didn't believe it until I saw it with my own eyes!
hazard pointers are patented, so they are not an option for boost.
I didn't like them anyway. (sour grapes, actually they seem the safest way. sigh.)
finding the right positions for memory barriers can be quite hard and usually requires a full understanding of the data structure. atomic<> will make it way easier to write a correct algorithm, since their member functions by default to a sequencially consistent memory model.
Thank you for the illumination! I now feel ready to try reading Anthony Williams' book again. The memory model stuff in there gave me the willies at first glance; hopefully it has improved or I have.
We could clean up the split-ordered list code and adapt it to freelists for donation to Lockfree, if the library is accepted.
as a workaround for the dcas requirement one can use array indices (the reason for my current freelist refactoring).
Ideally the freelist could have one consistent interface with different underlying behavior (i.e. size of return type) depending on what's supported. I don't know if this is possible. I assume your freelist has the problem that the size of the counter tag determines the probability of some horrible silent ABA failure? This should also be documented.
if i feel comfortable with your implementation, i am happy to adopt it!
I'm checking with my team to see if they're still willing to put in the effort to bring this to Boost. Obviously I'd also put in good work and be ultimately responsible for it if it happens. Cheers, Gordon

on Fri Jul 29 2011, Gordon Woodhull <gordon-AT-woodhull.com> wrote:
On Jul 29, 2011, at 3:00 PM, Tim Blechmann wrote:
IMO the big question with a library like this is not so much the design as the correctness. What can we do as reviewers to verify the correctness, assuming we are not lockfree gurus ourselves? Go back to the original sources. So I did; see below.
well, there is one problem about the publications done by the lock-free gurus: they don't care about the implementation, but assume a sequencially consistent memory model, so they don't need to care about the required memory barriers or the like ...
Yes. Your rationale that the default options for atomic operations should ensure sequential consistency makes sense to me (from my limited experience).
Personally I never understood how sequential consistency could be much of a help in reasoning about multithreaded programs... but that's just me. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Jul 29, 2011, at 7:36 PM, Dave Abrahams <dave@boostpro.com> wrote:
Personally I never understood how sequential consistency could be much of a help in reasoning about multithreaded programs... but that's just me.
True, it's bad enough with sequential consistency, but without it there's no hope of understanding at all IMO.

on Fri Jul 29 2011, Gordon Woodhull <gordon-AT-woodhull.com> wrote:
On Jul 29, 2011, at 7:36 PM, Dave Abrahams <dave@boostpro.com> wrote:
Personally I never understood how sequential consistency could be much of a help in reasoning about multithreaded programs... but that's just me.
True, it's bad enough with sequential consistency, but without it there's no hope of understanding at all IMO.
I think I meant the opposite, or nearly so. How does not having sequential consistency make reasoning worse? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Jul 30, 2011, at 8:40 AM, Dave Abrahams <dave@boostpro.com> wrote:
on Fri Jul 29 2011, Gordon Woodhull <gordon-AT-woodhull.com> wrote:
On Jul 29, 2011, at 7:36 PM, Dave Abrahams <dave@boostpro.com> wrote:
Personally I never understood how sequential consistency could be much of a help in reasoning about multithreaded programs... but that's just me.
True, it's bad enough with sequential consistency, but without it there's no hope of understanding at all IMO.
I think I meant the opposite, or nearly so. How does not having sequential consistency make reasoning worse?
I must be missing something. Without it you can't depend on the order of operations even within one block and one thread, right? E.g. we were talking about translating concurrent pseudocode to real code. Without sequential consistency it might require inserting memory barriers and it's never clear where, or if some other architecture is going to need different ones. How is it not better to at least understand that a single piece of code will do what it says in the same order?

on Sat Jul 30 2011, Gordon Woodhull <gordon-AT-woodhull.com> wrote:
On Jul 30, 2011, at 8:40 AM, Dave Abrahams <dave@boostpro.com> wrote:
on Fri Jul 29 2011, Gordon Woodhull <gordon-AT-woodhull.com> wrote:
On Jul 29, 2011, at 7:36 PM, Dave Abrahams <dave@boostpro.com> wrote:
Personally I never understood how sequential consistency could be much of a help in reasoning about multithreaded programs... but that's just me.
True, it's bad enough with sequential consistency, but without it there's no hope of understanding at all IMO.
I think I meant the opposite, or nearly so. How does not having sequential consistency make reasoning worse?
I must be missing something. Without it you can't depend on the order of operations even within one block and one thread, right?
Doesn't the (C++03) standard make some guarantees about sequence points that make ordering well-defined in the single-threaded case, without ever introducing the notion of sequential consistency?
E.g. we were talking about translating concurrent pseudocode to real code. Without sequential consistency it might require inserting memory barriers and it's never clear where, or if some other architecture is going to need different ones.
How is it not better to at least understand that a single piece of code will do what it says in the same order?
Could you show me an example of some single-threaded code that can't be understood on the basis of the C++03 standard alone, and give some examples of the multiple possible interpretations of its semantics? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Jul 30, 2011, at 5:10 PM, Dave Abrahams wrote:
on Sat Jul 30 2011, Gordon Woodhull <gordon-AT-woodhull.com> wrote:
E.g. we were talking about translating concurrent pseudocode to real code. Without sequential consistency it might require inserting memory barriers and it's never clear where, or if some other architecture is going to need different ones.
How is it not better to at least understand that a single piece of code will do what it says in the same order?
Could you show me an example of some single-threaded code that can't be understood on the basis of the C++03 standard alone, and give some examples of the multiple possible interpretations of its semantics?
No. Sorry, I meant that it's difficult or impossible to reason about a multithreaded program unless the sequence of operations in the individual threads is clear. It's still very difficult to reason with sequential consistency but it's better than having to consider the higglety-pigglety reordering optimizations of the compiler and processor. I'm surprised that you don't think it will help - care to shed any light (or darkness)?

on Sat Jul 30 2011, Gordon Woodhull <gordon-AT-woodhull.com> wrote:
On Jul 30, 2011, at 5:10 PM, Dave Abrahams wrote:
on Sat Jul 30 2011, Gordon Woodhull <gordon-AT-woodhull.com> wrote:
E.g. we were talking about translating concurrent pseudocode to real code. Without sequential consistency it might require inserting memory barriers and it's never clear where, or if some other architecture is going to need different ones.
How is it not better to at least understand that a single piece of code will do what it says in the same order?
Could you show me an example of some single-threaded code that can't be understood on the basis of the C++03 standard alone, and give some examples of the multiple possible interpretations of its semantics?
No. Sorry, I meant that it's difficult or impossible to reason about a multithreaded program unless the sequence of operations in the individual threads is clear.
It's still very difficult to reason with sequential consistency but it's better than having to consider the higglety-pigglety reordering optimizations of the compiler and processor.
I'm surprised that you don't think it will help - care to shed any light (or darkness)?
No program of substantial size can be understood "dynamically." You can't think about code paths and timing and control flow; there are just too many combinations and too much to keep track of. You have to think of static things like preconditions, postconditions, and invariants. That's why purely functional languages are easier to reason about, and why const and value semantics are important in languages with mutation. That's also why it's always seemed to me that thinking in terms of interleavings of instructions doesn't lead to any fundamentally improved ability to understand parallelism: it doesn't change anything from a static point of view. The basic issue is still one of protecting one thread from observing the broken invariants of another. Of course, I could well be missing something. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Jul 30, 2011, at 9:44 PM, Dave Abrahams <dave@boostpro.com> wrote:
No program of substantial size can be understood "dynamically." You can't think about code paths and timing and control flow; there are just too many combinations and too much to keep track of. You have to think of static things like preconditions, postconditions, and invariants. That's why purely functional languages are easier to reason about, and why const and value semantics are important in languages with mutation.
Right, sequential consistency in itself is too low level to help understanding --
That's also why it's always seemed to me that thinking in terms of interleavings of instructions doesn't lead to any fundamentally improved ability to understand parallelism: it doesn't change anything from a static point of view. The basic issue is still one of protecting one thread from observing the broken invariants of another.
-- but without sequential consistency, the compiler may be rearranging instructions in ways that wouldn't affect a single thread (as allowed by c++03), and threads may see each others' actions in the wrong order, so a weak memory model can actually break invariants even if the code is good.

on Sun Jul 31 2011, Gordon Woodhull <gordon-AT-woodhull.com> wrote:
On Jul 30, 2011, at 9:44 PM, Dave Abrahams <dave@boostpro.com> wrote:
That's also why it's always seemed to me that thinking in terms of interleavings of instructions doesn't lead to any fundamentally improved ability to understand parallelism: it doesn't change anything from a static point of view. The basic issue is still one of protecting one thread from observing the broken invariants of another.
-- but without sequential consistency, the compiler may be rearranging instructions in ways that wouldn't affect a single thread (as allowed by c++03),
Yes.
and threads may see each others' actions in the wrong order, so a weak memory model can actually break invariants even if the code is good.
I don't get it. Please, show me an example. Anyway, what's the "wrong order?" If one thread is depending on the order of things that happen in another thread without both threads using the synchronization necessary to ensure that order, it's a race condition (a.k.a. a bug), right? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Jul 31, 2011, at 1:05 PM, Dave Abrahams wrote:
and threads may see each others' actions in the wrong order, so a weak memory model can actually break invariants even if the code is good.
I don't get it. Please, show me an example.
I am definitely not an expert. I'll quote a diagram from this fascinating paper by Adve and Boehm [1], using Dekker's mutual exclusion algorithm as an example Initially X = Y = 0 Thread 1 X = 1; r1 = Y; Thread 2 Y = 1; r2 = X; Can r1 = r2 = 0? And they give various reasons why the processor might reorder the instructions. After all, with each thread considered its own program, there is no dependency between X and Y that should mean it's incorrect to reverse the operations.
Anyway, what's the "wrong order?" If one thread is depending on the order of things that happen in another thread without both threads using the synchronization necessary to ensure that order, it's a race condition (a.k.a. a bug), right?
All of the classic lock-free algorithms, including those that are the basis for the synchronization primitives we should usually use, have data races. At a low level it is sometimes necessary to use a variable for synchronization. The same paper argues that we should be writing our programs race-free except with specially marked synchronization variables, e.g. Java/C# "volatile" or IIUC C++11 atomics with the default options. "To write Figure 1 correctly under data-race-free, we need simply identify the shared variables X and Y as synchronization variables. This would require the implementation to do whatever is necessary to ensure sequential consistency, in spite of those simultaneous accesses." And then the implementation is allowed to do almost anything it wants with the rest of our code, since we programmers have agreed to keep it race-free. IMO using variables for synchronization does make for confusing programs and should be reserved only for very tiny components or very simple purposes. Plus we must remind ourselves over and over how expensive atomic operations and the now-implicit memory barriers are. Use Locks. Gordon (going back to metaprogramming where the syntax is awful but the semantics are clear as water) [1] http://rsim.cs.illinois.edu/Pubs/10-cacm-memory-models.pdf

[quick post, have some more in my outbox, but i'm at a conference this week, where they cut imap/smtp ports]
Anyway, what's the "wrong order?" If one thread is depending on the order of things that happen in another thread without both threads using the synchronization necessary to ensure that order, it's a race condition (a.k.a. a bug), right?
All of the classic lock-free algorithms, including those that are the basis for the synchronization primitives we should usually use, have data races. At a low level it is sometimes necessary to use a variable for synchronization.
the fifo code for example contains code like: atomic<> head_, tail_; head = head_.load(); tail = tail_.load(); head2 = head_.load(); if (head == head2) do something the correctness of this algorithm depends on the order of the loads: load head before tail and load tail before head2. without a memory model, both the compiler and the CPU are allowed to reorder loads (or head2 could simply use the value of head without actually reloading it) cheers, tim --------------------------------------------------------------- diese nachricht wurde ueber klingt.org gesendet this message was sent through klingt.org's webmail.

on Mon Aug 01 2011, Tim Blechmann <tim-AT-klingt.org> wrote:
[quick post, have some more in my outbox, but i'm at a conference this week, where they cut imap/smtp ports]
Anyway, what's the "wrong order?" If one thread is depending on the order of things that happen in another thread without both threads using the synchronization necessary to ensure that order, it's a race condition (a.k.a. a bug), right?
All of the classic lock-free algorithms, including those that are the basis for the synchronization primitives we should usually use, have data races. At a low level it is sometimes necessary to use a variable for synchronization.
the fifo code for example contains code like:
atomic<> head_, tail_;
head = head_.load(); tail = tail_.load();
head2 = head_.load();
if (head == head2) do something
the correctness of this algorithm depends on the order of the loads: load head before tail and load tail before head2. without a memory model, both the compiler and the CPU are allowed to reorder loads (or head2 could simply use the value of head without actually reloading it)
But I never said we don't need a memory model! Of course, we do need rules that allow us to ensure temporal sequencing of operations. All I said was that sequential consistency doesn't seem to make parallelism much easier. In particular, in C++ we're only talking about sequential consistency for data-race-free programs, which is kind of a funny hedge. It means that your atomics and the code around them as a whole blob respect some plausible interleaving of the instructions in single threads, but it doesn't say anything about the interaction of most variables, which (I hope) are not atomic. In other words, sequential consistency only helps those people already working in the rarified air of atomics and lock-free algorithms, where if you're going to play you'd better have a firm grasp of even lower-level concepts like memory barriers. OK, I guess this all scales up to those of us using locks... if you don't have sequential consistency among sections using the same mutex, the mutex can hardly be said to be working. So let me try saying what I mean differently: even a hypothetical programming model where *everything* (including non-atomics) is guaranteed to be sequentially consistent still seems insufficient to make parallel programming tractable. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Mon, Aug 1, 2011 at 11:00 AM, Dave Abrahams <dave@boostpro.com> wrote:
OK, I guess this all scales up to those of us using locks... if you don't have sequential consistency among sections using the same mutex, the mutex can hardly be said to be working. So let me try saying what I mean differently: even a hypothetical programming model where *everything* (including non-atomics) is guaranteed to be sequentially consistent still seems insufficient to make parallel programming tractable.
It would mean that we could use our "traditional" reasoning. But yes, as soon as you get more than a couple of variables, it quickly becomes exponentially harder to keep track of things. Even the r1, r2 example, with only 2 shared variables, has an uncomfortable number of possible outcomes to consider. In lockfree programming you need to look at what is in your data structure, what are the possible states. At any 'atomic sequence point' or 'publication point' which known good state are you in. This is somewhat the same as any program - "what states can you be in, they better all be correct". Lockfree just tends to have more states in a small amount of code, and less (ie 0) "internal blocks" where you can temporarily break the invariants. Your code needs to be somewhat convoluted so that: 1. temporary internal spots exist where invariants can be broken 2. publication/restoration of the invariant state is a single atomic operation Tony

on Sat Aug 06 2011, Gottlob Frege <gottlobfrege-AT-gmail.com> wrote:
In lockfree programming you need to look at what is in your data structure, what are the possible states. At any 'atomic sequence point' or 'publication point' which known good state are you in. This is somewhat the same as any program - "what states can you be in, they better all be correct". Lockfree just tends to have more states in a small amount of code, and less (ie 0) "internal blocks" where you can temporarily break the invariants. Your code needs to be somewhat convoluted so that: 1. temporary internal spots exist where invariants can be broken 2. publication/restoration of the invariant state is a single atomic operation
This is really not all that different from single-threaded programming. If I do: void X::f() { this->a = 1; // break invariant y->g(); this->a = 0; // restore invariant } I had better be sure that y->g() can't observe my broken invariant. The problems are not all that different. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

on Mon Aug 01 2011, Gordon Woodhull <gordon-AT-woodhull.com> wrote:
On Jul 31, 2011, at 1:05 PM, Dave Abrahams wrote:
and threads may see each others' actions in the wrong order, so a weak memory model can actually break invariants even if the code is good.
I don't get it. Please, show me an example.
I am definitely not an expert. I'll quote a diagram from this fascinating paper by Adve and Boehm [1], using Dekker's mutual exclusion algorithm as an example
Initially X = Y = 0 Thread 1 X = 1; r1 = Y;
Thread 2 Y = 1; r2 = X;
Can r1 = r2 = 0?
Sure they can, that's a race condition: Thread1 modifies X while thread 2 reads it. Anything could happen. Why would anyone expect otherwise.
And they give various reasons why the processor might reorder the instructions. After all, with each thread considered its own program, there is no dependency between X and Y that should mean it's incorrect to reverse the operations.
Exactly.
Anyway, what's the "wrong order?" If one thread is depending on the order of things that happen in another thread without both threads using the synchronization necessary to ensure that order, it's a race condition (a.k.a. a bug), right?
All of the classic lock-free algorithms, including those that are the basis for the synchronization primitives we should usually use, have data races.
IIUC, accesses to atomics are not in and of themselves data races. But it doesn't surprise me a bit to learn (again) that lock-free algorithms are hard to think about.
At a low level it is sometimes necessary to use a variable for synchronization.
Sure, an atomic variable. And along with those come associated memory barrier effects, right?
The same paper argues that we should be writing our programs race-free except with specially marked synchronization variables, e.g. Java/C# "volatile" or IIUC C++11 atomics with the default options.
"To write Figure 1 correctly under data-race-free, we need simply identify the shared variables X and Y as synchronization variables. This would require the implementation to do whatever is necessary to ensure sequential consistency, in spite of those simultaneous accesses."
Meh. Isn't this a bogus example? What is "correctly" supposed to mean? They don't even describe the intended semantics (or you haven't quoted that description).
And then the implementation is allowed to do almost anything it wants with the rest of our code, since we programmers have agreed to keep it race-free.
IMO using variables for synchronization does make for confusing programs and should be reserved only for very tiny components or very simple purposes. Plus we must remind ourselves over and over how expensive atomic operations and the now-implicit memory barriers are. Use Locks.
Exactly. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Aug 1, 2011, at 10:49 AM, Dave Abrahams wrote:
on Mon Aug 01 2011, Gordon Woodhull <gordon-AT-woodhull.com> wrote:
All of the classic lock-free algorithms, including those that are the basis for the synchronization primitives we should usually use, have data races.
IIUC, accesses to atomics are not in and of themselves data races. But it doesn't surprise me a bit to learn (again) that lock-free algorithms are hard to think about.
This is the part of the theory that seems like a cop-out to me. You can declare certain variables to be atomic/for synchronization and then *by definition* it's not a data race? Excuse me? No. It's still a race, but it's one that's condoned because you've told the implementation that you want it to be there. Of limited usefulness maybe. But the brilliant insight is that there is finally a way to tell the implementation which parts of the program order must be obeyed.
At a low level it is sometimes necessary to use a variable for synchronization.
Sure, an atomic variable. And along with those come associated memory barrier effects, right?
Yes. The nice thing about the new system is that you describe the barrier effect that you want for the accesses, with a default of sequential consistency. Instead of having plain atomic operations and having to guess where to put the barriers.
"To write Figure 1 correctly under data-race-free, we need simply identify the shared variables X and Y as synchronization variables. This would require the implementation to do whatever is necessary to ensure sequential consistency, in spite of those simultaneous accesses."
Meh. Isn't this a bogus example? What is "correctly" supposed to mean? They don't even describe the intended semantics (or you haven't quoted that description).
Definitely not bogus. This is the essence of a two-thread busy waiting mutual-exclusion algorithm. If you pry apart pthread locks they may have something like this inside. Each thread is trying to say "you first" by setting X and Y. You only have to add another variable or two and a little logic to actually enter and leave a critical section. Look for Peterson's "Myths about the mutual exclusion problem" for a concise description of a full algorithm (extended to N). (Thanks to Alberto Lerner for the fine references!)
IMO using variables for synchronization does make for confusing programs and should be reserved only for very tiny components or very simple purposes. Plus we must remind ourselves over and over how expensive atomic operations and the now-implicit memory barriers are. Use Locks.
Exactly.
C++ is a multi-paradigm language. Just as the world needs people of all different temperaments and abilities, C++ needs to support both really abstract, safe things, and really low level stuff you'd build an operating system out of. I wouldn't advise metaprogramming for anything that's not a library. I wouldn't advise lock-free for anything beyond the most low-level (and tiny) libraries. But these techniques are priceless when you need them. There may be a really intuitive way programming model for lockfree that isn't well-known yet, a way to describe invariants across many threads. Right now they rely on many-page-long mathematical proofs. I heard there is a proof that CAS or LL/SC is all you need for any concurrent algorithm. Who knows what the future holds. But most of us should be thinking about how to make our code truly data-race free, with very few atomics. Cheers, Gordon

on Mon Aug 01 2011, Gordon Woodhull <gordon-AT-woodhull.com> wrote:
On Aug 1, 2011, at 10:49 AM, Dave Abrahams wrote:
on Mon Aug 01 2011, Gordon Woodhull <gordon-AT-woodhull.com> wrote:
All of the classic lock-free algorithms, including those that are the basis for the synchronization primitives we should usually use, have data races.
IIUC, accesses to atomics are not in and of themselves data races. But it doesn't surprise me a bit to learn (again) that lock-free algorithms are hard to think about.
This is the part of the theory that seems like a cop-out to me. You can declare certain variables to be atomic/for synchronization and then *by definition* it's not a data race? Excuse me? No. It's still a race, but it's one that's condoned because you've told the implementation that you want it to be there.
It's not really different than locking. If you want to write to shared data, you need some way of making it not-a-race. It's just that when the data structure is small enough (like an int) you can make it atomic instead of putting a lock around it. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Dave Abrahams wrote: [... memory model ...]
It's not really different than locking. If you want to write to shared data, you need some way of making it not-a-race. It's just that when the data structure is small enough (like an int) you can make it atomic instead of putting a lock around it.
No. See: http://www.cl.cam.ac.uk/~pes20/cppppc/ Note that the proposed MM is still incomplete by (currently) not supporting atomic RMW operations (load-reserve/store-conditional) which are essential for locking. regards, alexander. P.S. I don't like C++11 MM atomics, I think that atomic loads and stores ought to support the following 'modes': Whether load/store is competing (default) or not. Competing load means that there might be concurrent store (to the same object). Competing store means that there might be concurrent load or store. Non-competing load/store can be performed non-atomically. Whether competing load/store needs remote write atomicity (default is no remote write atomicity). A remote-write-atomicity-yes load triggers undefined behaivior in the case of concurrent remote- write-atomicity-no store. Whether load/store has specified reordering constraint (default is no constraint specified) in terms of the following reordering modes: Whether preceding loads (in program order) can be reordered across it (can by default). Whether preceding stores (in program order) can be reordered across it (can by default). Whether subsequent loads (in program order) can be reordered across it (can by default). For load, the set of constrained subsequent loads can be limited to only dependant loads (aka 'consume' mode). Whether subsequent stores (in program order) can be reordered across it (can by default). For load, there is an implicit reordering constraint regarding dependent stores (no need to specify it). A fence/barrier operation can be used to specify reordering constraint using basically the same modes. Re C++11 MM, I'm still missing more fine-grained memory order labels such as in pseudo C++ example below. (I mean mo::noncompeting, mo::ssb/ssb_t (sink store barrier, a release not affecting preceding loads), slb/slb_t (a release not affecting preceding stores) below, and somesuch for relaxed acquire) // Introspection (for bool argument below) aside for a moment template<typename T, bool copy_ctor_or_dtor_can_mutate_object> class mutex_and_condvar_free_single_producer_single_consumer { typedef isolated< aligned_storage< T > > ELEM; size_t m_size; // > 1 ELEM * m_elem; // array of elements, init'ed by ctor atomic< ELEM * > m_head; // initially == m_elem atomic< ELEM * > m_tail; // initially == m_elem ELEM * advance(ELEM * elem) const { return (++elem < m_elem + m_size) ? elem : m_elem; } public: mutex_and_condvar_free_single_producer_single_consumer(); // ctor ~mutex_and_condvar_free_single_producer_single_consumer(); // dtor void producer(const T & value) { ELEM * tail = m_tail.load(mo::noncompeting); // may be nonatomic ELEM * next = advance(tail); while (next == m_head.load(mo::relaxed)) usleep(1000); new(tail) T(value); // placement copy ctor (make queued copy) m_tail.store(next, mo::ssb); // cheaper than mo::release } T consumer() { ELEM * head = m_head.load(mo::noncompeting); // may be nonatomic while (head == m_tail.load(mo::consume)) usleep(1000); T value(*head); // T's copy ctor (make a copy to return) head->~T(); // T's dtor (cleanup for queued copy) m_head.store(advance(head), type_list< mo::slb_t, mo::rel_t >:: element<copy_ctor_or_dtor_can_mutate_object>::type()); return value; // return copied T } }; Note also that given that example above presumes that no more than one thread can read from relevant atomic locations while they are written concurrently, there is definitely no need to pay the price of remote write atomicity even if it is run on 3+ way multiprocessor... IOW, hwsync is unneeded even if all mo::* above are changed to SC... but upcoming C++11 MM doesn't allow to express no-need-for-remote-write-atomicity for SC atomics.

on Tue Aug 23 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:
Dave Abrahams wrote:
[... memory model ...]
It's not really different than locking. If you want to write to shared data, you need some way of making it not-a-race. It's just that when the data structure is small enough (like an int) you can make it atomic instead of putting a lock around it.
No.
All I'm saying here is that a C++11 default atomic int is equivalent to an int, coupled with an associated mutex, where accesses to the int are always protected by locking the associated mutex. If you're seriously disagreeing with *that*, please say explicitly on what grounds. If you're not disagreeing with that, we have no disagreement and you're just grinding some irrelevant axe. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Dave Abrahams wrote:
on Tue Aug 23 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:
Dave Abrahams wrote:
[... memory model ...]
It's not really different than locking. If you want to write to shared data, you need some way of making it not-a-race. It's just that when the data structure is small enough (like an int) you can make it atomic instead of putting a lock around it.
No.
All I'm saying here is that a C++11 default atomic int is equivalent to an int, coupled with an associated mutex, where accesses to the int are always protected by locking the associated mutex. If you're seriously disagreeing with *that*, please say explicitly on what grounds.
int i; mutex mi, int j; mutex mj; mi.lock(); i = 1; mi.unlock(); mj.lock(); j = 2; mj.unlock(); can be transformed to multi_lock(mi, mj); // deadlock free j = 2; i = 1; mi.unlock(); mj.unlock(); and thus result in reodering i = 1 and j = 2. With C++11 default atomics (SC) for i and j such reodering is prohibited. regards, alexander.

Alexander Terekhov wrote:
Dave Abrahams wrote:
on Tue Aug 23 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:
Dave Abrahams wrote:
[... memory model ...]
It's not really different than locking. If you want to write to shared data, you need some way of making it not-a-race. It's just that when the data structure is small enough (like an int) you can make it atomic instead of putting a lock around it.
No.
All I'm saying here is that a C++11 default atomic int is equivalent to an int, coupled with an associated mutex, where accesses to the int are always protected by locking the associated mutex. If you're seriously disagreeing with *that*, please say explicitly on what grounds.
int i; mutex mi, int j; mutex mj;
mi.lock(); i = 1; mi.unlock();
mj.lock(); j = 2; mj.unlock();
can be transformed to
multi_lock(mi, mj); // deadlock free j = 2; i = 1; mi.unlock(); mj.unlock();
and thus result in reodering i = 1 and j = 2.
With C++11 default atomics (SC) for i and j such reodering is prohibited.
See also: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html regards, alexander.

on Wed Aug 24 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:
Dave Abrahams wrote:
All I'm saying here is that a C++11 default atomic int is equivalent to an int, coupled with an associated mutex, where accesses to the int are always protected by locking the associated mutex. If you're seriously disagreeing with *that*, please say explicitly on what grounds.
int i; mutex mi, int j; mutex mj;
mi.lock(); i = 1; mi.unlock();
mj.lock(); j = 2; mj.unlock();
can be transformed to
multi_lock(mi, mj); // deadlock free j = 2; i = 1; mi.unlock(); mj.unlock();
and thus result in reodering i = 1 and j = 2.
With C++11 default atomics (SC) for i and j such reodering is prohibited.
As it is with C++11 mutexes, IIUC. Another thread that locks the same mutexes is not allowed to observe the write to j before the write to i. If you have contrary evidence, please point to it. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Dave Abrahams wrote: [...]
mi.lock(); i = 1; mi.unlock();
mj.lock(); j = 2; mj.unlock();
can be transformed to
multi_lock(mi, mj); // deadlock free j = 2; i = 1; mi.unlock(); mj.unlock();
and thus result in reodering i = 1 and j = 2.
With C++11 default atomics (SC) for i and j such reodering is prohibited.
As it is with C++11 mutexes, IIUC.
Mutexes need only *acquire-release* semantics coupled with RMW(s) to result in *SC* for associated race-free data, as observed by another thread(s). C++11 default atomics are *SC* without mutexes (lockfree and all that stuff)... *SC* is more expansive than *acquire-release* in terms of sync on platforms such as:
If you have contrary evidence, please point to it.
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html (PowerPC) I mean: Load SC: hwsync; ld; cmp; bc; isync Store SC: hwsync; st alternative mappings: Load SC: ld; hwsync Store SC: lwsync; st; hwsync Note that *acquire-release* does NOT need *hw*sync: Load Acquire: ld; cmp; bc; isync Store Release: lwsync; st regards, alexander.

on Wed Aug 24 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:
Dave Abrahams wrote: [...]
mi.lock(); i = 1; mi.unlock();
mj.lock(); j = 2; mj.unlock();
can be transformed to
multi_lock(mi, mj); // deadlock free j = 2; i = 1; mi.unlock(); mj.unlock();
and thus result in reodering i = 1 and j = 2.
With C++11 default atomics (SC) for i and j such reodering is prohibited.
As it is with C++11 mutexes, IIUC.
Mutexes need only *acquire-release* semantics coupled with RMW(s) to result in *SC* for associated race-free data, as observed by another thread(s).
I think I know that...
C++11 default atomics are *SC* without mutexes (lockfree and all that stuff)... *SC* is more expansive than *acquire-release* in terms of sync
...and I think I know that...
on platforms such as:
If you have contrary evidence, please point to it.
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html (PowerPC)
I mean:
Load SC: hwsync; ld; cmp; bc; isync Store SC: hwsync; st
alternative mappings:
Load SC: ld; hwsync Store SC: lwsync; st; hwsync
Note that *acquire-release* does NOT need *hw*sync:
Load Acquire: ld; cmp; bc; isync Store Release: lwsync; st
I'm pretty sure I did't know that part, but I don't yet see what it has to do with your assertion. IIUC, using mutexes in the implementation of C++0x atomics has always been considered an available choice for implementors. If that reordering were allowed for mutexes, wouldn't that weaken the semantics of atomics? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Dave Abrahams wrote: [...]
IIUC, using mutexes in the implementation of C++0x atomics has always been considered an available choice for implementors. ...
Lock-free atomics can not be implemented using mutexes. Lock-free SC atomics need a heavy-weight 'hwsync' on PowerPC/PPC (mutexes don't need 'hwsync'). Nobody needs atomics implemented using mutexes because locking can be done more efficiently on higher level. regards, alexander.

on Fri Aug 26 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:
Dave Abrahams wrote: [...]
IIUC, using mutexes in the implementation of C++0x atomics has always been considered an available choice for implementors. ...
Lock-free atomics can not be implemented using mutexes. Lock-free SC atomics need a heavy-weight 'hwsync' on PowerPC/PPC (mutexes don't need 'hwsync').
Thanks for educating me.
Nobody needs atomics implemented using mutexes because locking can be done more efficiently on higher level.
Well, the idea is to allow code using atomics to run at all (if suboptimally) on platforms that don't support hardware atomics for the data type in question. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Fri, Aug 26, 2011 at 2:26 PM, Dave Abrahams <dave@boostpro.com> wrote:
on Fri Aug 26 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:
Dave Abrahams wrote: [...]
IIUC, using mutexes in the implementation of C++0x atomics has always been considered an available choice for implementors. ...
Lock-free atomics can not be implemented using mutexes. Lock-free SC atomics need a heavy-weight 'hwsync' on PowerPC/PPC (mutexes don't need 'hwsync').
Thanks for educating me.
Nobody needs atomics implemented using mutexes because locking can be done more efficiently on higher level.
Well, the idea is to allow code using atomics to run at all (if suboptimally) on platforms that don't support hardware atomics for the data type in question.
-- Dave Abrahams BoostPro Computing http://www.boostpro.com
I think you are trying to highlight an important point here, and I agree with you (that it is important and needs to be looked into - not that you said that specifically...). The idea in the standard is that atomic<Foo> foo; is suppose to work for any Foo, I think. If Foo is too big to be atomic by the CPU, then a mutex is to be used inside the atomic<>. And that should have the same semantics as atomic<int>. So my first question is, is the above true? Second is whether that is a problem. Note that in:
mi.lock(); i = 1; mi.unlock();
mj.lock(); j = 2; mj.unlock();
can be transformed to
multi_lock(mi, mj); // deadlock free j = 2; i = 1; mi.unlock(); mj.unlock();
There is no "synchronizes with" (standard's wording) because synch-with requires thread 2 to *read* the value written into the atomic by thread 1. We have no reads here - no if statement - so reordering doesn't matter. I don't think the standard disallows the reordering of i and j, even if they are atomics. Tony

on Sat Aug 27 2011, Gottlob Frege <gottlobfrege-AT-gmail.com> wrote:
There is no "synchronizes with" (standard's wording) because synch-with requires thread 2 to *read* the value written into the atomic by thread 1. We have no reads here - no if statement - so reordering doesn't matter. I don't think the standard disallows the reordering of i and j, even if they are atomics.
IIUC from the point of view of programs without undefined behavior, the standard disallows reordering in both cases. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Sat, Aug 27, 2011 at 1:52 PM, Dave Abrahams <dave@boostpro.com> wrote:
on Sat Aug 27 2011, Gottlob Frege <gottlobfrege-AT-gmail.com> wrote:
There is no "synchronizes with" (standard's wording) because synch-with requires thread 2 to *read* the value written into the atomic by thread 1. We have no reads here - no if statement - so reordering doesn't matter. I don't think the standard disallows the reordering of i and j, even if they are atomics.
IIUC from the point of view of programs without undefined behavior, the standard disallows reordering in both cases.
-- Dave Abrahams BoostPro Computing http://www.boostpro.com
But without a read, there is no visible behaviour. There is probably no such thing as "reordering" in the standard. Just whether a second thread reads what a first thread wrote, and if it does, what can be assumed by that (you can assume synch-with, if you follow the rules). I guess I haven't read very closely the SC parts of the standard, where it talks about data races. I concentrated on the synch-with parts. Maybe we need a better example? Tony

on Sat Aug 27 2011, Gottlob Frege <gottlobfrege-AT-gmail.com> wrote:
But without a read, there is no visible behaviour.
Right. Surely it's possible to read in this scenario without introducing a data race?
There is probably no such thing as "reordering" in the standard.
Right.
I guess I haven't read very closely the SC parts of the standard, where it talks about data races. I concentrated on the synch-with parts.
Maybe we need a better example?
In order to illustrate/understand what? I confess to being a little unsure what you want to achieve. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Sun, Aug 28, 2011 at 2:42 PM, Dave Abrahams <dave@boostpro.com> wrote:
on Sat Aug 27 2011, Gottlob Frege <gottlobfrege-AT-gmail.com> wrote:
But without a read, there is no visible behaviour.
Right. Surely it's possible to read in this scenario without introducing a data race?
There is probably no such thing as "reordering" in the standard.
Right.
I guess I haven't read very closely the SC parts of the standard, where it talks about data races. I concentrated on the synch-with parts.
Maybe we need a better example?
In order to illustrate/understand what? I confess to being a little unsure what you want to achieve.
Good question. I think the question at hand (or in my mind) is whether atomic<Foo>, implemented with an internal mutex, is valid. Given A) the reordering of m1.lock(); x1++; m1.unlock(); m2.lock(); x2++; m2.unlock(); to m1.lock(); m2.lock(); x2++; x1++; m1.unlock(); m2.unlock(); and B) that atomics are suppose to be "SC", where maybe the above reordering doesn't seem possible (if x1 and x2 were atomics implemented with locks), there seems to be a disconnect. I'm fine with the reordering, and I tend not to think about SC, so I'm not too concerned. I'm also unsure if the standard actually says "atomics are SC" or something equivalent, or something only close to SC. But there may be something worth understanding better here. To understand it with "synch-with" (the part of the standard I'm more comfortable with) we need a read/if-statement example. Tony P.S. yes, I should now add an example here. But I haven't been feeling well, so it's hard to think clearly for long enough. Particularly about threading, which requires a clear head.

Gottlob Frege wrote:
Good question. I think the question at hand (or in my mind) is whether atomic<Foo>, implemented with an internal mutex, is valid.
It is.
Given
A) the reordering of
m1.lock(); x1++; m1.unlock();
m2.lock(); x2++; m2.unlock();
to
m1.lock(); m2.lock(); x2++; x1++; m1.unlock(); m2.unlock();
This reordering is invalid, by the way, which is why Alexander has been careful to use "multi_lock". If you add a second example that locks/unlocks m2 first, then m1, and reorder it in the same way, the reordered example can deadlock, and the original cannot. It somewhat depends on what you mean by reordering though. It's probably true that an observer thread can see the memory effects of the operations in the order m1.lock/m2.lock/x2++/x1++ by using acquire loads (or maybe try_lock, assuming that it doesn't fail spuriously, which it's allowed to do) on the control variables of m1 and m2 and on x1 and x2, but this is not the same as reordering on the thread 1's side, whether by the compiler or by the hardware. Either way, the C++MM doesn't "think" in terms of reorderings, it specifies behavior of multithreaded programs. If you really think that atomic<Foo>, implemented with a mutex, is not sequentially consistent, there must exist an example (a multithreaded program) containing atomic<Foo> for which the memory model allows an outcome which is not SC.

On Tue, Aug 30, 2011 at 4:56 AM, Peter Dimov <pdimov@pdimov.com> wrote:
Gottlob Frege wrote:
Good question. I think the question at hand (or in my mind) is whether atomic<Foo>, implemented with an internal mutex, is valid.
It is.
Given
A) the reordering of
m1.lock(); x1++; m1.unlock();
m2.lock(); x2++; m2.unlock();
to
m1.lock(); m2.lock(); x2++; x1++; m1.unlock(); m2.unlock();
This reordering is invalid, by the way, which is why Alexander has been careful to use "multi_lock". If you add a second example that locks/unlocks m2 first, then m1, and reorder it in the same way, the reordered example can deadlock, and the original cannot. It somewhat depends on what you mean by reordering though. It's probably true that an observer thread can see the memory effects of the operations in the order m1.lock/m2.lock/x2++/x1++ by using acquire loads (or maybe try_lock, assuming that it doesn't fail spuriously, which it's allowed to do) on the control variables of m1 and m2 and on x1 and x2, but this is not the same as reordering on the thread 1's side, whether by the compiler or by the hardware.
Sorry, I didn't mean to disappear for a month just because I was wrong :-) (I did mention in that email that I wasn't feeling well. Still not 100%, so you are forewarned!) I was overlapping the "critical sections" that way and wondering if that was allowed, based on, for example, Herb Sutter's http://drdobbs.com/cpp/201804238 and the usual descriptions of code floating into a critical section, but not out. (Of course I missed his footnote mentioning that maybe the acquire/release of 2 blocks couldn't overlap.) Of course, the reordering happens in the data reads/writes so it is not the lock() and potential wait-for-lock-to-be-free that would be reordered, but, as you mentioned, the memory effects. I'm sure that's what I meant :-) The part that surprised me was that the effects of x2++ could be *seen* before x1++, but of course this could only be seen by another thread that did NOT use the locks, but instead read x2,x1 directly. Which means they are no longer atomics-implemented-with-locks. So, in the normal case, when thread 2 tries to read x2, it first grabs the m2 lock, which sets up the synchronize-with with the m2.unlock() in thread 1, meaning *eveything* before that is seen first - both x2++ and x1++. Which one "happened" first doesn't matter, they can't be seen before m2.lock(), at which point they are both visible. OK, I'm fine now with atomics-via-individual-locks. I was worried for a second there... Tony
Either way, the C++MM doesn't "think" in terms of reorderings, it specifies behavior of multithreaded programs. If you really think that atomic<Foo>, implemented with a mutex, is not sequentially consistent, there must exist an example (a multithreaded program) containing atomic<Foo> for which the memory model allows an outcome which is not SC.
P.S. I found thinking of atomics-by-locks hard wrt "sync-with". I had to imagine an atomic<BigObject> which is implemented with a lock which is in turn implemented with an atomic<int>. ie I'm too much in the habit of thinking about atomics to be able to think of locks in any other way than their atomic implementation, so it was hard to avoid the recursion :-)

Dave Abrahams wrote:
on Sat Aug 27 2011, Gottlob Frege <gottlobfrege-AT-gmail.com> wrote:
There is no "synchronizes with" (standard's wording) because synch-with requires thread 2 to *read* the value written into the atomic by thread 1. We have no reads here - no if statement - so reordering doesn't matter. I don't think the standard disallows the reordering of i and j, even if they are atomics.
IIUC from the point of view of programs without undefined behavior, the standard disallows reordering in both cases.
The standard prescribes UB in the case of lock-free non-atomic observer loads (the program is well-defined in the case of relaxed lock-free atomic stores and loads to observe the reordering of two stores). In reality, the generated code is the same (irrespective of whether the code uses relaxed atomic stores/loads versus ordinary stores/loads). The standard allows reordering in both cases. The only difference is UB imposed by fiat in one case and no UB in another case. regards, alexander.

On Wed, Aug 24, 2011 at 4:27 PM, Dave Abrahams <dave@boostpro.com> wrote:
on Wed Aug 24 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:
Dave Abrahams wrote:
All I'm saying here is that a C++11 default atomic int is equivalent to an int, coupled with an associated mutex, where accesses to the int are always protected by locking the associated mutex. If you're seriously disagreeing with *that*, please say explicitly on what grounds.
int i; mutex mi, int j; mutex mj;
mi.lock(); i = 1; mi.unlock();
mj.lock(); j = 2; mj.unlock();
can be transformed to
multi_lock(mi, mj); // deadlock free j = 2; i = 1; mi.unlock(); mj.unlock();
and thus result in reodering i = 1 and j = 2.
With C++11 default atomics (SC) for i and j such reodering is prohibited.
As it is with C++11 mutexes, IIUC. Another thread that locks the same mutexes is not allowed to observe the write to j before the write to i. If you have contrary evidence, please point to it.
FWIW, I agree with Alexander here. It is also a case I've thought about in the past. I'm not exactly sure about C++ mutexes, but typically mutexes have acquire on entry/lock, release on exit/unlock, which allows the lock regions to overlap. I would hope C++ mutexes match the semantics of posix, etc, mutexes. I think from C++11 standards perspective, it comes down to the definition of "happens-before", and the lack of a good single synchronization point (ie mi and mj don't synchronize with each other). But I'd need to read that section again clearly to be able to prove how it works. Instead I'm relying on my understanding of other mutexes and critical sections (basically, code can drift in, but not drift out). And my understanding has been that C++ is the same. C++ atomics (with default SC) are more like having all atomics using *the same mutex*, not one mutex per atomic. I'm not sure it is exactly the same, ie whether it is bidirectional, but you do need stronger than mutex-per. In relation to Alexander's comments in general, and C++11 MM in general: 1. I wish that C++11 atomics defaulted to acquire-on-read and release-on-write, and that's how I will almost always use them. It is also what java and C# use, I believe (on volatiles). 2. they could have gone with more fine grained details and options, but I think it is complicated enough. If you really want, you can still always resort to asm. Also, I suspect they would always be adding new options (write-through cache vs blah blah blah) which blows away any sense of abstract model. 3. I didn't go to the meetings and voice my concerns, or even email my national rep (well, maybe once), so I'm not complaining. Tony

[Gottlob Frege]
1. I wish that C++11 atomics defaulted to acquire-on-read and release-on-write, and that's how I will almost always use them. It is also what java and C# use, I believe (on volatiles).
That doesn't solve the "independent reads, independent writes" problem, which is why the non-Standard semantics that VC8 gave volatile aren't really useful. (volatile: a multithreaded programmer's worst enemy.) The Standard's focus on sequential consistency by default was carefully thought out. Stephan T. Lavavej Visual C++ Libraries Developer

On Thu, Aug 25, 2011 at 12:13 AM, Stephan T. Lavavej <stl@exchange.microsoft.com> wrote:
[Gottlob Frege]
1. I wish that C++11 atomics defaulted to acquire-on-read and release-on-write, and that's how I will almost always use them. It is also what java and C# use, I believe (on volatiles).
That doesn't solve the "independent reads, independent writes" problem, which is why the non-Standard semantics that VC8 gave volatile aren't really useful. (volatile: a multithreaded programmer's worst enemy.)
What's the "independent reads, independent writes" problem? I've probably heard, seen, or mistakenly had a bug due to it once, but don't know it by name.
The Standard's focus on sequential consistency by default was carefully thought out.
I find that most lockfree development uses acquire/release, not SC, so if the focus was to standardize existing patterns (at least in lockfree), then in that sense it seems odd. But I do believe it was carefully thought out, and by people who know better than I. I have a slight worry that SC was chosen so that it was easier to understand in general, instead of for those who will probably use it most, but again, I wasn't there, and really, I think that the memory ordering of an atomic operation is important enough that I might argue that it should always be specified, so I don't at all mind that I need to be explicit to use acqure/release. (sorry for the run on sentence) Tony

Gottlob Frege wrote:
On Thu, Aug 25, 2011 at 12:13 AM, Stephan T. Lavavej <stl@exchange.microsoft.com> wrote:
[Gottlob Frege]
1. I wish that C++11 atomics defaulted to acquire-on-read and release-on-write, and that's how I will almost always use them. It is also what java and C# use, I believe (on volatiles).
That doesn't solve the "independent reads, independent writes" problem, which is why the non-Standard semantics that VC8 gave volatile aren't really useful. (volatile: a multithreaded programmer's worst enemy.)
What's the "independent reads, independent writes" problem? I've probably heard, seen, or mistakenly had a bug due to it once, but don't know it by name.
IRIW is an example that highlights remote write atomicity: T1: X = 1; T2: Y = 1; T3: A = X; B = Y; T4: C = Y; D = X; A == 1 && B == 0 && C == 1 && D == 0 is possible without remote write atomicity. Another example is T1: X = 1; T2: if (X) Y = 1; T3: A = Y; B = X; A == 1 && B == 0 is possible without remote write atomicity. SC is basically "no local reordering" plus remote write atomicity. In 'modes' model, remote write atomicity is a separate mode apart from reordering. regards, alexander.

On Thu, Aug 25, 2011 at 7:38 AM, Alexander Terekhov <terekhov@web.de> wrote:
Gottlob Frege wrote:
What's the "independent reads, independent writes" problem? I've probably heard, seen, or mistakenly had a bug due to it once, but don't know it by name.
IRIW is an example that highlights remote write atomicity:
T1: X = 1; T2: Y = 1; T3: A = X; B = Y; T4: C = Y; D = X;
A == 1 && B == 0 && C == 1 && D == 0 is possible without remote write atomicity.
I would not be surprised by those resultant values.
Another example is
T1: X = 1; T2: if (X) Y = 1; T3: A = Y; B = X;
A == 1 && B == 0 is possible without remote write atomicity.
I'm comfortable with that as well.
SC is basically "no local reordering" plus remote write atomicity.
In 'modes' model, remote write atomicity is a separate mode apart from reordering.
regards, alexander.
Thanks Tony

Gottlob Frege wrote:
On Thu, Aug 25, 2011 at 12:13 AM, Stephan T. Lavavej <stl@exchange.microsoft.com> wrote:
[Gottlob Frege]
1. I wish that C++11 atomics defaulted to acquire-on-read and release-on-write, and that's how I will almost always use them. It is also what java and C# use, I believe (on volatiles).
That doesn't solve the "independent reads, independent writes" problem, which is why the non-Standard semantics that VC8 gave volatile aren't really useful. (volatile: a multithreaded programmer's worst enemy.)
What's the "independent reads, independent writes" problem? I've probably heard, seen, or mistakenly had a bug due to it once, but don't know it by name.
The Standard's focus on sequential consistency by default was carefully thought out.
I find that most lockfree development uses acquire/release, not SC, so
Note also that according to http://msdn.microsoft.com/en-us/library/ms235435(v=vs.90).aspx (Supported Platforms (Visual C++)) VC8 targets only x86/x64 and Itanium. After Intel clarified x86/x64 MM back in 2008, we know that x86/x64 MM is basically TSO, that is, acquire/release + remote write atomicity. Itanium MM explicitly defines WB release stores to have remote write atomicity. http://download.intel.com/design/itanium/downloads/25142901.pdf (3.3.7.1 Total Ordering of WB Releases) So IRIW is not really a problem for VC8 acquire/release because on VC8 supported platforms remote write atomicity is guaranteed for acquire/release. regards, alexander.

"Stephan T. Lavavej" wrote:
[Alexander Terekhov]
VC8 targets only x86/x64 and Itanium.
VC changes its targeted platforms from time to time. For example, Itanium will not be supported by VC11.
We have learned that granting non-Standard semantics to volatile for all platforms for all time was a mistake.
At the time of VC8, there was no standard semantics for atomics (apart from quasi-standard Java's SC volatiles). I agree that overloading C/C++ volatile with acquire/release was a mistake but I don't think that VC8 choice of acquire/release for atomics was problematic (apart from better, more relaxed 'modes' model), especially given that VC8 targeted only x86/x64 and Itanium (thus making IRIW irrelevant). regards, alexander.

[Gottlob Frege]
What's the "independent reads, independent writes" problem? I've probably heard, seen, or mistakenly had a bug due to it once, but don't know it by name.
Here's one explanation: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2177.html (I am not an expert here, but I try to be aware of what I don't know.) STL

"Stephan T. Lavavej" wrote:
[Gottlob Frege]
What's the "independent reads, independent writes" problem? I've probably heard, seen, or mistakenly had a bug due to it once, but don't know it by name.
Here's one explanation: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2177.html
Uhmm "...must not result in r1 = r3 = 1 and r1 = r3 = 0" sounds like a typo (the second r1 = r3 = 1 is just copy&paste lacking subsequent edits regarding numbers). ...must not result in r1 = r3 = 1 and r2 = r4 = 0 is what is meant, I suppose. regards, alexander.

On Thu, 25 Aug 2011 02:27:11 +0600, Dave Abrahams <dave@boostpro.com> wrote:
on Wed Aug 24 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:
Dave Abrahams wrote:
All I'm saying here is that a C++11 default atomic int is equivalent to an int, coupled with an associated mutex, where accesses to the int are always protected by locking the associated mutex. If you're seriously disagreeing with *that*, please say explicitly on what grounds.
int i; mutex mi, int j; mutex mj;
mi.lock(); i = 1; mi.unlock();
mj.lock(); j = 2; mj.unlock();
can be transformed to
multi_lock(mi, mj); // deadlock free j = 2; i = 1; mi.unlock(); mj.unlock();
and thus result in reodering i = 1 and j = 2.
With C++11 default atomics (SC) for i and j such reodering is prohibited.
As it is with C++11 mutexes, IIUC. Another thread that locks the same mutexes is not allowed to observe the write to j before the write to i. If you have contrary evidence, please point to it.
AFAIK, in the example above another thread can't observe reordering without introducing UB. Here is a more correct example: int x = 0; int y = 0; int r1, r2; mutex mx, my; // Thread 1: mx.lock(); x = 1; mx.unlock(); my.lock(); r1 = y; my.unlock(); // Thread 2: my.lock(); y = 1; my.unlock(); mx.lock(); r2 = x; mx.unlock(); Now, it is possible that r1 == r2 == 0, which is not true for the original example at http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html#introexamp... Thread 1 x = 1; r1 = y; Thread 2 y = 1; r2 = x;

Ilya Sokolov wrote:
Here is a more correct example:
int x = 0; int y = 0; int r1, r2; mutex mx, my;
// Thread 1: mx.lock(); x = 1; mx.unlock();
my.lock(); r1 = y; my.unlock();
// Thread 2: my.lock(); y = 1; my.unlock();
mx.lock(); r2 = x; mx.unlock();
Now, it is possible that r1 == r2 == 0, ...
I don't think that it is, no matter how you reorder things. Hans Boehm has a proof that race-free programs containing only lock/unlock are sequentially consistent: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2392.html I haven't looked at it in detail though. :-)

Ilya Sokolov wrote:
On Thu, 25 Aug 2011 02:27:11 +0600, Dave Abrahams <dave@boostpro.com> wrote:
on Wed Aug 24 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:
Dave Abrahams wrote:
All I'm saying here is that a C++11 default atomic int is equivalent to an int, coupled with an associated mutex, where accesses to the int are always protected by locking the associated mutex. If you're seriously disagreeing with *that*, please say explicitly on what grounds.
int i; mutex mi, int j; mutex mj;
mi.lock(); i = 1; mi.unlock();
mj.lock(); j = 2; mj.unlock();
can be transformed to
multi_lock(mi, mj); // deadlock free j = 2; i = 1; mi.unlock(); mj.unlock();
and thus result in reodering i = 1 and j = 2.
With C++11 default atomics (SC) for i and j such reodering is prohibited.
As it is with C++11 mutexes, IIUC. Another thread that locks the same mutexes is not allowed to observe the write to j before the write to i. If you have contrary evidence, please point to it.
AFAIK, in the example above another thread can't observe reordering without introducing UB.
I meant "observed" as in "check the sequence of writes in the generated code". But int can be simply changed to relaxed atomic<int> so that another thread can observe reordering without UB.
Here is a more correct example:
int x = 0; int y = 0; int r1, r2; mutex mx, my;
// Thread 1: mx.lock(); x = 1; mx.unlock();
my.lock(); r1 = y; my.unlock();
// Thread 2: my.lock(); y = 1; my.unlock();
mx.lock(); r2 = x; mx.unlock();
Now, it is possible that r1 == r2 == 0, ...
Nope, it is impossible. Same as with // Thread 1: multi_lock(mx, my); // deadlock free r1 = y; x = 1; mx.unlock(); my.unlock(); // Thread 2: multi_lock(my, mx); // deadlock free r2 = x; y = 1; my.unlock(); mx.unlock(); regards, alexander.

On Thu, 25 Aug 2011 18:19:48 +0600, Alexander Terekhov <terekhov@web.de> wrote:>
Ilya Sokolov wrote:
int x = 0; int y = 0; int r1, r2; mutex mx, my;
// Thread 1: mx.lock(); x = 1; mx.unlock();
my.lock(); r1 = y; my.unlock();
// Thread 2: my.lock(); y = 1; my.unlock();
mx.lock(); r2 = x; mx.unlock();
Now, it is possible that r1 == r2 == 0, ...
Nope, it is impossible. Same as with
// Thread 1: multi_lock(mx, my); // deadlock free r1 = y; x = 1; mx.unlock(); my.unlock();
// Thread 2: multi_lock(my, mx); // deadlock free r2 = x; y = 1; my.unlock(); mx.unlock();
Doh! Should have done that myself. Sorry for the noise

on Thu Aug 25 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:
Ilya Sokolov wrote:
AFAIK, in the example above another thread can't observe reordering without introducing UB.
I meant "observed" as in "check the sequence of writes in the generated code".
Do we care about what the generated code looks like? IIUC what counts is the sequence of memory states that another another thread can see.
But int can be simply changed to relaxed atomic<int> so that another thread can observe reordering without UB.
Sure. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Dave Abrahams wrote:
on Thu Aug 25 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:
Ilya Sokolov wrote:
AFAIK, in the example above another thread can't observe reordering without introducing UB.
I meant "observed" as in "check the sequence of writes in the generated code".
Do we care about what the generated code looks like? IIUC what counts is the sequence of memory states that another another thread can see.
But int can be simply changed to relaxed atomic<int> so that another thread can observe reordering without UB.
Sure.
I meant that for the example upthread, with an int instead of relaxed atomic<int>, another thread's code triggering UB does not prevent us from observing reordered writes in the writing thread's code (in generated code), by looking at it. Changing int to relaxed atomic<int> simply removes UB but does not change the relevant code in the writing thread. regards, alexander.

On Mon, Aug 1, 2011 at 10:49 AM, Dave Abrahams <dave@boostpro.com> wrote:
on Mon Aug 01 2011, Gordon Woodhull <gordon-AT-woodhull.com> wrote:
On Jul 31, 2011, at 1:05 PM, Dave Abrahams wrote:
I don't get it. Please, show me an example.
I am definitely not an expert. I'll quote a diagram from this fascinating paper by Adve and Boehm [1], using Dekker's mutual exclusion algorithm as an example
Initially X = Y = 0 Thread 1 X = 1; r1 = Y;
Thread 2 Y = 1; r2 = X;
Can r1 = r2 = 0?
Sure they can, that's a race condition: Thread1 modifies X while thread 2 reads it. Anything could happen. Why would anyone expect otherwise.
Most people are surprised that *both* r1 and r2 can be 0 at the same time. We have been trained that our code (single threaded) happens sequentially. So Y = 1; r2 = X; means "set Y **THEN** read X" - *in that order* now, really, we don't know or care about the true order (from the point of view of this CPU) until we later read it (from this CPU), but that's how we imagine it. In the single-threaded world we grew up in. So imagine the race you mention - thread1 is writing X while thread2 is reading it. In the traditional view - ie assuming a thread of code happens in the same order that it was written - this means: A. thread1 has not yet read Y (thread1 is currently writing X, and reading Y happens next) B. thread2 has finished setting Y (because it is currently reading X, which happens after setting Y) ---- C. therefore thread1 *must* read Y after it was set, therefore it will see Y==1, therefore r1 == 1. And then do the same logic for the opposite case (exchanging which thread goes first), and do the logic for any other order of the theads - as long as you assume a single thread happens in the order the code is listed in, you will find that one of the r's always comes out == 1 (if not both). But never both == 0. So our tradition of "code-line-N happens before code-line-N+1" plus our logic (that all programmers apply all day) concludes that r1 == 0 and r2 == 0 can't happen. If you grew up coding in different circumstances (you once mentioned doing lots of music/sound stuff early on - which tends to be very lock-free-ish), then maybe you don't make the traditional "N happens before N+1" assumptions. Having said all that, it doesn't mean thinking about re-orderings helps anything. That I agree with. The key isn't "here's more to think about" (reordering), it is "here's less to think about" - ie less to assume - stop thinking a single thread happens exactly one line after the other. But this seems to be hard to remember NOT to think about, since we've been thinking that way for years. It is better to think about how writing shared variables exposes state, and whether that correctly exposes only invariants. You may already be thinking this way. eg: Imagine the invariant is assert(y >= x). Imagine p is local, and the invariant only need to hold once p is exposed globally. p->x = p->y = 0; // invariant OK ... p->x = 10; // invariant temporarily broken, but local, so OK p->y = 20; // invariant restored // can now expose: globalP = p; // globally expose state here, after invariant is ensured. Thinking of "invariant exposure" makes it simple to realize that the globalP = p is the point of exposure (or "publication"). For threading and our memory model, this is also exactly where the barrier is needed (if you want to think about barriers) and exactly where the atomic operation is needed (or where the mutex can be unlocked). So just make globalP atomic. No need to think about reordering. But if people ask why the point of exposure is so important, or why globalP needs to be atomic or how it could possibly fail if globalP was just a pointer, you need to explain that p->y = 20 can happen (or be seen) after globalP = p. Which is like the r1 == r2 == 0 example - it goes against tradition. Most people would think that once globalP has been set, then globalP->y = 20. Since that's the order the lines of code are written in. But that's only true if globalP is an atomic. Once they believe and/or understand that "invariant exposure happens on shared variables therefore shared variables must be atomic", then reordering can be forgotten. You can then think of your code as blocks that happen between read/writes of atomics (ie atomics are the sequence points). Just like critical sections are blocks of code. You have blocks defined by published-invariant sequence points. Or something like that. So I agree "sequential consistency" doesn't help much. Nor does reordering. Think about publishing invariants. (Actually what "sequential consistency" of the atomics means is that you can now reason about the large code blocks defined by those sequence points, ignoring their details. So that is useful.) Tony (For the pedantic: in this example, sequential consistency was overkill. we only needed "release" semantics on globalP = p)

on Sat Aug 06 2011, Gottlob Frege <gottlobfrege-AT-gmail.com> wrote:
On Mon, Aug 1, 2011 at 10:49 AM, Dave Abrahams <dave@boostpro.com> wrote:
on Mon Aug 01 2011, Gordon Woodhull <gordon-AT-woodhull.com> wrote:
On Jul 31, 2011, at 1:05 PM, Dave Abrahams wrote:
I don't get it. Please, show me an example.
I am definitely not an expert. I'll quote a diagram from this fascinating paper by Adve and Boehm [1], using Dekker's mutual exclusion algorithm as an example
Initially X = Y = 0 Thread 1 X = 1; r1 = Y;
Thread 2 Y = 1; r2 = X;
Can r1 = r2 = 0?
Sure they can, that's a race condition: Thread1 modifies X while thread 2 reads it. Anything could happen. Why would anyone expect otherwise.
Most people are surprised that *both* r1 and r2 can be 0 at the same time.
"At the same time" is ill-defined until you put in some kind of synchronization. If you synchronize those two threads at the end I expect all their effects to have taken place. But they would still have a race condition so I wouldn't expect any particular result.
We have been trained that our code (single threaded) happens sequentially. So Y = 1; r2 = X;
means "set Y **THEN** read X" - *in that order*
But it doesn't mean that!
now, really, we don't know or care about the true order (from the point of view of this CPU) until we later read it (from this CPU), but that's how we imagine it. In the single-threaded world we grew up in.
Who is "we," Kimosabe? I *really* don't think about code that way. I generally try to avoid writing code that makes me think about the order in which things happen.
So imagine the race you mention - thread1 is writing X while thread2 is reading it. In the traditional view - ie assuming a thread of code happens in the same order that it was written - this means:
A. thread1 has not yet read Y (thread1 is currently writing X, and reading Y happens next) B. thread2 has finished setting Y (because it is currently reading X, which happens after setting Y) ---- C. therefore thread1 *must* read Y after it was set, therefore it will see Y==1, therefore r1 == 1.
But my point is that even if this reasoning held, it is a hopelessly un-scalable way to think about parallel execution. I can't reason that way about the behavior of a complex system without having a nervous breakdown, and I don't believe it's tractable for most other people either. That's why I'm saying that sequential consistency is not a game-changer in terms of reasoning about parallelism.
If you grew up coding in different circumstances (you once mentioned doing lots of music/sound stuff early on - which tends to be very lock-free-ish), then maybe you don't make the traditional "N happens before N+1" assumptions.
I don't think I ever had to confront the issue.
Having said all that, it doesn't mean thinking about re-orderings helps anything. That I agree with. The key isn't "here's more to think about" (reordering), it is "here's less to think about" - ie less to assume - stop thinking a single thread happens exactly one line after the other. But this seems to be hard to remember NOT to think about, since we've been thinking that way for years.
It is better to think about how writing shared variables exposes state, and whether that correctly exposes only invariants. You may already be thinking this way.
Yes, that's exactly what I'm saying.
eg: Imagine the invariant is assert(y >= x). Imagine p is local, and the invariant only need to hold once p is exposed globally.
p->x = p->y = 0; // invariant OK ... p->x = 10; // invariant temporarily broken, but local, so OK p->y = 20; // invariant restored // can now expose: globalP = p; // globally expose state here, after invariant is ensured.
Thinking of "invariant exposure" makes it simple to realize that the globalP = p is the point of exposure (or "publication"). For threading and our memory model, this is also exactly where the barrier is needed (if you want to think about barriers) and exactly where the atomic operation is needed (or where the mutex can be unlocked).
Yes.
So just make globalP atomic. No need to think about reordering.
Only if the atomic comes with the barriers that ensure the writes to *p have happened before globalP is modified... which is what sequential consistency (the default for atomics in C++11) provides.
But if people ask why the point of exposure is so important, or why globalP needs to be atomic or how it could possibly fail if globalP was just a pointer, you need to explain that p->y = 20 can happen (or be seen) after globalP = p. Which is like the r1 == r2 == 0 example - it goes against tradition.
No you don't. All you need to say is that *every* datastructure (even a pointer like globalP) has an invariant that is broken during modification. The implication is that unless globalP is atomic the other thread might see a broken value of it and then all bets are off.
Most people would think that once globalP has been set, then globalP->y = 20. Since that's the order the lines of code are written in. But that's only true if globalP is an atomic.
You don't even need to think as far as globalP->y if you assume globalP can be totally broken unless access is synchronized.
Once they believe and/or understand that "invariant exposure happens on shared variables therefore shared variables must be atomic", then reordering can be forgotten. You can then think of your code as blocks that happen between read/writes of atomics (ie atomics are the sequence points). Just like critical sections are blocks of code. You have blocks defined by published-invariant sequence points. Or something like that.
Something like that. *If* programming with atomics is for mortals.
So I agree "sequential consistency" doesn't help much. Nor does reordering. Think about publishing invariants.
I think about "not exposing broken invariants."
(Actually what "sequential consistency" of the atomics means is that you can now reason about the large code blocks defined by those sequence points, ignoring their details. So that is useful.)
Tony
(For the pedantic: in this example, sequential consistency was overkill. we only needed "release" semantics on globalP = p)
Right. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Who is "we," Kimosabe?
Almost all other programmers, I think.
I *really* don't think about code that way. I generally try to avoid writing code that makes me think about the order in which things happen.
I believe you. But I expect you are the exception. So to understand some of the difficulties people have with lockfree, I think first you would need to get into the sequential habit, only to understand what it is like to break th habit. Like taking up smoking to know what it is like to quit. If you instead think more about state/invariants, then the hard part might only be giving up areas of protection (member functions/locks) where you can break the invariants. And give up some things that would normally be made part of the invariant. Like a pointer managing a list. Allow the temporary state in the middle of an add to be exposed to another operation (in a way that all operations can still manage). Often the invariants become complicated conditions - "either the pointer is valid or the counter is odd or ..." yet you can't check them all in a single operation. And then sometimes you do use order - ie if I read the counter, then the pointer , then re-read the counter, and the counter hasn't changed, then the pointer if valid (assuming proper barriers). It helps if you like logic puzzles (which are also often easier if you draw state diagrams of some kind - hmmm). So invariant management with very little room to temporarily break them. Anyhow, I think I strayed off topic. And I'll just go further... It is interesting to me to understand how other programmers think. I don't think much about order, or invariants, or state; I spend most time thinking about how the pieces of code fit together - or more importantly, stay apart. Tony

on Tue Aug 09 2011, Gottlob Frege <gottlobfrege-AT-gmail.com> wrote:
Who is "we," Kimosabe?
Almost all other programmers, I think.
I *really* don't think about code that way. I generally try to avoid writing code that makes me think about the order in which things happen.
I believe you. But I expect you are the exception. So to understand some of the difficulties people have with lockfree, I think first you would need to get into the sequential habit, only to understand what it is like to break th habit. Like taking up smoking to know what it is like to quit.
umm... nothanks.
If you instead think more about state/invariants, then the hard part might only be giving up areas of protection (member functions/locks) where you can break the invariants.
That's the hard part about lockfree programming. But most people shouldn't do that, right?
And give up some things that would normally be made part of the invariant. Like a pointer managing a list.
huh?
Allow the temporary state in the middle of an add to be exposed to another operation (in a way that all operations can still manage).
To put it more simply, states within the invariant can be exposed, and other states can't. Don't make this more complicated than it needs to be ;-)
Often the invariants become complicated conditions - "either the pointer is valid or the counter is odd or ..." yet you can't check them all in a single operation.
Checking is irrelevant, though, isn't it?
And then sometimes you do use order - ie if I read the counter, then the pointer , then re-read the counter, and the counter hasn't changed, then the pointer if valid (assuming proper barriers).
To put it more simply, states within the invariant can be exposed, and other states can't. [etc]
Anyhow, I think I strayed off topic. And I'll just go further... It is interesting to me to understand how other programmers think. I don't think much about order, or invariants, or state; I spend most time thinking about how the pieces of code fit together - or more importantly, stay apart.
I do that too, for architectural purposes. It's a useful way of thinking, but usually not very useful (to me) for determining correctness. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

I have not compared the 'folkloric' queue yet.
I meant ring buffer. Can't find it in Herlihy. Got a reference?
it is called `bounded queue'. it implements the same idea, but the algorithm looks different. taomp wasn't released, when i wrote all this code, so i basically learned about lock-free programming the hard way (reading research papers and trying to implement them). the ringbuffer is adapted from other open-source projects (linux kernel, jack, portaudio, supercollider). cheers, tim

I have not compared the 'folkloric' queue yet.
I meant ring buffer. Can't find it in Herlihy. Got a reference?
it is called WaitFreeQueue (section 3). however the implementation looks quite different from the book: WaitFreeQueue simply increments read/write indices and uses the integer modulo to wrap the indices into the queue range. if the size is not a power of two, the implementation cannot use bitmasks for that, and would be prone to integer overflows ... however the ideas are the same ... cheers, tim

hi gordon,
the new interface is rather complex because i need to resolve from a `handle' (pointer or 16bit index) to a node pointer (and reverse). maybe tagged_pointer and freelist can be exposed to the public interface at a later point, but for now i hesitate to expose the full freelist interface. maybe but maybe a simplified version could be exposed?
An interesting problem. If my team adapts split-ordered list to freelists, it will give another basis for the abstraction.
that would be useful!
good eye, the missing word is `freelist'. the michael/scott paper only mentions that they use a freelist, but they simply omit the that part from the pseudocode. in my implementation, the same memory location is used for linking the nodes of the queue and the nodes of the freelist.
Aha. I'd like to see this kind of stuff in the documentation, either on the pages describing each algo or in the Rationale.
well, maybe a rationale shall be added, describing the implementation in detail ...
i should probably add a recommendation to `the art of multiprocessor programming'. iirc it also covers the michael/scott queue ...
Oh cool, I have that but I haven't looked at it. If the 'folklore' is via there, please say so.
So I looked at Herlihy's stack versus yours, and there is one mismatch in push():
bool push(T const & v) { node * newnode = pool.construct(v);
if (newnode == 0) return false;
tagged_node_ptr old_tos = tos.load(detail::memory_order_relaxed);
for (;;) { tagged_node_ptr new_tos (newnode, old_tos.get_tag()); newnode->next.set_ptr(old_tos.get_ptr());
if (tos.compare_exchange_weak(old_tos, new_tos)) return true; } }
In the Herlihy implementation, old_tos is fetched effectively at the beginning of the loop. It seems to me that with your code, if someone else pushes or pops between when old_tos is fetched and the CAS, it will keep on looping until it sees the same node that used to be at the top is on top again.
compare_exchange actually updates old_tos, so there is no need to load it again ... again, this is the difference between paper and actual implementation ;)
lock-free data structures cannot scale linear: when multiple threads try to perform the same operations, one compare_exchange will fail and one of the threads has to retry. the more threads the more retries.
Well, you happen to have chosen the data structures with the tightest memory access patterns. :-D
Since you are only accessing the beginning and end of the stack/queue, you're seeing massive contention. A hash table can scale linearly. I didn't believe it until I saw it with my own eyes!
sure, a hash table will have less contention, so it should scale linearly ...
We could clean up the split-ordered list code and adapt it to freelists for donation to Lockfree, if the library is accepted.
as a workaround for the dcas requirement one can use array indices (the reason for my current freelist refactoring).
Ideally the freelist could have one consistent interface with different underlying behavior (i.e. size of return type) depending on what's supported. I don't know if this is possible.
currently, i have to use two freelist implementations (one for tagged pointers and one for tagged indices), both share the same interface, but introduce the concept of a `handle' that can converted to a pointer using a member function of the freelist (and a `tagged handle' for ABA prevention). i just hesitate to expose these classes, because the interface couldn't not be changed anymore ...
I assume your freelist has the problem that the size of the counter tag determines the probability of some horrible silent ABA failure? This should also be documented.
actually the tagged pointer and tagged index. i'll queue it for the `rationale' section of the docs.
if i feel comfortable with your implementation, i am happy to adopt it!
I'm checking with my team to see if they're still willing to put in the effort to bring this to Boost. Obviously I'd also put in good work and be ultimately responsible for it if it happens.
cool! i'll definitely try to help integrating it into boost.atomic cheers, tim

On Jul 30, 2011, at 5:11 AM, Tim Blechmann <tim@klingt.org> wrote:
compare_exchange actually updates old_tos, so there is no need to load it again
Oho, I get it. Instead of "if compare equal then exchange" it's more like "if compare equal then set else retrieve". Tricky.
again, this is the difference between paper and actual implementation ;)
In this case, looks like the book is taking an abstraction penalty for exposition. Way better than cases where a paper will leave out necessary details because they're too messy.
participants (8)
-
Alexander Terekhov
-
Dave Abrahams
-
Gordon Woodhull
-
Gottlob Frege
-
Ilya Sokolov
-
Peter Dimov
-
Stephan T. Lavavej
-
Tim Blechmann