
[Dave Abrahams pointed me at this thread.] Alexander Terekhov <terekhov <at> web.de> writes:
Dave Abrahams wrote:
on Thu Aug 25 2011, Alexander Terekhov <terekhov-AT-web.de> wrote:
Ilya Sokolov wrote:
AFAIK, in the example above another thread can't observe reordering without introducing UB.
I meant "observed" as in "check the sequence of writes in the generated code".
Do we care about what the generated code looks like? IIUC what counts is the sequence of memory states that another another thread can see.
But int can be simply changed to relaxed atomic<int> so that another thread can observe reordering without UB.
Sure.
I meant that for the example upthread, with an int instead of relaxed atomic<int>, another thread's code triggering UB does not prevent us from observing reordered writes in the writing thread's code (in generated code), by looking at it.
Changing int to relaxed atomic<int> simply removes UB but does not change the relevant code in the writing thread.
regards, alexander.
I agree with Alexander and most of the recent postings. If I write mx.lock(); x = 1; mx.unlock(); my.lock(); y = 1; my.unlock(); There is nothing to prevent the stores to x and y from becoming visible out of order at the hardware level. Enforcing ordering here would add needless and substantial cost. But a data-race-free program cannot observe such reordering. (One reference for that is Sarita Adve's and my PLDI 08 paper. Unofficial TR version at http://www.hpl.hp.com/techreports/2008/HPL-2008- 56.html. I believe Mark Batty et al have a much more complete and formal proof.) If we change x and y to be atomic variables, and the stores to be memory_order_relaxed, then an observer can tell the difference. I.e. if x and y are initially zero, and another thread does r1 = y; r2 = x; (or the acquire versions), it can see r1 = 1 and r2 = 0, since there are no synchronizes-with relationships between threads, and thus nothing to enforce ordering. If you use memory_order_relaxed, or even the acquire/release forms, things do get MUCH more complicated. The equivalent rules for other languages are an interesting story. I believe Posix implentors assume the above rules. I have a PPoPP 2007 paper that explains why that's not exactly what Posix actually states. OpenMP 3.0 has stricter rules which are also not consistently obeyed, and are somewhat fixed in OpenMP 3.1. (See my MSPC 11 paper for some details.) Java has arguably the same rules as C++ in this respect. In my mind, the main way in which sequential consistency differs from the cheaper acquire/release semantics that Alexander wants is that a) Dekker's algorithm (which by itself isn't that practically interesting), plus a number of real lock-free algorithms (which do matter) don't work. b) In my mind, acquire/release is a far more difficult model to explain to non- experts than the model for C++ with only sequentially consistent atomics. You can no longer reason about interleaving thread actions, either for purposes of determining the existence of a data-race, nor when defining what data-race- free programs mean. This was extensively discussed in C++ meetings starting in about 2005. The current solution with default-sequentially-consistent atomics, but explicit support for weaker ordering, was the resulting committee compromise. Hans

"Boehm, Hans" wrote: [...]
In my mind, the main way in which sequential consistency differs from the cheaper acquire/release semantics that Alexander wants is that
Actually I want even more relaxed 'modes' model.
a) Dekker's algorithm (which by itself isn't that practically interesting), plus a number of real lock-free algorithms (which do matter) don't work.
b) In my mind, acquire/release is a far more difficult model to explain to non- experts than the model for C++ with only sequentially consistent atomics. You
Difficult model? I view it as job security for those who can understand. Consider also that http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html (x86 (including x86-64) part) "Load Seq_Cst: MOV (from memory) Store Seq Cst: (LOCK) XCHG // alternative: MOV (into memory),MFENCE" is an overkill for typical use cases (e.g. see recent "Double checked locking pattern article on aristeia" thread on comp.lang.c++.moderated), compared to http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html (x86 (including x86-64) part) "Load Acquire: MOV (from memory) Store Release: MOV (into memory)" regards, alexander.

Alexander Terekhov wrote:
Consider also that
"Load Seq_Cst: MOV (from memory) Store Seq Cst: (LOCK) XCHG // alternative: MOV (into memory),MFENCE"
is an overkill for typical use cases...
But that's not a problem because everyone who understands should use explicit constraints, even if they happen to be memory_order_seq_cst. Relying on the SC default is bad practice because it can (and, to be on the safe side, should) be interpreted to mean that the author just hasn't figured out the minimum requirements.

Peter Dimov wrote:
Alexander Terekhov wrote:
Consider also that
"Load Seq_Cst: MOV (from memory) Store Seq Cst: (LOCK) XCHG // alternative: MOV (into memory),MFENCE"
is an overkill for typical use cases...
But that's not a problem because everyone who understands should use explicit constraints, even if they happen to be memory_order_seq_cst. Relying on the SC default is bad practice because it can (and, to be on the safe side, should) be interpreted to mean that the author just hasn't figured out the minimum requirements.
I think that atomic<>'s memory order default should express minimum requirements, not maximum (SC). regards, alexander.

Peter Dimov <pdimov <at> pdimov.com> writes:
Alexander Terekhov wrote:
Consider also that
"Load Seq_Cst: MOV (from memory) Store Seq Cst: (LOCK) XCHG // alternative: MOV (into memory),MFENCE"
is an overkill for typical use cases...
But that's not a problem because everyone who understands should use explicit constraints, even if they happen to be memory_order_seq_cst. Relying on the SC default is bad practice because it can (and, to be on the safe side, should) be interpreted to mean that the author just hasn't figured out the minimum requirements.
I would have stated this differently, though probably with the same result. At least when writing application-level code, I would always rely on the default initially, and not worry about ordering. I would explicitly specify the ordering only when it turns out that memory_order_seq_cst introduces a performance problem. If nothing else, this would allow me to separate out debugging of memory model issues. My experience is that very few people manage to get memory ordering right. My PPoPP 07 and MSPC 11 papers both have examples of commonly used mutex implementations getting it wrong in various interesting ways. We didn't understand what the specs actually required, but on top of that some of the implementations got it wrong in ways that were clearly independent of any misunderstanding of the spec. Given that the experts can't figure it out for what should be the easy cases, I'd much rather most people just stick the sequentially consistent default. This is entirely consistent with Peter's claim that using the sequentially consistent default means I haven't thought about it. But in many cases I really don't want to think about it, and that may be a fine state of affairs. For example, if I use an atomic counter, it's very likely that either: 1. It's not performance critical, I'm using atomics because they're more direct than mutexes in this case, or because I need the signal handler/interrupt safety, and the SC version is fine, or 2. It is performance critical, and I probably want to think hard about alternate solutions the keep thread-local counts. In both cases, it's unlikely that memory ordering will significantly impact application performance. Of course this doesn't apply to all use cases. Hans

on Fri Aug 26 2011, "Boehm, Hans" <hans.boehm-AT-hp.com> wrote:
I agree with Alexander and most of the recent postings. If I write
mx.lock(); x = 1; mx.unlock(); my.lock(); y = 1; my.unlock();
There is nothing to prevent the stores to x and y from becoming visible out of order at the hardware level. Enforcing ordering here would add needless and substantial cost. But a data-race-free program cannot observe such reordering. (One reference for that is Sarita Adve's and my PLDI 08 paper. Unofficial TR version at http://www.hpl.hp.com/techreports/2008/HPL-2008- 56.html. I believe Mark Batty et al have a much more complete and formal proof.)
And that actually supports what I've been saying. It doesn't really matter if a program with data races can observe such a reordering, because programs with data races invoke undefined behavior, so they can observe the moon being carved open to reveal green cheese.
If we change x and y to be atomic variables, and the stores to be memory_order_relaxed, then an observer can tell the difference.
Meaning a data-race-free observer.
I.e. if x and y are initially zero, and another thread does
r1 = y; r2 = x;
(or the acquire versions), it can see r1 = 1 and r2 = 0, since there are no synchronizes-with relationships between threads, and thus nothing to enforce ordering. If you use memory_order_relaxed, or even the acquire/release forms, things do get MUCH more complicated.
Right, but―just for the record―we were explicitly discussing the default behaviors.
In my mind, the main way in which sequential consistency differs from the cheaper acquire/release semantics that Alexander wants is that
a) Dekker's algorithm (which by itself isn't that practically interesting), plus a number of real lock-free algorithms (which do matter) don't work.
Don't work with with SC or with acquire/release semantics?
b) In my mind, acquire/release is a far more difficult model to explain to non- experts than the model for C++ with only sequentially consistent atomics. You can no longer reason about interleaving thread actions, either for purposes of determining the existence of a data-race, nor when defining what data-race- free programs mean.
How *do* you reason about data races in that model? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Dave Abrahams <dave <at> boostpro.com> writes:
And that actually supports what I've been saying. It doesn't really matter if a program with data races can observe such a reordering, because programs with data races invoke undefined behavior, so they can observe the moon being carved open to reveal green cheese.
If we change x and y to be atomic variables, and the stores to be memory_order_relaxed, then an observer can tell the difference.
Meaning a data-race-free observer. Yes.
on Fri Aug 26 2011, "Boehm, Hans" <hans.boehm-AT-hp.com> wrote:
In my mind, the main way in which sequential consistency differs from the cheaper acquire/release semantics that Alexander wants is that
a) Dekker's algorithm (which by itself isn't that practically interesting), plus a number of real lock-free algorithms (which do matter) don't work.
Don't work with with SC or with acquire/release semantics? Don't work with just acquire/release semantics. Sorry about being unclear.
b) In my mind, acquire/release is a far more difficult model to explain to
experts than the model for C++ with only sequentially consistent atomics. You can no longer reason about interleaving thread actions, either for
non- purposes of
determining the existence of a data-race, nor when defining what data-race- free programs mean.
How *do* you reason about data races in that model?
You can clearly reason explicitly in terms of happens-before ordering, as in 1.10 of the standard. I think you can also informally reason in a way similar to the sequentially consistent model, but you have to consider that atomic (release) stores are indefinitely buffered, and only become visible to other threads at some later time, but in their original order. I haven't thought enough about the implications this would have on programming rules and interface specifications. The canonical example of how this gets tricky, based on Dekker's example, is (thread1_busy, thread2_busy acquire/release atomics, z an ordinary variable, everything initially zero/false): Thread 1: thread1_busy = true; if (!thread2_busy) { // thread 2 either not busy, or will see that we are z = 17; } Thread 2: thread2_busy = true; if (!thread1_busy) { // thread 1 either not busy, or will see that we are z = 42; } This DOES have a data race in the acquire/release model, i.e. the comments are incorrect. And that race can't be understood in terms of interleaving. But you can understand it if you explicitly reason about buffered stores. Unfortunately, I think you also need to view unlock operations as getting buffered, and occuring at a later time, since the first and fourth operation in x.store(1, memory_order_release); m.unlock(); m2.lock(); r1 = y.load(memory_order_acquire); can become visible out of order. The acquire/release model is more-or-less what is used by C# volatiles, for better or worse. Some of the Microsoft Research folks have thought about it more than I have. Hans

On Mon, Aug 29, 2011 at 5:03 PM, Hans Boehm <Hans.Boehm@hp.com> wrote:
Dave Abrahams <dave <at> boostpro.com> writes:
And that actually supports what I've been saying. It doesn't really matter if a program with data races can observe such a reordering, because programs with data races invoke undefined behavior, so they can observe the moon being carved open to reveal green cheese.
If we change x and y to be atomic variables, and the stores to be memory_order_relaxed, then an observer can tell the difference.
Meaning a data-race-free observer. Yes.
on Fri Aug 26 2011, "Boehm, Hans" <hans.boehm-AT-hp.com> wrote:
In my mind, the main way in which sequential consistency differs from the cheaper acquire/release semantics that Alexander wants is that
a) Dekker's algorithm (which by itself isn't that practically interesting), plus a number of real lock-free algorithms (which do matter) don't work.
Don't work with with SC or with acquire/release semantics? Don't work with just acquire/release semantics. Sorry about being unclear.
b) In my mind, acquire/release is a far more difficult model to explain to
experts than the model for C++ with only sequentially consistent atomics. You can no longer reason about interleaving thread actions, either for
non- purposes of
determining the existence of a data-race, nor when defining what data-race- free programs mean.
How *do* you reason about data races in that model?
You can clearly reason explicitly in terms of happens-before ordering, as in 1.10 of the standard. I think you can also informally reason in a way similar to the sequentially consistent model, but you have to consider that atomic (release) stores are indefinitely buffered, and only become visible to other threads at some later time, but in their original order. I haven't thought enough about the implications this would have on programming rules and interface specifications.
The canonical example of how this gets tricky, based on Dekker's example, is (thread1_busy, thread2_busy acquire/release atomics, z an ordinary variable, everything initially zero/false):
Thread 1: thread1_busy = true; if (!thread2_busy) { // thread 2 either not busy, or will see that we are z = 17; }
Thread 2: thread2_busy = true; if (!thread1_busy) { // thread 1 either not busy, or will see that we are z = 42; }
This DOES have a data race in the acquire/release model, i.e. the comments are incorrect. And that race can't be understood in terms of interleaving. But you can understand it if you explicitly reason about buffered stores.
Well, I don't know about interleaving, but reordering works, as long as you remember that thread1_busy = true and if (!thread2_busy) (and 1 <-> 2 vice versa) are fenced such that they can actually be reordered. Not sure if that is different than "interleaving". You can also understand it in terms of synchronizes-with. Or at least come to the conclusion that you can't find the right variable to synchronize with. See the discussion in the comments in http://bartoszmilewski.wordpress.com/2008/12/01/c-atomics-and-memory-orderin... and Anthony's explanation at http://www.justsoftwaresolutions.co.uk/threading/petersons_lock_with_C++0x_a...
Unfortunately, I think you also need to view unlock operations as getting buffered, and occuring at a later time, since the first and fourth operation in
x.store(1, memory_order_release); m.unlock(); m2.lock(); r1 = y.load(memory_order_acquire);
can become visible out of order.
The acquire/release model is more-or-less what is used by C# volatiles, for better or worse. Some of the Microsoft Research folks have thought about it more than I have.
Hans
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Hans Boehm wrote: [...]
plus a number of real lock-free algorithms (which do matter) don't work.
I'm really interested to know the exact number/names of algorithms. [... "based on Dekker's example" ...] Yes, acquire/release is free from implicit store-load barrier (same as with TSO and even IBM 370). Note that computer hardware vendors never provided and still do not provide implicit store-load barrier for hardware. In this respect I really like "Figure 8": http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf (Figure 8: Simple categorization of relaxed models. ...) Why the heck is C++'s default mode for atomics so special in this respect? regards, alexander.

Alexander Terekhov <terekhov <at> web.de> writes:
Hans Boehm wrote: [...]
plus a number of real lock-free algorithms (which do matter) don't work.
I'm really interested to know the exact number/names of algorithms.
I asked Doug Lea this question 3 or 4 years ago when this was still being actively discussed. He's more of an expert on practical concurrent algorithms. I got back a bunch of answers. One simple and common example is the algorithm for releasing a queued lock. I need to set the lock state to unlocked and then check the queue for waiters. If those are reordered, I get lost wakeup problems. His assessment was that such cases are surprisingly common, though there are clearly also a bunch of scenarios, like lazy initialization via double-checked locking, for which acquire-release atomics suffice.
[... "based on Dekker's example" ...]
Yes, acquire/release is free from implicit store-load barrier (same as with TSO and even IBM 370).
Note that computer hardware vendors never provided and still do not provide implicit store-load barrier for hardware.
Note that we're not actually requiring a full store-load barrier in the implementation. We only need a barrier that orders atomic stores and atomic loads, which could perhaps be appreciably cheaper, though on current architectures it isn't.
In this respect I really like "Figure 8":
http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf (Figure 8: Simple categorization of relaxed models. ...)
The situation is really quite different when you're talking about 1995 hardware ISAs as opposed to programming languages like C++. At the 1995 hardware level, you can no longer distinguish data and synchronization accesses, so you would also add the overhead to data accesses. And whatever you promise is unlikely to be visible to the programmer anyway, since compiler reordering is likely to invalidate whatever properties you tried to guarantee at the hardware level. Providing full sequential consistency at the hardware level without compiler changes is likely to buy you essentially nothing, and do so at significant expense either in hardware complexity and/or performance. For what it's worth, Sarita Adve is both an author of the report you cite and the original and perhaps strongest advocate for the "sequential consistency for data-race-free programs" programming model.
Why the heck is C++'s default mode for atomics so special in this respect?
At the programming language level, this isn't at all unique. I believe Ada 83 atomic accesses were sequentially consistent. Certainly Java volatiles are.
regards, alexander.

Hans Boehm wrote: [...]
For what it's worth, Sarita Adve is both an author of the report you cite and the original and perhaps strongest advocate for the "sequential consistency for data-race-free programs" programming model.
I'm not contra "sequential consistency for data-race-free programs" programming model for programs using locks. On PPC, for example, such programs don't even need hwsync. For programs with lock-free atomics OTOH, the races (concurrent accesses to the same locations with loads competing with concurrent stores) is a feature, not a bug, and SC is simply way too expensive (e.g. it needs hwsync on PPC) for use in default mode for lock-free atomics: C/C++ is "you don't pay for what you don't need". regards, alexander.

Alexander Terekhov <terekhov <at> web.de> writes:
Hans Boehm wrote: [...]
For what it's worth, Sarita Adve is both an author of the report you cite
and
the original and perhaps strongest advocate for the "sequential consistency for data-race-free programs" programming model.
I'm not contra "sequential consistency for data-race-free programs" programming model for programs using locks. On PPC, for example, such programs don't even need hwsync. For programs with lock-free atomics OTOH, the races (concurrent accesses to the same locations with loads competing with concurrent stores) is a feature, not a bug, and SC is simply way too expensive (e.g. it needs hwsync on PPC) for use in default mode for lock-free atomics: C/C++ is "you don't pay for what you don't need".
regards, alexander.
The question is when the "sequential consistency for data-race-free programs" should extend to programs using atomic load, store, and RMW operations. The C++ committee, including me, came to the conclusion that the answer needs to be the yes; there are many cases in which the use of atomics is fairly straightforward and useful. And it should be possible to use them without leaving this relatively simple programming model. By doing so, you get a safe programming model by default. Since we do have explicit ordering primitives, you have the option of only paying for what you need. But 90%, or probably 99% of programmers will not know what they need here. And that's fine. This is entirely consistent with many other C++ design decisions. The default operator new allocates memory that lives as long as the process, even though that's more expensive than allocating memory local to the current stack frame or thread, and often one of those latter two options would be sufficient. But it would be nasty to use that as default behavior. The overhead of enforcing sequential consistency is unfortunately currently very platform-specific. On X86, it's increasingly minor, since it's possible to confine the added cost to stores and, as far as I can tell, the added cost is becoming much less than the cost of a coherence miss. And if your performance is limited by the cost of stores to shared variables, you are fairly likely to unavoidably see lots of coherence misses, so there is a hand- wavy argument that this is likely to be a minor perturbation. On other architectures, the costs are unfortunately larger. But my impression is that they are decreasing everywhere, as architects pay more attention to synchronization costs. Hans

Hans Boehm wrote: [...]
I'm really interested to know the exact number/names of algorithms. I asked Doug Lea this question 3 or 4 years ago when this was still being actively discussed. He's more of an expert on practical concurrent algorithms. I got back a bunch of answers. One simple and common example is the algorithm for releasing a queued lock. I need to set the lock state to unlocked and then check the queue for waiters. If those are reordered, I get lost wakeup problems. ...
Dejavu... http://www.mailarchive.ca/lists/comp.unix.solaris/2003-12/0937.html#start937 The relevant part is: ------
and subsequent store-load barrier + load to check whether you have a blocked waiter to unblock? Store-load barrier is known to be "the most expensive"...
The store-load barrier is avoided by carefully setting up the sleeping and blocking interactions so that the above code + the slow paths can never loose a waiter.
And you can do it without atomic read-{modify-}write on the fast path in mutex_exit()? Hmm. Please elaborate.
Casper already got this, but here: 1. As long as the thread is on-CPU, you spin. Blocking is a four-stage process: 2. atomically set the waiters bit, 3. re-scan all of the other CPUs on the system, checking to make sure the owner is still not running on them. 4. check that the lock hasn't changed while you were looking. 5. Finally, block. If any of the above steps fail, you go back to spinning. 6. if a thread is in the "critical section" (i.e. after the load and before the clear) when we are resuming it (either from an interrupt or in a context switch), we back up the program counter to the beginning of mutex_exit. There are further complications, but that's the crucial bit. jonathan Received on Tue Dec 09 2003 - 17:05:23 PST ------ regards, alexander.

Alexander Terekhov <terekhov <at> web.de> writes:
Hans Boehm wrote: [...]
I'm really interested to know the exact number/names of algorithms. I asked Doug Lea this question 3 or 4 years ago when this was still being actively discussed. He's more of an expert on practical concurrent algorithms. I got back a bunch of answers. One simple and common example
is
the algorithm for releasing a queued lock. I need to set the lock state to unlocked and then check the queue for waiters. If those are reordered, I get lost wakeup problems. ...
Dejavu...
http://www.mailarchive.ca/lists/comp.unix.solaris/2003-12/0937.html#start937
The relevant part is:
------
and subsequent store-load barrier + load to check whether you have a blocked waiter to unblock? Store-load barrier is known to be "the most expensive"...
The store-load barrier is avoided by carefully setting up the sleeping and blocking interactions so that the above code + the slow paths can never loose a waiter.
And you can do it without atomic read-{modify-}write on the fast path in mutex_exit()? Hmm. Please elaborate.
Casper already got this, but here:
1. As long as the thread is on-CPU, you spin. Blocking is a four-stage process: 2. atomically set the waiters bit, 3. re-scan all of the other CPUs on the system, checking to make sure the owner is still not running on them. 4. check that the lock hasn't changed while you were looking. 5. Finally, block. If any of the above steps fail, you go back to spinning. 6. if a thread is in the "critical section" (i.e. after the load and before the clear) when we are resuming it (either from an interrupt or in a context switch), we back up the program counter to the beginning of mutex_exit.
There are further complications, but that's the crucial bit.
jonathan
Although the existence of such an algorithm is interesting, the point is that avoiding the equivalent of the sequentuially consistent store here is incredibly complicated, appears to require scheduler help and, at least for me, would require careful study of the proof before I were willing to implement it. On the other hand, the algorithm using the equivalent of sequentially consistent atomics (which I think is still far more widely used) is orders of magnitude simpler. If you understand the complex algorithm, we assume you can add the ordering constraints corectly. Interestingly, there is a 2011 POPL paper by Attiya et al that proves, IIRC, that store-load ordering or atomic RMW operations are unavoidable in some cases. I haven't looked into this enough to understand how to reconcile that with the above result. The kind of scheduler hack needed in step 6 above is probably outside their model. Hans Hans

Hans Boehm wrote: [...]
Although the existence of such an algorithm is interesting, the point is that avoiding the equivalent of the sequentuially consistent store here is incredibly complicated, appears to require scheduler help and, at least for me, would require careful study of the proof before I were willing to implement it. On the other hand, the algorithm using the equivalent of sequentially consistent atomics (which I think is still far more widely used) is orders of magnitude simpler. If you understand the complex algorithm, we assume you can add the ordering constraints corectly.
With waiters bit (0x80000000) and atomic RMW, mutex is simply: (pseudo-code below) WHILE atomic_bit_test_set_hlb(&lock, 1) // efficient acquire MO lock_queue_wait(&lock, 1) // wait if locked bit is set for lock() and uintptr_t lock_queue IF atomic_decrement_rel(lock_queue = &lock) // release MO THEN lock_queue_wake(lock_queue, 1) for unlock() (lock_queue uintptr_t pointer may become indeterminate by the time of lock_queue_wake() and implementation should handle it (spurious wakes are okay)). Similarly, for semaphore: sema_lock: WHILE !atomic_decrement_if_binand_7FFFFFFF_is_not_zero_hlb(&lock) lock_queue_wait(&lock, 0) // wait if sema value is zero sema_unlock: uintptr_t lock_queue IF atomic_increment_rel(lock_queue = &lock) > 0x80000000 THEN lock_queue_wake(lock_queue, 1) regards, alexander.

Alexander Terekhov <terekhov <at> web.de> writes:
Hans Boehm wrote: [...]
Although the existence of such an algorithm is interesting, the point is
that
avoiding the equivalent of the sequentuially consistent store here is incredibly complicated, appears to require scheduler help and, at least for me, would require careful study of the proof before I were willing to implement it. On the other hand, the algorithm using the equivalent of sequentially consistent atomics (which I think is still far more widely used) is orders of magnitude simpler. If you understand the complex algorithm, we assume you can add the ordering constraints corectly.
With waiters bit (0x80000000) and atomic RMW, mutex is simply: (pseudo-code below)
WHILE atomic_bit_test_set_hlb(&lock, 1) // efficient acquire MO lock_queue_wait(&lock, 1) // wait if locked bit is set
for lock() and
uintptr_t lock_queue IF atomic_decrement_rel(lock_queue = &lock) // release MO THEN lock_queue_wake(lock_queue, 1) I'm not sure I understand your syntax here.
I don't see the correctness argument. What prevents the sequence: 1. In thread T0, performing unlock, atomic_decr_rel update completes, sees no waiters, puts updated (released) lock word in store buffer. (I'm informally viewing memory_order_release as just putting a value in a store buffer, which I think is reasonable here.) 2. Thread T1 tries to acquire the lock, sees it still held, since store buffer hasn't drained. Enqueues itself. 3. Thread T0s store buffer drains. Thread T1 is now waiting on an unlocked lock. The real point here again is that such arguments are subtle. I don't want to have to deal with them unless performance considerations warrant that I do, especially since we have lots of evidence that even experts get these things wrong. Hans

Hans Boehm wrote:
Alexander Terekhov <terekhov <at> web.de> writes:
Hans Boehm wrote: [...]
Although the existence of such an algorithm is interesting, the point is
that
avoiding the equivalent of the sequentuially consistent store here is incredibly complicated, appears to require scheduler help and, at least for me, would require careful study of the proof before I were willing to implement it. On the other hand, the algorithm using the equivalent of sequentially consistent atomics (which I think is still far more widely used) is orders of magnitude simpler. If you understand the complex algorithm, we assume you can add the ordering constraints corectly.
With waiters bit (0x80000000) and atomic RMW, mutex is simply: (pseudo-code below)
WHILE atomic_bit_test_set_hlb(&lock, 1) // efficient acquire MO lock_queue_wait(&lock, 1) // wait if locked bit is set
for lock() and
uintptr_t lock_queue IF atomic_decrement_rel(lock_queue = &lock) // release MO THEN lock_queue_wake(lock_queue, 1) I'm not sure I understand your syntax here.
I don't see the correctness argument. What prevents the sequence:
1. In thread T0, performing unlock, atomic_decr_rel update completes, sees no waiters, puts updated (released) lock word in store buffer. (I'm informally viewing memory_order_release as just putting a value in a store buffer, which I think is reasonable here.) 2. Thread T1 tries to acquire the lock, sees it still held, since store buffer hasn't drained. Enqueues itself. 3. Thread T0s store buffer drains. Thread T1 is now waiting on an unlocked lock.
lock_queue_wait() is meant to block the calling thread and set waiters bit only if locked bit is actually set (an RMW operation). The races for setting/clearing bits are resolved by RMWs on the same (&lock) memory location. The situation is no different when (&lock) memory location would be protected by a lock except that with RMWs the program is lock-free on fast paths. regards, alexander.

Alexander Terekhov <terekhov <at> web.de> writes:
Hans Boehm wrote:
Alexander Terekhov <terekhov <at> web.de> writes:
Hans Boehm wrote: [...]
Although the existence of such an algorithm is interesting, the point
is
that
avoiding the equivalent of the sequentuially consistent store here is incredibly complicated, appears to require scheduler help and, at least for me, would require careful study of the proof before I were willing to implement it. On the other hand, the algorithm using the equivalent of sequentially consistent atomics (which I think is still far more widely used) is orders of magnitude simpler. If you understand the complex algorithm, we assume you can add the ordering constraints corectly.
With waiters bit (0x80000000) and atomic RMW, mutex is simply: (pseudo-code below)
WHILE atomic_bit_test_set_hlb(&lock, 1) // efficient acquire MO lock_queue_wait(&lock, 1) // wait if locked bit is set
for lock() and
uintptr_t lock_queue IF atomic_decrement_rel(lock_queue = &lock) // release MO THEN lock_queue_wake(lock_queue, 1) I'm not sure I understand your syntax here.
I don't see the correctness argument. What prevents the sequence:
1. In thread T0, performing unlock, atomic_decr_rel update completes, sees no waiters, puts updated (released) lock word in store buffer. (I'm informally viewing memory_order_release as just putting a value in a store buffer, which I think is reasonable here.) 2. Thread T1 tries to acquire the lock, sees it still held, since store buffer hasn't drained. Enqueues itself. 3. Thread T0s store buffer drains. Thread T1 is now waiting on an unlocked lock.
lock_queue_wait() is meant to block the calling thread and set waiters bit only if locked bit is actually set (an RMW operation).
The races for setting/clearing bits are resolved by RMWs on the same (&lock) memory location.
The situation is no different when (&lock) memory location would be protected by a lock except that with RMWs the program is lock-free on fast paths.
I think I see what you're doing. By embedding the state of the queue into the same atomic object as the lock set bit, you can rely on single-location cache coherence to get the right ordering. That can be made to work for the low- level bit-twiddling implementation, but not for a more generic approach that keeps the queue and the lock-bit itself separate. I would still much rather see a default that works for the full solution space. I prefer to have people like us who are sometimes willing to go through subtle reasoning about memory ordering add the annotations, rather than insisting that the average programmer who has never heard of these issues add them to avoid subtle bugs. Hans

On Sun, Oct 2, 2011 at 2:09 PM, Hans Boehm <Hans.Boehm@hp.com> wrote:
I think I see what you're doing. By embedding the state of the queue into the same atomic object as the lock set bit, you can rely on single-location cache coherence to get the right ordering. That can be made to work for the low- level bit-twiddling implementation, but not for a more generic approach that keeps the queue and the lock-bit itself separate. I would still much rather see a default that works for the full solution space. I prefer to have people like us who are sometimes willing to go through subtle reasoning about memory ordering add the annotations, rather than insisting that the average programmer who has never heard of these issues add them to avoid subtle bugs.
Hans
I think that is the essence of the discussion. I would say that Alexander is of the opinion that the new atomics should have been written for lock-free expert, whereas you might want them approachable by the average programmer. I worry about given atomics to the average programmer. Maybe they should have been harder to type than reinterpret_cast. I fully expect most uses (outside a library) to be incorrect. Tony

Hans Boehm wrote: [... RMW/store-load barrier free (on fast path) release for lock with a queue ...]
Interestingly, there is a 2011 POPL paper by Attiya et al that proves, IIRC, that store-load ordering or atomic RMW operations are unavoidable in some cases. I haven't looked into this enough to understand how to reconcile that with the above result. The kind of scheduler hack needed in step 6 above is probably outside their model.
http://infoscience.epfl.ch/record/161286/files/popl168gf-attiya.pdf (Laws of Order: Expensive Synchronization in Concurrent Algorithms Cannot be Eliminated) Nice title. ;-) I think they are talking about impossibility of RMW/store-load barrier free lock acquisition. RMW/store-load barrier free (on fast path) release for queued lock is out of scope of their paper, AFAICS. regards, alexander.
participants (6)
-
Alexander Terekhov
-
Boehm, Hans
-
Dave Abrahams
-
Gottlob Frege
-
Hans Boehm
-
Peter Dimov