
After having tried to learn about atomic from reading through several articles, processor specs and mailing lists, I am more confused than before. I would be glad if we could (re)start a discussion about the topic. Perhaps I am not the only one to benefit from this. Following are some things I learned, but this might be wrong, and I would appreciate clarification. Also some questions: 1) atomicity (in this specialized context) is about optimizing the pattern: enter_critical_section; do_something; leave_critical_section; by making use of processor/platform specific means. In particular in presence of multiple processors. I.e. an atomic lib is primarily about performance. 2) atomicity better would be addressed by the compiler, given a suitable memory model, than as a library. 3) Despite 2) it would be possible to write a library, but it will be hard to get processor independent semantics. E.g. there is one concept of read/write/full memory barriers or another of acquire/release semantics for SMP. 4) Does there exist a canonical set of atomic primitives, from which others can be built? E.g. given atomic_test_and_set, atomic_load and atomic_store is it possible to emulate atomic_compare_and_swap? (without use of system mutices of course.) 5) Is it worth the effort to create a library with processor independent semantics, at the price of not being optimal? E.g. by doing away with the various kinds of barriers, instead simply requiring atomicity and full memory barrier semantics for the operation? Which operations, besides load and store would be essential? Sorry if this is not the perfect list to discuss the topic, but I think boost could possibly benefit from such a library, as previous discussions let me believe. Regards, Roland

Roland Schwarz <roland.schwarz@chello.at> writes:
I would be glad if we could (re)start a discussion about the topic. Perhaps I am not the only one to benefit from this.
Sounds sensible.
Following are some things I learned, but this might be wrong, and I would appreciate clarification. Also some questions:
1) atomicity (in this specialized context) is about optimizing the pattern: enter_critical_section; do_something; leave_critical_section; by making use of processor/platform specific means.
Essentially, yes. Other CPUs/threads will either see the state before the atomic op, or after, but never a "partial" effect. On x86, normal reads and writes to suitably-aligned 32-bit values are atomic in this sense.
In particular in presence of multiple processors. I.e. an atomic lib is primarily about performance.
Not just about performance. It also enables the construction of the higher-level primitives. Atomic instructions also affect visibility, which is addressed below.
2) atomicity better would be addressed by the compiler, given a suitable memory model, than as a library.
Yes.
3) Despite 2) it would be possible to write a library, but it will be hard to get processor independent semantics. E.g. there is one concept of read/write/full memory barriers or another of acquire/release semantics for SMP.
I think that the memory barrier and acquire/release semantics are just two ways of talking about the same thing. As I understand it, on x86, the SFENCE instruction is a "Store Fence", which is a "Write Barrier", and has "Release Semantics". Any store instructions which happen before it on this CPU are made globally visible afterwards. No stores instructions which occur afterwards on this CPU are permitted to be globally visible beforehand. Again on x86, the LFENCE instruction is a "Load Fence", which is a "Read Barrier", and has "Acquire Semantics". Any read instructions which happen before it on this CPU must have already completed afterwards. No loads instructions which occur afterwards on this CPU are permitted to be executed beforehand. A full memory barrier, the MFENCE instruction on x86, does both. There is also the concept of a "raw" atomic operation, which does not have any impact on memory visibility, except it is either done or not done. As described above, on x86 this applies to all suitably-aligned 32-bit reads and writes. Some atomic operations also incorporate a full memory barrier. On x86, these are those ops that assert the LOCK# signal, which include XCHG (with or without the LOCK prefix), LOCK CMPXCHG, LOCK INC and LOCK ADD, amongst others.
4) Does there exist a canonical set of atomic primitives, from which others can be built?
Yes, I'm sure there is, but I'd have to think hard to work out what the minimal set is. I expect that there are several possible such sets.
5) Is it worth the effort to create a library with processor independent semantics, at the price of not being optimal? E.g. by doing away with the various kinds of barriers, instead simply requiring atomicity and full memory barrier semantics for the operation? Which operations, besides load and store would be essential?
I think it's worth the effort. For processor independence, you could just specify that the barriers are "at least" what is specified --- if you specify a read barrier, you might get a store barrier too, and vice versa.
Sorry if this is not the perfect list to discuss the topic, but I think boost could possibly benefit from such a library, as previous discussions let me believe.
The details of the memory model, atomics, and visibility, and how it applies to C++, are under discussion amongst C++ standards committee members. I would imagine that you'd be welcome to join such discussions. Anyway, this is important to boost, if we're going to provide a library that does it. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

Anthony Williams wrote:
Roland Schwarz <roland.schwarz@chello.at> writes:
In particular in presence of multiple processors. I.e. an atomic lib is primarily about performance.
Not just about performance. It also enables the construction of the higher-level primitives.
As you might know, this was the route I am following. But the primitives are not necessarily exposed to the user. To be more precise: From a user perspective an atomic lib is primarily about performance. Better?
I think that the memory barrier and acquire/release semantics are just two ways of talking about the same thing.
This is a point where I am still confused about. acquire/release are "one way" ordering constraints while memory barriers are "both way". This is as I understand it: *) memory barriers are primitives which have no other effect as to order memory access, i.e. they do not store or load anything by themselves. Also they affect only memory operations and have no effect on others. Also they should affect the compiler, as to disallow reordering across the barrier. There are 3: read_mb, write_mb and full_mb. read_mb. read_mb orders read access, i.e. no read from before may be moved after the barrier and no read from after may be moved before the barrier. write_mb does the same for writes. full_mb disallows moving any read or write across the barrier, and so establishes total order. *) acquire/release semantics on the other hand establish a conceptional different model of ordering. acquire disallows moving any access (read/write) from after the primitive to occur before it, but still allows accesses from before to occur after it. This is kind of a one-way sign for accesses. release semantics is the other way round. Also acquire/release is bound to an operation kind of an attribute of the operation, while memory barriers are operations on their own. So as I currently understand it, these two concepts are about the same issue, but are neither orthogonal nor can one be used to synthesize the other. I would be glad to be proved wrong. Another observation: release/acquire semantics is closer to mutex behavior, since no harm is done when an operation from before mutex acquisition is moved inside the critical section. A barrier would not allow such an operation. True?
As I understand it, on x86, the SFENCE instruction is a "Store Fence", which is a "Write Barrier", and has "Release Semantics". Any store instructions which happen before it on this CPU are made globally visible afterwards. No stores instructions which occur afterwards on this CPU are permitted to be globally visible beforehand.
This looks to me as it is possible in the acquire/release model to separate the operation from the "fence". Or viewed in the other direction, it is possible to optimize by attaching the (otherwise separate) fence to a instruction to save some cpu cycles. True?
Again on x86, the LFENCE instruction is a "Load Fence", which is a "Read Barrier", and has "Acquire Semantics". Any read instructions which happen before it on this CPU must have already completed afterwards.
Are you really sure about this one?
No loads instructions which occur afterwards on this CPU are permitted to be executed beforehand.
This part of the statement makes sense to me. I omitted the rest of your post, since I think it depends on the acquire/release versus memory barriers getting clarified first.
The details of the memory model, atomics, and visibility, and how it applies to C++, are under discussion amongst C++ standards committee members. I would imagine that you'd be welcome to join such discussions.
Hmm, not sure how I could join other than posting to some lists. Do you mean comp.lang.c++.moderated? Roland

Roland Schwarz <roland.schwarz@chello.at> writes:
Anthony Williams wrote:
Roland Schwarz <roland.schwarz@chello.at> writes:
In particular in presence of multiple processors. I.e. an atomic lib is primarily about performance.
Not just about performance. It also enables the construction of the higher-level primitives.
As you might know, this was the route I am following. But the primitives are not necessarily exposed to the user. To be more precise: From a user perspective an atomic lib is primarily about performance. Better?
Maybe. I'm not sure.
I think that the memory barrier and acquire/release semantics are just two ways of talking about the same thing.
This is a point where I am still confused about. acquire/release are "one way" ordering constraints while memory barriers are "both way".
acquire/release gets me in a twist. Re-reading my refs, you're right.
This is as I understand it:
*) memory barriers are primitives which have no other effect as to order memory access, i.e. they do not store or load anything by themselves. Also they affect only memory operations and have no effect on others. Also they should affect the compiler, as to disallow reordering across the barrier. There are 3: read_mb, write_mb and full_mb. read_mb. read_mb orders read access, i.e. no read from before may be moved after the barrier and no read from after may be moved before the barrier. write_mb does the same for writes. full_mb disallows moving any read or write across the barrier, and so establishes total order.
Yes.
*) acquire/release semantics on the other hand establish a conceptional different model of ordering. acquire disallows moving any access (read/write) from after the primitive to occur before it, but still allows accesses from before to occur after it. This is kind of a one-way sign for accesses. release semantics is the other way round.
Also acquire/release is bound to an operation kind of an attribute of the operation, while memory barriers are operations on their own.
Yes. Acquire semantics only tend to apply to read operations, and Release semantics to write ops.
So as I currently understand it, these two concepts are about the same issue, but are neither orthogonal nor can one be used to synthesize the other. I would be glad to be proved wrong.
You're right. I was getting confused.
Another observation: release/acquire semantics is closer to mutex behavior, since no harm is done when an operation from before mutex acquisition is moved inside the critical section. A barrier would not allow such an operation. True?
True.
As I understand it, on x86, the SFENCE instruction is a "Store Fence", which is a "Write Barrier", and has "Release Semantics". Any store instructions which happen before it on this CPU are made globally visible afterwards. No stores instructions which occur afterwards on this CPU are permitted to be globally visible beforehand.
This looks to me as it is possible in the acquire/release model to separate the operation from the "fence". Or viewed in the other direction, it is possible to optimize by attaching the (otherwise separate) fence to a instruction to save some cpu cycles. True?
This SFENCE instruction is just a store fence. Some other (non-fence) instructions also have fence-like properties, but they tend to be full barriers.
Again on x86, the LFENCE instruction is a "Load Fence", which is a "Read Barrier", and has "Acquire Semantics". Any read instructions which happen before it on this CPU must have already completed afterwards.
Are you really sure about this one?
Yes, apart from the "acquire semantics". The intel spec says: "Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. This serializing operation guarantees that every load instruction that precedes in program order the LFENCE instruction is globally visible before any load instruction that follows the LFENCE instruction is globally visible."
No loads instructions which occur afterwards on this CPU are permitted to be executed beforehand.
This part of the statement makes sense to me.
I omitted the rest of your post, since I think it depends on the acquire/release versus memory barriers getting clarified first.
The details of the memory model, atomics, and visibility, and how it applies to C++, are under discussion amongst C++ standards committee members. I would imagine that you'd be welcome to join such discussions.
Hmm, not sure how I could join other than posting to some lists. Do you mean comp.lang.c++.moderated?
No. There is a cpp-threads mailing list, and the C++ Standards committee reflectors. Peter Dimov told me how to get on cpp-threads. Ask your national body (Here in the UK it's BSI, in Germany it's DIN, and in the USA it's ANSI) about joining their C++ panel, and speak to Andy Koenig about getting added to the committee reflectors. If your national body is unhelpful, speak to Lois Goldthwaite (standards@accu.org), and she'll probably let you join the BSI panel. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk

Roland Schwarz wrote:
There are 3: read_mb, write_mb and full_mb.
There are fifteen, in principle. The basic bidirectional ordering constraints are #LoadLoad, #LoadStore, #StoreLoad and #StoreStore, and a barrier can include any subset of them. In practice #StoreLoad usually implies a full barrier. A load.acquire can be emulated as load #LoadLoad | #LoadStore A store.release can be emulated as #LoadStore | #StoreStore store
full_mb disallows moving any read or write across the barrier, and so establishes total order.
Total order is a systemwide property of the memory model and can't necessarily be achieved with local barriers. It's possible for an SMP system to have full barriers and not deliver total order. As an example, if you have two CPUs execute // CPU #1 op1 full barrier op2 // CPU #2 op3 full barrier op4 then it is guaranteed that all CPUs will see op1 before op2 and op3 before op4, but it is not guaranteed that CPU #4 will see the same order of op1, op2, op3, and op4 as CPU #3. In principle, at least.

Peter Dimov wrote:
Roland Schwarz wrote:
There are fifteen, in principle. The basic bidirectional ordering constraints are #LoadLoad, #LoadStore, #StoreLoad and #StoreStore, and a barrier can include any subset of them. In practice #StoreLoad usually implies a full barrier.
Hmm, these names are from the sparc architecture, correct? I don't know enough about this, so I would be glad if you could tell me more. I was speaking (altough didn't make it explicit) about the three barriers used in the linux kernel. What is the semantics of e.g. #LoadLoad? Does it mean loads may not be moved across the barrier? Does it mean loads of the same memory location may not be moved across? Due to lack of knowing the semantics I cannot understand your emulation example, could you please be even more explicit?
A load.acquire can be emulated as
load #LoadLoad | #LoadStore
What does the "|" bar mean? Is this the means to define the subsets? I.e.: When # means unordered 1) # | # | # | # 2) # | # | # | #StoreStore 3) # | # | #StoreLoad | # 4) # | # | #StoreLoad | #StoreStore 5) # | #LoadStore | # | # ... Is this how you arrived at the number 15?
Total order is a systemwide property of the memory model and can't necessarily be achieved with local barriers. It's possible for an SMP system to have full barriers and not deliver total order. As an example, if you have two CPUs execute
<example snipped> Ok, thank you. Obviously I used the wrong name for what I was meaning. I meant that either operation type read/write may not be reordered. How would this be called then? Full ordering? This brings up the question: Is there a kind of established naming? Can you possibly suggest a book or certain articles I should study to learn about the common aspects of memory visibility on multiprocessor systems? Roland

Roland Schwarz wrote:
Peter Dimov wrote:
Roland Schwarz wrote:
There are fifteen, in principle. The basic bidirectional ordering constraints are #LoadLoad, #LoadStore, #StoreLoad and #StoreStore, and a barrier can include any subset of them. In practice #StoreLoad usually implies a full barrier.
Hmm, these names are from the sparc architecture, correct? I don't know enough about this, so I would be glad if you could tell me more. I was speaking (altough didn't make it explicit) about the three barriers used in the linux kernel.
What is the semantics of e.g. #LoadLoad?
You are right, this is Sun SPARC terminology. #LoadLoad means that loads that precede the barrier cannot be reordered with loads that follow the barrier. The four basic primitives can be combined freely, so for example #LoadLoad | #LoadStore means that preceding loads cannot be reordered with subsequent loads or stores.

Peter Dimov wrote:
#LoadLoad means that loads that precede the barrier cannot be reordered with loads that follow the barrier.
I.e. a instruction sequence load(A) #LoadLoad load(B) cannot be executed by the processor as load(B) load(A) but nothing prevents the processor from executing the sequence load(A) #LoadLoad store(B) as store(B) load(A) right? Then load(A) load(B) #LoadLoad | #LoadStore store(C) cannot be reordered as load(B) store(C) load(A) Yes? But in the acquire/release world load(A) load(B).acquire store(C) can be reordered as load(B) store(C) load(A) or I am wrong in this assumption? If I am not wrong, In which respect is your example then an emulation of acquire semantics. If I a wrong, what I am missing?
The four basic primitives can be combined freely, so for example #LoadLoad | #LoadStore means that preceding loads cannot be reordered with subsequent loads or stores.
I see, thank you. Roland

Roland Schwarz wrote:
Peter Dimov wrote:
#LoadLoad means that loads that precede the barrier cannot be reordered with loads that follow the barrier.
I.e. a instruction sequence
load(A) #LoadLoad load(B)
cannot be executed by the processor as load(B) load(A)
but nothing prevents the processor from executing the sequence load(A) #LoadLoad store(B)
as store(B) load(A)
right?
Right.
Then load(A) load(B) #LoadLoad | #LoadStore store(C)
cannot be reordered as load(B) store(C) load(A)
Yes?
Yes.
But in the acquire/release world load(A) load(B).acquire store(C)
can be reordered as load(B) store(C) load(A)
or I am wrong in this assumption?
Right again.
If I am not wrong, In which respect is your example then an emulation of acquire semantics. If I a wrong, what I am missing?
Nothing; it's an emulation, not an exact mapping. It achieves results that are valid executions under an acquire/release model. The emulation is stricter since it doesn't allow all reorderings that are possible in an acquire/release model. But it's a way to implement acquire/release on hardware that provides bidirectional barriers. When the (physical or conceptual) hardware only offers the more limited #LoadLoad (read barrier), #StoreStore (write barrier) and full barrier subset, the acquire/release emulation is even stricter, since it needs to use the full barriers.

Peter Dimov wrote:
Roland Schwarz wrote:
If I am not wrong, In which respect is your example then an emulation of acquire semantics. If I a wrong, what I am missing?
Nothing; it's an emulation, not an exact mapping. It achieves results that are valid executions under an acquire/release model.
I understand.
The emulation is stricter since it doesn't allow all reorderings that are possible in an acquire/release model. But it's a way to implement acquire/release on hardware that provides bidirectional barriers.
I was under the (seemingly wrong) impression that an emulation has to preserve all of the semantics, opposed to simulation which I would have used in your example. But nevertheless I got the point. Thanks. Roland

Roland Schwarz escreveu:
2) atomicity better would be addressed by the compiler, given a suitable memory model, than as a library.
3) Despite 2) it would be possible to write a library, but it will be hard to get processor independent semantics. E.g. there is one concept of read/write/full memory barriers or another of acquire/release semantics for SMP.
5) Is it worth the effort to create a library with processor independent semantics, at the price of not being optimal? E.g. by doing away with the various kinds of barriers, instead simply requiring atomicity and full memory barrier semantics for the operation? Which operations, besides load and store would be essential?
The current working paper for the Standard Library contains chapter 29 "Atomic operations library". This chapter currently contains only a forward declaration (hehe) to the draft proposal by Hans Bohem. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2047.html It's sister paper, "Sequencing and the concurrency memory model", is: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2052.htm -- Pedro Lamarão

Pedro Lamarão wrote:
This chapter currently contains only a forward declaration (hehe) to the draft proposal by Hans Bohem.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2047.html
I have read it, and tried to implement something along this paper. However I have run into some problems. Can you please give me a hint how and where I could contribute to this discussion or ask specific questions? Roland

Roland Schwarz escreveu:
Pedro Lamarão wrote:
This chapter currently contains only a forward declaration (hehe) to the draft proposal by Hans Bohem.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2047.html
I have read it, and tried to implement something along this paper. However I have run into some problems. Can you please give me a hint how and where I could contribute to this discussion or ask specific questions?
Usually the people involved with the Standard gather in the comp.std.c++ newsgroup. You can access it through Google Groups here: http://groups.google.com/group/comp.std.c++ I don't know if the author of the above papers reads it too; but in the header of both papers there is an email for contact. My experience is that paper authors really appreciate this kind of feedback. -- Pedro Lamarão

On Friday 01 December 2006 11:33, Roland Schwarz wrote:
4) Does there exist a canonical set of atomic primitives, from which others can be built?
[1]
E.g. given atomic_test_and_set, atomic_load and atomic_store is it possible to emulate atomic_compare_and_swap? (without use of system mutices of course.) [2]
The answer to [1] is yes. The answer to [2] is no. The reference work which has prooven this is written by M. Herlihy (1991) "Wait-free synchronization". ACM Trans. Program. Lang. Syst. 13 (1), 124–149. <http://www.cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf> This paper is not easy. One of the things he proves is that there is an 'ordering' (using the 'consensus number') of atomic operations, where a higher consensus number indicates that it can be used to implement a lower consensus number. Well, only Compare-And-Swap and Memory-Move have a consensusnumber of infinity. Read-Write has order '1', Atomic fetch-and-add only has '2'. (The ordering collapses on uni-processor systems though.) See page '126' of the above link. The main idea is: if your processor does not have the compare-and-swap instruction, it is not very useful for multicore or SMP systems. I believe 'lock-free' algorithms (which use CAS) have been discussed earlier on this list. Peter -- Peter Soetens -- FMTC -- <http://www.fmtc.be>
participants (5)
-
Anthony Williams
-
Pedro Lamarão
-
Peter Dimov
-
Peter Soetens
-
Roland Schwarz