Re: [boost] Boost.Threads, N2178, N2184, et al

23 Mar 2007

      Anthony Williams wrote:
...
"Peter Dimov" <pdimov@gmail.com> writes:
...
On x86 all loads already have acquire semantics by default, and all stores
have release semantics.
On Itanium, sure.

<quote source=Intel Itanium Architecture Software Developer's Manual>

6.3.4 Memory Ordering Interactions

IA-32 instructions are mapped into the Itanium memory ordering model as
follows:

- All IA-32 stores have release semantics

- All IA-32 loads have acquire semantics

- All IA-32 read-modify-write or lock instructions have release and
  acquire semantics (fully fenced).

</quote>
...
Not according to the intel specs. 25366818.pdf (IA32 software developers
manual volume 3A), section 7.7.2:
The thing is that x86 native doesn't have officially defined memory 
model (Itanium mapping may well be stronger than x86 native).

Note that what you quote below was written for testers with scopes on 
"system bus".
...
"1. Reads can be carried out speculatively and in any order."
However...

http://www.well.com/~aleks/CompOsPlan9/0005.html 

<quote author=an architect at Intel> 

The PPro does speculative and out-of-order loads.  However, 
it has a mechanism called the "memory order buffer" to ensure 
that the above memory ordering model is not violated.  Load 
and store instructions do not get retired until the processor 
can prove there are no memory ordering violations in the actual 
order of execution that was used.  Stores do not get sent to 
memory until they are ready to be retired.  If the processor 
detects a memory ordering violation, it discards all unretired 
operations (including the offending memory operation) and 
restarts execution at the oldest unretired instruction. 

</quote> 

Consider also:

Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. Two techniques to 
enhance the performance of memory consistency models. In Proceedings of 
the 1991 International Conference on Parallel Processing (Vol. I 
Architecture), pages 1-355-364, August 1991. 

<quote> 

The speculative-load buffer provides the detection mechanism by signaling 
when the speculated result is incorrect. The buffer works as follows. 
Loads that are retired from the reservation station are put into the 
buffer in addition to being issued to the memory system. There are four 
fields per entry (as shown in Figure 4): load address, acq, done, and 
store tag. The load address field holds the physical address for the load. 
The acq field is set if the load is considered an acquire access. For SC, 
all loads are treated as acquires. The done field is set when the load is 
performed. If the consistency constraints require the load to be delayed 
for a previous store, the store tag uniquely identifies that store. A 
null store tag specifies that the load depends on no previous stores. 
When a store completes, its corresponding tag in the speculative-load 
buffer is nullified if present. Entries are retired in a FIFO manner. Two 
conditions need to be satisfied before an entry at the head of the buffer 
is retired. First, the store tag field should equal null. Second, the 
done field should be set if the acq field is set. Therefore, for SC, an 
entry remains in the buffer until all previous load and store accesses 
complete and the load access it refers to completes. Appendix A describes 
how an atomic read-modify-write can be incorporated in the above 
implementation. 

We now describe the detection mechanism. The following coherence 
transactions are monitored by the speculativeload buffer: invalidations 
(or ownership requests), updates, and replacements.3 The load addresses 
in the buffer are associatively checked for a match with the address of 
such transactions.4 Multiple matches are possible. We assume the match 
closest to the head of the buffer is reported. A match in the buffer for 
an address that is being invalidated or updated signals the possibility 
of an incorrect speculation. A match for an address that is being 
replaced signifies that future coherence transactions for that address 
will not be sent to the processor. In either case, the speculated value 
for the load is assumed to be incorrect. Guaranteeing the constraints 
for release consistency can be done in a similar way to SC. The 
conventional way to provide RC is to delay a release access until its 
previous accesses complete and to delay accesses following an acquire 
until the acquire completes. Let us first consider delays for stores. 
The mechanism that provides precise interrupts by holding back store 
accesses in the store buffer is sufficient for guaranteeing that stores 
are delayed for the previous acquire. Although the mechanism described 
is stricter than what RC requires, the conservative implementation is 
required for providing precise interrupts. The same mechanism also 
guarantees that a release (which is simply a special store access) is 
delayed for previous load accesses. To guarantee a release is also 
delayed for previous store accesses, the store buffer delays the issue 
of the release operation until all previously issued stores are 
complete. In contrast to SC, however, ordinary stores are issued in a 
pipelined manner. 

</quote> 

and, also somewhat related:

http://www.cs.wisc.edu/~cain/pubs/micro01_correct_vp.pdf 

regards,
alexander.

Re: [boost] Boost.Threads, N2178, N2184, et al

Alexander Terekhov