_ReadWriteBarrier as memory fence, or not?

Hi folks, Quick question about the use of _ReadWriteBarrier (Win32 MSVC): According to http://msdn.microsoft.com/en-us/library/ms684208(VS.85).aspx , this compiler intrinsic doesn't generate any memory fencing instructions at all, it's strictly a compiler optimization barrier. I was wondering if someone had any documentation to contradict this, or if boost is relying on the new volatile semantics in VC2005 to ensure the fence? It sure looks like it's being used as if it were a real memory fence, but I am not sure that's the case. I'm not an expert though. Cheers, Simon.

Simon Thornington:
Hi folks,
Quick question about the use of _ReadWriteBarrier (Win32 MSVC):
According to http://msdn.microsoft.com/en-us/library/ms684208(VS.85).aspx , this compiler intrinsic doesn't generate any memory fencing instructions at all, it's strictly a compiler optimization barrier. I was wondering if someone had any documentation to contradict this, or if boost is relying on the new volatile semantics in VC2005 to ensure the fence?
It sure looks like it's being used as if it were a real memory fence, but I am not sure that's the case. I'm not an expert though.
What specific _ReadWriteBarrier use do you find suspicious?

thread\win32\interlocked_read.hpp bothers me because I don't believe x86 guarantees ordering of loads mixed with stores, but interlocked_read_acquire to my reading implies no upward migration of loads (hardware reordering). I don't believe _ReadWriteBarrier() helps at all in this case. asio\detail\indirect_handler_queue.hpp bothers me because on non-MSVC platforms, it's using _GLIBCXX_WRITE_MEM_BARRIER which is a real memory barrier. The code might be fine (I'm sure I'm not qualified to comment on it), but the fact that one is a real memory barrier and one is not tweaks me. detail_w32.hpp explicitly declares them all BOOST_COMPILER_FENCE which I take to mean the author is not assuming a memory fence. Also, I don't see off-hand any situations in which ordering would matter there given the use of the interlocked primitives. Cheers, Simon. 2008/8/21 Peter Dimov <pdimov@pdimov.com>
Simon Thornington:
Hi folks,
Quick question about the use of _ReadWriteBarrier (Win32 MSVC):
According to http://msdn.microsoft.com/en-us/library/ms684208(VS.85).aspx<http://msdn.microsoft.com/en-us/library/ms684208%28VS.85%29.aspx>, this compiler intrinsic doesn't generate any memory fencing instructions at all, it's strictly a compiler optimization barrier. I was wondering if someone had any documentation to contradict this, or if boost is relying on the new volatile semantics in VC2005 to ensure the fence?
It sure looks like it's being used as if it were a real memory fence, but I am not sure that's the case. I'm not an expert though.
What specific _ReadWriteBarrier use do you find suspicious? _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

"Simon Thornington" <simon.thornington@gmail.com> writes:
thread\win32\interlocked_read.hpp bothers me because I don't believe x86 guarantees ordering of loads mixed with stores, but interlocked_read_acquire to my reading implies no upward migration of loads (hardware reordering). I don't believe _ReadWriteBarrier() helps at all in this case.
On x86, an aligned integer load (MOV instruction) is an acquire operation. The _ReadWriteBarrier is needed to stop the compiler reordering the instruction. Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Ok, thanks. Is it fair to say that these MSVC macros are assuming x86 architecture? Would this code still be safe on IA? (I don't personally have to deal with Itanium, I'm just curious now). Simon. On Thu, Aug 21, 2008 at 11:55 PM, Anthony Williams <anthony.ajw@gmail.com>wrote:
"Simon Thornington" <simon.thornington@gmail.com> writes:
thread\win32\interlocked_read.hpp bothers me because I don't believe x86 guarantees ordering of loads mixed with stores, but interlocked_read_acquire to my reading implies no upward migration of loads (hardware reordering). I don't believe _ReadWriteBarrier() helps at all in this case.
On x86, an aligned integer load (MOV instruction) is an acquire operation. The _ReadWriteBarrier is needed to stop the compiler reordering the instruction.
Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

"Simon Thornington" <simon.thornington@gmail.com> writes:
Ok, thanks. Is it fair to say that these MSVC macros are assuming x86 architecture? Would this code still be safe on IA? (I don't personally have to deal with Itanium, I'm just curious now).
The Microsoft docs for the latest versions of their compiler say that volatile reads have a acquire semantics, so it should be fine. At the assembly language level, you do need a ld.acq instruction though rather than a simple ld. I haven't checked whether the compiler does this. Anthony -- Anthony Williams | Just Software Solutions Ltd Custom Software Development | http://www.justsoftwaresolutions.co.uk Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Simon Thornington:
thread\win32\interlocked_read.hpp bothers me because I don't believe x86 guarantees ordering of loads mixed with stores, but interlocked_read_acquire to my reading implies no upward migration of loads (hardware reordering). I don't believe _ReadWriteBarrier() helps at all in this case.
x86 allows a store to be reordered with a following load. An acquire read should prevent following operations to be reordered with the read, but there is no problem with a preceding operation being reordered.
asio\detail\indirect_handler_queue.hpp bothers me because on non-MSVC platforms, it's using _GLIBCXX_WRITE_MEM_BARRIER which is a real memory barrier. The code might be fine (I'm sure I'm not qualified to comment on it), but the fact that one is a real memory barrier and one is not tweaks me.
This is the most suspicious example but it only needs to prevent the two stores following the barrier from being reordered with the preceding instructions, and on x86 stores never migrate upwards.
detail_w32.hpp explicitly declares them all BOOST_COMPILER_FENCE which I take to mean the author is not assuming a memory fence.
Indeed. :-)

On Thu, Aug 21, 2008 at 5:23 PM, Simon Thornington <simon.thornington@gmail.com> wrote:
Hi folks,
Quick question about the use of _ReadWriteBarrier (Win32 MSVC):
According to http://msdn.microsoft.com/en-us/library/ms684208(VS.85).aspx , this compiler intrinsic doesn't generate any memory fencing instructions at all, it's strictly a compiler optimization barrier. I was wondering if someone had any documentation to contradict this, or if boost is relying on the new volatile semantics in VC2005 to ensure the fence?
It sure looks like it's being used as if it were a real memory fence, but I am not sure that's the case. I'm not an expert though.
Correct, it only prevents the compiler from reordering - it doesn't emit any fence instruction itself. You must use _mm_lfence/_mm_sfence/_mm_mfence if you want that. It just happens that for many things, fence instructions aren't needed on current x86 hardware. -- Cory Nelson

Correct, it only prevents the compiler from reordering - it doesn't emit any fence instruction itself. You must use _mm_lfence/_mm_sfence/_mm_mfence if you want that. It just happens that for many things, fence instructions aren't needed on current x86 hardware.
This is correct as far as the take-home message is concerned but it is not exactly right. The "Wintel" way of doing shared-memory multiprocessing, as reflected in both the x86 instruction set architecture and the implementation of all but the most exotic x86-based systems, is based on a cache-coherent, write-through memory model. Current x86 hardware realizes this model -- the instruction set architecture -- using a number of tricks at the microarchitecture level. Those details change from one generation to the next but the ISA stays (mostly) the same. Under the coherent-cache model, changes to memory in one instruction stream are visible to all other instruction streams on the system immediately. There is no need for a fence *instruction*, it could have no effect in this model. But there is still a need to prevent the compiler from reordering the memory operations in an instruction stream. Now, there are extensions to the instruction set, notably with SSE2+, that provide an escape from the cache-coherent memory model. With those come fence instructions, because you need them once you leave the cache-coherent model. Following quoted from http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/ 40546.pdf Streaming instructions include the non-temporal stores MOVNTDQ, MOVNTI, MOVNTPS, MOVNTPD, MOVNTSD, MOVNTSS and the MMX instruction MOVNTQ. However, unlike regular stores, non-temporal stores are weakly ordered relative to other loads and stores. If strong ordering of stores is required, an SFENCE instruction should be used between the non-temporal stores and any succeeding normal stores. See Section 11.4, "Memory Barrier Operations' on page 196 for further recommendations on memory barrier instructions. Streaming instructions can dramatically improve memory-write performance. They write data directly to memory through write-combining buffers, bypassing the cache. This is faster than PREFETCHW because data does not need to be initially read from memory to fill the cache lines, only to be completely overwritten shortly thereafter. The new data is simply written to memory, replacing the old data in memory, so no memory read is performed. One application where streaming is useful, often in conjunction with prefetch instructions, is in copying large blocks of memory. Note:The streaming instructions are not recommended or necessary for write-combined memory regions since the processor automatically combines writes for those regions. Write-combine memory types are indicated through the MTRRs and the page-attribute table (PAT). Note:For best performance, do not mix streaming instructions on a cache line with non-streaming store instructions.
participants (5)
-
Anthony Williams
-
Cory Nelson
-
Peter Dimov
-
Simon Thornington
-
Stephen Nuchia