
Correct, it only prevents the compiler from reordering - it doesn't emit any fence instruction itself. You must use _mm_lfence/_mm_sfence/_mm_mfence if you want that. It just happens that for many things, fence instructions aren't needed on current x86 hardware.
This is correct as far as the take-home message is concerned but it is not exactly right. The "Wintel" way of doing shared-memory multiprocessing, as reflected in both the x86 instruction set architecture and the implementation of all but the most exotic x86-based systems, is based on a cache-coherent, write-through memory model. Current x86 hardware realizes this model -- the instruction set architecture -- using a number of tricks at the microarchitecture level. Those details change from one generation to the next but the ISA stays (mostly) the same. Under the coherent-cache model, changes to memory in one instruction stream are visible to all other instruction streams on the system immediately. There is no need for a fence *instruction*, it could have no effect in this model. But there is still a need to prevent the compiler from reordering the memory operations in an instruction stream. Now, there are extensions to the instruction set, notably with SSE2+, that provide an escape from the cache-coherent memory model. With those come fence instructions, because you need them once you leave the cache-coherent model. Following quoted from http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/ 40546.pdf Streaming instructions include the non-temporal stores MOVNTDQ, MOVNTI, MOVNTPS, MOVNTPD, MOVNTSD, MOVNTSS and the MMX instruction MOVNTQ. However, unlike regular stores, non-temporal stores are weakly ordered relative to other loads and stores. If strong ordering of stores is required, an SFENCE instruction should be used between the non-temporal stores and any succeeding normal stores. See Section 11.4, "Memory Barrier Operations' on page 196 for further recommendations on memory barrier instructions. Streaming instructions can dramatically improve memory-write performance. They write data directly to memory through write-combining buffers, bypassing the cache. This is faster than PREFETCHW because data does not need to be initially read from memory to fill the cache lines, only to be completely overwritten shortly thereafter. The new data is simply written to memory, replacing the old data in memory, so no memory read is performed. One application where streaming is useful, often in conjunction with prefetch instructions, is in copying large blocks of memory. Note:The streaming instructions are not recommended or necessary for write-combined memory regions since the processor automatically combines writes for those regions. Write-combine memory types are indicated through the MTRRs and the page-attribute table (PAT). Note:For best performance, do not mix streaming instructions on a cache line with non-streaming store instructions.