
double-checked with intel's software developer's manual, volume 3a, section 8.1.1: there is no way to access 128bit words atomically. afaik, the only way to implement them correctly is to use cmpxchg16b ...
That's the answer we got from Intel & AMD architects a year ago. Unlike 8-byte aligned memory access on x86, there are no guarantees for atomicity for 16-byte memory accesses on x64 (except for cmpxchg16b). Though I believe all implementations did at that time. A few notes, after a one minute look over one file only: Some of the functions seem pretty weird. E.g. why is a 128-bit fetch_add implemented in terms of a packed add. Don't individual components of the vector just wrap around on overflow without carrying over? And __declspec(align) takes a number of bytes not bits. BTW, VC 9 had a fairly random set of supported intrinsics for various locked operations on x86/x64. I added a couple more to make this a bit more consistent across bit widths in VC10 -- to support my implementation of <atomic>. Also alignment is not really alignment in x86 VC. Sadly, VC++ for x86 has no strong stack alignment. By default, there are no stronger guarantees than that ESP is 4 byte aligned on function entry. Hence, for locals with stronger alignment requirements dynamic stack alignment is required (interprocedural optimizations can sometimes elide that setup). Here doubles differ from long longs from types with explicit alignment requirements (i.e. __declspec(align)). Only the latter are really guaranteed to produce aligned addresses when declared as locals. I'm not sure what aligned_storage and friends do, but just because __alignof reports some value does not mean that locals will be properly aligned. -hg