
Anyhow, I think this should be straightforward. I didn't look at it in detail, but it looked like you only need a specialization with size 8 types that does a _InterlockedCompareExchange64 for most everything. Efficient loads & stores are a bit tricky in that SSE2 is not a requirement for 32-bit Windows. Without it, I think we need to resort FILD/FISTP, which is a pain.
iirc, sse2 intrinsics are not guaranteed to be atomic, so sometimes memory access has to be emulated via CAS.
Well, I hope that at least atomic<> gets it right. Nonaligned accesses are not guaranteed to be atomic. In this case I guess, the atomic<> implementation ends up using a uint64_t. In VC++ __alignof == 8, but that doesn't mean that autos of type uint64_t are necessarily aligned. In our case I would assume that you get memory from the OS to be at least 16 or 8 byte aligned and the alignment of class types should be fine.
this is the reason, why atomic<>::is_lock_free() is not a static member function.
E.g.: consider this on x86 with an older GCC void foo() { atomic<tagged_node_ptr<X> > x; }
Is x aligned, here? I don't recall the ABI, but I believe it doesn't guarantee anything beyond 4-byte alignment for ESP on entry. So to align x properly in the stack frame, the stack must be dynamically aligned (or some interprocedural optimization may help) -- but I don't think older GCCs do that.
dynamic memory allocation makes it even worse, because you can use placement new to put the data structure to virtually any memory location :/ tim