
Helge Bahmann wrote
Second solution: use 64bit CAS to load/store 64bit values on x86. It seems too heavy for just loading/storing it isn't?
this is actually what I do in Boost.Atomic; I *think* it is cheaper than shuffling around the values between SSE and general purpose registers (it sure is cheaper than MMX considering you also have to issue emms)
It's easy to test! Express test: CDS's RecursiveSpinLock<atomic64_t> (load64 is used actively by TATAS algo for busy wait when CAS acquiring the lock is failed) Equipment: WinXP Intel Core2 (3GHz, 2 core, no HT), MSVC++ 2008, release build with full optimization SSE2 load64: static inline atomic64_t load64( atomic64_t volatile const * pMem ) { __m128i volatile v = _mm_loadl_epi64( (__m128i const *) pMem ) ; return v.m128i_i64[0] ; } result (one of, average): Spinlock_MT::recursiveSpinLock64 Lock test, thread count=8 loop per thread=1000000... Duration=2.21852 CAS64 load64 (no CAS loop): static inline atomic64_t load64( atomic64_t volatile const * pMem ) { atomic64_t cur = 0 ; return _InterlockedCompareExchange64( const_cast<atomic64_t volatile *>(pMem), cur, cur ) ; } result (one of, average): Spinlock_MT::recursiveSpinLock64 Lock test, thread count=8 loop per thread=1000000... Duration=2.79662 +20% performance for SSE2. Not so bad I wait more :) Unfortunately, I have no access to multi-processor Win32 server for testing now. Note, boost.atomic uses CAS-based loop for load64, so, I think the performance gain could be more. Regards, Max