
At 11:11 AM +0200 9/9/07, Corrado Zoccolo wrote:
when performance tuning a really tight loop in a simple program, I found that reading the value of an atomic count is unexpectedly slow on a gcc platform.
[...]
Is there a compelling reason to use the locked operation with gcc, or a simpler volatile access can serve the same purpose?
[...]
Do you see any drawback in changing the access to the counter to a simple volatile access, at least when the platform is known to be an IA32?
Don't do that. It won't work properly on a multi-processor system. Memory barriers are needed to ensure correct operation on such systems, and gcc (x86) does not generate a memory barrier for a volatile load.
[... quoting existing implementation for gcc ...] operator long() const { return __exchange_and_add(&value_, 0); }
The use of __exchange_and_add here is a way to perform a load-acquire operation (a somewhat clumsy way, presumably necessary in the absence of a more direct (and possibly better performing) mechanism). The "acquire" qualifier indicates the kind of memory barrier needed.
I checked other implementations, and for example solaris has the much lighter: operator uint32_t() const { return static_cast<uint32_t const volatile &>( value_ ); }
Because the (current) standard does not address threads and such at all, different implementations have associated different semantics with "volatile" in the presence of threads. I expect that *on solaris* one would find a memory barrier generated for this code sequence.