[atomic_count] Discrepancy between gcc and solaris implementation

Hi boosters, when performance tuning a really tight loop in a simple program, I found that reading the value of an atomic count is unexpectedly slow on a gcc platform. I checked the code, and the implementation used, (I have a checkout of some month ago of boost HEAD, when there was still the cvs in place), is: operator long() const { return __exchange_and_add(&value_, 0); } I checked other implementations, and for example solaris has the much lighter: operator uint32_t() const { return static_cast<uint32_t const volatile &>( value_ ); } Is there a compelling reason to use the locked operation with gcc, or a simpler volatile access can serve the same purpose? A simple test program shows a large overhead for the locked operation. #include <boost/timer.hpp> #include <boost/detail/atomic_count.hpp> #include <iostream> namespace cz { using __gnu_cxx::__atomic_add; using __gnu_cxx::__exchange_and_add; class atomic_count { public: explicit atomic_count(long v) : value_(v) {} void operator++() { __atomic_add(&value_, 1); } long operator--() { return __exchange_and_add(&value_, -1) - 1; } operator long() const { return static_cast<_Atomic_word volatile const &>(value_); } private: atomic_count(atomic_count const &); atomic_count & operator=(atomic_count const &); mutable _Atomic_word value_; }; } int main(int argc, char **argv) { { boost::timer t; for(boost::detail::atomic_count i(0);i<10000000;++i); std::cout<<t.elapsed()<<std::endl; } { boost::timer t; for(cz::atomic_count i(0);i<10000000;++i); std::cout<<t.elapsed()<<std::endl; } } Compiled with g++ -O3, and run on a Linux box equipped with Pentium IV 2.8GHz (HT disabled), I obtain the following result: [corrado@et2 test]$ ./a.out 1.4 0.75 i.e. the "for" that uses the volatile access is almost twice as fast than the other one. Do you see any drawback in changing the access to the counter to a simple volatile access, at least when the platform is known to be an IA32? Corrado -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:zoccolo@di.unipi.it PhD - Department of Computer Science - University of Pisa, Italy --------------------------------------------------------------------------

At 11:11 AM +0200 9/9/07, Corrado Zoccolo wrote:
when performance tuning a really tight loop in a simple program, I found that reading the value of an atomic count is unexpectedly slow on a gcc platform.
[...]
Is there a compelling reason to use the locked operation with gcc, or a simpler volatile access can serve the same purpose?
[...]
Do you see any drawback in changing the access to the counter to a simple volatile access, at least when the platform is known to be an IA32?
Don't do that. It won't work properly on a multi-processor system. Memory barriers are needed to ensure correct operation on such systems, and gcc (x86) does not generate a memory barrier for a volatile load.
[... quoting existing implementation for gcc ...] operator long() const { return __exchange_and_add(&value_, 0); }
The use of __exchange_and_add here is a way to perform a load-acquire operation (a somewhat clumsy way, presumably necessary in the absence of a more direct (and possibly better performing) mechanism). The "acquire" qualifier indicates the kind of memory barrier needed.
I checked other implementations, and for example solaris has the much lighter: operator uint32_t() const { return static_cast<uint32_t const volatile &>( value_ ); }
Because the (current) standard does not address threads and such at all, different implementations have associated different semantics with "volatile" in the presence of threads. I expect that *on solaris* one would find a memory barrier generated for this code sequence.
participants (2)
-
Corrado Zoccolo
-
Kim Barrett