Shared pointer atomic assembler for ARM?

Dear All, I note that shared_ptr uses architecture-specific assembler for the atomic operations needed for thread safe operations, on x86, ia64 and ppc; it falls back to pthreads for other architectures. Has anyone quantified the performance benefit of the assembler? Assuming that the benefit is significant, I'd like to implement it for ARM. Has anyone else looked at this? ARM has a swap instruction. I have a (very vague) recollection that perhaps some of the newer chips have some other locked instructions e.g. test-and-set, but I would want to code to the lowest common denominator i.e. swap only. Is this sufficient for what shared_ptr wants? I note that since 4.1, gcc has provided built-in functions for atomic operations. But it says that "Not all operations are supported by all target processors", and the list doesn't include swap; so maybe this isn't so useful after all. Another thought is that gcc now ships with a tr1/shared_ptr which is based on shared_ptr from boost 1.32; maybe it uses the gcc built-in atomics... checks... no, it calls something called __exchange_and_add, the implementation of which in libstdc++ (disassembled) doesn't look atomic to me! Anyway: Is this worthwhile? Has it already been done? Is swap sufficient? Regards, Phil.

Phil Endecott:
Dear All,
I note that shared_ptr uses architecture-specific assembler for the atomic operations needed for thread safe operations, on x86, ia64 and ppc; it falls back to pthreads for other architectures. Has anyone quantified the performance benefit of the assembler?
Assuming that the benefit is significant, I'd like to implement it for ARM. Has anyone else looked at this?
ARM has a swap instruction. I have a (very vague) recollection that perhaps some of the newer chips have some other locked instructions e.g. test-and-set, but I would want to code to the lowest common denominator i.e. swap only. Is this sufficient for what shared_ptr wants?
I note that since 4.1, gcc has provided built-in functions for atomic operations. But it says that "Not all operations are supported by all target processors", and the list doesn't include swap; so maybe this isn't so useful after all.
Can you try the SVN trunk version of shared_ptr and look at the assembly? detail/sp_counted_base.hpp should choose sp_counted_base_sync.hpp for g++ 4.1 and higher and take advantage of the built-ins. To answer your question: no, a mere swap instruction is not enough for shared_ptr, it needs atomic increment, decrement and compare and swap.

Hi Peter, Peter Dimov wrote:
Phil Endecott:
I note that shared_ptr uses architecture-specific assembler for the atomic operations needed for thread safe operations, on x86, ia64 and ppc; it falls back to pthreads for other architectures. Has anyone quantified the performance benefit of the assembler?
Assuming that the benefit is significant, I'd like to implement it for ARM. Has anyone else looked at this?
ARM has a swap instruction. I have a (very vague) recollection that perhaps some of the newer chips have some other locked instructions e.g. test-and-set, but I would want to code to the lowest common denominator i.e. swap only. Is this sufficient for what shared_ptr wants?
I note that since 4.1, gcc has provided built-in functions for atomic operations. But it says that "Not all operations are supported by all target processors", and the list doesn't include swap; so maybe this isn't so useful after all.
Can you try the SVN trunk version of shared_ptr and look at the assembly? detail/sp_counted_base.hpp should choose sp_counted_base_sync.hpp for g++ 4.1 and higher and take advantage of the built-ins.
Well it's quicker for me to try this: int x; int main(int argc, char* argv[]) { __sync_fetch_and_add(&x,1); } $ arm-linux-gnu-g++ --version arm-linux-gnu-g++ (GCC) 4.1.2 20061028 (prerelease) (Debian 4.1.1-19) $ arm-linux-gnu-g++ -W -Wall check_sync_builtin.cc check_sync_builtin.cc:3: warning: unused parameter ‘argc’ check_sync_builtin.cc:3: warning: unused parameter ‘argv’ /tmp/ccwWxfsT.o: In function `main': check_sync_builtin.cc:(.text+0x20): undefined reference to `__sync_fetch_and_add_4' collect2: ld returned 1 exit status (It does compile on x86, and the disassembly includes a "lock addl" instruction.) As I mentioned before, gcc doesn't implement these atomic builtins on all platforms, i.e. it doesn't implement them on platforms where the hardware doesn't provide them. I don't fully understand how this all works in libstdc++ (there are too many levels of #include and #if for me to follow) but there seems to be a __gnu_cxx::__mutex that they can use in those cases.
To answer your question: no, a mere swap instruction is not enough for shared_ptr, it needs atomic increment, decrement and compare and swap.
Well, I think you can implement a spin-lock mutex with swap: int mutex=0; // 0 = unlocked, 1 = locked void lock() { do { int n=1; swap(mutex,n); // atomic swap instruction } while (n==1); // if n is 1 after the swap, the mutex was already locked } void unlock() { mutex=0; } So you could using something like that to protect the reference counts, rather than falling back to the pthread method. Or alternatively, could you use a sentinel value (say -1) in the reference to indicate that it's locked: int refcount; int read_refcount() { do { int r = refcount; } while (r==-1); return r; } int adj_refcount(int adj) { int r=-1; do { swap(refcount,r); } while (r==-1); refcount = r+adj; } (BTW, for gcc>=4.1 on x86 would you plan to use the gcc builtins or the existing Boost asm?) Regards, Phil.

Phil Endecott wrote:
Peter Dimov wrote:
Phil Endecott:
I note that shared_ptr uses architecture-specific assembler for the atomic operations needed for thread safe operations, on x86, ia64 and ppc; it falls back to pthreads for other architectures. Has anyone quantified the performance benefit of the assembler?
(I'm still curious about this.)
Assuming that the benefit is significant, I'd like to implement it for ARM. Has anyone else looked at this?
ARM has a swap instruction. I have a (very vague) recollection that perhaps some of the newer chips have some other locked instructions e.g. test-and-set, but I would want to code to the lowest common denominator i.e. swap only. Is this sufficient for what shared_ptr wants?
To answer your question: no, a mere swap instruction is not enough for shared_ptr, it needs atomic increment, decrement and compare and swap.
Well, I think you can implement a spin-lock mutex with swap:
int mutex=0; // 0 = unlocked, 1 = locked
void lock() { do { int n=1; swap(mutex,n); // atomic swap instruction } while (n==1); // if n is 1 after the swap, the mutex was already locked }
void unlock() { mutex=0; }
So you could using something like that to protect the reference counts, rather than falling back to the pthread method. Or alternatively, could you use a sentinel value (say -1) in the reference to indicate that it's locked:
int refcount;
int read_refcount() { do { int r = refcount; } while (r==-1); return r; }
int adj_refcount(int adj) { int r=-1; do { swap(refcount,r); } while (r==-1); refcount = r+adj; }
Right, I have had a go at implementing this. The following code seems to generate plausible assembler but has not been tested. Maybe someone has a multi-threaded shared pointer test program that does zillions of shared pointer accesses to see if they ever mess up? But I would prefer to think that this sort of code can be "proved correct" by the "many eyes" method! So let me know if you can see any faults. In particular, I have never used weak pointers, so I don't really know what that stuff is doing. This code can also be found at: https://svn.chezphil.org/boost/trunk/include/boost/detail/ Regards, Phil. #ifndef BOOST_DETAIL_SP_COUNTED_BASE_GCC_ARM_HPP_INCLUDED #define BOOST_DETAIL_SP_COUNTED_BASE_GCC_ARM_HPP_INCLUDED // // detail/sp_counted_base_gcc_arm.hpp // // Copyright (c) 2001, 2002, 2003 Peter Dimov and Multi Media Ltd. // Copyright 2004-2005 Peter Dimov // Copyright 2007 Phil Endecott // // Distributed under the Boost Software License, Version 1.0. (See // accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) // // ARM version using swap by Phil Endecott, based on sp_counted_base_nt.hpp // #include <typeinfo> namespace boost { namespace detail { // All ARM processors since ARM3 (circa 1989) have an atomic swap instruction, // but apart from a few very recent chips (the v6 architecture, not to be confused // with the ARM6) this is the only atomic operation that they support. // // Implementing shared pointers using only swap requires a different approach // from that used on other platforms: we store a sentinel value, -1, in the // count variables to indicate that they are locked. If an attempt to read // a count gets -1, we spin until a non-negative value is returned. // // The following bit of assembler wraps the swap instruction. It takes two // register operands and one memory operands, i.e. // swp r1, r2, [r3] // is equivalent to // ldw r1, [r3] // stw r2, [r3] static inline long atomic_read_and_lock(long& mem) { long val; do { // Equivalent to: // val = mem; // mem = -1; asm volatile ("swp\t%0, %1, [%2]" :"=r" (val) :"r" (-1), "r" (&mem) :"memory"); } while (val==-1); return val; } static inline long atomic_inc(long& mem, long inc) { return mem = atomic_read_and_lock(mem)+inc; } class sp_counted_base { private: sp_counted_base( sp_counted_base const & ); sp_counted_base & operator= ( sp_counted_base const & ); long use_count_; // #shared long weak_count_; // #weak + (#shared != 0) public: sp_counted_base(): use_count_( 1 ), weak_count_( 1 ) { } virtual ~sp_counted_base() // nothrow { } // dispose() is called when use_count_ drops to zero, to release // the resources managed by *this. virtual void dispose() = 0; // nothrow // destroy() is called when weak_count_ drops to zero. virtual void destroy() // nothrow { delete this; } virtual void * get_deleter( std::type_info const & ti ) = 0; void add_ref_copy() { atomic_inc(use_count_,+1); } bool add_ref_lock() // true on success { long c = atomic_read_and_lock(use_count_); if (c==0) { use_count_ = 0; return false; } use_count_ = c+1; return true; } void release() // nothrow { if( atomic_inc(use_count_,-1) == 0 ) { dispose(); weak_release(); } } void weak_add_ref() // nothrow { atomic_inc(weak_count_,+1); } void weak_release() // nothrow { if( atomic_inc(weak_count_,-1) == 0 ) { destroy(); } } long use_count() const // nothrow { while (1) { long c = use_count_; if (c>=0) return c; } } }; } // namespace detail } // namespace boost #endif // #ifndef BOOST_DETAIL_SP_COUNTED_BASE_GCC_ARM_HPP_INCLUDED

I don't like the spin-lock approach for the atomic variables. If the mutex is already locked by another thread, spinning just wastes CPU cycles on a single-processor machine until the time-slice for the current thread comes to an end. That's usually a 10-millisecond waste of CPU in the worst case scenario. It might have some benefit on a multi-processor machine, but only if the thread that currently owns the mutex lock has not already been pre-empted. I don't see much benefit of the spin-lock approach over a system mutex, since the system mutex at least allows other threads to continue running while the mutex is being held. A hybrid of a spin-lock and mutex sounds better to me than a simple spin-lock. What I mean is that if the spin-lock does not terminate its loop in short order, that means the lock loop is going last for a very long time ( possibly up to 10-milliseconds. ) If the loop does not exit quickly, then yield the processor to some other thread, (possibly even the thread that's holding the lock in the first place.) There must be another 'better' method than the one which is described below. I don't have any good answers right now myself, but I hope that my concern was understood. -Sid -----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Phil Endecott Sent: Monday, September 10, 2007 3:05 PM To: boost@lists.boost.org Subject: Re: [boost] Shared pointer atomic assembler for ARM? Phil Endecott wrote:
Peter Dimov wrote:
Phil Endecott:
I note that shared_ptr uses architecture-specific assembler for the atomic operations needed for thread safe operations, on x86, ia64 and ppc; it falls back to pthreads for other architectures. Has anyone quantified the performance benefit of the assembler?
(I'm still curious about this.)
Assuming that the benefit is significant, I'd like to implement it for ARM. Has anyone else looked at this?
ARM has a swap instruction. I have a (very vague) recollection that perhaps some of the newer chips have some other locked instructions e.g. test-and-set, but I would want to code to the lowest common denominator i.e. swap only. Is this sufficient for what shared_ptr wants?
To answer your question: no, a mere swap instruction is not enough for shared_ptr, it needs atomic increment, decrement and compare and swap.
Well, I think you can implement a spin-lock mutex with swap:
int mutex=0; // 0 = unlocked, 1 = locked
void lock() { do { int n=1; swap(mutex,n); // atomic swap instruction } while (n==1); // if n is 1 after the swap, the mutex was already locked }
void unlock() { mutex=0; }
So you could using something like that to protect the reference counts, rather than falling back to the pthread method. Or alternatively, could you use a sentinel value (say -1) in the reference to indicate that it's locked:
int refcount;
int read_refcount() { do { int r = refcount; } while (r==-1); return r; }
int adj_refcount(int adj) { int r=-1; do { swap(refcount,r); } while (r==-1); refcount = r+adj; }
Right, I have had a go at implementing this. The following code seems to generate plausible assembler but has not been tested. Maybe someone has a multi-threaded shared pointer test program that does zillions of shared pointer accesses to see if they ever mess up? But I would prefer to think that this sort of code can be "proved correct" by the "many eyes" method! So let me know if you can see any faults. In particular, I have never used weak pointers, so I don't really know what that stuff is doing. This code can also be found at: https://svn.chezphil.org/boost/trunk/include/boost/detail/ Regards, Phil. #ifndef BOOST_DETAIL_SP_COUNTED_BASE_GCC_ARM_HPP_INCLUDED #define BOOST_DETAIL_SP_COUNTED_BASE_GCC_ARM_HPP_INCLUDED // // detail/sp_counted_base_gcc_arm.hpp // // Copyright (c) 2001, 2002, 2003 Peter Dimov and Multi Media Ltd. // Copyright 2004-2005 Peter Dimov // Copyright 2007 Phil Endecott // // Distributed under the Boost Software License, Version 1.0. (See // accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) // // ARM version using swap by Phil Endecott, based on sp_counted_base_nt.hpp // #include <typeinfo> namespace boost { namespace detail { // All ARM processors since ARM3 (circa 1989) have an atomic swap instruction, // but apart from a few very recent chips (the v6 architecture, not to be confused // with the ARM6) this is the only atomic operation that they support. // // Implementing shared pointers using only swap requires a different approach // from that used on other platforms: we store a sentinel value, -1, in the // count variables to indicate that they are locked. If an attempt to read // a count gets -1, we spin until a non-negative value is returned. // // The following bit of assembler wraps the swap instruction. It takes two // register operands and one memory operands, i.e. // swp r1, r2, [r3] // is equivalent to // ldw r1, [r3] // stw r2, [r3] static inline long atomic_read_and_lock(long& mem) { long val; do { // Equivalent to: // val = mem; // mem = -1; asm volatile ("swp\t%0, %1, [%2]" :"=r" (val) :"r" (-1), "r" (&mem) :"memory"); } while (val==-1); return val; } static inline long atomic_inc(long& mem, long inc) { return mem = atomic_read_and_lock(mem)+inc; } class sp_counted_base { private: sp_counted_base( sp_counted_base const & ); sp_counted_base & operator= ( sp_counted_base const & ); long use_count_; // #shared long weak_count_; // #weak + (#shared != 0) public: sp_counted_base(): use_count_( 1 ), weak_count_( 1 ) { } virtual ~sp_counted_base() // nothrow { } // dispose() is called when use_count_ drops to zero, to release // the resources managed by *this. virtual void dispose() = 0; // nothrow // destroy() is called when weak_count_ drops to zero. virtual void destroy() // nothrow { delete this; } virtual void * get_deleter( std::type_info const & ti ) = 0; void add_ref_copy() { atomic_inc(use_count_,+1); } bool add_ref_lock() // true on success { long c = atomic_read_and_lock(use_count_); if (c==0) { use_count_ = 0; return false; } use_count_ = c+1; return true; } void release() // nothrow { if( atomic_inc(use_count_,-1) == 0 ) { dispose(); weak_release(); } } void weak_add_ref() // nothrow { atomic_inc(weak_count_,+1); } void weak_release() // nothrow { if( atomic_inc(weak_count_,-1) == 0 ) { destroy(); } } long use_count() const // nothrow { while (1) { long c = use_count_; if (c>=0) return c; } } }; } // namespace detail } // namespace boost #endif // #ifndef BOOST_DETAIL_SP_COUNTED_BASE_GCC_ARM_HPP_INCLUDED _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Hello, 2007/9/16, Sid Sacek <ssacek@appsecinc.com>:
I don't see much benefit of the spin-lock approach over a system mutex, since the system mutex at least allows other threads to continue running while the mutex is being held.
In our company use spin locks instead of system mutex, because we usually have close to one thousand objects that are used by different threads. Creating such a large number is often not supported.
A hybrid of a spin-lock and mutex sounds better to me than a simple spin-lock. What I mean is that if the spin-lock does not terminate its loop in short order, that means the lock loop is going last for a very long time ( possibly up to 10-milliseconds. ) If the loop does not exit quickly, then yield the processor to some other thread, (possibly even the thread that's holding the lock in the first place.)
Yes, we used that approach and on systems like MIPS with LL/SC we also loop a few times before giving up. According to the documentation many false-failures of the conditional seem to be possible. Hence trying a few times seemed reasonable. Giving up is usually done by sleeping. Doesnt ARM support ll/sc like operations? Regards Andreas Pokorny

Sid Sacek wrote:
I don't like the spin-lock approach for the atomic variables.
If the mutex is already locked by another thread, spinning just wastes CPU cycles on a single-processor machine until the time-slice for the current thread comes to an end. That's usually a 10-millisecond waste of CPU in the worst case scenario.
Hi Sid, Yes, but because the variable is locked for just a handful of instructions during the atomic increment/decrement, the probability of being pre-empted is small. So spinning is rare, and the average slowdown is small. It is more important to optimise the common case where the variable is not locked. In particular, spinning is much better than using pthreads, which is what boost::shared_ptr does at present on ARM Linux; the function call overhead alone will dwarf the time spent spinning. It also requires that each shared pointer is larger since it must also store the mutex. To quantify this, I've timed a simple single-threaded loop that does 100 million shared pointer increments and decrements. With pthreads it takes 180 seconds, while with my spinning implementation it takes 22 seconds. That's about an additional microsecond per atomic operation. Comparing with the 10 ms that is wasted by spinning when an atomic operation is interrupted, the balance is when one in every 10 000 atomic operations is interrupted by a thread that will access the same variable. On my hardware, the window during which the variable stores the sentinel value is about 15 ns, so scheduling 100 times per second the probability of interrupting an atomic operation will be one in 666 667. That's a much lower probability than one in 10 000, so the spinning solution is better. [Do let me know if you can see any flaws in my logic.] An alternative is to simply yield when the thread is found to be locked: #include <sched.h> static inline long atomic_read_and_lock(long& mem) { long val; do { // Equivalent to: // val = mem; // mem = -1; asm volatile ("swp\t%0, %1, [%2]" :"=&r" (val) :"r" (-1), "r" (&mem) :"memory"); if (val != -1) { break; } sched_yield(); } while (1); return val; } In the typical case this executes no more instructions than the spin code, and my test program runs at the same speed. But it does increase the code size, which matters to me on my embedded system with not much cache. It is also OS-specific. The benefits of each approach will depend on the workload, i.e. how often shared pointers are actually used by more than one thread, and I suggest benchmarking your application with each of the alternatives. Regards, Phil.

I don't like the spin-lock approach for the atomic variables.
If the mutex is already locked by another thread, spinning just wastes CPU cycles on a single-processor machine until the time-slice for the current thread comes to an end. That's usually a 10-millisecond waste of CPU in
Hi Phil, I had a chance to think about this issue with spin-locks a bunch more. Okay, while I agree with you that the possibility of a thread being pre-empted in the middle of the assembly lock-routine is small, there is still that possibility that it will happen, and depending on the application, it might happen noticeably too often. Knowing that, the programmer should be given the option whether to use ( or not use ) the spin-locking mechanism in their code. I'm certain that most applications will be fine using the locks in this manner, but there will be cases where the possibility of wasting CPU cycles will be unacceptable. This will all depend on the application and how often locking is performed in that program. I know that I personally would want the option of being able to turn that feature on and off at my discretion, and I'm sure that others would too. If you make it so that ARM coders are forced to always use the spin-locks, they may then be forced into not using the boost::shared_ptrs and come up with their own implementations. If I had a say in this matter, I would request that the underlying locking mechanism be configurable. Regards, -Sid Sacek -----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Phil Endecott Sent: Sunday, September 16, 2007 12:16 PM To: boost@lists.boost.org Subject: Re: [boost] Shared pointer atomic assembler for ARM? Sid Sacek wrote: the
worst case scenario.
Hi Sid, Yes, but because the variable is locked for just a handful of instructions during the atomic increment/decrement, the probability of being pre-empted is small. So spinning is rare, and the average slowdown is small. It is more important to optimise the common case where the variable is not locked. In particular, spinning is much better than using pthreads, which is what boost::shared_ptr does at present on ARM Linux; the function call overhead alone will dwarf the time spent spinning. It also requires that each shared pointer is larger since it must also store the mutex. To quantify this, I've timed a simple single-threaded loop that does 100 million shared pointer increments and decrements. With pthreads it takes 180 seconds, while with my spinning implementation it takes 22 seconds. That's about an additional microsecond per atomic operation. Comparing with the 10 ms that is wasted by spinning when an atomic operation is interrupted, the balance is when one in every 10 000 atomic operations is interrupted by a thread that will access the same variable. On my hardware, the window during which the variable stores the sentinel value is about 15 ns, so scheduling 100 times per second the probability of interrupting an atomic operation will be one in 666 667. That's a much lower probability than one in 10 000, so the spinning solution is better. [Do let me know if you can see any flaws in my logic.] An alternative is to simply yield when the thread is found to be locked: #include <sched.h> static inline long atomic_read_and_lock(long& mem) { long val; do { // Equivalent to: // val = mem; // mem = -1; asm volatile ("swp\t%0, %1, [%2]" :"=&r" (val) :"r" (-1), "r" (&mem) :"memory"); if (val != -1) { break; } sched_yield(); } while (1); return val; } In the typical case this executes no more instructions than the spin code, and my test program runs at the same speed. But it does increase the code size, which matters to me on my embedded system with not much cache. It is also OS-specific. The benefits of each approach will depend on the workload, i.e. how often shared pointers are actually used by more than one thread, and I suggest benchmarking your application with each of the alternatives. Regards, Phil. _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Sid Sacek wrote:
Hi Phil,
I had a chance to think about this issue with spin-locks a bunch more.
Okay, while I agree with you that the possibility of a thread being pre-empted in the middle of the assembly lock-routine is small, there is still that possibility that it will happen, and depending on the application, it might happen noticeably too often.
Some numbers would be nice :-)
Knowing that, the programmer should be given the option whether to use ( or not use ) the spin-locking mechanism in their code. I'm certain that most applications will be fine using the locks in this manner, but there will be cases where the possibility of wasting CPU cycles will be unacceptable. This will all depend on the application and how often locking is performed in that program.
I tend to disagree concerning the atomics used for shared pointers, but in the more general case of mutex implementations I do agree that offering the user two or more implementations would be useful. I wonder how this would fit in with the recently-proposed std::threads work? This needs investigation.
I know that I personally would want the option of being able to turn that feature on and off at my discretion, and I'm sure that others would too.
Note that sp_counted_base.hpp has some preprocessor stuff to select an implementation, and I believe that you can influence this with suitable #defines; so if you wanted to use pthreads mutexes (i.e. to get the current behaviour) then you could get that. The only proviso is that all code compiled together would need to use the same (or compatible) implementations.
If you make it so that ARM coders are forced to always use the spin-locks,
Oh come off it. I'm not in a position to "force" anyone to do anything. The chances of the code that I posted ever being included in Boost are tiny - just look at how few replies there have been to this thread. And Boost doesn't have any ARM-based testing systems and ARM isn't even a supported platform, so there would be justified objections to ever including the code. The only people who will ever use this stuff are people who try the default Boost shared pointers, find that they are very slow, find my implementation [either the original version with pure spin-locks or the one I posted in the last message that yields] and _choose_ to use it.
they may then be forced into not using the boost::shared_ptrs and come up with their own implementations.
What does the code that I posted in my last email not do that you need?
If I had a say in this matter, I would request that the underlying locking mechanism be configurable.
You do have a say, Boost is a democracy. Thanks for your interest. Regards, Phil.

Phil Endecott:
Well it's quicker for me to try this:
int x;
int main(int argc, char* argv[]) { __sync_fetch_and_add(&x,1); }
$ arm-linux-gnu-g++ --version arm-linux-gnu-g++ (GCC) 4.1.2 20061028 (prerelease) (Debian 4.1.1-19)
$ arm-linux-gnu-g++ -W -Wall check_sync_builtin.cc check_sync_builtin.cc:3: warning: unused parameter â?~argcâ?T check_sync_builtin.cc:3: warning: unused parameter â?~argvâ?T /tmp/ccwWxfsT.o: In function `main': check_sync_builtin.cc:(.text+0x20): undefined reference to `__sync_fetch_and_add_4'
Do __sync_lock_test_and_set and __sync_lock_release work?

Hi Peter, Peter Dimov wrote:
Do __sync_lock_test_and_set and __sync_lock_release work?
__sync_lock_test_and_set doesn't work. __sync_lock_release compiles, but I think it's a NO-OP. My initial thought was, "Why is Peter asking me this? I have already said that it only has swap, so it's obviously not going to support test-and-set!". But I was sensible and checked the gcc docs before posting, and it has this: type __sync_lock_test_and_set (type *ptr, type value, ...) This builtin, as described by Intel, is not a traditional test-and-set operation, but rather an atomic exchange operation. It writes value into *ptr, and returns the previous contents of *ptr. So "test and set" is actually an atomic swap, i.e. exactly what ARM has - but misnamed. Aarggh. If it had the right name I would have noticed it months ago. (I note that gcc blame Intel for the misnaming.) Yet gcc still doesn't seem to implement it. I will go and ask on their mailing list. Regards, Phil.

Phil Endecott wrote:
Peter Dimov wrote:
Do __sync_lock_test_and_set and __sync_lock_release work?
__sync_lock_test_and_set doesn't work.
So "test and set" is actually an atomic swap, i.e. exactly what ARM has
Yet gcc still doesn't seem to implement it. I will go and ask on their mailing list.
Here is what I have learnt from gcc-help: - You can test for the presence of atomic-add and related builtins using the __GCC_HAVE_SYNC_COMPARE_AND_SWAP_n predefined macros, so this could be used to enable the sp_counted_base implementation that Peter has done based on them. Except that this macro is only defined in relatively recent versions of the compiler, i.e. there are compilers that have the builtins but don't advertise them with the macro. - There is no ARM builtin for SWAP; I've filed a gcc bug suggesting it (and a corresponding macro) as an enhancement. - The code that I presented, which uses a sentinel value to lock the count and spins if it finds it already locked, will have problems on a system where a high-priority thread must explicitly yield to let lower priority threads run. This isn't the case on any system that I (currently) care about, but it did make me wonder whether Boost has a policy on supported threading models, and whether any of the other Boost atomicity / threading code would suffer the same issue. Regards, Phil.
participants (5)
-
Andreas Pokorny
-
Gottlob Frege
-
Peter Dimov
-
Phil Endecott
-
Sid Sacek