Help needed for shared_ptr issue on iPad2 (dual core ARM)

Hi everyone, Ticket #5372: https://svn.boost.org/trac/boost/ticket/5372 says that shared_ptr's ARM spinlock implementation (which uses the swp instruction) doesn't work properly on iPad2 (which has a dual core ARM processor). The sample program in the ticket compares it to a loop using __sync_fetch_and_add, which means that the __sync intrinsics are implemented by the compiler the submitter is using. These didn't work on gcc for ARM when we tested them, but may have been added meanwhile. (I can see some code samples that test for 4.4, but the official docs state that ARM intrinsics are only supported on Linux before 4.6, which was released yesterday.) So, we have two questions; first, why does the swp-based spinlock fail, and second, how can we detect support for __sync intrinsics and use them. Anybody with ARM knowledge and iPad2 development access?

On 3/28/2011 7:03 AM, Peter Dimov wrote:
Hi everyone,
Ticket #5372:
https://svn.boost.org/trac/boost/ticket/5372
says that shared_ptr's ARM spinlock implementation (which uses the swp instruction) doesn't work properly on iPad2 (which has a dual core ARM processor). The sample program in the ticket compares it to a loop using __sync_fetch_and_add, which means that the __sync intrinsics are implemented by the compiler the submitter is using. These didn't work on gcc for ARM when we tested them, but may have been added meanwhile. (I can see some code samples that test for 4.4, but the official docs state that ARM intrinsics are only supported on Linux before 4.6, which was released yesterday.)
So, we have two questions; first, why does the swp-based spinlock fail, and second, how can we detect support for __sync intrinsics and use them.
Anybody with ARM knowledge
Only a bare minimum.
and iPad2 development access?
Yes. Although testing anything on iOS is a PITA. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org (msn) - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim,yahoo,skype,efnet,gmail

Rene Rivera wrote:
and iPad2 development access?
Yes. Although testing anything on iOS is a PITA.
Well... maybe you can tell me what's the official compiler for iOS these days and, if it's still gcc, which version and whether it defines __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4?

On 3/28/2011 10:59 AM, Peter Dimov wrote:
Rene Rivera wrote:
and iPad2 development access?
Yes. Although testing anything on iOS is a PITA.
Well... maybe you can tell me what's the official compiler for iOS these days and, if it's still gcc, which version and whether it defines __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4?
For the not-latest Xcode, i.e. Xcode3 series, there's GCC 4.2, LLVM GCC 4.2, and LLVM clang 1.7 as possible compilers. Yes, they are all supported. Like I said.. a PITA. I'll get back to you on the defines shortly. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org (msn) - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim,yahoo,skype,efnet,gmail

Hi Peter, Peter Dimov wrote:
Ticket #5372:
https://svn.boost.org/trac/boost/ticket/5372
says that shared_ptr's ARM spinlock implementation (which uses the swp instruction) doesn't work properly on iPad2 (which has a dual core ARM processor). The sample program in the ticket compares it to a loop using __sync_fetch_and_add, which means that the __sync intrinsics are implemented by the compiler the submitter is using. These didn't work on gcc for ARM when we tested them, but may have been added meanwhile. (I can see some code samples that test for 4.4, but the official docs state that ARM intrinsics are only supported on Linux before 4.6, which was released yesterday.)
So, we have two questions; first, why does the swp-based spinlock fail, and second, how can we detect support for __sync intrinsics and use them.
Anybody with ARM knowledge and iPad2 development access?
First let me say that the "right way" to fix this is surely to get Boost.Atomic finished and to use that as the basis of shared_ptr. I've contributed ARM code for Boost.Atomic that knows about the different architecture versions and will use ldrex/strex on ARMv7 (though it needs some attention from someone who knows more than I do about memory barriers, and it has had very little testing). I also have a trivial sp_counted_base_atomic.hpp that uses it. These are in use in a number of iPad apps and I've not yet had any reports of problems on the iPad 2 (fingers crossed). It seems that perhaps Helge doesn't have enough free time to finish this off - in that case, I think it's a sufficiently important library that we should perhaps consider how we can help to progress it. I could certainly contribute a modest amount of time and testing resource to it. In the meantime, my understanding is that SWP is "deprecated" in ARMv7 - except that it is a peculiarly strong kind of deprecation where you have to turn on a bit in a control register to enable it. I have asked Apple what they do with this bit on the iPad 2 (where of course the lockdown means individual apps cannot change it) and I await an answer. One other issue is that even when enabled, SWP might not have the required memory barrier semantics on the multi-processor systems, i.e. you might need to put explicit barrier instructions either side of it. I'm uncertain about this; it doesn't help that the ARMv7 architecture documents are still only available under NDA. Anyone here have copies? I don't yet have an iPad 2, but will eventually; I do have another dual-core ARM box with an Nvidia Tegra 2 chip, but I'm not sure if anything useful can be learnt from testing on it.
how can we detect support for __sync intrinsics and use them.
I believe that there is a macro something like __GCC_HAVE_SYNC_COMPARE_AND_SWAP__. The difficulty is that it was introduced well after the actual intrinsics were added, so there are gcc versions that do have the intrinsics but not the macro. Last time I checked, this was too much of an issue to ignore. Maybe things have moved on enough that this macro could now be used. Regards, Phil.

Phil Endecott wrote:
First let me say that the "right way" to fix this is surely to get Boost.Atomic finished and to use that as the basis of shared_ptr.
Maybe, but until it is, we still need to fix things somehow.
I've contributed ARM code for Boost.Atomic that knows about the different architecture versions and will use ldrex/strex on ARMv7 (though it needs some attention from someone who knows more than I do about memory barriers, and it has had very little testing).
Where can I see this code? How do you detect the ARMv7 architecture to enable ldrex/strex use?
In the meantime, my understanding is that SWP is "deprecated" in ARMv7 - except that it is a peculiarly strong kind of deprecation where you have to turn on a bit in a control register to enable it.
Yes, it's deprecated, but it still 'should' work, or perhaps fail, but judging by the bug report in the ticket, it seems that it works but is non-atomic.
One other issue is that even when enabled, SWP might not have the required memory barrier semantics on the multi-processor systems, i.e. you might need to put explicit barrier instructions either side of it.
As far as I know, the official semantics of SWP on newer architectures are that it works but is inefficient because it locks the bus. Who knows what it does on the Apple CPU though.
how can we detect support for __sync intrinsics and use them.
I believe that there is a macro something like __GCC_HAVE_SYNC_COMPARE_AND_SWAP__.
There should be __GCC_HAVE_COMPARE_AND_SWAP_4, but I'm not sure if it's actually defined by the whatever compiler people use for iOS development. :-)

Peter Dimov wrote:
Phil Endecott wrote:
I've contributed ARM code for Boost.Atomic that knows about the different architecture versions and will use ldrex/strex on ARMv7 (though it needs some attention from someone who knows more than I do about memory barriers, and it has had very little testing).
Where can I see this code?
It's in Helge's repository - http://git.chaoticmind.net/cgi-bin/cgit.cgi/boost.atomic/tree/boost/atomic/d...
How do you detect the ARMv7 architecture to enable ldrex/strex use?
The compiler defines a macro. Unfortunately it appears not to have a simple "v6" or "v7" but instead something more detailed, and I'm not at all convinced that this detects all of the variants: // This list of ARM architecture versions comes from Apple's arm/arch.h header. // I don't know how complete it is. #elif defined(__GNUC__) && (defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) \ || defined(__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__) \ || defined(__ARM_ARCH_6K__) || defined(__ARM_ARCH_7A__)) #include <boost/atomic/detail/gcc-armv6+.hpp>
how can we detect support for __sync intrinsics and use them.
I believe that there is a macro something like __GCC_HAVE_SYNC_COMPARE_AND_SWAP__.
There should be __GCC_HAVE_COMPARE_AND_SWAP_4, but I'm not sure if it's actually defined by the whatever compiler people use for iOS development.
I'm using Apple's version of gcc 4.2, and it is NOT defined there - but the intrinsic DOES exist: #ifdef __GCC_HAVE_COMPARE_AND_SWAP_4 #warning "__GCC_HAVE_COMPARE_AND_SWAP_4 defined" #else #warning "__GCC_HAVE_COMPARE_AND_SWAP_4 NOT defined" int a, b, c; __sync_bool_compare_and_swap(&a,b,c); #endif That outputs the "NOT defined" warning, but compiles and links OK. (From your other reply:)
I'm not quite sure what exactly DSB and ISB can be used for - maybe one of those could be enough.
I think they are used when you've changed the page tables, and similar things, and aren't appropriate here. What might be appropriate are some of the options to the DMB instruction e.g. "DMB #ST"; I think they might be new though. (I'm out of my depth here - I know just enough about memory barriers to know that I don't understand them...) Phil.

Phil Endecott wrote:
It's in Helge's repository - http://git.chaoticmind.net/cgi-bin/cgit.cgi/boost.atomic/tree/boost/atomic/d...
Thanks. Your barriers look fine. I see that you also switch to ARM mode automatically if in __thumb__. I wonder whether there's some easier way to do that, for example by tagging the appropriate functions as ARM by some magic GCC __attribute__.
How do you detect the ARMv7 architecture to enable ldrex/strex use?
The compiler defines a macro. Unfortunately it appears not to have a simple "v6" or "v7" but instead something more detailed, and I'm not at all convinced that this detects all of the variants:
// This list of ARM architecture versions comes from Apple's arm/arch.h header. // I don't know how complete it is. #elif defined(__GNUC__) && (defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) \ || defined(__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__) \ || defined(__ARM_ARCH_6K__) || defined(__ARM_ARCH_7A__))
Thanks for that, too. It helps to have a starting point for googling. It looks like v6 is: __ARM_ARCH_6__ __ARM_ARCH_6J__ __ARM_ARCH_6K__ __ARM_ARCH_6Z__ __ARM_ARCH_6ZK__ __ARM_ARCH_6T2__ and v7 is: __ARM_ARCH_7__ __ARM_ARCH_7A__ __ARM_ARCH_7R__ __ARM_ARCH_7M__ and maybe __ARM_ARCH_7EM__ but it looks like the only one seen in practice is indeed __ARM_ARCH_7A__. They don't make that easy.
There should be __GCC_HAVE_COMPARE_AND_SWAP_4, but I'm not sure if it's actually defined by the whatever compiler people use for iOS development.
I'm using Apple's version of gcc 4.2, and it is NOT defined there - but the intrinsic DOES exist:
#ifdef __GCC_HAVE_COMPARE_AND_SWAP_4 #warning "__GCC_HAVE_COMPARE_AND_SWAP_4 defined" #else #warning "__GCC_HAVE_COMPARE_AND_SWAP_4 NOT defined" int a, b, c; __sync_bool_compare_and_swap(&a,b,c); #endif
That outputs the "NOT defined" warning, but compiles and links OK.
Yeah, thought so. The instrinsics _probably_ work in ARMv6 and ARMv7 mode, but we can't be sure. I assume you tried that in ARMv6 mode?
I'm not quite sure what exactly DSB and ISB can be used for - maybe one of those could be enough.
I think they are used when you've changed the page tables, and similar things, and aren't appropriate here. What might be appropriate are some of the options to the DMB instruction e.g. "DMB #ST"; I think they might be new though.
I think that it's better to stay away from these.

Phil Endecott wrote:
One other issue is that even when enabled, SWP might not have the required memory barrier semantics on the multi-processor systems, i.e. you might need to put explicit barrier instructions either side of it.
Yes, you may be right about this. It could be the lack of a memory barrier (DMB) that causes the spinlock to not work properly. (I'm not quite sure what exactly DSB and ISB can be used for - maybe one of those could be enough.)

On 28/03/2011 14:03, Peter Dimov wrote:
Hi everyone,
Ticket #5372:
https://svn.boost.org/trac/boost/ticket/5372
says that shared_ptr's ARM spinlock implementation (which uses the swp instruction) doesn't work properly on iPad2 (which has a dual core ARM processor). The sample program in the ticket compares it to a loop using __sync_fetch_and_add, which means that the __sync intrinsics are implemented by the compiler the submitter is using. These didn't work on gcc for ARM when we tested them, but may have been added meanwhile. (I can see some code samples that test for 4.4, but the official docs state that ARM intrinsics are only supported on Linux before 4.6, which was released yesterday.)
So, we have two questions; first, why does the swp-based spinlock fail, and second, how can we detect support for __sync intrinsics and use them.
Anybody with ARM knowledge and iPad2 development access?
It appears SWP does not work across multiple cores because it doesn't perform a memory barrier. Wrap the code in calls to DMB or better yet, rewrite it to use LDREX/STREX. I'll test on my dual-core Cortex-A9 when I have the time.
participants (4)
-
Mathias Gaunard
-
Peter Dimov
-
Phil Endecott
-
Rene Rivera