shared_ptr under OS X

I've had a number of programs crash after recompiling with the 1.33.0 release and eventually tracked the problems down to shared_ptr, specifically the atomic_decrement function provided in sp_counted_base_cw_ppc.hpp. So far, I've only noted problems on a dual processor G5 running OS X 10.4.2, most often when a shared_ptr created as a temporary object goes out of scope and the destructor gets called. I was able to eliminate the problem by replacing the hand-rolled asm atomic routines with equivalent system routines from CarbonCore/DriverSychronization.h (except for atomic_conditional_increment, which I don't have an equivalent for). I'm not a PPC asm guru, or even a threading/mp guru, so I haven't been able to properly evaluate the boost functions. I do know that I was getting access exceptions upon execution of the sync instruction in the affected areas of the code. This was using CodeWarrior 9.5 to compile, with optimizations disabled, but inlining activated (the decrement was being inlined). In my (limited) research into the problem, I haven't been able to find any good examples of hand-rolled atomic ops that are mp-savvy, though I found plenty of dire warnings against making the attempt and numerous recommendations to use either the kernel-level or Carbon-level routines instead, depending on context. What additional information can I provide and who can I provide it to? I haven't spent enough time on this to conclude definitively that the problem is in the boost source, as it might be CodeWarrior's munging of it that causes the problem. It's unlikely to be an issue in my code, as I've seen the issue pop up in relation to several different constructs, all of which appear to be legal C++ to me. I won't be pursuing this independently, as the workaround I've found solves my immediate problem and lets me get back to work after killing a couple of days playing around with it. Steve

On Aug 21, 2005, at 4:28 PM, Steve Ramsey wrote:
I've had a number of programs crash after recompiling with the 1.33.0 release and eventually tracked the problems down to shared_ptr, specifically the atomic_decrement function provided in sp_counted_base_cw_ppc.hpp.
Try these: inline void atomic_increment( register long * pw ) { register int a; asm { loop: lwarx a, 0, pw addi a, a, 1 stwcx. a, 0, pw bne- loop } } inline long atomic_decrement( register long * pw ) { register int a; asm { sync loop: lwarx a, 0, pw addi a, a, -1 stwcx. a, 0, pw bne- loop isync } return a; } inline long atomic_conditional_increment( register long * pw ) { register int a; asm { loop: lwarx a, 0, pw cmpwi a, 0 beq store addi a, a, 1 store: stwcx. a, 0, pw bne- loop } return a; } -Howard

In article <045D1135-E8F5-4BF3-82BE-813C835E31FC@lucena.com>, Steve Ramsey <steve@lucena.com> wrote:
In my (limited) research into the problem, I haven't been able to find any good examples of hand-rolled atomic ops that are mp-savvy, though I found plenty of dire warnings against making the attempt and numerous recommendations to use either the kernel-level or Carbon-level routines instead, depending on context.
I voiced my concern over hand-rolled assembly when this issue was brought up regarding the boost implementation, and I am going to voice it again. We do not have the resources to test this kind of a change on all the hardware we support, and the fact that a broken shared_ptr made it into a release just highlights this. I do not think that the intended performance improvement is worth sacrificing the quality of our libraries, and I don't see a way to maintain that quality if we make hardware-specific changes while not having the resources to test on all the supported hardware (including future hardware). My intent here is not to say "I told you so", but to point out that I consider this a pretty serious error, and that I think we should understand how it came to pass. Ben -- I changed my name: <http://periodic-kingdom.org/People/NameChange.php>

On 23/08/05, Ben Artin <macdev@artins.org> wrote: In article <045D1135-E8F5-4BF3-82BE-813C835E31FC@lucena.com>, Steve Ramsey <steve@lucena.com> wrote: <snip> I voiced my concern over hand-rolled assembly when this issue was brought up regarding the boost implementation, and I am going to voice it again. We do not have the resources to test this kind of a change on all the hardware we support, and the fact that a broken shared_ptr made it into a release just highlights this.
The testing failed. There should be a test that warns if a correct result occurs when it should fail. For example, my home grown atomic ops include a unit test that does this. See below for an example. It warns if an unprotected increment operation works safely.
I do not think that the intended performance improvement is worth sacrificing the quality of our libraries, and I don't see a way to maintain that quality if we make hardware-specific changes while not having the resources to test on all the supported hardware (including future hardware).
The performance improvement is definitely worth it on many platforms for many apps. Native atomic ops are often an order of magnitude faster. Though many os platforms will wrap such ops in lib call that is just as fast... is this what you're referring to rather than using a generic safe approach? Perhaps you should be able to "opt out" to a mutex implementation via a #define. matt. //-------------------------------------------------- <snip> #define X86_64_GCC #include <synch/atomic_op.hpp> void synch_test_001() { static int global_count = 0; int number_of_threads = 10; static int inner_loop = 10000000; ZOMOJO_WORKER_THREAD_BEGIN( inner_loop ) global_count = global_count + 1; // unsafe naked increment of shared int ZOMOJO_WORKER_THREAD_END( number_of_threads ) BOOST_MESSAGE( "orderly progression = " << inner_loop * number_of_threads ); BOOST_MESSAGE( "conccurrent progression from this run = " << global_count ); // Warning will be generated if concurrency is not useful // If this warning is generated then the validity of the other tests for concurrency testing // is questionable. That is, if lots of threads can increment a shared variable with no // harmful consequences then testing locking and other concurrency strategies is not going // to test concurrency usefully. The further tests should still pass, failure is not expected, // they are just not thorough with respect to threading. BOOST_WARN_PREDICATE(std::not_equal_to<int>(), (global_count)( inner_loop * number_of_threads) ); } </snip>

Matt Hurd wrote:
The testing failed. There should be a test that warns if a correct result occurs when it should fail.
I believe that the tests do detect the problem in question... they just weren't being run under the problematic configuration. We first heard of the problem after the RC has gone out, and it was too late.
Perhaps you should be able to "opt out" to a mutex implementation via a #define.
You can: #define BOOST_SP_USE_PTHREADS It's even documented: http://www.boost.org/libs/smart_ptr/shared_ptr.htm#ThreadSafety (at the end of the section.)

On Aug 23, 2005, at 4:50 AM, Ben Artin wrote:
I consider this a pretty serious error, and that I think we should understand how it came to pass.
The technical cause of the error is that there was an unforeseen conflict between the CodeWarrior compiler's register allocation scheme, and the original code, but only when the original was actually inlined (the conflict disappeared with inlining off as might have easily happened in a debug run). The bug had nothing to do with multithreading or memory barriers per say, just that these issues were what was being implemented. The fix was to avoid explicitly naming registers in the assembly, allowing the compiler to allocate registers at will with no chance for conflict. The bug was first detected (to my knowledge) by John Daub on Aug. 10. I learned about it on Aug. 12, but was not able to look at it, diagnose it, and suggest a fix until Aug. 15. <shrug> bugs happen. Bugs of this nature won't be eradicated unless you outlaw dropping into assembly (and I don't recommend that). Perhaps an extended public beta of boost releases (all compressed and easy to download) might help? -Howard

In article <2c8df206af943147929336f60d8a4bea@twcny.rr.com>, Howard Hinnant <hinnant@twcny.rr.com> wrote:
On Aug 23, 2005, at 4:50 AM, Ben Artin wrote:
I consider this a pretty serious error, and that I think we should understand how it came to pass.
[...]
<shrug> bugs happen. Bugs of this nature won't be eradicated unless you outlaw dropping into assembly (and I don't recommend that).
Sure, bugs happen. My main point here is that we should make sure we learn something from them -- or, more importantly, that we should learn something from them *before* our users learn that Mac OS X releases of boost are not thoroughly tested.
Perhaps an extended public beta of boost releases (all compressed and easy to download) might help?
Can someone explain exactly why this wasn't caught in unit tests? I am not sure I understand why we missed this, but I am pretty sure it's important that we try to avoid this happening in the future. Ben -- I changed my name: <http://periodic-kingdom.org/People/NameChange.php>

Ben Artin wrote:
Can someone explain exactly why this wasn't caught in unit tests? I am not sure I understand why we missed this, but I am pretty sure it's important that we try to avoid this happening in the future.
The Dirxion tests: http://engineering.meta-comm.com/boost-regression/CVS-HEAD/developer/smart_p... are being run with inlining turned off. The bug only manifests itself when inlining is on.
Sure, bugs happen. My main point here is that we should make sure we learn something from them -- or, more importantly, that we should learn something from them *before* our users learn that Mac OS X releases of boost are not thoroughly tested.
The lesson here is that we can't test everything and that we must make a beta or an RC available earlier. We've had similar configuration-specific bugs before, and they weren't caused by platform-specific implementations. Alpha testing simply can't obviate the need for beta tests.

In article <007501c5a829$4e871bd0$6601a8c0@pdimov2>, "Peter Dimov" <pdimov@mmltd.net> wrote:
Ben Artin wrote:
Sure, bugs happen. My main point here is that we should make sure we learn something from them -- or, more importantly, that we should learn something from them *before* our users learn that Mac OS X releases of boost are not thoroughly tested.
The lesson here is that we can't test everything and that we must make a beta or an RC available earlier. We've had similar configuration-specific bugs before, and they weren't caused by platform-specific implementations. Alpha testing simply can't obviate the need for beta tests.
I agree. I understand what a bitch it can be to work with CVS branches and merging, so I understand the desire to minimize the divergence between the release branch and the mainline, but a longer period of (more thorough) testing would probably be a good idea. In particular, I am sufficiently busy that I didn't even have the time to look at the new release between first candidate and final release. I don't think I am the only one. Ben -- I changed my name: <http://periodic-kingdom.org/People/NameChange.php>

On Wed, 24 Aug 2005 00:26:17 +0300, Peter Dimov wrote
Ben Artin wrote:
Can someone explain exactly why this wasn't caught in unit tests? I am not sure I understand why we missed this, but I am pretty sure it's important that we try to avoid this happening in the future.
The Dirxion tests:
http://engineering.meta-comm.com/boost-regression/CVS-HEAD/developer/smart_p...
are being run with inlining turned off.
The bug only manifests itself when inlining is on.
For what it's worth, there is a recognition of the issue of running regression tests under debug only (search the mail archive if you really want to read about it). And, in fact, we do now have some platforms running tests using release configurations where inlining should be enabled. I believe the 1.33 is the first Boost release to have achieved this level of testing -- so, in fact, I believe the level of Boost testing has never been higher.
Sure, bugs happen. My main point here is that we should make sure we learn something from them -- or, more importantly, that we should learn something from them *before* our users learn that Mac OS X releases of boost are not thoroughly tested.
What I'd learn is that Boost is getting ever bigger in size and more popular. So even if we are getting better at testing we might very well see 'more bugs' after a release because there's more code to have bugs and more people with more configurations to see them. This dynamic is likely to continue.
The lesson here is that we can't test everything and that we must make a beta or an RC available earlier. We've had similar configuration-specific bugs before, and they weren't caused by platform-specific implementations. Alpha testing simply can't obviate the need for beta tests.
I agree. Jeff

* Ben Artin (macdev@artins.org) [20050823 11:03]:
the quality of our libraries, and I don't see a way to maintain that quality if we make hardware-specific changes while not having the resources to test on all the supported hardware (including future hardware).
Depends on the hardware you need for testing. SUSE does have quit a bit of hardware (i386, x86-64, ppc, ppc64, IPF (aka ia64), s390 and s390x) , but all of them are running Linux. If that suffices, I'd be willing to offer help in testing. Philipp -- Philipp Thomas <pth AT suse.de> Research & Development SUSE LINUX Products GmbH, Maxfeldstr. 5, D-90409 Nuremberg, Germany
participants (7)
-
Ben Artin
-
Howard Hinnant
-
Jeff Garland
-
Matt Hurd
-
Peter Dimov
-
Philipp Thomas
-
Steve Ramsey