[shared_ptr] possible bug in sp_counted_base_cw_ppc.hpp

I think that the functions atomic_increment atomic_decrement atomic_conditional_increment should be marked as inline, as they are defined in a header file. This is causing multiple definition problems, see for instance http://tinyurl.com/6eub2 Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

Joaquín Mª López Muñoz wrote:
I think that the functions
atomic_increment atomic_decrement atomic_conditional_increment
should be marked as inline, as they are defined in a header file. This is causing multiple definition problems, see for instance
Yes, you are right. I'd fix it but I don't know whether "inline" needs to go before or after "asm". ;-) Rene?

Peter Dimov wrote:
Joaquín Mª López Muñoz wrote:
I think that the functions
atomic_increment atomic_decrement atomic_conditional_increment
should be marked as inline, as they are defined in a header file. This is causing multiple definition problems, see for instance
Yes, you are right. I'd fix it but I don't know whether "inline" needs to go before or after "asm". ;-) Rene?
Sorry.. it took a while, waiting for the CVS update. And the answer is "neither". Function level assembly can't be inlined. At least that's what the compilers says :-) Which means changing to use statement level assembly. So it would have to be something like this... inline void atomic_increment( register long * pw ) { asm { loop: lwarx r4, 0, r3 addi r4, r4, 1 stwcx. r4, 0, r3 bne- loop } } Which when applied to the other functions, compiles without warnings. And since all the tests pass, I checked in the changes to CVS. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Rene Rivera wrote: [...]
Which when applied to the other functions, compiles without warnings. And since all the tests pass, I checked in the changes to CVS.
Also, I forgot to ask about something I noticed.. In sp_counted_base.hpp there is this sequence: #elif defined( __GNUC__ ) && ( defined( __i386__ ) || defined( __x86_64__ ) ) # include <boost/detail/sp_counted_base_gcc_x86.hpp> ...and further along... #elif defined( WIN32 ) || defined( _WIN32 ) || defined( __WIN32__ ) # include <boost/detail/sp_counted_base_w32.hpp> Question is.. Shouldn't we be using the Win32 implementation in the case of GCC on Windows, i.e. under MinGW? -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Rene Rivera wrote:
Rene Rivera wrote: [...]
Which when applied to the other functions, compiles without warnings. And since all the tests pass, I checked in the changes to CVS.
Also, I forgot to ask about something I noticed.. In sp_counted_base.hpp there is this sequence:
#elif defined( __GNUC__ ) && ( defined( __i386__ ) || defined( __x86_64__ ) ) # include <boost/detail/sp_counted_base_gcc_x86.hpp>
...and further along...
#elif defined( WIN32 ) || defined( _WIN32 ) || defined( __WIN32__ ) # include <boost/detail/sp_counted_base_w32.hpp>
Question is.. Shouldn't we be using the Win32 implementation in the case of GCC on Windows, i.e. under MinGW?
No, because the Win32 implementation on non-MSVC/Intel makes indirect calls to the kernel Interlocked* functions. The g++ version is better.

Peter Dimov wrote:
No, because the Win32 implementation on non-MSVC/Intel makes indirect calls to the kernel Interlocked* functions. The g++ version is better.
So that brings up this question.. Would it be preferable to have CodeWarrior on Windows use an assembly implementation? That is have a sp_counted_base_cw_x86.hpp, with the same implementation as sp_counted_base_gcc_x86.hpp. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Rene Rivera wrote:
Peter Dimov wrote:
No, because the Win32 implementation on non-MSVC/Intel makes indirect calls to the kernel Interlocked* functions. The g++ version is better.
So that brings up this question..
Would it be preferable to have CodeWarrior on Windows use an assembly implementation? That is have a sp_counted_base_cw_x86.hpp, with the same implementation as sp_counted_base_gcc_x86.hpp.
(ping) Haven't gotten an answer on the question. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Rene Rivera wrote:
Peter Dimov wrote:
No, because the Win32 implementation on non-MSVC/Intel makes indirect calls to the kernel Interlocked* functions. The g++ version is better.
So that brings up this question..
Would it be preferable to have CodeWarrior on Windows use an assembly implementation? That is have a sp_counted_base_cw_x86.hpp, with the same implementation as sp_counted_base_gcc_x86.hpp.
Maybe. I was trying to avoid sp_counted_base_* variations unless necessary. Since we need g++/x86 for non-Windows platforms anyway, it seems reasonable to use it on Windows, too. I wonder if CodeWarrior supports the _Interlocked* family as intrinsic functions. If it does, we'll just need to enable the appropriate path in detail/interlocked.hpp for __MWERKS__. In any event, I think that this can wait for after 1.33 is released.

Peter Dimov wrote:
I wonder if CodeWarrior supports the _Interlocked* family as intrinsic functions. If it does, we'll just need to enable the appropriate path in detail/interlocked.hpp for __MWERKS__.
Didn't find anything in the docs nor the CodeWarrior runtime sources (other than the instances in the win32 sdk) to suggest that it supports them. How would I go about changing interlocked.hpp to verify this either way? (I guess I could hardwire it, hm..) Yes, I know, I'm being a pest about this.. But I'm *really* interested in the CW/x86 case. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Rene Rivera wrote:
Peter Dimov wrote:
I wonder if CodeWarrior supports the _Interlocked* family as intrinsic functions. If it does, we'll just need to enable the appropriate path in detail/interlocked.hpp for __MWERKS__.
Didn't find anything in the docs nor the CodeWarrior runtime sources (other than the instances in the win32 sdk) to suggest that it supports them. How would I go about changing interlocked.hpp to verify this either way? (I guess I could hardwire it, hm..)
There is an #ifdef for BOOST_MSVC or BOOST_INTEL_WIN at line 30; try adding __MWERKS__ to it.
Yes, I know, I'm being a pest about this.. But I'm *really* interested in the CW/x86 case.
Well, if our review manager and our testers don't mind, you can go ahead and add a _cw_x86 version. Although it might be a good idea to verify whether it actually leads to an increase in performance first. ;-)

Peter Dimov wrote:
Rene Rivera wrote:
Yes, I know, I'm being a pest about this.. But I'm *really* interested in the CW/x86 case.
Well, if our review manager and our testers don't mind, you can go ahead and add a _cw_x86 version. Although it might be a good idea to verify whether it actually leads to an increase in performance first. ;-)
OK, I tried translating the gcc_x86. But I just don't understand the GCC machine description language well enough. What does the assembly end up being for all those memory, cc, etc. ? And yes I'll test the performance to make sure it actually helps :-) -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Rene Rivera wrote:
Peter Dimov wrote:
Rene Rivera wrote:
Yes, I know, I'm being a pest about this.. But I'm *really* interested in the CW/x86 case.
Well, if our review manager and our testers don't mind, you can go ahead and add a _cw_x86 version. Although it might be a good idea to verify whether it actually leads to an increase in performance first. ;-)
OK, I tried translating the gcc_x86. But I just don't understand the GCC machine description language well enough. What does the assembly end up being for all those memory, cc, etc. ?
I don't think that CodeWarrior has any way of specifying whether an asm statement affects memory, clobbers a register, or should not be reordered. It's supposedly (and hopefully) smart enough to figure these things out on its own. ;-)

Peter Dimov wrote:
Rene Rivera wrote:
OK, I tried translating the gcc_x86. But I just don't understand the GCC machine description language well enough. What does the assembly end up being for all those memory, cc, etc. ?
I don't think that CodeWarrior has any way of specifying whether an asm statement affects memory, clobbers a register, or should not be reordered. It's supposedly (and hopefully) smart enough to figure these things out on its own. ;-)
It is smart enough, and you have to jump some hoops to make it optimize assembly blocks.. But I also meant all the other GCC constructs that I don't know :-) For example MOVL is not x86 assembly, but some GCC construct, and I'm not totally sure yet if it's just the argument inverted version of MOV. This is what I came up with for the atomic_conditional_increment.. inline int atomic_conditional_increment( int * pw ) { // int rv = *pw; // if( rv != 0 ) ++*pw; // return rv; int rv; int register tmp; asm { mov eax, [pw] L0: test eax, eax je L1 mov tmp, eax inc tmp lock cmpxchg [pw], tmp jne L0 L1: mov rv, eax //~ "movl %0, %%eax\n\t" //~ "0:\n\t" //~ "test %%eax, %%eax\n\t" //~ "je 1f\n\t" //~ "movl %%eax, %2\n\t" //~ "incl %2\n\t" //~ "lock\n\t" //~ "cmpxchgl %2, %0\n\t" //~ "jne 0b\n\t" //~ "1:": //~ "=m"( *pw ), "=&a"( rv ), "=&r"( tmp ): // outputs (%0, %1, %2) //~ "m"( *pw ): // input (%3) //~ "cc" // clobbers } return rv; } But, even though the logic seems correct to me, it doesn't work. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Rene Rivera wrote:
Peter Dimov wrote:
Rene Rivera wrote:
OK, I tried translating the gcc_x86. But I just don't understand the GCC machine description language well enough. What does the assembly end up being for all those memory, cc, etc. ?
I don't think that CodeWarrior has any way of specifying whether an asm statement affects memory, clobbers a register, or should not be reordered. It's supposedly (and hopefully) smart enough to figure these things out on its own. ;-)
It is smart enough, and you have to jump some hoops to make it optimize assembly blocks.. But I also meant all the other GCC constructs that I don't know :-) For example MOVL is not x86 assembly, but some GCC construct, and I'm not totally sure yet if it's just the argument inverted version of MOV.
Yes, movl is the argument-inverted mov.
This is what I came up with for the atomic_conditional_increment..
inline int atomic_conditional_increment( int * pw ) { // int rv = *pw; // if( rv != 0 ) ++*pw; // return rv;
int rv; int register tmp;
asm { mov eax, [pw] L0: test eax, eax je L1 mov tmp, eax inc tmp lock cmpxchg [pw], tmp jne L0 L1: mov rv, eax //~ "movl %0, %%eax\n\t" //~ "0:\n\t" //~ "test %%eax, %%eax\n\t" //~ "je 1f\n\t" //~ "movl %%eax, %2\n\t" //~ "incl %2\n\t" //~ "lock\n\t" //~ "cmpxchgl %2, %0\n\t" //~ "jne 0b\n\t" //~ "1:": //~ "=m"( *pw ), "=&a"( rv ), "=&r"( tmp ): // outputs (%0, %1, %2) //~ "m"( *pw ): // input (%3) //~ "cc" // clobbers }
return rv; }
But, even though the logic seems correct to me, it doesn't work.
Assuming CodeWarrior returns in eax, try: inline int atomic_conditional_increment( int * pw ) { // int rv = *pw; // if( rv != 0 ) ++*pw; // return rv; asm { mov esi, [pw] mov eax, [esi] L0: test eax, eax je L1 mov ebx, eax inc ebx lock cmpxchg [esi], ebx jne L0 L1: } } // esi == pw, eax == *pw

Peter Dimov wrote: I have good news, bad news, and worse news..
Assuming CodeWarrior returns in eax, try:
I was in the middle of looking at the disassembly, when I got your reply, and realizing that CW is too smart. It was removing the based addressing :-(
inline int atomic_conditional_increment( int * pw ) { // int rv = *pw; // if( rv != 0 ) ++*pw; // return rv;
asm { mov esi, [pw] mov eax, [esi] L0: test eax, eax je L1 mov ebx, eax inc ebx lock cmpxchg [esi], ebx jne L0 L1: } }
// esi == pw, eax == *pw
Good news.. After doing that, and applying the equivalent transformation to the other functions it works much better. Bad news.. The results of the timing test are, from slow to fast: sp_counted_base_w32.hpp... debug: 5.058 - release: 1.902 sp_counted_base_cw_x86.hpp... debug: 4.997 - release: 1.743 sp_counted_base_nt.hpp... debug: 4.847 - release: 1.432 So some improvement, but I'm not sure if it's worth the extra complication. Unless there are people out there doing massive, >10Meg, shared_ptr manipulations. Worse news.. Not all tests pass! The shared_ptr_mt_test.cpp test fail in the rather spectacular illegal memory access exception. And to top it off it causes the MSVC JIT debugger to crash if one attempts to debug.. Oh fun! -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Rene Rivera wrote:
Worse news..
Not all tests pass! The shared_ptr_mt_test.cpp test fail in the rather spectacular illegal memory access exception. And to top it off it causes the MSVC JIT debugger to crash if one attempts to debug.. Oh fun!
Can you post the other two functions?

Rene Rivera wrote:
Peter Dimov wrote:
I have good news, bad news, and worse news..
Forgot that I wanted to post the code that I have so others may give it a try. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org #ifndef BOOST_DETAIL_SP_COUNTED_BASE_CW_X86_HPP_INCLUDED #define BOOST_DETAIL_SP_COUNTED_BASE_CW_X86_HPP_INCLUDED // MS compatible compilers support #pragma once #if defined(_MSC_VER) && (_MSC_VER >= 1020) # pragma once #endif // // detail/sp_counted_base_gcc_x86.hpp - g++ on 486+ or AMD64 // // Copyright (c) 2001, 2002, 2003 Peter Dimov and Multi Media Ltd. // Copyright 2004-2005 Peter Dimov // // Distributed under the Boost Software License, Version 1.0. (See // accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) // // // Lock-free algorithm by Alexander Terekhov // // Thanks to Ben Hitchings for the #weak + (#shared != 0) // formulation // #include <typeinfo> namespace boost { namespace detail { inline int atomic_exchange_and_add( int * pw, int dv ) { // int r = *pw; // *pw += dv; // return r; asm { mov esi, [pw] mov eax, dv lock xadd [esi], eax } } inline void atomic_increment( int * pw ) { //atomic_exchange_and_add( pw, 1 ); asm { mov esi, [pw] lock inc [esi] } } inline int atomic_conditional_increment( int * pw ) { // int rv = *pw; // if( rv != 0 ) ++*pw; // return rv; asm { mov esi, [pw] mov eax, [esi] L0: test eax, eax je L1 mov ebx, eax inc ebx lock cmpxchg [esi], ebx jne L0 L1: } } class sp_counted_base { private: sp_counted_base( sp_counted_base const & ); sp_counted_base & operator= ( sp_counted_base const & ); int use_count_; // #shared int weak_count_; // #weak + (#shared != 0) public: sp_counted_base(): use_count_( 1 ), weak_count_( 1 ) { } virtual ~sp_counted_base() // nothrow { } // dispose() is called when use_count_ drops to zero, to release // the resources managed by *this. virtual void dispose() = 0; // nothrow // destroy() is called when weak_count_ drops to zero. virtual void destroy() // nothrow { delete this; } virtual void * get_deleter( std::type_info const & ti ) = 0; void add_ref_copy() { atomic_increment( &use_count_ ); } bool add_ref_lock() // true on success { return atomic_conditional_increment( &use_count_ ) != 0; } void release() // nothrow { if( atomic_exchange_and_add( &use_count_, -1 ) == 1 ) { dispose(); weak_release(); } } void weak_add_ref() // nothrow { atomic_increment( &weak_count_ ); } void weak_release() // nothrow { if( atomic_exchange_and_add( &weak_count_, -1 ) == 1 ) { destroy(); } } long use_count() const // nothrow { return static_cast<int const volatile &>( use_count_ ); } }; } // namespace detail } // namespace boost #endif // #ifndef BOOST_DETAIL_SP_COUNTED_BASE_GCC_X86_HPP_INCLUDED #ifndef BOOST_DETAIL_SP_COUNTED_BASE_HPP_INCLUDED #define BOOST_DETAIL_SP_COUNTED_BASE_HPP_INCLUDED // MS compatible compilers support #pragma once #if defined(_MSC_VER) && (_MSC_VER >= 1020) # pragma once #endif // // detail/sp_counted_base.hpp // // Copyright 2005 Peter Dimov // // Distributed under the Boost Software License, Version 1.0. (See // accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) // #include <boost/config.hpp> #if defined( BOOST_SP_DISABLE_THREADS ) # include <boost/detail/sp_counted_base_nt.hpp> #elif defined( BOOST_SP_USE_PTHREADS ) # include <boost/detail/sp_counted_base_pt.hpp> #elif defined( __GNUC__ ) && ( defined( __i386__ ) || defined( __x86_64__ ) ) # include <boost/detail/sp_counted_base_gcc_x86.hpp> #elif defined( __MWERKS__ ) && ( defined( __i386__ ) || defined( __x86_64__ ) ) # include <boost/detail/sp_counted_base_cw_x86.hpp> #elif defined( __MWERKS__ ) && defined( __POWERPC__ ) # include <boost/detail/sp_counted_base_cw_ppc.hpp> #elif defined( __GNUC__ ) && ( defined( __powerpc__ ) || defined( __ppc__ ) ) # include <boost/detail/sp_counted_base_gcc_ppc.hpp> #elif defined( WIN32 ) || defined( _WIN32 ) || defined( __WIN32__ ) # include <boost/detail/sp_counted_base_w32.hpp> #elif !defined( BOOST_HAS_THREADS ) # include <boost/detail/sp_counted_base_nt.hpp> #elif defined( BOOST_HAS_PTHREADS ) # include <boost/detail/sp_counted_base_pt.hpp> #else // Use #define BOOST_DISABLE_THREADS to avoid the error # error Unrecognized threading platform #endif #endif // #ifndef BOOST_DETAIL_SP_COUNTED_BASE_HPP_INCLUDED

Rene Rivera wrote:
namespace boost {
namespace detail {
inline int atomic_exchange_and_add( int * pw, int dv ) { // int r = *pw; // *pw += dv; // return r;
asm { mov esi, [pw] mov eax, dv lock xadd [esi], eax } }
inline void atomic_increment( int * pw ) { //atomic_exchange_and_add( pw, 1 );
asm { mov esi, [pw] lock inc [esi] } }
Everything looks correct to me, except I'd put the lock prefix before the instruction. GCC needs it on its own line but the x86 way is 'lock inc [esi]'. Unless you can post a disassembly, I'm out of ideas. :-)

Peter Dimov wrote:
Rene Rivera wrote:
asm { mov esi, [pw] lock inc [esi] }
Everything looks correct to me, except I'd put the lock prefix before the instruction. GCC needs it on its own line but the x86 way is 'lock inc [esi]'.
Unless you can post a disassembly, I'm out of ideas. :-)
I looked at the disassembly of some of the tests and found one problem. CW was using the incorrect access size in some cases, BYTE instead of DWORD. So if I make that explicit everything works. I running the timing tests again, with some changes to see if I can get more representative numbers. But attached is the end result. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org #ifndef BOOST_DETAIL_SP_COUNTED_BASE_CW_X86_HPP_INCLUDED #define BOOST_DETAIL_SP_COUNTED_BASE_CW_X86_HPP_INCLUDED // MS compatible compilers support #pragma once #if defined(_MSC_VER) && (_MSC_VER >= 1020) # pragma once #endif // // detail/sp_counted_base_gcc_x86.hpp - g++ on 486+ or AMD64 // // Copyright (c) 2001, 2002, 2003 Peter Dimov and Multi Media Ltd. // Copyright 2004-2005 Peter Dimov // // Distributed under the Boost Software License, Version 1.0. (See // accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) // // // Lock-free algorithm by Alexander Terekhov // // Thanks to Ben Hitchings for the #weak + (#shared != 0) // formulation // #include <typeinfo> namespace boost { namespace detail { inline int atomic_exchange_and_add( int * pw, int dv ) { // int r = *pw; // *pw += dv; // return r; asm { mov esi, [pw] mov eax, dv lock xadd dword ptr [esi], eax } } inline void atomic_increment( int * pw ) { //atomic_exchange_and_add( pw, 1 ); asm { mov esi, [pw] lock inc dword ptr [esi] } } inline int atomic_conditional_increment( int * pw ) { // int rv = *pw; // if( rv != 0 ) ++*pw; // return rv; asm { mov esi, [pw] mov eax, dword ptr [esi] L0: test eax, eax je L1 mov ebx, eax inc ebx lock cmpxchg dword ptr [esi], ebx jne L0 L1: } } class sp_counted_base { private: sp_counted_base( sp_counted_base const & ); sp_counted_base & operator= ( sp_counted_base const & ); int use_count_; // #shared int weak_count_; // #weak + (#shared != 0) public: sp_counted_base(): use_count_( 1 ), weak_count_( 1 ) { } virtual ~sp_counted_base() // nothrow { } // dispose() is called when use_count_ drops to zero, to release // the resources managed by *this. virtual void dispose() = 0; // nothrow // destroy() is called when weak_count_ drops to zero. virtual void destroy() // nothrow { delete this; } virtual void * get_deleter( std::type_info const & ti ) = 0; void add_ref_copy() { atomic_increment( &use_count_ ); } bool add_ref_lock() // true on success { return atomic_conditional_increment( &use_count_ ) != 0; } void release() // nothrow { if( atomic_exchange_and_add( &use_count_, -1 ) == 1 ) { dispose(); weak_release(); } } void weak_add_ref() // nothrow { atomic_increment( &weak_count_ ); } void weak_release() // nothrow { if( atomic_exchange_and_add( &weak_count_, -1 ) == 1 ) { destroy(); } } long use_count() const // nothrow { return static_cast<int const volatile &>( use_count_ ); } }; } // namespace detail } // namespace boost #endif // #ifndef BOOST_DETAIL_SP_COUNTED_BASE_GCC_X86_HPP_INCLUDED

Rene Rivera wrote:
I running the timing tests again, with some changes to see if I can get more representative numbers. But attached is the end result.
With a changed test that exercises all three functions, here's the timing results: sp_counted_base_w32.hpp... debug: 15.673 - release: 7.151 sp_counted_base_cw_x86.hpp... debug: 15.943 - release: 6.499 sp_counted_base_nt.hpp... debug: 15.933 - release: 6.499 So the the hand assembly is as fast as the non-threading version. But is both slower and faster than the Interlocked* version. What gives? Is there a better smart_ptr timing test out there? I'm including the one I used. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org #include <boost/config.hpp> #if defined(BOOST_MSVC) #pragma warning(disable: 4786) // identifier truncated in debug info #pragma warning(disable: 4710) // function not inlined #pragma warning(disable: 4711) // function selected for automatic inline expansion #pragma warning(disable: 4514) // unreferenced inline removed #endif // // shared_ptr_timing_test.cpp - use to evaluate the impact of thread safety // // Copyright (c) 2002 Peter Dimov and Multi Media Ltd. // // Distributed under the Boost Software License, Version 1.0. (See // accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) // #include <boost/shared_ptr.hpp> #include <boost/weak_ptr.hpp> #include <iostream> #include <vector> #include <ctime> int const n = 4 * 1024 * 1024; int const r = 4; int main() { using namespace std; std::vector< boost::shared_ptr<int> > v(n); std::vector< boost::weak_ptr<int> > w(n); std::vector< boost::shared_ptr<int> > x(n); boost::shared_ptr<int> pi(new int); clock_t t = clock(); { for (int i = 0; i < r; ++i) { v.assign(n,pi); w.assign(v.begin(),v.end()); for (int j = 0; j < n; ++j) { x[j] = w[j].lock(); } v.clear(); w.clear(); x.clear(); } } t = clock() - t; std::cout << static_cast<double>(t) / CLOCKS_PER_SEC << '\n'; return 0; }

Rene Rivera wrote:
Rene Rivera wrote:
I running the timing tests again, with some changes to see if I can get more representative numbers. But attached is the end result.
With a changed test that exercises all three functions, here's the timing results:
sp_counted_base_w32.hpp... debug: 15.673 - release: 7.151
sp_counted_base_cw_x86.hpp... debug: 15.943 - release: 6.499
sp_counted_base_nt.hpp... debug: 15.933 - release: 6.499
So the the hand assembly is as fast as the non-threading version. But is both slower and faster than the Interlocked* version. What gives? Is there a better smart_ptr timing test out there? I'm including the one I used.
Darn... attached the wrong file :-( Here's the one that really works.. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org #include <boost/config.hpp> #if defined(BOOST_MSVC) #pragma warning(disable: 4786) // identifier truncated in debug info #pragma warning(disable: 4710) // function not inlined #pragma warning(disable: 4711) // function selected for automatic inline expansion #pragma warning(disable: 4514) // unreferenced inline removed #endif // // shared_ptr_timing_test.cpp - use to evaluate the impact of thread safety // // Copyright (c) 2002 Peter Dimov and Multi Media Ltd. // // Distributed under the Boost Software License, Version 1.0. (See // accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) // #include <boost/shared_ptr.hpp> #include <boost/weak_ptr.hpp> #include <iostream> #include <vector> #include <ctime> int const n = 4 * 1024 * 1024; int const r = 4; int main() { using namespace std; std::vector< boost::shared_ptr<int> > v(n); std::vector< boost::weak_ptr<int> > w(n); std::vector< boost::shared_ptr<int> > x(n); boost::shared_ptr<int> pi(new int); clock_t t = clock(); { for (int i = 0; i < r; ++i) { v.assign(n,pi); w.assign(v.begin(),v.end()); for (int j = 0; j < n; ++j) { x[j] = w[j].lock(); } v.assign(n,boost::shared_ptr<int>()); w.assign(n,boost::weak_ptr<int>()); x.assign(n,boost::shared_ptr<int>()); } } t = clock() - t; std::cout << static_cast<double>(t) / CLOCKS_PER_SEC << '\n'; return 0; }

Rene Rivera wrote:
With a changed test that exercises all three functions, here's the timing results:
sp_counted_base_w32.hpp... debug: 15.673 - release: 7.151
sp_counted_base_cw_x86.hpp... debug: 15.943 - release: 6.499
sp_counted_base_nt.hpp... debug: 15.933 - release: 6.499
The _mt tests are also timing tests; you can try them, too. I've tried to make weak_ptr_mt_test fairly representative of real-world weak_ptr code. Don't bother testing the debug builds.

Rene Rivera wrote:
I running the timing tests again, with some changes to see if I can get more representative numbers.
If you can get the assembly version to outperform the generic Win32 version, you mean ;-) But out of curiosity, what changes? I've been thinking to add a more representative shared_ptr and weak_ptr timing tests. The _mt versions aren't bad, though.
But attached is the end result.
// detail/sp_counted_base_gcc_x86.hpp - g++ on 486+ or AMD64
You should probably fix this line...
// Copyright (c) 2001, 2002, 2003 Peter Dimov and Multi Media Ltd. // Copyright 2004-2005 Peter Dimov
... and add your name here.

Peter Dimov wrote:
Rene Rivera wrote:
I running the timing tests again, with some changes to see if I can get more representative numbers.
If you can get the assembly version to outperform the generic Win32 version, you mean ;-)
No.. just one that tests all three of the functions. The timing test you have only tests one of them.
But out of curiosity, what changes? I've been thinking to add a more representative shared_ptr and weak_ptr timing tests. The _mt versions aren't bad, though.
Well certainly an _mt timing test would be best. As it's not that useful to only time the non-contention cases.
// detail/sp_counted_base_gcc_x86.hpp - g++ on 486+ or AMD64
You should probably fix this line...
// Copyright (c) 2001, 2002, 2003 Peter Dimov and Multi Media Ltd. // Copyright 2004-2005 Peter Dimov
... and add your name here.
Yes :-) -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org #ifndef BOOST_DETAIL_SP_COUNTED_BASE_CW_X86_HPP_INCLUDED #define BOOST_DETAIL_SP_COUNTED_BASE_CW_X86_HPP_INCLUDED // MS compatible compilers support #pragma once #if defined(_MSC_VER) && (_MSC_VER >= 1020) # pragma once #endif // // detail/sp_counted_base_cw_x86.hpp - CodeWarrion on 486+ // // Copyright (c) 2001, 2002, 2003 Peter Dimov and Multi Media Ltd. // Copyright 2004-2005 Peter Dimov // Copyright 2005 Rene Rivera // // Distributed under the Boost Software License, Version 1.0. (See // accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) // // // Lock-free algorithm by Alexander Terekhov // // Thanks to Ben Hitchings for the #weak + (#shared != 0) // formulation // #include <typeinfo> namespace boost { namespace detail { inline int atomic_exchange_and_add( int * pw, int dv ) { // int r = *pw; // *pw += dv; // return r; asm { mov esi, [pw] mov eax, dv lock xadd dword ptr [esi], eax } } inline void atomic_increment( int * pw ) { //atomic_exchange_and_add( pw, 1 ); asm { mov esi, [pw] lock inc dword ptr [esi] } } inline int atomic_conditional_increment( int * pw ) { // int rv = *pw; // if( rv != 0 ) ++*pw; // return rv; asm { mov esi, [pw] mov eax, dword ptr [esi] L0: test eax, eax je L1 mov ebx, eax inc ebx lock cmpxchg dword ptr [esi], ebx jne L0 L1: } } class sp_counted_base { private: sp_counted_base( sp_counted_base const & ); sp_counted_base & operator= ( sp_counted_base const & ); int use_count_; // #shared int weak_count_; // #weak + (#shared != 0) public: sp_counted_base(): use_count_( 1 ), weak_count_( 1 ) { } virtual ~sp_counted_base() // nothrow { } // dispose() is called when use_count_ drops to zero, to release // the resources managed by *this. virtual void dispose() = 0; // nothrow // destroy() is called when weak_count_ drops to zero. virtual void destroy() // nothrow { delete this; } virtual void * get_deleter( std::type_info const & ti ) = 0; void add_ref_copy() { atomic_increment( &use_count_ ); } bool add_ref_lock() // true on success { return atomic_conditional_increment( &use_count_ ) != 0; } void release() // nothrow { if( atomic_exchange_and_add( &use_count_, -1 ) == 1 ) { dispose(); weak_release(); } } void weak_add_ref() // nothrow { atomic_increment( &weak_count_ ); } void weak_release() // nothrow { if( atomic_exchange_and_add( &weak_count_, -1 ) == 1 ) { destroy(); } } long use_count() const // nothrow { return static_cast<int const volatile &>( use_count_ ); } }; } // namespace detail } // namespace boost #endif // #ifndef BOOST_DETAIL_SP_COUNTED_BASE_GCC_X86_HPP_INCLUDED

Rene Rivera wrote:
No.. just one that tests all three of the functions. The timing test you have only tests one of them.
No, it tests two of them. The push_back loop causes vector reallocations that copy and destroy the old contents. But increments are more frequent. If you put a v.resize( 0 ) after the loop, it will test an even number of increments and decrements but will also time 2^23 int destructions and this will skew the results. Your std::fill idea with an empty pointer isn't bad, though. lock() should be tested in a weak_ptr_timing_test, but I haven't written one yet.
Well certainly an _mt timing test would be best. As it's not that useful to only time the non-contention cases.
There are two _mt tests that are also timing tests, you even reported crashes with one of them. ;-) The intent of shared_ptr_timing_test is precisely to time the uncontended case, because that's what I want to measure, the cost of MT safety when none is needed.

Peter Dimov wrote:
Rene Rivera wrote:
No.. just one that tests all three of the functions. The timing test you have only tests one of them.
No, it tests two of them.
Really? I'm not so sure because when I look at the dissaembly of the test and I only saw "xadd" from one of the functions. Didn't see an "inc" from the other one.
Well certainly an _mt timing test would be best. As it's not that useful to only time the non-contention cases.
There are two _mt tests that are also timing tests, you even reported crashes with one of them. ;-)
Didn't realize they spit out timing information :-) Here's the timing info the weak_ptr test.. CodeWarrior 8.3... sp_counted_base_w32.hpp, 17.555 sp_counted_base_nt.hpp, 17.355 sp_counted_base_cw_x86.hpp, 17.215 VC 7.1... sp_counted_base_w32.hpp, 13.259 sp_counted_base_nt.hpp, 13.319
The intent of shared_ptr_timing_test is precisely to time the uncontended case, because that's what I want to measure, the cost of MT safety when none is needed.
Oh, I see. So you really need that weak_ptr_timing_test ;-) -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

On Thu, Apr 14, 2005 at 10:00:54AM -0500, Rene Rivera wrote:
Peter Dimov wrote:
Rene Rivera wrote:
OK, I tried translating the gcc_x86. But I just don't understand the GCC machine description language well enough. What does the assembly end up being for all those memory, cc, etc. ?
I don't think that CodeWarrior has any way of specifying whether an asm statement affects memory, clobbers a register, or should not be reordered. It's supposedly (and hopefully) smart enough to figure these things out on its own. ;-)
It is smart enough, and you have to jump some hoops to make it optimize assembly blocks.. But I also meant all the other GCC constructs that I don't know :-) For example MOVL is not x86 assembly, but some GCC construct, and I'm not totally sure yet if it's just the argument inverted version of MOV. This is what I came up with for the atomic_conditional_increment..
I think it's just 32-bit MOV in AT&T syntax ('l' is for long). Since it's AT&T syntax, the args are inverted. Google for "gcc asm syntax" and you'll get a plethora of pages on it, including converting to/from CW asm. jon -- "Yield to temptation, it may not pass your way again." - Robert Heinlein

Rene Rivera wrote:
Peter Dimov wrote:
I wonder if CodeWarrior supports the _Interlocked* family as intrinsic functions. If it does, we'll just need to enable the appropriate path in detail/interlocked.hpp for __MWERKS__.
Didn't find anything in the docs nor the CodeWarrior runtime sources (other than the instances in the win32 sdk) to suggest that it supports them. How would I go about changing interlocked.hpp to verify this either way? (I guess I could hardwire it, hm..)
Just hardwired it.. I'd say it definitely doesn't support the interlocked intrinsics, it errors with undefined symbols. it doesn't support "#pragma intrinsic" either giving warnings about those. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org
participants (4)
-
Joaquín Mª López Muñoz
-
Jonathan Wakely
-
Peter Dimov
-
Rene Rivera