shared_ptr and BOOST_DISABLE_THREADS

Playing with a Spirit-based parser for a simple scripting language I noticed that #defining BOOST_DISABLE_THREADS resulted in a parser that was nearly 3 times faster. For reasons that aren't relevant to the discussion, this had to be an MT build, but the parser didn't need thread safety. That made me wonder though: If I #define BOOST_DISABLE_THREADS in some translation units but not in others, what's going to happen? I suppose one way of summarizing it is this: does BOOST_DISABLE_THREADS change the type of shared_ptr that's made? If not, that seems like a rather serious problem to me - especially given the enormous difference in performance that I measured (optimized build with VC8 alpha). What obvious thing am I overlooking? -cd

Carl Daniel wrote:
Playing with a Spirit-based parser for a simple scripting language I noticed that #defining BOOST_DISABLE_THREADS resulted in a parser that was nearly 3 times faster. For reasons that aren't relevant to the discussion, this had to be an MT build, but the parser didn't need thread safety.
That made me wonder though: If I #define BOOST_DISABLE_THREADS in some translation units but not in others, what's going to happen?
In theory? Undefined behavior because of ODR violations. In practice, on Windows the current version will mostly work "as expected" (but earlier versions deadlocked).
I suppose one way of summarizing it is this: does BOOST_DISABLE_THREADS change the type of shared_ptr that's made?
No, it does not.
If not, that seems like a rather serious problem to me - especially given the enormous difference in performance that I measured (optimized build with VC8 alpha).
What obvious thing am I overlooking?
Nothing. The implementation of shared_ptr needs a serious redesign WRT thread safety, which is planned but requires free time on my part, which is currently in short supply, although I've got some ordered.

Peter Dimov <pdimov@mmltd.net> wrote:
Nothing. The implementation of shared_ptr needs a serious redesign WRT thread safety, which is planned but requires free time on my part, which is currently in short supply, although I've got some ordered.
Are there any chance of using reference linking instead of counter allocated on heap, or at least some serious optimization (ie. memory pool) in regard to this heap allocation? B.

Bronek Kozicki schrieb:
Peter Dimov <pdimov@mmltd.net> wrote:
Nothing. The implementation of shared_ptr needs a serious redesign WRT thread safety, which is planned but requires free time on my part, which is currently in short supply, although I've got some ordered.
Are there any chance of using reference linking instead of counter allocated on heap, or at least some serious optimization (ie. memory pool) in regard to this heap allocation?
B.
Reference linking is really great stuff in (single-thread) theory, but unfortunatly it seems to have lots of difficulties in mt world. (At least this seemed to be the consensus opinion to a similar question of mine in c.l.c.moderated). Daniel

On Fri, 23 Apr 2004 11:12:27 +0200, Daniel Krügler wrote:
Reference linking is really great stuff in (single-thread) theory, but unfortunatly it seems to have lots of difficulties in mt world. (At
why are we trying to built thread-safety into shared_ptr? There is good saying that drove C++ to the point where we are right now: "do not pay for what you do not use". If I'm willing to provide own synchronization for shared pointers, or use it in single-threaded env, why would I need to pay extra heap allocation to make it play nice in mt world? Moreover, if I'm willing to use shared_ptr in mt env. and use its synchronization features, it does not provide synch. for pointee anyway. Thus I still need to provide my own synchroniozation to protect state of pointee. B.

From: "Bronek Kozicki"
On Fri, 23 Apr 2004 11:12:27 +0200, Daniel Krügler wrote:
Reference linking is really great stuff in (single-thread) theory, but unfortunatly it seems to have lots of difficulties in mt world. (At
why are we trying to built thread-safety into shared_ptr? There is good saying that drove C++ to the point where we are right now: "do not pay for what you do not use". If I'm willing to provide own synchronization for shared pointers,
You can't. Not unless you control the entire program and you can determine with certainty which shared_ptr instances share ownership at every particular moment. Well, you could do it with a global mutex just to prove the point, I suppose.
or use it in single-threaded env, why would I need to pay extra heap allocation to make it play nice in mt world? Moreover, if I'm willing to use shared_ptr in mt env. and use its synchronization features, it does not provide synch. for pointee anyway. Thus I still need to provide my own synchroniozation to protect state of pointee.
Pointee synchronization is something that you can do yourself. And you should, because only you know the locking granularity that is required by the pointee.

Peter Dimov <pdimov@mmltd.net> wrote:
why are we trying to built thread-safety into shared_ptr? There is good saying that drove C++ to the point where we are right now: "do not pay for what you do not use". If I'm willing to provide own synchronization for shared pointers,
You can't. Not unless you control the entire program and you can determine with certainty which shared_ptr instances share ownership at every particular moment.
It came to my head after sending this message. However I'm still unhappy with extra heap allocation, as in my environment (MSVC) heap manager is not very fast. There are 3rd party products like SmartHeap, unforunatelly quite expensive :( B.

On Apr 24, 2004, at 2:04 PM, Peter Dimov wrote:
why are we trying to built thread-safety into shared_ptr? There is good saying that drove C++ to the point where we are right now: "do not pay for what you do not use". If I'm willing to provide own synchronization for shared pointers,
You can't. Not unless you control the entire program and you can determine with certainty which shared_ptr instances share ownership at every particular moment.
Actually I'm not convinced about this. I have the same concerns as Bronek. They are only concerns. I am not sure of myself. But here's a data point: In the Metrowerks std::tr1::shared_ptr I give the option to the client whether or not shared_ptr contains a mutex (via a #define flag). This decision can be made independent of whether the entire C++ lib is compiled in multithread mode or not. And at the same time I use shared_ptr in the implementation of std::locale. In the std::locale implementation I am careful to wrap each use of the locale implementation with a mutex lock. I believe I would have to do this whether or not shared_ptr protected its count. And yet std::locale is just a library, not an application. Perhaps a more correct statement is that if you expose a shared_ptr as part of your interface, and advertise that multiple threads can operate on copies of that exposed shared_ptr, then shared_ptr must be internally protected. But if your shared_ptr is used as an implementation detail within your library, an internal mutex is not necessarily beneficial, and may even be a handicap. -Howard

On Sat, 24 Apr 2004 19:20:00 -0400 Howard Hinnant <hinnant@twcny.rr.com> wrote:
Actually I'm not convinced about this. I have the same concerns as Bronek. They are only concerns. I am not sure of myself. But here's
a data point:
I do not buy that argument either.
Perhaps a more correct statement is that if you expose a shared_ptr as
part of your interface, and advertise that multiple threads can operate on copies of that exposed shared_ptr, then shared_ptr must be internally protected. But if your shared_ptr is used as an implementation detail within your library, an internal mutex is not necessarily beneficial, and may even be a handicap.
I believe this is actually easier than specified, if the locking is part of the type. For instance, the ACE+TAO philosophy is to make synchronization a template parameter. While there may be some other issues with this approach, it allows classes to be used as they are needed. The programmer can use both MT and ST objects, as they are required. I have found this pattern very beneficial, and believe the MT issues of shared_ptr to be problematic at best.

From: "Howard Hinnant"
On Apr 24, 2004, at 2:04 PM, Peter Dimov wrote:
why are we trying to built thread-safety into shared_ptr? There is good saying that drove C++ to the point where we are right now: "do not pay for what you do not use". If I'm willing to provide own synchronization for shared pointers,
You can't. Not unless you control the entire program and you can determine with certainty which shared_ptr instances share ownership at every particular moment.
Actually I'm not convinced about this. I have the same concerns as Bronek. They are only concerns. I am not sure of myself. But here's a data point:
In the Metrowerks std::tr1::shared_ptr I give the option to the client whether or not shared_ptr contains a mutex (via a #define flag). This decision can be made independent of whether the entire C++ lib is compiled in multithread mode or not. And at the same time I use shared_ptr in the implementation of std::locale.
In the std::locale implementation I am careful to wrap each use of the locale implementation with a mutex lock. I believe I would have to do this whether or not shared_ptr protected its count. And yet std::locale is just a library, not an application.
I am not sure whether you can skip the count synchronization here. Even though you do not expose a shared_ptr directly, the user can still copy a std::locale at will. When two threads copy the same std::locale at the same time, a non-synchronized count leads to undefined behavior. Note that there's no mutex requirement. But you need the atomic updates.
Perhaps a more correct statement is that if you expose a shared_ptr as part of your interface, and advertise that multiple threads can operate on copies of that exposed shared_ptr, then shared_ptr must be internally protected.
It doesn't matter whether the user sees the shared_ptr or not. What matters is whether it can be copied by two threads (or copied by one thread and destroyed in another) at the same time. Which I believe is the case in your std::locale example. But I may be wrong because std::locale isn't one of my areas of expertise.

On Apr 25, 2004, at 6:52 AM, Peter Dimov wrote:
In the std::locale implementation I am careful to wrap each use of the locale implementation with a mutex lock. I believe I would have to do this whether or not shared_ptr protected its count. And yet std::locale is just a library, not an application.
I am not sure whether you can skip the count synchronization here. Even though you do not expose a shared_ptr directly, the user can still copy a std::locale at will. When two threads copy the same std::locale at the same time, a non-synchronized count leads to undefined behavior.
I don't think I was clear. The access is synchronized. The synchronization happens at the locale level, making the synchronization at the shared_ptr level redundant. I need to synchronize more than just the counts in the shared_ptr, namely access to the stuff that the shared_ptr is pointing to. -Howard

From: "Howard Hinnant"
On Apr 25, 2004, at 6:52 AM, Peter Dimov wrote:
In the std::locale implementation I am careful to wrap each use of the locale implementation with a mutex lock. I believe I would have to do this whether or not shared_ptr protected its count. And yet std::locale is just a library, not an application.
I am not sure whether you can skip the count synchronization here. Even though you do not expose a shared_ptr directly, the user can still copy a std::locale at will. When two threads copy the same std::locale at the same time, a non-synchronized count leads to undefined behavior.
I don't think I was clear. The access is synchronized. The synchronization happens at the locale level, making the synchronization at the shared_ptr level redundant. I need to synchronize more than just the counts in the shared_ptr, namely access to the stuff that the shared_ptr is pointing to.
It is still not clear why you think that the count does not need separate synchronization. The count protection and the pointee protection are orthogonal. Count accesses happen on copy: handle a; // thread 1 handle b(a); // thread 2 handle c(a); Whereas implementation accesses happen on mutable operations: // thread 1 a.f(); // non-const // thread 2 a.g(); // const or non-const You can reuse the implementation mutex to synchronize copies as well, but that's inefficient; you'd be needlessly serializing copies and implementation access, as in: // thread 1 handle b(a); // thread 2 a.g(); // const In fact, it isn't even clear to me why you need to synchronize access to the implementation at all, as std::locale is immutable; it has no non-const member functions. Perhaps some illustrative code can help me understand your point better.

On Apr 25, 2004, at 9:29 PM, Peter Dimov wrote:
In fact, it isn't even clear to me why you need to synchronize access to the implementation at all, as std::locale is immutable; it has no non-const member functions.
Ah, I think I see the confusion. locale is mutable. The copy ctor mutates the rhs (though not visibly) and the assignment mutates both sides. There's more refcounting running around under locale's hood than just the shared_ptr to the implementation. In addition each facet is individually refcounted (but not with shared_ptr). Even std::use_facet potentially mutates a locale (via lazy addition of a standard facet to the locale). This latter technique is one of the main (portable) ways typically used to drop the code size of HelloWorld down from astronomical. http://anubis.dkuug.dk/jtc1/sc22/wg21/docs/PDTR18015.pdf -Howard

From: "Howard Hinnant"
On Apr 25, 2004, at 9:29 PM, Peter Dimov wrote:
In fact, it isn't even clear to me why you need to synchronize access to the implementation at all, as std::locale is immutable; it has no non-const member functions.
Ah, I think I see the confusion. locale is mutable. The copy ctor mutates the rhs (though not visibly) and the assignment mutates both sides. There's more refcounting running around under locale's hood than just the shared_ptr to the implementation. In addition each facet is individually refcounted (but not with shared_ptr). Even std::use_facet potentially mutates a locale (via lazy addition of a standard facet to the locale).
Yes, I see. It isn't very common for a class to physically (but not logically) mutate the rhs on copy. ;-) But I agree that in this case you'd gain nothing from separate count synchronization. Let's go a bit further. The obvious alternative to shared_ptr here, to me, is not a non-synchronized shared_ptr, but an embedded reference count (no need for separate count synchronization -> no need for separate count). Is there something else that I'm missing that makes intrusive counting unsuitable?

On Apr 26, 2004, at 7:15 PM, Peter Dimov wrote:
Let's go a bit further. The obvious alternative to shared_ptr here, to me, is not a non-synchronized shared_ptr, but an embedded reference count (no need for separate count synchronization -> no need for separate count). Is there something else that I'm missing that makes intrusive counting unsuitable?
Probably not. The DLL-proof characteristic of shared_ptr appealed to me so I jumped to reuse it. There's always another rewrite waiting in the wings. :-) Thanks for your comments. -Howard

Actually I'm not convinced about this. I have the same concerns as Bronek. They are only concerns. I am not sure of myself. But here's a data point:
In the Metrowerks std::tr1::shared_ptr I give the option to the client whether or not shared_ptr contains a mutex (via a #define flag). This decision can be made independent of whether the entire C++ lib is compiled in multithread mode or not. And at the same time I use shared_ptr in the implementation of std::locale.
In the std::locale implementation I am careful to wrap each use of the locale implementation with a mutex lock. I believe I would have to do this whether or not shared_ptr protected its count. And yet std::locale is just a library, not an application.
I've used shared_ptr a lot for reference-counted copy-on-write-pimpl's (which is basically what std::locale is right?), and I'm not sure that you are right here - the problem is avoiding a race condition inside shared_ptr::~shared_ptr - which is pretty difficult correctly IMO. Here's my best attempt so far (note not necessarily correct!), getting this right is distinctly non-trivial IMO, I would much rather have shared_ptr thread safe to begin with: class pimpl { class implementation; shared_ptr<implementation> m_pimp; public: pimpl() { m_pimp = new implementation(); } pimpl(const pimp& p) { // compiler generated version will compile but do the wrong thing: mutex::lock l(m_pimp->get_mutex()) m_pimp = p.m_pimp; } pimpl& operator=(const pimpl&) { // compiler generated version will compile but do the wrong thing: mutex::lock l(m_pimp->get_mutex()) m_pimp = p.m_pimp; return *this; } ~pimpl() { // this one is tricky: mutex::lock l(m_pimp->get_mutex()) if(!m_pimp.unique()) { shared_ptr<implmenation> old; old.swap(m_pimp); } // when we get here m_pimp must be either empty or unique: assert(m_pimp.unique() || !m_pimp.get()); } }; In contrast, the thread safe version "just plain works", where as the in the idiom I've sketched above, there are at least three potential traps for the unwary, that can only be detected if your code actually happens to hit a race condition (probably not until the product has shipped ;-) ).
Perhaps a more correct statement is that if you expose a shared_ptr as part of your interface, and advertise that multiple threads can operate on copies of that exposed shared_ptr, then shared_ptr must be internally protected. But if your shared_ptr is used as an implementation detail within your library, an internal mutex is not necessarily beneficial, and may even be a handicap.
Maybe, but lets not forget that shared_ptr doesn't need a mutex at all, what it needs is an atomic counter (actually less than that, because we never need the value of the counter just whether it's less than/greater than, or equal to zero). There are plenty of platforms that support this natively, and these should expect much less of a hit than a mutex would give. Having said all that, I think I would be in favour of something like: #ifdef BOOST_HAVE_THREADS #define BOOST_SP_DEFAULT true #else #define BOOST_SP_DEFAULT false #endif template <class T, bool threaded = BOOST_SP_DEFAULT> class shared_ptr; Which I think I'm right in saying would be std conforming (because we're permitted additional defaulted template parameters right?), and would also quash any potential ODR violations. We could even make the non-thread safe version use a linked list implementation if that really is so much faster, but please, don't change the default thread safe behaviour! Thoughts? John.

On Apr 25, 2004, at 7:00 AM, John Maddock wrote:
Maybe, but lets not forget that shared_ptr doesn't need a mutex at all, what it needs is an atomic counter (actually less than that, because we never need the value of the counter just whether it's less than/greater than, or equal to zero).
shared_ptr has two counts which sometimes must both be incremented in an atomic fashion.
template <class T, bool threaded = BOOST_SP_DEFAULT> class shared_ptr;
Which I think I'm right in saying would be std conforming (because we're permitted additional defaulted template parameters right?), and would also quash any potential ODR violations.
Actually no, it would not be standard conforming for a hypothetical std::shared_ptr, unless the standard granted specific permission for this one template. (or specified the defaulted template in the first place) http://anubis.dkuug.dk/jtc1/sc22/wg21/docs/lwg-closed.html#94 -Howard

From: "Howard Hinnant"
On Apr 25, 2004, at 7:00 AM, John Maddock wrote:
Maybe, but lets not forget that shared_ptr doesn't need a mutex at all, what it needs is an atomic counter (actually less than that, because we never need the value of the counter just whether it's less than/greater than, or equal to zero).
shared_ptr has two counts which sometimes must both be incremented in an atomic fashion.
Right, but thanks to Alexander Terekhov, this is already a solved problem. :-) I'll try to switch boost::shared_ptr to use atomic operations (on Windows at least) for the next release. Perhaps you missed the comp.std.c++ thread and my proof of concept implementation, available at http://www.pdimov.com/cpp/shared_count_x86_exp2.hpp

On Apr 25, 2004, at 9:34 PM, Peter Dimov wrote:
shared_ptr has two counts which sometimes must both be incremented in an atomic fashion.
Right, but thanks to Alexander Terekhov, this is already a solved problem. :-) I'll try to switch boost::shared_ptr to use atomic operations (on Windows at least) for the next release. Perhaps you missed the comp.std.c++ thread and my proof of concept implementation, available at
Somehow I think I knew this and had just completely forgotten about it (I actually already had your proof of concept implementation bookmarked). Thanks for the reminder! -Howard

Howard Hinnant wrote: [...]
Somehow I think I knew this and had just completely forgotten about it (I actually already had your proof of concept implementation bookmarked). Thanks for the reminder!
You're also doing PPC, right? WinXX's InterlockedIncrement, InterlockedDecrement, and InterlockedCompareExchange are fully fenced (bidirectional "full stop" barrier for loads and stores imposed on both compiler and hardware). You can do better than that. http://google.com/groups?selm=3F17EF6F.97295E7D%40web.de decrement() shall have the effect of release memory sync operation if the resulting value of a reference count is not zero; otherwise, this function shall have the effect of acquire memory synchronization operation. For compiler that still means "full stop" but not so for hardware (or whoever does runtime reodering). increment() doesn't impose any reordering constraints. decrement(msync::acq{, ...}) shall have the effect of acquire memory sync operation if the resulting value of a reference count is zero; otherwise, the memory synchronization effect of this function is unspecified. decrement(msync::rel{, ...}) shall have the effect of release memory synchronization operation if the resulting value of a reference count is not zero; otherwise, the memory sync effect of this function is unspecified. regards, alexander.

Bronek Kozicki schrieb:
Peter Dimov <pdimov@mmltd.net> wrote:
Nothing. The implementation of shared_ptr needs a serious redesign WRT thread safety, which is planned but requires free time on my part, which is currently in short supply, although I've got some ordered.
Are there any chance of using reference linking instead of counter allocated on heap,
Not likely. Reference linking offers little advantage in practice over the current implementation, even if we drop the fancy custom deleters and weak pointers to make it look better.
or at least some serious optimization (ie. memory pool) in regard to this heap allocation?
Serious memory allocation optimizations (thread specific pools) require a considerable effort; there's also the problem with thread-specific cleanup on Win32, which requires either a helper DLL, or would tie shared_ptr to Boost.Threads. But depending on your seriousness, you may find the following alternatives helpful: 0. Make sure that your program is allocation-bound, performance wise. 1. Make sure that your compiler doesn't already have a seriously optimized malloc. Many do, and the number is growing. If it does, attempts to "optimize" memory allocations by manually pooling may (and I've seen it happen) actually cause measurable slowdowns. 2. Replace malloc/global new with dlmalloc (or even a commercial alternative). The advantage is that you will automatically optimize every small object allocation in your entire program; remember that every shared_ptr count allocation has a corresponding object allocation. 3. #define BOOST_SP_USE_QUICK_ALLOCATOR. From: "Daniel Krügler"
Reference linking is really great stuff in (single-thread) theory, but unfortunatly it seems to have lots of difficulties in mt world.
Right, that too.

Bronek Kozicki <brok <at> rubikon.pl> writes:
Peter Dimov <pdimov <at> mmltd.net> wrote:
Nothing. The implementation of shared_ptr needs a serious redesign WRT thread safety, which is planned but requires free time on my part, which is currently in short supply, although I've got some ordered.
Are there any chance of using reference linking instead of counter allocated on heap, or at least some serious optimization (ie. memory pool) in regard to this heap allocation?
FWIW: I have a smart pointer class similar to the boost one, in which the ref_count class can contain the object, thus requiring one allocation only: template <class T> class ref_count { char data [sizeof(T)]; long strong_count; long weak_count; // .... public: // .... }; There are two major disadvantages to this construction: 1. Quite elaborate code is needed to fake parameter forwarding. 2. The memory needed for T can be deallocated only when the memory for the counter is (i.e. when weak_count becomes zero). Rogier

Rogier van Dalen wrote: [snip]
FWIW: I have a smart pointer class similar to the boost one, in which the ref_count class can contain the object, thus requiring one allocation only:
template <class T> class ref_count { char data [sizeof(T)];
long strong_count; long weak_count; // ....
public: // .... };
The file: */boost/files/managed_ptr/overhead_referent_vals.zip contains a simple multiple inheritance equivalent parameterized on the overhead, i.e. template < typename Overhead , typename Referent
class overhead_referent_vals : public Overhead , public Referent {...}
There are two major disadvantages to this construction: 1. Quite elaborate code is needed to fake parameter forwarding.
This was done with Paul Mensonides' help using preprocessor magic as acknowledged in the file: managed_ptr_ctor_forwarder.hpp which is included in the zip file.
2. The memory needed for T can be deallocated only when the memory for the counter is (i.e. when weak_count becomes zero). Hmm.. that's as it should be, isn't it? After all, the refcount should stay around as long as the memory to which it refers. AFAICT, that's how shared_ptr works.
participants (10)
-
Alexander Terekhov
-
Bronek Kozicki
-
Carl Daniel
-
Daniel Krügler
-
Howard Hinnant
-
Jody Hagins
-
John Maddock
-
Larry Evans
-
Peter Dimov
-
Rogier van Dalen