[boost::shared_ptr] performance killer
Hello, we have some application, where boost::shared_ptr is extensively used. After profiling with gprof I got that the shared_ptr destructor is very expensive. I used following g++ compiler flags to compile: -Wall -ftemplate-depth-50 -fexceptions -fexpensive-optimizations -O3 Are there may be some MACRO-defs, which can enable/disable some additional optimizations? With Kind Regards, Ovanes Markarian
On Mon, October 23, 2006 13:57, loufoque wrote:
Ovanes Markarian wrote:
we have some application, where boost::shared_ptr is extensively used.
shared_ptr is rather slow, indeed. You should consider changing your design if the speed is a problem.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Well, I must admit, that a problem was with different loop variables set. One var (whereshared_ptr was used) had one ZERO more in its decimal value.... What a shame! Sorry guys! But I am happy now. Now with default macros used, shared_ptr is about 10% slower. If I use the quick allocator the speed is only 8% slower. If I enable both macros (single-threaded and quick-allocator) shared_ptr is quicker as an implementation without shared_ptr. My question now would be if you plan or it is already possible to have a policy for specifying Threading and Allocator? With Kind Regards, Ovanes Markarian
Ovanes Markarian wrote:
Now with default macros used, shared_ptr is about 10% slower. If I use the quick allocator the speed is only 8% slower. If I enable both macros (single-threaded and quick-allocator) shared_ptr is quicker as an implementation without shared_ptr. My question now would be if you plan or it is already possible to have a policy for specifying Threading and Allocator?
You can override the allocator used for the internal allocation of the control block by using the shared_ptr constructor taking three arguments (it's a relatively new feature) but judging by the numbers, there isn't much point in doing so, as the default allocator is good enough. 2% isn't a significant difference and it may well be the case that the default could be faster in real code. Currently there is no way to specify a threading policy.
On Mon, October 23, 2006 15:25, Peter Dimov wrote:
Ovanes Markarian wrote:
Now with default macros used, shared_ptr is about 10% slower. If I use the quick allocator the speed is only 8% slower. If I enable both macros (single-threaded and quick-allocator) shared_ptr is quicker as an implementation without shared_ptr. My question now would be if you plan or it is already possible to have a policy for specifying Threading and Allocator?
You can override the allocator used for the internal allocation of the control block by using the shared_ptr constructor taking three arguments (it's a relatively new feature) but judging by the numbers, there isn't much point in doing so, as the default allocator is good enough. 2% isn't a significant difference and it may well be the case that the default could be faster in real code.
Currently there is no way to specify a threading policy.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Well, I have messages consisting from multiple optional smaller object instances. I would like to try the approach described in A. Alexandrescu's Modern C++ Design: Small Object Allocator. Unfortunately I can not use it directly from Loki lib, since it does not have interface described in c++ standard for std::allocator. I think 2% in such a small piece of software, which constists let's say from ca. 7 such small pieces can turn out into a 14% perfromance increase if a good allocation strategy can be found. With Kind Regards, Ovanes Markarian
Ovanes Markarian wrote:
Well, I have messages consisting from multiple optional smaller object instances. I would like to try the approach described in A. Alexandrescu's Modern C++ Design: Small Object Allocator. Unfortunately I can not use it directly from Loki lib, since it does not have interface described in c++ standard for std::allocator.
I think 2% in such a small piece of software, which constists let's say from ca. 7 such small pieces can turn out into a 14% perfromance increase if a good allocation strategy can be found.
I wouldn't bet on that. The default allocator in g++ is reasonably good; it already has small block support built-in. Beating it by 14% will be quite a challenge in a real application (you could probably do it in a microbenchmark, though.) In addition, when multiple threads and CPU cores enter the picture, most simplistic small object allocators that use a single mutex perform rather badly. This includes quick_allocator. :-)
On Mon, October 23, 2006 15:52, Peter Dimov wrote:
Ovanes Markarian wrote:
Well, I have messages consisting from multiple optional smaller object instances. I would like to try the approach described in A. Alexandrescu's Modern C++ Design: Small Object Allocator. Unfortunately I can not use it directly from Loki lib, since it does not have interface described in c++ standard for std::allocator.
I think 2% in such a small piece of software, which constists let's say from ca. 7 such small pieces can turn out into a 14% perfromance increase if a good allocation strategy can be found.
I wouldn't bet on that. The default allocator in g++ is reasonably good; it already has small block support built-in. Beating it by 14% will be quite a challenge in a real application (you could probably do it in a microbenchmark, though.) In addition, when multiple threads and CPU cores enter the picture, most simplistic small object allocators that use a single mutex perform rather badly. This includes quick_allocator. :-)
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
I know that in my case system gets (without used of shared_ptrs) really slow after some certain amount of time, which is not big (let's say it is measurable in minutes). The problem is really heap fragmentation. I really respect your huge know-how and also rely on it to 90%, but as you know in the software development nothing works as expected at first and I would like to make some real tests and measurements. I hope you are right, since I know which pain it is to implement such a beast as a solid peace of code. I went through the shared_ptr (not shared_counter) source of the version 1.33.1 but did not find where it is possible to pass the quick allocator instance. Is this additional constructor going to appear in the version 1.34? With Kind Regards, Ovanes Markarian
-----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users-bounces@lists.boost.org] On Behalf Of Peter Dimov Sent: Monday, October 23, 2006 3:25 PM To: boost-users@lists.boost.org Subject: Re: [Boost-users] [boost::shared_ptr] performance killer
Ovanes Markarian wrote:
Now with default macros used, shared_ptr is about 10% slower. If I use the quick allocator the speed is only 8% slower. If I enable both macros (single-threaded and quick-allocator) shared_ptr is quicker as an implementation without shared_ptr. My question now would be if you plan or it is already possible to have a policy for specifying Threading and Allocator?
You can override the allocator used for the internal allocation of the control block by using the shared_ptr constructor taking three arguments (it's a relatively new feature) but judging by the numbers, there isn't much point in doing so, as the default allocator is good enough. 2% isn't a significant difference and it may well be the case that the default could be faster in real code.
It's not in 1.33.1, is it? It may not bring performance benefits, still, it is a usefull feature to have. I recently worked with a real-time system that offered several heaps to allocate from, ranging from the block private to the system shared. Having the shared count allocated from a different heap than the object it guards is usually not a good idea in such systems. Best regards, Leon Mlakar
Leon Mlakar wrote:
You can override the allocator used for the internal allocation of the control block by using the shared_ptr constructor taking three arguments (it's a relatively new feature) but judging by the numbers, there isn't much point in doing so, as the default allocator is good enough. 2% isn't a significant difference and it may well be the case that the default could be faster in real code.
It's not in 1.33.1, is it? It may not bring performance benefits, still, it is a usefull feature to have.
No, it isn't in 1.33.1. It will be in 1.34, though.
On Mon, October 23, 2006 16:03, Peter Dimov wrote:
Leon Mlakar wrote:
You can override the allocator used for the internal allocation of the control block by using the shared_ptr constructor taking three arguments (it's a relatively new feature) but judging by the numbers, there isn't much point in doing so, as the default allocator is good enough. 2% isn't a significant difference and it may well be the case that the default could be faster in real code.
It's not in 1.33.1, is it? It may not bring performance benefits, still, it is a usefull feature to have.
No, it isn't in 1.33.1. It will be in 1.34, though. _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
But 1.33.1 uses it internally? Since after defining this macro I was able to measure some performance benefits. With Kind Regards, Ovanes Markarian
It's not in 1.33.1, is it? It may not bring performance benefits, still, it is a usefull feature to have.
No, it isn't in 1.33.1. It will be in 1.34, though. _______________________________________________
But 1.33.1 uses it internally? Since after defining this macro I was able to measure some performance benefits.
It is used internally, yes. You can check how its done in detail/sp_counted_impl.hpp and detail/quick_allocator.hpp. However, underneath the quick allocator still relies on the global new operator (for allocating pages) and uses whatever heap the global new/delete operators use. That's okay for a single heap environments. It's just that it cannot be used when multiple heaps enter the picture. In published versions it is possible to specify a deletor for the contained object, but not the custom allocator/deletor for the shared count. Therefore, a shared count may end up in a different heap than the contained object and that could lead to the visibility problems. As a matter of fact, I will be interesting to see how this was done. We tried to modify the code into this direction but run into some problems, and since there was a pressure to deliver (as there always is) we then used a backup plan and guarded the (infrequently used) shared heap by other means. Best regards, Leon Mlakar
Leon Mlakar wrote:
As a matter of fact, I will be interesting to see how this was done. We tried to modify the code into this direction but run into some problems, and since there was a pressure to deliver (as there always is) we then used a backup plan and guarded the (infrequently used) shared heap by other means.
You can take a look at the commits http://boost.cvs.sourceforge.net/boost/boost/boost/shared_ptr.hpp?r1=1.58&r2=1.59 http://boost.cvs.sourceforge.net/boost/boost/boost/detail/shared_count.hpp?r1=1.47&r2=1.48 http://boost.cvs.sourceforge.net/boost/boost/boost/detail/sp_counted_impl.hpp?r1=1.2&r2=1.3 where this support was added, or you could download the RC of 1.34 if there were an obvious place to do that. :-) People that are more resourceful than me have found it at: http://engineering.meta-comm.com/boost/snapshot/boost-RC_1_34_0.tar.bz2
Ovanes Markarian wrote:
Hello,
we have some application, where boost::shared_ptr is extensively used. After profiling with gprof I got that the shared_ptr destructor is very expensive. I used following g++ compiler flags to compile:
-Wall -ftemplate-depth-50 -fexceptions -fexpensive-optimizations -O3
It's not a good idea to use gprof to profile C++ applications. Try callgrind.
Are there may be some MACRO-defs, which can enable/disable some additional optimizations?
BTW, you did not say which platform you are on. - Volodya
On Mon, October 23, 2006 14:08, Vladimir Prus wrote:
Ovanes Markarian wrote:
Hello,
we have some application, where boost::shared_ptr is extensively used. After profiling with gprof I got that the shared_ptr destructor is very expensive. I used following g++ compiler flags to compile:
-Wall -ftemplate-depth-50 -fexceptions -fexpensive-optimizations -O3
It's not a good idea to use gprof to profile C++ applications. Try callgrind.
Are there may be some MACRO-defs, which can enable/disable some additional optimizations?
BTW, you did not say which platform you are on.
- Volodya
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
I used callgrind as well and it shows that the call to shared_ptr destructor is pretty expensive. I used to create some 100000 object instances and it called 370000 time the destructor. I assume the problem is in virtual function calls of the destructor which can not be inlined. Are there may be some alternatives? Like defining an own shared_counter policy? I am using RedHat with kernel: 2.6.9-22 and g++ 3.4.4 Apropos: This goes to Vladimir personally but not related to this post. Did you see my posting with pointing to a small bug in the current (1.33.1) and CVS contained version in program_options. Formating there fails sometimes. With Kind Regards, Ovanes Markarian
Ovanes Markarian wrote:
Hello,
we have some application, where boost::shared_ptr is extensively used. After profiling with gprof I got that the shared_ptr destructor is very expensive. I used following g++ compiler flags to compile:
-Wall -ftemplate-depth-50 -fexceptions -fexpensive-optimizations -O3
Are there may be some MACRO-defs, which can enable/disable some additional optimizations?
First, you need to determine where the time is being spent. Typically, ~shared_ptr for the last instance performs two atomic decrements and two 'delete p' operations; the first delete calls the destructor of your object, the second destroys the control block, and there are two calls to operator delete. What is your platform? You may try to play with #define BOOST_SP_USE_QUICK_ALLOCATOR and see if it helps. See shared_ptr_alloc_test.cpp for an example. If your application is single threaded, you can try BOOST_SP_DISABLE_THREADS (although I don't believe that the problem is with the atomics; it's more likely that 'delete' is an expensive operation for some reason.)
On Mon, October 23, 2006 14:13, Peter Dimov wrote:
Ovanes Markarian wrote:
Hello,
we have some application, where boost::shared_ptr is extensively used. After profiling with gprof I got that the shared_ptr destructor is very expensive. I used following g++ compiler flags to compile:
-Wall -ftemplate-depth-50 -fexceptions -fexpensive-optimizations -O3
Are there may be some MACRO-defs, which can enable/disable some additional optimizations?
First, you need to determine where the time is being spent. Typically, ~shared_ptr for the last instance performs two atomic decrements and two 'delete p' operations; the first delete calls the destructor of your object, the second destroys the control block, and there are two calls to operator delete.
What is your platform? You may try to play with #define BOOST_SP_USE_QUICK_ALLOCATOR and see if it helps. See shared_ptr_alloc_test.cpp for an example. If your application is single threaded, you can try BOOST_SP_DISABLE_THREADS (although I don't believe that the problem is with the atomics; it's more likely that 'delete' is an expensive operation for some reason.)
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Peter, thanks for your reply. Regarding platform etc. I wrote in my parallel post to Vladimir. Should I define this macros, before installing boost libs or in my app. I assume my app would be enough, since shared_ptr does not link to any external boost libs. May I read somewhere about the quick allocator used? I am going to write my own allocator to avoid heap fragmentation and would like to see, how boost allocator works. With Kind Regards, Ovanes Markarian
participants (5)
-
Leon Mlakar
-
loufoque
-
Ovanes Markarian
-
Peter Dimov
-
Vladimir Prus