[Block Pointer] benchmark

Phil Bouchard

24 May 2011 24 May '11

10:40 p.m.

Hi, Just for the records I have benchmarked block_ptr using: https://svn.boost.org/svn/boost/sandbox/block_ptr/libs/smart_ptr/example/ben... And I get the following: auto_ptr: 9208870 ns shared_ptr: 22958516 ns block_ptr: 75860365 ns Which is pretty good given the complexity it goes thru. I was wondering if there is any hope in having a pool::ordered_malloc() call of O(1) instead of O(n) and a faster pool::is_from()? The latter could be just a check to see if a pointer is within reserved memory pages of the pool. I tried to talk to the author of pool but I got no response. Thanks, -Phil

Show replies by date

Nevin Liber

24 May 24 May

11:07 p.m.

On 24 May 2011 17:40, Phil Bouchard <philippe@fornux.com> wrote:

...

Hi,

Just for the records I have benchmarked block_ptr using:

https://svn.boost.org/svn/boost/sandbox/block_ptr/libs/smart_ptr/example/ben...

And I get the following: auto_ptr: 9208870 ns shared_ptr: 22958516 ns block_ptr: 75860365 ns

Which is pretty good given the complexity it goes thru.

Your claim is " It is a fast as the popular smart pointer * boost::shared_ptr<T>*". Yet, in single-threaded code and shared_ptr using new instead of make_shared, block_ptr still takes 3.3x as long as shared_ptr. That is *a lot* of overhead... -- Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

Rhys Ulerich

11:18 p.m.

...

Your claim is " It is a fast as the popular smart pointer * boost::shared_ptr<T>*". Yet, in single-threaded code and shared_ptr using new instead of make_shared, block_ptr still takes 3.3x as long as shared_ptr.

That is *a lot* of overhead...

It would be interesting to see a comparison of random access pattern times via block_ptr vs shared_ptr (both with and without make_shared). If I've grokked the block_ptr implementation details correctly (low probability but possible), I'd expect to see the comparison be more favorable for block_ptr than the case Phil reported. - Rhys

Phil Bouchard

11:23 p.m.

On 5/24/2011 4:07 PM, Nevin Liber wrote:

...

Your claim is " It is a fast as the popular smart pointer * boost::shared_ptr<T>*". Yet, in single-threaded code and shared_ptr using new instead of make_shared, block_ptr still takes 3.3x as long as shared_ptr.

That is *a lot* of overhead...

Yes but this is a worse-case scenario because the assignment of a pointer from the data segment that is the last one pointing to an object on the heap will go thru more subroutines. If the pointer was already on the heap then it'll be faster. What really matters is the constant complexity. But you're right on the docs I'll correct them. My point is to optimize ordered_malloc() and is_from() and I was wondering if there is any hope that can happen? Thanks, -Phil

Phil Bouchard

11:52 p.m.

On 5/24/2011 4:07 PM, Nevin Liber wrote:

...

Your claim is " It is a fast as the popular smart pointer * boost::shared_ptr<T>*". Yet, in single-threaded code and shared_ptr using new instead of make_shared, block_ptr still takes 3.3x as long as shared_ptr.

That is *a lot* of overhead...

I just tested it using make_shared & make_block and I get: make: auto_ptr: 11109841 ns shared_ptr: 21215277 ns block_ptr: 143637475 ns new: auto_ptr 4583447 ns shared_ptr: 10675000 ns block_ptr: 67152785 ns FYI make_shared is slower than new because of the temporary it creates. If people what speed they should stick to operator new in all cases. -Phil

Phil Bouchard

11:56 p.m.

On 5/24/2011 4:52 PM, Phil Bouchard wrote:

...

FYI make_shared is slower than new because of the temporary it creates. If people what speed they should stick to operator new in all cases.

"If people want speed"... -Phil

Emil Dotchevski

11:59 p.m.

On Tue, May 24, 2011 at 4:52 PM, Phil Bouchard <philippe@fornux.com> wrote:

...

On 5/24/2011 4:07 PM, Nevin Liber wrote:

...
Your claim is " It is a fast as the popular smart pointer * boost::shared_ptr<T>*". Yet, in single-threaded code and shared_ptr using new instead of make_shared, block_ptr still takes 3.3x as long as shared_ptr.

That is *a lot* of overhead...

I just tested it using make_shared & make_block and I get: make: auto_ptr: 11109841 ns shared_ptr: 21215277 ns block_ptr: 143637475 ns

new: auto_ptr 4583447 ns shared_ptr: 10675000 ns block_ptr: 67152785 ns

FYI make_shared is slower than new because of the temporary it creates. If people what speed they should stick to operator new in all cases.

In my experience any assumptions about speed do more harm than good. It's probably best to pick the semantics you want (do you want the control block allocated with the object or not), and only worry about speed when it is known to be a problem. Emil Dotchevski Reverge Studios, Inc. http://www.revergestudios.com/reblog/index.php?n=ReCode

Darren Garvey

25 May 25 May

12:33 a.m.

On 25 May 2011 00:52, Phil Bouchard <philippe@fornux.com> wrote:

...

On 5/24/2011 4:07 PM, Nevin Liber wrote:

...
Your claim is " It is a fast as the popular smart pointer * boost::shared_ptr<T>*". Yet, in single-threaded code and shared_ptr using new instead of make_shared, block_ptr still takes 3.3x as long as shared_ptr.

That is *a lot* of overhead...

I just tested it using make_shared & make_block and I get: make: auto_ptr: 11109841 ns shared_ptr: 21215277 ns block_ptr: 143637475 ns

new: auto_ptr 4583447 ns shared_ptr: 10675000 ns block_ptr: 67152785 ns

FYI make_shared is slower than new because of the temporary it creates. If people what speed they should stick to operator new in all cases.

This is surprising. According to the docs - under "Best Practices" - it suggests using make_shared(), because: [1] "Besides convenience and style, such a function is also exception safe and considerably faster because it can use a single allocation for both the object and its corresponding control block, eliminating a significant portion of shared_ptr's construction overhead. This eliminates one of the major efficiency complaints about shared_ptr." I wonder if it's that your test case prevents your compiler from performing RVO, or that the docs might be incorrect. I wonder what the results look like if you instead test: ``` template <typename T, T (*P)()> void worker_make() { for (int i = 0; i < 100000; ++ i) T p = P(); } ``` -- Darren [1] http://www.boost.org/doc/libs/1_46_1/libs/smart_ptr/make_shared.html

Phil Bouchard

12:45 a.m.

On 5/24/2011 5:33 PM, Darren Garvey wrote:

...

On 25 May 2011 00:52, Phil Bouchard<philippe@fornux.com> wrote:

...
I just tested it using make_shared& make_block and I get: make: auto_ptr: 11109841 ns shared_ptr: 21215277 ns block_ptr: 143637475 ns

new: auto_ptr 4583447 ns shared_ptr: 10675000 ns block_ptr: 67152785 ns

I wonder if it's that your test case prevents your compiler from performing RVO, or that the docs might be incorrect. I wonder what the results look like if you instead test:

```

template<typename T, T (*P)()> void worker_make() { for (int i = 0; i< 100000; ++ i) T p = P(); }

I get a slight improvement: make: auto_ptr: 12849529 ns shared_ptr: 18574307 ns block_ptr: 126387234 ns -Phil

Mathias Gaunard

12:10 p.m.

On 25/05/2011 02:33, Darren Garvey wrote:

...

template<typename T, T (*P)()> void worker_make() { for (int i = 0; i< 100000; ++ i) T p = P(); }

This is a fairly bad way to write a benchmark, and is likely to yield somewhat incorrect results. You should bench based on an external condition, such as time, rather than making a fixed number of iterations.

Phil Bouchard

4:19 p.m.

On 5/25/2011 5:10 AM, Mathias Gaunard wrote:

...

This is a fairly bad way to write a benchmark, and is likely to yield somewhat incorrect results.

It is based on a fixed amount of work and returns an estimate of time.

...

You should bench based on an external condition, such as time, rather than making a fixed number of iterations.

You mean: it should return the amount of iterations based on a fixed interval of time. I don't think it'll be as precise. -Phil

Mathias Gaunard

5:06 p.m.

On 25/05/2011 18:19, Phil Bouchard wrote:

...

On 5/25/2011 5:10 AM, Mathias Gaunard wrote:

...
This is a fairly bad way to write a benchmark, and is likely to yield somewhat incorrect results.

It is based on a fixed amount of work and returns an estimate of time.

...
You should bench based on an external condition, such as time, rather than making a fixed number of iterations.

You mean: it should return the amount of iterations based on a fixed interval of time. I don't think it'll be as precise.

No, it should return the median time (or better yet, the median amount of cycles) taken by a single iteration, but you should do many iterations for a fixed interval of time. This is necessary to prevent the compiler from reordering things between iterations or to unroll or optimize out the loop, since you have a non-predictable branch.

Nevin Liber

5:52 a.m.

On 24 May 2011 18:52, Phil Bouchard <philippe@fornux.com> wrote:

...

On 5/24/2011 4:07 PM, Nevin Liber wrote:

...
Your claim is " It is a fast as the popular smart pointer * boost::shared_ptr<T>*". Yet, in single-threaded code and shared_ptr using new instead of make_shared, block_ptr still takes 3.3x as long as shared_ptr.

That is *a lot* of overhead...

I just tested it using make_shared & make_block and I get: make: auto_ptr: 11109841 ns shared_ptr: 21215277 ns block_ptr: 143637475 ns

new: auto_ptr 4583447 ns shared_ptr: 10675000 ns block_ptr: 67152785 ns

FYI make_shared is slower than new because of the temporary it creates. If people what speed they should stick to operator new in all cases.

Heck, even auto_ptr is significantly faster using new instead of make_shared and make_block. Simply amazing! Wait, ... how exactly did you use make_shared and make_block in the auto_ptr case? And that isn't nearly as amazing as your assertion that an increment and decrement of a reference count (which is all that a shared_ptr "temporary" translates into) is not only slower, but *significantly* slower than a heap allocation/deallocation in the standard heap. -- Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

Phil Bouchard

7:18 a.m.

On 5/24/2011 10:52 PM, Nevin Liber wrote:

...

Heck, even auto_ptr is significantly faster using new instead of make_shared and make_block. Simply amazing! Wait, ... how exactly did you use make_shared and make_block in the auto_ptr case?

auto_ptr uses make_auto in the make test, shared_ptr uses make_shared and block_ptr uses make_block. -Phil

Nevin Liber

3 p.m.

On 25 May 2011 02:18, Phil Bouchard <philippe@fornux.com> wrote:

...

On 5/24/2011 10:52 PM, Nevin Liber wrote:

...
Heck, even auto_ptr is significantly faster using new instead of make_shared and make_block. Simply amazing! Wait, ... how exactly did you use make_shared and make_block in the auto_ptr case?

auto_ptr uses make_auto in the make test, shared_ptr uses make_shared and block_ptr uses make_block.

And what exactly is make_auto doing that is *much more* costly than a heap allocation? According to your numbers, it takes almost 2.5x longer than just a new. -- Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

Phil Bouchard

4:11 p.m.

On 5/25/2011 8:00 AM, Nevin Liber wrote:

...

And what exactly is make_auto doing that is *much more* costly than a heap allocation? According to your numbers, it takes almost 2.5x longer than just a new.

The make factory functions create a temporary r-value which needs to be copied to the next l-value; this is why it is much slower. -Phil

Mathias Gaunard

5:07 p.m.

On 25/05/2011 18:11, Phil Bouchard wrote:

...

On 5/25/2011 8:00 AM, Nevin Liber wrote:

...
And what exactly is make_auto doing that is *much more* costly than a heap allocation? According to your numbers, it takes almost 2.5x longer than just a new.

The make factory functions create a temporary r-value which needs to be copied to the next l-value; this is why it is much slower.

Are you sure you're doing your benchmarks with -03?

Phil Bouchard

6:35 p.m.

On 5/25/2011 10:07 AM, Mathias Gaunard wrote:

...

Are you sure you're doing your benchmarks with -03?

Yes: gcc.compile.c++ bin/benchmark.test/gcc-4.1.2/release/benchmark.o "g++" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -pedantic -fPIC -DNDEBUG -I"../../.." -I"/home/vnmr1/phil/boost_1_46_1" -c -o "bin/benchmark.test/gcc-4.1.2/release/benchmark.o" "benchmark.cpp" -Phil

Nevin Liber

7:18 p.m.

On 25 May 2011 11:11, Phil Bouchard <philippe@fornux.com> wrote:

...

On 5/25/2011 8:00 AM, Nevin Liber wrote:

...
And what exactly is make_auto doing that is *much more* costly than a heap allocation? According to your numbers, it takes almost 2.5x longer than just a new.

The make factory functions create a temporary r-value which needs to be copied to the next l-value; this is why it is much slower.

Is it copying anything more than a pointer? Are you really asserting that copying a pointer is far more work than a heap allocation?? Could you post some assembly on any platform backing your assertion? Just trying to understand, because to me, this seems totally absurd. You can talk all you want about r-values and l-values causing some mysterious slowdown, but at the end of the day, copying a pointer doesn't result in more than a few machine instructions, even unoptimized. -- Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

Phil Bouchard

8:22 p.m.

On 5/25/2011 12:18 PM, Nevin Liber wrote:

...

Is it copying anything more than a pointer? Are you really asserting that copying a pointer is far more work than a heap allocation?? Could you post some assembly on any platform backing your assertion?

So we have shared_ptr being initialized with an object created by operator new. shared_ptr will then allocate another reference counter. 2 allocations is faster than the make_shared counterpart so yes copying a pointer is slower than 2 heap allocations. -Phil

Frank Mori Hess

8:39 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wednesday, May 25, 2011, Phil Bouchard wrote:

...

So we have shared_ptr being initialized with an object created by operator new. shared_ptr will then allocate another reference counter. 2 allocations is faster than the make_shared counterpart so yes copying a pointer is slower than 2 heap allocations.

What version of boost are you referring to? make_shared used to be slow due to it doing a lot of unnecessary copying of the storage area for the pointee, but that should have been fixed probably a couple years back now. Copying a shared_ptr is fairly slow due to the atomic reference counting, but I would expect a compiler to be able to elide the copy of the make_shared return value. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iEYEARECAAYFAk3daPAACgkQ5vihyNWuA4VQHQCeM3lG462/Wisv84Iklc5WLtIv 1CwAn1jMJw12gz042FLYChF5L0D88qr+ =bUfW -----END PGP SIGNATURE-----

Phil Bouchard

9:23 p.m.

On 5/25/2011 1:39 PM, Frank Mori Hess wrote:

...

What version of boost are you referring to?

boost_1_46_1

...

make_shared used to be slow due to it doing a lot of unnecessary copying of the storage area for the pointee, but that should have been fixed probably a couple years back now. Copying a shared_ptr is fairly slow due to the atomic reference counting, but I would expect a compiler to be able to elide the copy of the make_shared return value.

That optimization doesn't exist and it's worse with the Intel Compiler: $ bin/benchmark.test/intel-linux/release/benchmark make: auto_ptr: 9406481 ns shared_ptr: 24266241 ns block_ptr: 126199145 ns new: auto_ptr: 4637891 ns shared_ptr: 11943452 ns block_ptr: 73166351 ns $ bin/benchmark.test/gcc-4.1.2/release/benchmark make: auto_ptr: 10980716 ns shared_ptr: 15384949 ns block_ptr: 118651908 ns new: auto_ptr: 4608425 ns shared_ptr: 10733654 ns block_ptr: 69710242 ns -Phil

Nevin Liber

10:27 p.m.

On 25 May 2011 16:23, Phil Bouchard <philippe@fornux.com> wrote:

...

$ bin/benchmark.test/gcc-4.1.2/release/benchmark make: auto_ptr: 10980716 ns shared_ptr: 15384949 ns block_ptr: 118651908 ns

new: auto_ptr: 4608425 ns shared_ptr: 10733654 ns block_ptr: 69710242 ns

You have yet to explain why your make_auto is so much slower than new directly. It just isn't that expensive to copy one pointer and zero out another. I can't imagine what else your code can possibly be doing. I can't imagine how that can be more, let alone significantly more expensive than a heap allocation. Enlighten me. -- Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

Phil Bouchard

10:40 p.m.

On 5/25/2011 3:27 PM, Nevin Liber wrote:

...

You have yet to explain why your make_auto is so much slower than new directly. It just isn't that expensive to copy one pointer and zero out another. I can't imagine what else your code can possibly be doing. I can't imagine how that can be more, let alone significantly more expensive than a heap allocation. Enlighten me.

Well if you have a direct assignment coming from operator new then the pointer can be transferred directly into a register. If you have temporary r-values to copy it into then the transfer will be done from one memory location to another, but it won't be a fast as a register-to-register transfer. -Phil

Nevin Liber

11:07 p.m.

On 25 May 2011 17:40, Phil Bouchard <philippe@fornux.com> wrote:

...

On 5/25/2011 3:27 PM, Nevin Liber wrote:

...
You have yet to explain why your make_auto is so much slower than new directly. It just isn't that expensive to copy one pointer and zero out another. I can't imagine what else your code can possibly be doing. I can't imagine how that can be more, let alone significantly more expensive than a heap allocation. Enlighten me.

Well if you have a direct assignment coming from operator new then the pointer can be transferred directly into a register. If you have temporary r-values to copy it into then the transfer will be done from one memory location to another, but it won't be a fast as a register-to-register transfer.

And why do you believe that is *slower* (by almost 2.5x with your original numbers) than a heap allocation from the standard heap? Because that is what you keep asserting... -- Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

Phil Endecott

11:15 p.m.

Phil Bouchard <philippe@fornux.com> wrote:

...

On 5/25/2011 3:27 PM, Nevin Liber wrote:

...
You have yet to explain why your make_auto is so much slower than new directly. It just isn't that expensive to copy one pointer and zero out another. I can't imagine what else your code can possibly be doing. I can't imagine how that can be more, let alone significantly more expensive than a heap allocation. Enlighten me.

Well if you have a direct assignment coming from operator new then the pointer can be transferred directly into a register. If you have temporary r-values to copy it into then the transfer will be done from one memory location to another, but it won't be a fast as a register-to-register transfer.

That effect should be immeasurably small, or eliminated entirely by the compiler. I think you had better post your benchmark code. There must surely be something else going on here. Phil.

Phil Bouchard

11:55 p.m.

On 5/25/2011 4:15 PM, Phil Endecott wrote:

...

That effect should be immeasurably small, or eliminated entirely by the compiler.

I think you had better post your benchmark code. There must surely be something else going on here.

In debug mode I see that: 1) worker_name() run by auto_ptr calls the following on each iteration: - make_auto<int>() - operator new - auto_ptr::auto_ptr(int *) - operator auto_ptr_ref () - auto_ptr::auto_ptr(auto_ptr_ref()) - ~auto_ptr - auto_ptr_ref<int>() ... - auto_ptr<int>::operator = (auto_ptr_ref<int>) ... - auto_ptr<int>::~auto_ptr<int>() ... 2) worker_new() run by auto_ptr simply calls - operator new - auto_ptr<int>::reset(int *) I haven't debugged the release mode but we see that there is a lot of work to do to optimize worker_name() run by auto_ptr -Phil

Phil Bouchard

26 May 26 May

12:06 a.m.

On 5/25/2011 4:55 PM, Phil Bouchard wrote:

...

In debug mode I see that: 1) worker_name() run by auto_ptr calls the following on each iteration: - make_auto<int>() - operator new - auto_ptr::auto_ptr(int *) - operator auto_ptr_ref () - auto_ptr::auto_ptr(auto_ptr_ref()) - ~auto_ptr - auto_ptr_ref<int>() ... - auto_ptr<int>::operator = (auto_ptr_ref<int>) ... - auto_ptr<int>::~auto_ptr<int>() ...

2) worker_new() run by auto_ptr simply calls - operator new - auto_ptr<int>::reset(int *)

I haven't debugged the release mode but we see that there is a lot of work to do to optimize worker_name() run by auto_ptr

I meant: worker_make, not worker_name. -Phil

Darren Garvey

12:42 a.m.

On 26 May 2011 00:15, Phil Endecott <spam_from_boost_dev@chezphil.org>wrote:

...

Phil Bouchard <philippe@fornux.com> wrote:

...
On 5/25/2011 3:27 PM, Nevin Liber wrote:

...
You have yet to explain why your make_auto is so much slower than new directly. It just isn't that expensive to copy one pointer and zero out another. I can't imagine what else your code can possibly be doing. I can't imagine how that can be more, let alone significantly more expensive than a heap allocation. Enlighten me.

Well if you have a direct assignment coming from operator new then the pointer can be transferred directly into a register. If you have temporary r-values to copy it into then the transfer will be done from one memory location to another, but it won't be a fast as a register-to-register transfer.

That effect should be immeasurably small, or eliminated entirely by the compiler.

I think you had better post your benchmark code. There must surely be something else going on here.

I believe he has already (modulo if he incorporated the minor change I mentioned): https://svn.boost.org/svn/boost/sandbox/block_ptr/libs/smart_ptr/example/ben... I agree though, it isn't clear that your code is representative of "real" code. You should try not using templates - which take function pointers - to do the work and inline the code into the tests *. It would be interesting to see the assembly produced by your compiler (I'm not sure you've said which one you're using) in a fully optimised release build of the tests too, just to see what's going on under the hood. -- Darren * If using the templates to do the tests impact performance, that's still useful information, I think.

Phil Bouchard

8:45 p.m.

On 5/25/2011 5:42 PM, Darren Garvey wrote:

...

I believe he has already (modulo if he incorporated the minor change I mentioned): https://svn.boost.org/svn/boost/sandbox/block_ptr/libs/smart_ptr/example/ben...

I agree though, it isn't clear that your code is representative of "real" code. You should try not using templates - which take function pointers - to do the work and inline the code into the tests *.

It would be interesting to see the assembly produced by your compiler (I'm not sure you've said which one you're using) in a fully optimised release build of the tests too, just to see what's going on under the hood.

Sorry for the delay but here is the assembler dump in release mode of both functions: *** worked_make ***: push %r12 xor %r12d,%r12d push %rbp xor %ebp,%ebp push %rbx nopl 0x0(%rax) mov $0x4,%edi callq 0x405288 <_Znwm@plt> (operator new) mov %rax,%rbx xor %edi,%edi callq 0x405058 <_ZdlPv@plt> (operator delete) cmp %rbx,%r12 je 0x406464 <void worker_make<std::auto_ptr<int>, &(std::auto_ptr<int> make_auto<int>())>()+52> mov %r12,%rdi mov %rbx,%r12 callq 0x405058 <_ZdlPv@plt> (operator delete) xor %edi,%edi add $0x1,%ebp callq 0x405058 <_ZdlPv@plt> (operator delete) cmp $0x186a0,%ebp jne 0x406440 <void worker_make<std::auto_ptr<int>, &(std::auto_ptr<int> make_auto<int>())>()+16> pop %rbx pop %rbp mov %r12,%rdi pop %r12 jmpq 0x405058 <_ZdlPv@plt> (operator delete) mov %rax,%rbx mov %r12,%rdi callq 0x405058 <_ZdlPv@plt> (operator delete) mov %rbx,%rdi callq 0x405298 <_Unwind_Resume@plt> *** worker_new ***: push %r12 xor %r12d,%r12d push %rbp xor %ebp,%ebp push %rbx nopl 0x0(%rax) mov $0x4,%edi callq 0x405288 <_Znwm@plt> (operator new) cmp %rax,%r12 mov %rax,%rbx je 0x405e4d <void worker_new<std::auto_ptr<int>, int>()+45> mov %r12,%rdi mov %rbx,%r12 callq 0x405058 <_ZdlPv@plt> (operator delete) add $0x1,%ebp cmp $0x186a0,%ebp jne 0x405e30 <void worker_new<std::auto_ptr<int>, int>()+16> pop %rbx pop %rbp mov %r12,%rdi pop %r12 jmpq 0x405058 <_ZdlPv@plt> (operator delete) mov %rax,%rbx mov %r12,%rdi callq 0x405058 <_ZdlPv@plt> (operator delete) mov %rbx,%rdi callq 0x405298 <_Unwind_Resume@plt> -Phil

Phil Bouchard

10:02 p.m.

With the jump labels: *** worker_make ***: push %r12 xor %r12d,%r12d push %rbp xor %ebp,%ebp push %rbx nopl 0x0(%rax) <void worker_make<std::auto_ptr<int>, &(std::auto_ptr<int> make_auto<int>())>()+16>: mov $0x4,%edi callq 0x405288 <_Znwm@plt> (operator new) mov %rax,%rbx xor %edi,%edi callq 0x405058 <_ZdlPv@plt> (operator delete) cmp %rbx,%r12 je 0x406464 <void worker_make<std::auto_ptr<int>, &(std::auto_ptr<int> make_auto<int>())>()+52> mov %r12,%rdi mov %rbx,%r12 callq 0x405058 <_ZdlPv@plt> (operator delete) <void worker_make<std::auto_ptr<int>, &(std::auto_ptr<int> make_auto<int>())>()+52>: xor %edi,%edi add $0x1,%ebp callq 0x405058 <_ZdlPv@plt> (operator delete) cmp $0x186a0,%ebp jne 0x406440 <void worker_make<std::auto_ptr<int>, &(std::auto_ptr<int> make_auto<int>())>()+16> pop %rbx pop %rbp mov %r12,%rdi pop %r12 jmpq 0x405058 <_ZdlPv@plt> (operator delete) mov %rax,%rbx mov %r12,%rdi callq 0x405058 <_ZdlPv@plt> (operator delete) mov %rbx,%rdi callq 0x405298 <_Unwind_Resume@plt> *** worker_new ***: push %r12 xor %r12d,%r12d push %rbp xor %ebp,%ebp push %rbx nopl 0x0(%rax) <void worker_new<std::auto_ptr<int>, int>()+16>: mov $0x4,%edi callq 0x405288 <_Znwm@plt> (operator new) cmp %rax,%r12 mov %rax,%rbx je 0x405e4d <void worker_new<std::auto_ptr<int>, int>()+45> mov %r12,%rdi mov %rbx,%r12 callq 0x405058 <_ZdlPv@plt> (operator delete) <void worker_new<std::auto_ptr<int>, int>()+45>: add $0x1,%ebp cmp $0x186a0,%ebp jne 0x405e30 <void worker_new<std::auto_ptr<int>, int>()+16> pop %rbx pop %rbp mov %r12,%rdi pop %r12 jmpq 0x405058 <_ZdlPv@plt> (operator delete) mov %rax,%rbx mov %r12,%rdi callq 0x405058 <_ZdlPv@plt> (operator delete) mov %rbx,%rdi callq 0x405298 <_Unwind_Resume@plt> -Phil

Phil Bouchard

11:19 p.m.

On 5/26/2011 3:02 PM, Phil Bouchard wrote:

...

*** worker_make ***: push %r12 xor %r12d,%r12d push %rbp xor %ebp,%ebp push %rbx nopl 0x0(%rax)

<void worker_make<std::auto_ptr<int>, &(std::auto_ptr<int> make_auto<int>())>()+16>: mov $0x4,%edi callq 0x405288 <_Znwm@plt> (operator new) mov %rax,%rbx xor %edi,%edi callq 0x405058 <_ZdlPv@plt> (operator delete)

... On each iteration worker_make calls delete explaining why it is twice slower than worker_new. I'm not sure what it is deleting. -Phil

Phil Bouchard

11:42 p.m.

On 5/26/2011 4:19 PM, Phil Bouchard wrote:

...

I'm not sure what it is deleting.

If I examine worker_make used with block_ptr it's obvious it creates temporaries all over the place and there is no such RVO: push %r15 xor %r15d,%r15d push %r14 push %r13 push %r12 push %rbp push %rbx sub $0x48,%rsp lea 0x10(%rsp),%rax mov %rax,%rdi mov %rax,0x8(%rsp) callq 0x408a30 <block_ptr> lea 0x20(%rsp),%rax mov %rax,(%rsp) mov %rax,%rdi nop callq 0x4094f0 <boost::detail::bp::block_ptr<int> boost::detail::bp::make_block<int>()> callq 0x407c60 <boost::detail::bp::block_header::static_mutex()> mov %rax,%rdi mov %rax,%r14 callq 0x405258 <pthread_mutex_lock@plt> test %eax,%eax jne 0x409d34 <void worker_make<boost::detail::bp::block_ptr<int>, &(boost::detail::bp::block_ptr<int> boost::detail::bp::make_block<int>())>()+804> mov 0x18(%rsp),%rdx mov 0x8(%rdx),%rcx mov 0x8(%rcx),%rax cmp %rax,%rcx jne 0x409a74 <void worker_make<boost::detail::bp::block_ptr<int>, &(boost::detail::bp::block_ptr<int> boost::detail::bp::make_block<int>())>()+100> jmp 0x409a81 <void worker_make<boost::detail::bp::block_ptr<int>, &(boost::detail::bp::block_ptr<int> boost::detail::bp::make_block<int>())>()+113> mov %rcx,%rax mov %rax,0x8(%rdx) mov 0x8(%rax),%rcx cmp %rax,%rcx jne 0x409a71 <void worker_make<boost::detail::bp::block_ptr<int>, &(boost::detail::bp::block_ptr<int> boost::detail::bp::make_block<int>())>()+97> mov 0x28(%rsp),%rsi mov 0x8(%rsi),%rdx mov 0x8(%rdx),%rax cmp %rax,%rdx jne 0x409a98 <void worker_make<boost::detail::bp::block_ptr<int>, &(boost::detail::bp::block_ptr<int> boost::detail::bp::make_block<int>())>()+136> jmp 0x409aa5 <void worker_make<boost::detail::bp::block_ptr<int>, &(boost::detail::bp::block_ptr<int> boost::detail::bp::make_block<int>())>()+149> mov %rdx,%rax mov %rax,0x8(%rsi) mov 0x8(%rax),%rdx cmp %rax,%rdx jne 0x409a95 <void worker_make<boost::detail::bp::block_ptr<int>, &(boost::detail::bp::block_ptr<int> boost::detail::bp::make_block<int>())>()+133> cmp %rdx,%rcx je 0x409ba9 <void worker_make<boost::detail::bp::block_ptr<int>, &(boost::detail::bp::block_ptr<int> boost::detail::bp::make_block<int>())>()+409> xor %esi,%esi mov 0x8(%rsp),%rdi callq 0x408500 <boost::detail::bp::block_ptr<int>::release(bool)> mov 0x28(%rsp),%rcx mov 0x8(%rcx),%rdx mov 0x8(%rdx),%rax cmp %rax,%rdx jne 0x409ad3 <void worker_make<boost::detail::bp::block_ptr<int>, &(boost::detail::bp::block_ptr<int> boost::detail::bp::make_block<int>())>()+195> jmp 0x409ae0 <void worker_make<boost::detail::bp::block_ptr<int>, &(boost::detail::bp::block_ptr<int> boost::detail::bp::make_block<int>())>()+208> xchg %ax,%ax mov %rdx,%rax mov %rax,0x8(%rcx) mov 0x8(%rax),%rdx cmp %rax,%rdx jne 0x409ad0 <void worker_make<boost::detail::bp::block_ptr<int>, &(boost::detail::bp::block_ptr<int> boost::detail::bp::make_block<int>())>()+192> mov 0x18(%rsp),%rcx mov 0x8(%rcx),%rsi mov 0x8(%rsi),%rax cmp %rax,%rsi jne 0x409af7 <void worker_make<boost::detail::bp::block_ptr<int>, &(boost::detail::bp::block_ptr<int> boost::detail::bp::make_block<int>())>()+231> jmp 0x409b04 <void worker_make<boost::detail::bp::block_ptr<int>, &(boost::detail::bp::block_ptr<int> boost::detail::bp::make_block<int>())>()+244> mov %rsi,%rax mov %rax,0x8(%rcx) mov 0x8(%rax),%rsi cmp %rax,%rsi jne 0x409af4 <void worker_make<boost::detail::bp::block_ptr<int>, &(boost::detail::bp::block_ptr<int> boost::detail::bp::make_block<int>())>()+228> cmp %rsi,%rdx je 0x409ba9 <void worker_make<boost::detail::bp::block_ptr<int>, &(boost::detail::bp::block_ptr<int> boost::detail::bp::make_block<int>())>()+409> lea 0x28(%rsi),%r8 ... -Phil

Peter Dimov

27 May 27 May

12:15 a.m.

Phil Bouchard wrote:

...

On 5/26/2011 3:02 PM, Phil Bouchard wrote: ...

...
xor %edi,%edi callq 0x405058 <_ZdlPv@plt> (operator delete) ...

On each iteration worker_make calls delete explaining why it is twice slower than worker_new. I'm not sure what it is deleting.

It looks like it calls delete with NULL. This is probably the destructor of the temporary auto_ptr, whose pointer is NULL after the assignment. Does your test use rvalue references by the way (-std=c++0x)? shared_ptr has a move assignment and should be able to take advantage of that to avoid some of the overhead imposed by the remaining temporary (-ies). Also, the trunk/release make_shared should be better than the one in 1.46.1 (although I don't know how relevant is this for such a small pointee as int.)

Phil Bouchard

12:35 a.m.

On 5/26/2011 5:15 PM, Peter Dimov wrote:

...

It looks like it calls delete with NULL. This is probably the destructor of the temporary auto_ptr, whose pointer is NULL after the assignment.

Does your test use rvalue references by the way (-std=c++0x)?

gcc 4.1.2 doesn't support -std=c++0x. -Phil

Phil Bouchard

12:53 a.m.

On 5/26/2011 5:15 PM, Peter Dimov wrote:

...

Does your test use rvalue references by the way (-std=c++0x)?

The Intel Compiler supports it but I get the same results: make: auto_ptr: 9268026 ns shared_ptr: 22938761 ns block_ptr: 122800172 ns new: auto_ptr: 4598675 ns shared_ptr: 12218498 ns block_ptr: 65766473 ns -Phil

Frank Mori Hess

26 May 26 May

11:44 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wednesday, May 25, 2011, Phil Bouchard wrote:

...

On 5/25/2011 1:39 PM, Frank Mori Hess wrote:

...
What version of boost are you referring to?

boost_1_46_1

...
make_shared used to be slow due to it doing a lot of unnecessary copying of the storage area for the pointee, but that should have been fixed probably a couple years back now. Copying a shared_ptr is fairly slow due to the atomic reference counting, but I would expect a compiler to be able to elide the copy of the make_shared return value.

That optimization doesn't exist and it's worse with the Intel Compiler:

Well, I was thinking of the case where you are initializing a newly declared shared_ptr as opposed to re-assigning the same shared_ptr over and over again. Running a trimmed down and tweaked version of your benchmark (attached) I get: $ g++ -O3 -Wall pb_benchmark.cpp -lrt fhess@tailpipe:~/test$ ./a.out new: shared_ptr: 17500558 ns make: shared_ptr: 12208956 ns Note, I moved the declaration of the shared_ptrs inside the loops. Also, your benchmark on my machine shows a strong dependence on the order the tests are run. The first run is slower, which is why the "new" result is before the "make" result above, to demonstrate the effect. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iEYEARECAAYFAk3e5ewACgkQ5vihyNWuA4VYKgCgkXKZk+AqfXyzjQKmg4b65lbd mFsAoJGL5Leu1e45BaguXnk9O0zLllFe =XV7n -----END PGP SIGNATURE-----

Phil Bouchard

27 May 27 May

12:22 a.m.

On 5/26/2011 4:44 PM, Frank Mori Hess wrote:

...

Note, I moved the declaration of the shared_ptrs inside the loops. Also, your benchmark on my machine shows a strong dependence on the order the tests are run. The first run is slower, which is why the "new" result is before the "make" result above, to demonstrate the effect.

You're right but make remains the same: 1) new: shared_ptr: 21234066 ns make: shared_ptr: 18122766 ns 2) make: shared_ptr: 18271141 ns new: shared_ptr: 13395539 ns But after analyzing the assembler dump of worker_make with block_ptr it's obvious RVO is a lie. -Phil

Frank Mori Hess

1:25 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thursday, May 26, 2011, Phil Bouchard wrote:

...

You're right but make remains the same: 1) new: shared_ptr: 21234066 ns make: shared_ptr: 18122766 ns

2) make: shared_ptr: 18271141 ns new: shared_ptr: 13395539 ns

That's not what I see on my machine: $ ./a.out new: shared_ptr: 15096947 ns make: shared_ptr: 10986592 ns # recompile with different order $ ./a.out make: shared_ptr: 14187251 ns new: shared_ptr: 11498070 ns My cpu cores turn down their clock frequency when idle to save power, and I don't know how long it takes to switch speeds. Maybe the cpu clock speed is jumping from low to high speed while the benchmark is running. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iEYEARECAAYFAk3fpk0ACgkQ5vihyNWuA4V/BwCggGtUFvMgSOYdNtFgmQDeU8RU aD0AnRRMZfgr0qXZREuDnFoVNbqTmDAU =uPv7 -----END PGP SIGNATURE-----

Phil Bouchard

9:07 p.m.

On 5/27/2011 6:25 AM, Frank Mori Hess wrote:

...

My cpu cores turn down their clock frequency when idle to save power, and I don't know how long it takes to switch speeds. Maybe the cpu clock speed is jumping from low to high speed while the benchmark is running.

I get different results after changing the priority: # nice --20 .../benchmark new: auto_ptr: 9602673 ns shared_ptr: 21695797 ns block_ptr: 136451426 ns make: auto_ptr: 10390549 ns shared_ptr: 21044471 ns block_ptr: 237520341 ns # nice -20 .../benchmark new: auto_ptr: 9317126 ns shared_ptr: 17450029 ns block_ptr: 69507905 ns make: auto_ptr: 5028312 ns shared_ptr: 10538796 ns block_ptr: 117030907 ns Although worker_make looks faster for auto_ptr & shared_ptr, it is slower for block_ptr. The assembly code is still much simpler for worker_new and it should be faster. -Phil

Phil Bouchard

9:35 p.m.

On 5/27/2011 6:25 AM, Frank Mori Hess wrote:

...

My cpu cores turn down their clock frequency when idle to save power, and I don't know how long it takes to switch speeds. Maybe the cpu clock speed is jumping from low to high speed while the benchmark is running.

The results are constant so I'm not sure what would trigger CPU speedups in a constant way. Maybe the calls to the memory map are getting faster. -Phil

Phil Bouchard

10:08 p.m.

On 5/27/2011 2:35 PM, Phil Bouchard wrote:

...

The results are constant so I'm not sure what would trigger CPU speedups in a constant way. Maybe the calls to the memory map are getting faster.

The memory map is indeed getting faster. So now I calculate the median and I get the same results even after reordering the calls: 1) new: auto_ptr: 4640123 ns shared_ptr: 10901185 ns block_ptr: 77581308 ns make: auto_ptr: 5017051 ns shared_ptr: 10559566 ns block_ptr: 120121569 ns 2) make: auto_ptr: 5127206 ns shared_ptr: 10731503 ns block_ptr: 128038925 ns new: auto_ptr: 4667825 ns shared_ptr: 11210022 ns block_ptr: 69109934 ns -Phil

Phil Bouchard

10:58 p.m.

On 5/27/2011 3:08 PM, Phil Bouchard wrote:

...

2) make: auto_ptr: 5127206 ns shared_ptr: 10731503 ns block_ptr: 128038925 ns

new: auto_ptr: 4667825 ns shared_ptr: 11210022 ns block_ptr: 69109934 ns

So the conclusion is: there is no such RVO for block_ptr. -Phil

Bryce Lelbach

26 May 26 May

4:25 a.m.

On 2011.05.24_16.52.00, Phil Bouchard wrote:

...

FYI make_shared is slower than new because of the temporary it creates. If people what speed they should stick to operator new in all cases.

My box begs to differ. Units for min, max, mean and moment are: shared_ptrs created / 8 seconds Count is: number of trials Sum is: shared_ptrs created / (numer of trials * 8 seconds) GCC 4.4.5: make_shared: min = 38176474 max = 40958919 mean = 3.95662e+07 count = 8 sum = 316529282 moment<2> = 1.56663e+15 new: min = 32075900 max = 35917824 mean = 3.41757e+07 count = 8 sum = 273405391 moment<2> = 1.16916e+15 Intel C++ 12.0 make_shared: min = 38264052 max = 40850617 mean = 4.0022e+07 count = 8 sum = 320176020 moment<2> = 1.6026e+15 new: min = 33708800 max = 35364869 mean = 3.47297e+07 count = 8 sum = 277837391 moment<2> = 1.20645e+15 Clang 2.8 make_shared: min = 35269466 max = 37759937 mean = 3.69983e+07 count = 8 sum = 295986798 moment<2> = 1.3695e+15 new: min = 32876828 max = 34609212 mean = 3.38261e+07 count = 8 sum = 270608628 moment<2> = 1.14456e+15 lll-clang 3.0 make_shared: min = 34756169 max = 37310038 mean = 3.66762e+07 count = 8 sum = 293409533 moment<2> = 1.34575e+15 new: min = 33098433 max = 34011625 mean = 3.37868e+07 count = 8 sum = 270294200 moment<2> = 1.14162e+15 Equipment and kernel: HP ProLiant DL785 G6 Server 8 AMD Socket Fs, 6 Opteron 8431s per socket w/ HyperTransport Cache sizes per socket: L1 384kB, L2 3072kB, L3 5120kB 96 gigs main memory, 100ish gigs swap RHEL 6.0 (Linux 2.6.32), sits at runlevel 3 Boost from top of trunk No other jobs were running at the time. Peak memory usage for one invocation of the benchmark program is appx 60 gigs. Source code attached. -- Bryce Lelbach aka wash boost-spirit.com px.cct.lsu.edu github.com/lll-project

Mathias Gaunard

25 May 25 May

12:15 p.m.

On 25/05/2011 00:40, Phil Bouchard wrote:

...

Hi,

Just for the records I have benchmarked block_ptr using: https://svn.boost.org/svn/boost/sandbox/block_ptr/libs/smart_ptr/example/ben...

And I get the following: auto_ptr: 9208870 ns shared_ptr: 22958516 ns block_ptr: 75860365 ns

Which is pretty good given the complexity it goes thru.

If you want a really useful benchmark, I'd suggest you give the max and average cycle count of each primitive associated with those types (construction, copy constructor, assignment, dereference, destructor)

...

I was wondering if there is any hope in having a pool::ordered_malloc() call of O(1) instead of O(n)

I don't see how such a thing would be possible, unless you can guarantee that all your allocations/deallocations happen in a LIFO order (in which case you can use the non-ordered allocation).

...

and a faster pool::is_from()? The latter could be just a check to see if a pointer is within reserved memory pages of the pool. I tried to talk to the author of pool but I got no response.

The pages are not contiguous, so you need to walk through each page at least.

Phil Bouchard

4:25 p.m.

On 5/25/2011 5:15 AM, Mathias Gaunard wrote:

...

On 25/05/2011 00:40, Phil Bouchard wrote:

If you want a really useful benchmark, I'd suggest you give the max and average cycle count of each primitive associated with those types (construction, copy constructor, assignment, dereference, destructor)

Ok I understand and you're right but at least I already have an estimate.

...

I don't see how such a thing would be possible, unless you can guarantee that all your allocations/deallocations happen in a LIFO order (in which case you can use the non-ordered allocation).

Just to make sure: O(n) where n is the number of objects already allocated?

...

The pages are not contiguous, so you need to walk through each page at least.

It'll still be decently fast. -Phil

Mathias Gaunard

5:09 p.m.

On 25/05/2011 18:25, Phil Bouchard wrote:

...

Just to make sure: O(n) where n is the number of objects already allocated?

The number of objects currently in the pool.

...

...
The pages are not contiguous, so you need to walk through each page at least.

It'll still be decently fast.

It's already what it does.

Phil Bouchard

7:05 p.m.

On 5/25/2011 10:09 AM, Mathias Gaunard wrote:

...

The number of objects currently in the pool.

People might confuse that with the size of the object because the same letter is used twice for two different purposes "t.ordered_malloc(n)" and "O(n)": http://www.boost.org/doc/libs/1_36_0/libs/pool/doc/interfaces/pool.html On the other hand if you allocate big objects at the end of the pool and small ones at the beginning then I'm pretty sure there is a way to sort out free space and end up with a O(log(n)) complexity. O(n) doesn't make sense.

...

It's already what it does.

Ok false alarm because I read somewhere is_from() was slow. -Phil

5174

Age (days ago)

5177

Last active (days ago)

List overview

Download

47 comments

10 participants

participants (10)

Bryce Lelbach
Darren Garvey
Emil Dotchevski
Frank Mori Hess
Mathias Gaunard
Nevin Liber
Peter Dimov
Phil Bouchard
Phil Endecott
Rhys Ulerich