"Small buffer optimization" for boost::function?

Good <time-of-day>! Why not use 'small buffer optimization' for boost::function? Most of allocations are 8-16 bytes - an optimal size for SBO. This optimization will make boost::function _much_ faster - in one of our project boost::function is responsible for 15% of the whole processor time. PS: it will also be nice if something like http://www.codeproject.com/cpp/FastDelegate.asp will be included in Boost (and possibly in TR2). PS: Happy New Year! -- With respect, Alex Besogonov (cyberax@elewise.com)

Alex Besogonov wrote:
Good <time-of-day>!
Why not use 'small buffer optimization' for boost::function? Most of allocations are 8-16 bytes - an optimal size for SBO.
This optimization will make boost::function _much_ faster - in one of
How much faster, exactly? Recently, in a thread "[signal] performance" I've posted a small benchmark that finds boost::function to take 4 times more time that call via function pointer -- which, IMO, is pretty fast. Can you use that benchmark to obtain speedup for your optimization?
our project boost::function is responsible for 15% of the whole processor time.
I find it pretty strange, can you clarify how it's possible. How did you measure the time spend in boost::function? - Volodya

Vladimir Prus wrote:
This optimization will make boost::function _much_ faster - in one of How much faster, exactly? Recently, in a thread "[signal] performance" I've posted a small benchmark that finds boost::function to take 4 times more time that call via function pointer -- which, IMO, is pretty fast. I don't have any problem with boost::function speed of invocations (though FastDelegate is two times faster).
Can you use that benchmark to obtain speedup for your optimization?
our project boost::function is responsible for 15% of the whole processor time. I find it pretty strange, can you clarify how it's possible. How did you measure the time spend in boost::function? In our application we need to create lots of short-lived boost::function objects, so time for dynamic allocations performed inside boost::function's constructor really matters.
-- With respect, Alex Besogonov (cyberax@elewise.com)

Alex Besogonov wrote:
Vladimir Prus wrote:
This optimization will make boost::function _much_ faster - in one of How much faster, exactly? Recently, in a thread "[signal] performance" I've posted a small benchmark that finds boost::function to take 4 times more time that call via function pointer -- which, IMO, is pretty fast.
I don't have any problem with boost::function speed of invocations (though FastDelegate is two times faster).
I see. Still, would be nice to see specific numbers. - Volodya

Vladimir Prus wrote:
I don't have any problem with boost::function speed of invocations (though FastDelegate is two times faster). I see. Still, would be nice to see specific numbers. I've attached a test program (you need FastDelegate from http://www.codeproject.com/cpp/FastDelegate.asp to compile it).
Results: ========================= C:\temp\delegates>gcc -O3 -funroll-loops -fomit-frame-pointer test.cpp -Ic:/tools/boost -lstdc++ C:\temp\delegates>a.exe Time elapsed for FastDelegate: 1.191000 (sec) Time elapsed for simple bind: 0.010000 (sec) Time elapsed for bind+function: 33.118000 (sec) Time elapsed for pure function invocation: 3.705000 (sec) ========================= (GCC 4.1.0 was used) You can see that boost::function + boost::bind is an order of magnitude slower than FastDelegate. Even a mere invocation of a boost::function is slower than complete bind+invoke for FastDelegate. -- With respect, Alex Besogonov (cyberax@elewise.com) #include <stdio.h> #include "FastDelegate.h" #include <boost/timer.hpp> #include <boost/bind.hpp> #include <boost/function.hpp> class Test { public: void func(int param) { int i=0; i=param+1; } }; using namespace fastdelegate; int main(void) { typedef FastDelegate1<int> IntMyDelegate; Test test; boost::timer t; for(int f=0;f<100000000;f++) { IntMyDelegate newdeleg; newdeleg = MakeDelegate(&test, &Test::func); newdeleg(f); } printf("Time elapsed for FastDelegate: %f (sec)\n",t.elapsed()); boost::timer t2; for(int f=0;f<100000000;f++) { boost::bind(&Test::func,&test,_1)(f); } printf("Time elapsed for simple bind: %f (sec)\n",t2.elapsed()); boost::timer t3; for(int f=0;f<100000000;f++) { boost::function<void(int)> func=boost::bind(&Test::func,&test,_1); func(f); } printf("Time elapsed for bind+function: %f (sec)\n",t3.elapsed()); boost::timer t4; boost::function<void(int)> func=boost::bind(&Test::func,&test,_1); for(int f=0;f<100000000;f++) { func(f); } printf("Time elapsed for pure function invocation: %f (sec)\n",t4.elapsed()); return 0; }

Alex Besogonov <cyberax@elewise.com> writes: | You can see that boost::function + boost::bind is an order of | magnitude slower than FastDelegate. Even a mere invocation of a | boost::function is slower than complete bind+invoke for FastDelegate. This is a profile with a optimization and all inlining turned off: #include <stdio.h> #include <boost/timer.hpp> #include <boost/function.hpp> void test(int param) { int i = 0; i = param + 1; } void test3() { boost::timer t3; for(int f=0;f<100000000;f++) { boost::function<void(int)> func = &test; func(f); } printf("Time elapsed for bind+function: %f(sec)\n",t3.elapsed()); } int main(void) { test3(); return 0; } g++ -Wextra -Wall -pg -g -O0 -fno-inline-functions -fno-inline \ -fno-inline-functions-called-once -o function function.cxx Just the top of the profile: Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 14.83 7.29 7.29 200000000 0.00 0.00 boost::detail::function::functor_manager<void (*)(int), std::allocator<void> >::manage(boost::detail::function::any_pointer, boost::detail::function::functor_manager_operation_type) 13.58 13.96 6.67 100000000 0.00 0.00 void boost::function1<void, int, std::allocator<void> >::assign_to<void (*)(int)>(void (*)(int), boost::detail::function::function_ptr_tag) 11.65 19.69 5.73 200000000 0.00 0.00 boost::function1<void, int, std::allocator<void> >::clear() 9.10 24.16 4.47 200000000 0.00 0.00 boost::detail::function::functor_manager<void (*)(int), std::allocator<void> >::manager(boost::detail::function::any_pointer, boost::detail::function::functor_manager_operation_type, boost::detail::function::function_ptr_tag) I have no idea if this is of any use to anyone. -- Lgb

On Dec 28, 2005, at 4:15 AM, Alex Besogonov wrote:
Vladimir Prus wrote:
I don't have any problem with boost::function speed of invocations (though FastDelegate is two times faster). I see. Still, would be nice to see specific numbers. I've attached a test program (you need FastDelegate from http:// www.codeproject.com/cpp/FastDelegate.asp to compile it).
Results: ========================= C:\temp\delegates>gcc -O3 -funroll-loops -fomit-frame-pointer test.cpp -Ic:/tools/boost -lstdc++ C:\temp\delegates>a.exe Time elapsed for FastDelegate: 1.191000 (sec) Time elapsed for simple bind: 0.010000 (sec) Time elapsed for bind+function: 33.118000 (sec) Time elapsed for pure function invocation: 3.705000 (sec) ========================= (GCC 4.1.0 was used)
You can see that boost::function + boost::bind is an order of magnitude slower than FastDelegate. Even a mere invocation of a boost::function is slower than complete bind+invoke for FastDelegate.
The major performance problem in this example is the memory allocation required to construct boost::function objects. We could implement SBO directly in boost::function, but we're trading off space and performance. Is it worth it? It depends on how often you copy boost::function objects vs. how many of them you store in memory. Would a pooling allocator solve the problem? I tried switching the boost::function<> allocator to boost::pool_allocator and boost::fast_pool_allocator (from the Boost.Pool library), but performance actually got quite a bit worse with this change: Time elapsed for simple bind: 2.050000 (sec) Time elapsed for bind+function: 43.120000 (sec) Time elapsed for pure function invocation: 2.020000 (sec) Time elapsed for bind+function+pool: 130.750000 (sec) Time elapsed for bind+function+fastpool: 108.590000 (sec) Pooling is not feasible, so we need the SBO for performance, but not all users can take the increase in boost::function size. On non- broken compilers, we could use the Allocator parameter to implement the SBO. At first I was hoping we could just make boost::function smart enough to handle stateful allocators, then write an SBO allocator. Unfortunately, this doesn't play well with rebinding: template<typename Signature, typename Allocator> class function : Allocator { public: template<typename F> function(const F& f) { typedef typename Allocator::template rebind<F>::other my_allocator; my_allocator alloc(*this); F* new_F = alloc.allocate(1); // where does this point to? // ... } }; Presumably, a SBO allocator's allocate() member would return a pointer into it's own buffer, but what happens when you rebind for the new type F and then allocate() using that rebound allocator? You get a pointer to the wrong buffer. So the SBO needs to be more deeply ingrained in boost::function. The common case on most 32-bit architectures is an 8-byte member function pointer and a 4-byte object pointer, so we need 12 bytes of storage to start with for the buffer; boost::function is currently only 12 bytes (4 bytes of that is the buffer). boost::function adds to this the "manager" and "invoker" pointers, which would bring us to 20 bytes in the SBO case. But, we can collapse the manager and invoker into a single vtable pointer, so we'd get back down to 16 bytes. Still larger than before, but that 4-byte overhead could drastically improve performance for many common cases. I'm okay with that. We'll probably have to give up the no-throw swap guarantee, and perhaps also the strong exception safety of copying boost::function objects, but I don't think anyone will care about those. The basic guarantee is good enough. Doug

On Dec 29, 2005, at 12:49 PM, Douglas Gregor wrote:
boost::function adds to this the "manager" and "invoker" pointers, which would bring us to 20 bytes in the SBO case. But, we can collapse the manager and invoker into a single vtable pointer, so we'd get back down to 16 bytes.
FWIW, I've just committed changes that compress the manager and invoker pointers into a single vtable pointer. boost::function is now 8 bytes (on most 32-bit platforms); we can try the SBO from there. Doug

I've now implemented the small buffer optimization for Boost.Function. The patch is attached, but I have yet to check it in. Here's the executive summary: Performance difference: up to 6x faster when the SBO applies Space difference: boost::function takes an extra 4 bytes (now, it's 16 bytes) Semantics: Assignment operators now give the basic guarantee (was the strong guarantee); swap() can now throw. (We're now less TR1-conforming, but we could claim that the TR is wrong to be so strict). Usability: The optimization won't help much in practice unless Boost.Bind objects become smaller :( I've extended the performance test (attached) with a "smallbind" function object and its tests. The tests that follow use both "smallbind" and "bind", separately, because the former fits in the 12- byte SBO buffer whereas the latter does not. I tested on GCC 4.2.0 (bleeding edge, straight from CVS) and GCC 3.3 (both Apple and FSF) on a Dual G5 running Mac OS X Panther and on an Athlon XP system running Linux. The newer compiler gave us the performance boost we wanted, with about a 6x improvement when the SBO is used. We get more like 2x with GCC 3.3, although I have a trick or two left that may improve things. In the data that follows, there are 3 versions of Boost.Function being tested: 1.33.1: This is Boost.Function as released in Boost 1.33.1. No SBO applied, of course. 1.34.0 w/ vtables: This is Boost.Function as it currently stands in Boost CVS. It uses vtables for a space optimization (a boost::function object requires only 8 bytes of storage), but does not implement the SBO. 1.34.0 w/ tables and SBO: This is Boost.Function in Boost CVS with the attached patch applied. It uses vtables and contains a 12-byte (actually, the size of a member pointer + the size of a void*) buffer for the SBO optimization. Even with the SBO in Boost.Function, users won't immediately realize the benefits. The problem is that Boost.Bind produces function objects whose size is not minimal. For instance, boost::bind (&Test::func, &func, _1) returns a function object that is 16 bytes. That 4 bytes of wasted space doesn't matter most of the time, but here is means the difference between using the SBO and not using the SBO :( So, Peter, any chance of getting a slightly more optimized Boost.Bind that can fit boost::bind(&Test::func, &func, _1) into 12 bytes? Doug On my Athlon XP box ------------------------- OS: Gentoo Linux ("old") Compiler: GCC 3.3.6 Flags: -O3 -funroll-loops -fomit-frame-pointer) [1.33.1] Time elapsed for simple bind: 1.360000 (sec) Time elapsed for smallbind+function (size=12): 10.870000 (sec) Time elapsed for bind+function (size=16): 11.770000 (sec) Time elapsed for pure function invocation: 1.590000 (sec) Time elapsed for bind+function+pool: 29.690000 (sec) Time elapsed for bind+function+fastpool: 6.560000 (sec) [1.34.0 w/ vtables] Time elapsed for simple bind: 1.410000 (sec) Time elapsed for smallbind+function (size=12): 11.260000 (sec) Time elapsed for bind+function (size=16): 12.500000 (sec) Time elapsed for pure function invocation: 1.530000 (sec) Time elapsed for bind+function+pool: 30.360000 (sec) Time elapsed for bind+function+fastpool: 7.370000 (sec) [1.34.0 w/ vtables and SBO] Time elapsed for simple bind: 1.360000 (sec) Time elapsed for smallbind+function (size=12): 5.190000 (sec) Time elapsed for bind+function (size=16): 13.150000 (sec) Time elapsed for pure function invocation: 1.660000 (sec) Time elapsed for bind+function+pool: 29.730000 (sec) Time elapsed for bind+function+fastpool: 7.060000 (sec) On my Athlon XP Linux box ------------------------- OS: Gentoo Linux ("old") Compiler: GCC 4.2.0 (20051122, experimental) Flags: -O3 -funroll-loops -fomit-frame-pointer) [1.33.1] Time elapsed for simple bind: 0.850000 (sec) Time elapsed for smallbind+function (size=12): 14.230000 (sec) Time elapsed for bind+function (size=16): 15.100000 (sec) Time elapsed for pure function invocation: 1.430000 (sec) Time elapsed for bind+function+pool: 26.930000 (sec) Time elapsed for bind+function+fastpool: 9.060000 (sec) [1.34.0 w/ vtables] Time elapsed for simple bind: 0.020000 (sec) Time elapsed for smallbind+function (size=12): 13.410000 (sec) Time elapsed for bind+function (size=16): 13.360000 (sec) Time elapsed for pure function invocation: 1.350000 (sec) Time elapsed for bind+function+pool: 25.570000 (sec) Time elapsed for bind+function+fastpool: 7.590000 (sec) [1.34.0 w/ vtables and SBO] Time elapsed for simple bind: 0.020000 (sec) Time elapsed for smallbind+function (size=12): 2.640000 (sec) Time elapsed for bind+function (size=16): 12.940000 (sec) Time elapsed for pure function invocation: 1.460000 (sec) Time elapsed for bind+function+pool: 25.260000 (sec) Time elapsed for bind+function+fastpool: 7.430000 (sec) On my dual G5 PowerMac ---------------------- OS: Panther (10.3.9) Compiler: Apple GCC 3.3 Flags: -O3 [1.33.1] Time elapsed for simple bind: 2.330000 (sec) Time elapsed for smallbind+function (size=12): 22.080000 (sec) Time elapsed for bind+function (size=16): 29.370000 (sec) Time elapsed for pure function invocation: 2.590000 (sec) Time elapsed for bind+function+pool: 38.810000 (sec) Time elapsed for bind+function+fastpool: 21.460000 (sec) [1.34.0 w/ vtables] Time elapsed for simple bind: 1.180000 (sec) Time elapsed for smallbind+function (size=12): 24.050000 (sec) Time elapsed for bind+function (size=16): 25.860000 (sec) Time elapsed for pure function invocation: 2.590000 (sec) Time elapsed for bind+function+pool: 42.000000 (sec) Time elapsed for bind+function+fastpool: 23.590000 (sec) [1.34.0 w/ vtables and SBO] Time elapsed for simple bind: 1.200000 (sec) Time elapsed for smallbind+function (size=12): 8.140000 (sec) Time elapsed for bind+function (size=16): 24.180000 (sec) Time elapsed for pure function invocation: 2.590000 (sec) Time elapsed for bind+function+pool: 39.210000 (sec) Time elapsed for bind+function+fastpool: 21.280000 (sec)

Douglas Gregor wrote:
So, Peter, any chance of getting a slightly more optimized Boost.Bind that can fit boost::bind(&Test::func, &func, _1) into 12 bytes?
I can't think of an easy way to do that at the moment (compress all placeholders to not take up space). Even if I did, bind(&X::f, &x, _1, true) would still overflow the 12 byte buffer. ;-) One hack-ish solution would be to increase the buffer to 16, possibly even 32, and try the measurements again. This of course has obvious drawbacks for people that use only function<> without bind.

On Jan 9, 2006, at 7:18 AM, Peter Dimov wrote:
Douglas Gregor wrote:
So, Peter, any chance of getting a slightly more optimized Boost.Bind that can fit boost::bind(&Test::func, &func, _1) into 12 bytes?
I can't think of an easy way to do that at the moment (compress all placeholders to not take up space).
Well, there's compressed_pair<F, compressed_pair<arg1_type, compressed_pair<arg2_type> > >...
Even if I did, bind(&X::f, &x, _1, true) would still overflow the 12 byte buffer. ;-)
Why? Does it need more than the 8-byte member pointer and 4-byte pointer?
One hack-ish solution would be to increase the buffer to 16, possibly even 32, and try the measurements again. This of course has obvious drawbacks for people that use only function<> without bind.
Much of my earlier whining about implementing the SBO in boost::function is that I've been trying to keep the overhead down. I think at it's current state with the 12-byte buffer, it's at the right balance point between execution time and space overhead (assuming we can trick bind into smashing those binders into 12 bytes <g>). Doug

Douglas Gregor wrote:
On Jan 9, 2006, at 7:18 AM, Peter Dimov wrote:
Even if I did, bind(&X::f, &x, _1, true) would still overflow the 12 byte buffer. ;-)
Why? Does it need more than the 8-byte member pointer and 4-byte pointer?
The 'true' needs to be stored somewhere.

On Jan 9, 2006, at 10:13 AM, Peter Dimov wrote:
Douglas Gregor wrote:
On Jan 9, 2006, at 7:18 AM, Peter Dimov wrote:
Even if I did, bind(&X::f, &x, _1, true) would still overflow the 12 byte buffer. ;-)
Why? Does it need more than the 8-byte member pointer and 4-byte pointer?
The 'true' needs to be stored somewhere.
*Smacks forehead* I didn't see the true, because I was assuming that you had written bind(&X::f, &x, _1) :) Anyway, we're going to have a cutoff somewhere, The curse of the SBO is that at some point your objects don't fit into the small buffer any more, so you have a jump in your performance curve. If we go to 16 bytes, then bind(&X::f, &x, _1, true) will fit but bind(&X::f, &x, _1, true, true) won't. The member pointer + "this" pointer case seems like the one that users would most expect to work well. It goes head- to-head with delegates, closures, and other similar extensions, with the added benefit of keeping boost::function down to 16 bytes. It just feels like the right cutoff for the buffer size. Doug

Douglas Gregor wrote:
On Jan 9, 2006, at 10:13 AM, Peter Dimov wrote:
Douglas Gregor wrote:
On Jan 9, 2006, at 7:18 AM, Peter Dimov wrote:
Even if I did, bind(&X::f, &x, _1, true) would still overflow the 12 byte buffer. ;-)
Why? Does it need more than the 8-byte member pointer and 4-byte pointer?
The 'true' needs to be stored somewhere.
*Smacks forehead* I didn't see the true, because I was assuming that you had written bind(&X::f, &x, _1) :)
Anyway, we're going to have a cutoff somewhere, The curse of the SBO is that at some point your objects don't fit into the small buffer any more, so you have a jump in your performance curve.
Yes. It would be interesting to measure the performance for larger buffer sizes, though. The break-even point should occur somewhere around 32 or 48, maybe even 64 if the allocator is bad enough.
If we go to 16 bytes, then bind(&X::f, &x, _1, true) will fit but bind(&X::f, &x, _1, true, true) won't.
It probably will unless 'true' is 4 bytes.
The member pointer + "this" pointer case seems like the one that users would most expect to work well. It goes head- to-head with delegates, closures, and other similar extensions, with the added benefit of keeping boost::function down to 16 bytes. It just feels like the right cutoff for the buffer size.
In my experience, the closure case is indeed very common in code written by people who don't take advantage of the full expressive power of boost::bind, probably because they have a Borland/delegate background. If you design your class to have struct X { void show(); void hide(); }; closures are enough. But there is an alternative. You can use struct X { void set_visibility( bool visible ); }; and synthesize show/hide with boost::bind. My code tends towards the latter variety, so I won't be seeing much of the SBO with a &X::f+&x cutoff. BTW, this talk about 12 byte buffer is assuming g++. A member pointer is 4-16 on MSVC, 8 on g++, 12 on Borland, 4 on Digital Mars (!). There's a nice table in http://www.codeproject.com/cpp/FastDelegate.asp Anyway, I committed a storage optimization to the CVS. <crosses fingers>

On Jan 9, 2006, at 11:55 AM, Peter Dimov wrote:
Yes. It would be interesting to measure the performance for larger buffer sizes, though. The break-even point should occur somewhere around 32 or 48, maybe even 64 if the allocator is bad enough.
Yeah, that's possible. I don't think it's just a matter of finding the break-even point for performance, though. function<> is supposed to replace function pointers, closures, etc. If it's significantly larger than those entities, it becomes harder to justify the use of function<>. We have two things to optimize here :(
If we go to 16 bytes, then bind(&X::f, &x, _1, true) will fit but bind(&X::f, &x, _1, true, true) won't.
It probably will unless 'true' is 4 bytes.
Some ABIs are actually that strange :)
In my experience, the closure case is indeed very common in code written by people who don't take advantage of the full expressive power of boost::bind, probably because they have a Borland/delegate background. [snip] and synthesize show/hide with boost::bind. My code tends towards the latter variety, so I won't be seeing much of the SBO with a &X::f+&x cutoff.
Me too :)
BTW, this talk about 12 byte buffer is assuming g++. A member pointer is 4-16 on MSVC, 8 on g++, 12 on Borland, 4 on Digital Mars (!). There's a nice table in
Yeah, I know. The actual code for function<> has a union containing a struct with an unknown member pointer and a void pointer in it. I guess we could pad that with an integer or two if we want to expand the buffer...
Anyway, I committed a storage optimization to the CVS. <crosses fingers>
Very cool. Works like a charm on GCC, at least. Once I get a chance to write up documentation for the changes to Function, I'll commit everything to CVS and we'll see who screams :) Doug

Doug Gregor wrote:
On Jan 9, 2006, at 11:55 AM, Peter Dimov wrote:
Yes. It would be interesting to measure the performance for larger buffer sizes, though. The break-even point should occur somewhere around 32 or 48, maybe even 64 if the allocator is bad enough.
Yeah, that's possible. I don't think it's just a matter of finding the break-even point for performance, though. function<> is supposed to replace function pointers, closures, etc. If it's significantly larger than those entities, it becomes harder to justify the use of function<>. We have two things to optimize here :(
Yes, it's obviously a tradeoff, but I don't think that function pointers and closures compete with function<>. You simply can't use a function pointer or a closure to duplicate function<>.

Peter Dimov wrote:
Anyway, I committed a storage optimization to the CVS. <crosses fingers> Whoa! That's fast: ================================ C:\temp\delegates>test.exe Time elapsed for FastDelegate: 0.560000 (sec) Time elapsed for simple bind: 0.411000 (sec) Time elapsed for bind+function: 1.983000 (sec) Time elapsed for pure function invocation: 0.781000 (sec) ================================ bind+function<> is now just 4 times slower than the best possible case.
BTW, why not add FastDelegate (it's in public domain) to bind+function<> as a special case for optimization? -- With respect, Alex Besogonov (cyberax@elewise.com)

Alex Besogonov wrote:
Peter Dimov wrote:
Anyway, I committed a storage optimization to the CVS. <crosses fingers> Whoa! That's fast: ================================ C:\temp\delegates>test.exe Time elapsed for FastDelegate: 0.560000 (sec) Time elapsed for simple bind: 0.411000 (sec) Time elapsed for bind+function: 1.983000 (sec) Time elapsed for pure function invocation: 0.781000 (sec) ================================ bind+function<> is now just 4 times slower than the best possible case.
BTW, why not add FastDelegate (it's in public domain) to bind+function<> as a special case for optimization?
Most of the time is probably due to function<>'s dynamic dispatch mechanism; I'm not sure how FastDelegate can help. You could also try a test using ct_mem_fn< void (X::*)(), &X::f > where ct_mem_fn is: template< class Pm, Pm pm > struct ct_mem_fn { typedef void result_type; template<class A1> result_type operator()( A1 & a1 ) const { boost::mem_fn( pm )( a1 ); } template<class A1, class A2> result_type operator()( A1 & a1, A2 & a2 ) const { boost::mem_fn( pm )( a1, a2 ); } };

Peter Dimov wrote:
Most of the time is probably due to function<>'s dynamic dispatch mechanism; I'm not sure how FastDelegate can help. I've tried to hack it into function<> myself but I've failed :)
I'm thinking of something like: ====================== union function_buffer { //FastDelegate FastDelegateData delegate_data; // For pointers to function objects void* obj_ptr; // For pointers to std::type_info objects // (get_functor_type_tag, check_functor_type_tag). const void* const_obj_ptr; // For function pointers of all kinds mutable void (*func_ptr)(); // For bound member pointers struct bound_memfunc_ptr_t { void (X::*memfunc_ptr)(int); void* obj_ptr; } bound_memfunc_ptr; // To relax aliasing constraints mutable char data; }; ====================== 'delegate_data' should be used if we're creating a delegate (i.e. pointer to method with bound 'this' pointer). It's called a 'closure' in Borland-speak. It will complicate code a lot though, because simple 'mpl::bool_<(function_allows_small_object_optimization<functor_type>::value)>()' won't be enough.
You could also try a test using ct_mem_fn< void (X::*)(), &X::f > where ct_mem_fn is This is fine, but we're losing generic interface of function<> :(
-- With respect, Alex Besogonov (cyberax@elewise.com)

Alex Besogonov wrote:
Peter Dimov wrote:
Most of the time is probably due to function<>'s dynamic dispatch mechanism; I'm not sure how FastDelegate can help. I've tried to hack it into function<> myself but I've failed :)
I'm thinking of something like: ====================== union function_buffer { //FastDelegate FastDelegateData delegate_data;
// For pointers to function objects void* obj_ptr;
// For pointers to std::type_info objects // (get_functor_type_tag, check_functor_type_tag). const void* const_obj_ptr;
// For function pointers of all kinds mutable void (*func_ptr)();
// For bound member pointers struct bound_memfunc_ptr_t { void (X::*memfunc_ptr)(int); void* obj_ptr; } bound_memfunc_ptr;
// To relax aliasing constraints mutable char data; }; ====================== 'delegate_data' should be used if we're creating a delegate (i.e. pointer to method with bound 'this' pointer). It's called a 'closure' in Borland-speak.
I would be careful using FastDelegate as an adjunct to Boost function/bind. It is very dependent on knowledge of compiler member function pointer sizes and uses "hacks" based on that knowledge to achieve its speed. Boost should try to stay clear of low-level compiler dependent knowledge whenever it can, sacrificing a little extra speed for greater stability. Of course if it is an optional choice for end-users, then it is more understandable but even then end-users should be duly notified of the issues involved. This is not in any way a putdown of FastDelegate but Boost function/bind is richer in functionality and therefore pays the price of being slower. Of course speed is important but not if it means less stability of implementation.

Edward Diener wrote:
I would be careful using FastDelegate as an adjunct to Boost function/bind. It is very dependent on knowledge of compiler member function pointer sizes and uses "hacks" based on that knowledge to achieve its speed. Boost should try to stay clear of low-level compiler dependent knowledge whenever it can, sacrificing a little extra speed for greater stability. Boost already has a lots of compiler-specific hacks and kludges (see MPL or TYPEOF/FOREACH). There's nothing inherently wrong in a good hack if it is protected by #ifdef's and there's a fallback implementation for unsupported platforms.
Of course if it is an optional choice for end-users, then it is more understandable but even then end-users should be duly notified of the issues involved. This is not in any way a putdown of FastDelegate but Boost function/bind is richer in functionality and therefore pays the price of being slower. Of course speed is important but not if it means less stability of implementation. SBO already gives 10x speed boost to Boost.bind :) It's performance is quite acceptable now.
-- With respect, Alex Besogonov (cyberax@elewise.com)

On Jan 15, 2006, at 11:15 AM, Edward Diener wrote:
I would be careful using FastDelegate as an adjunct to Boost function/bind. It is very dependent on knowledge of compiler member function pointer sizes and uses "hacks" based on that knowledge to achieve its speed. Boost should try to stay clear of low-level compiler dependent knowledge whenever it can, sacrificing a little extra speed for greater stability.
If we do something like this, we'll do it very carefully... only enable it with compiler versions that we know work, add some tricky tests to the test suite to be absolutely sure, etc. If provided, I'll review and consider patches for FastDelegate-like support in Boost.Function. However, I'm not at all interested in doing the development work myself. Doug

Vladimir Prus wrote:
Alex Besogonov wrote:
Vladimir Prus wrote:
This optimization will make boost::function _much_ faster - in one of How much faster, exactly? Recently, in a thread "[signal] performance" I've posted a small benchmark that finds boost::function to take 4 times more time that call via function pointer -- which, IMO, is pretty fast.
I don't have any problem with boost::function speed of invocations (though FastDelegate is two times faster).
I see. Still, would be nice to see specific numbers.
I like timing the copy contructor of various types. :-) 8M push_back operations in an unreserved vector for: shared_ptr: 1.6 function<> holding a function pointer: 1.6 POD with size 64: 4.9 function<> holding a trivial function object: 17.6 So, assuming that function<> is more frequently used with function objects, SBO would be a win even for size 128. This is with a "bad" ::operator new (the one from MSVC 7.1 MT lib), but still...
participants (7)
-
Alex Besogonov
-
Doug Gregor
-
Douglas Gregor
-
Edward Diener
-
larsbj@gullik.net
-
Peter Dimov
-
Vladimir Prus