[signals] Performance

newer
[errorinfo] A proposal for better...

Jody Hagins

13 Oct 2004 13 Oct '04

1:17 p.m.

Are there any docs which describe the performance of the signal/slot library? I was about to embark on a performance study because I want to use it in a very high performance critical code path, but I thought I'd ask if anyone else may have already done some work in this area. Thanks!

Show replies by date

Doug Gregor

13 Oct 13 Oct

3:36 p.m.

On Oct 13, 2004, at 8:17 AM, Jody Hagins wrote:

...

Are there any docs which describe the performance of the signal/slot library?

The reference docs say a little about the asymptotic behavior of signal operations, but AFAIK nobody has done a real performance study.

...

I was about to embark on a performance study because I want to use it in a very high performance critical code path, but I thought I'd ask if anyone else may have already done some work in this area.

I won't claim to be optimistic about the results, but I'd love to see them. Doug

Dan Eloff

17 Oct 17 Oct

11:12 p.m.

I would go ahead an use boost::signals, and if later you profile your code and find they produce a signifigant performance penalty, replace them. They may not be that fast, but if you're doing something a 100,000 times slower in your code, you really won't notice the 0.001% overhead they add. As always premature optimization is a waste of time, effort, and money. Only ever optimize if a) you've profiled and identified the code as a bottleneck or b) you're concerned it will be a bottleneck AND the alternative is similarily easy to implement and use, then you could get away with using your alternative from the start. -Dan On Wed, 13 Oct 2004 09:17:04 -0400, Jody Hagins <jody-boost-011304@atdesk.com> wrote:

...

Are there any docs which describe the performance of the signal/slot library? I was about to embark on a performance study because I want to use it in a very high performance critical code path, but I thought I'd ask if anyone else may have already done some work in this area.

Thanks!

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Aaron W. LaFramboise

18 Oct 18 Oct

12:36 a.m.

Dan Eloff wrote:

...

I would go ahead an use boost::signals, and if later you profile your code and find they produce a signifigant performance penalty, replace them. They may not be that fast, but if you're doing something a 100,000 times slower in your code, you really won't notice the 0.001% overhead they add. As always premature optimization is a waste of time, effort, and money. Only ever optimize if a) you've profiled and identified the code as a bottleneck or b) you're concerned it will be a bottleneck AND the alternative is similarily easy to implement and use, then you could get away with using your alternative from the start.

On the other hand, as Alexandrescu in _Modern C++ Design_ quotes Len Lattanzi, "Belated pessimization is the leaf of no good." Alexandrescu goes on to write, "A pessimization of one order of magnitude in the runtime of a core object like a functor, a smart pointer, or a string can easily make the difference between success and failure for a whole project" (Ch. 4, p. 77). Library writers often do not have the same liberty as application writers to go back later, after profiling, and fix their slow code. I have been working on a design for a prototype policy-based universal demultiplexor. One of the policies is a 'dispatch policy' that provides the actual mechanism for notifying resource classes of events. In the initial implementation of this policy, I am using Boost.Signal. Preliminary analysis suggests that the performance efficiency of the demultiplexor when using this dispatcher is "bad." While I can not yet offer meaningful numbers, it seems that there is something quite slow happening under the covers in the Signals library. I have not examined the implementation of the Signals library at all, however. The extent of this problem, and whether it is easily fixable, remains to be seen. I'm going to get back to the list on this, probably about when I'm done with the demultiplexor prototype. Aaron W. LaFramboise

Jody Hagins

1:36 p.m.

On Sun, 17 Oct 2004 19:36:39 -0500 "Aaron W. LaFramboise" <aaronrabiddog51@aaronwl.com> wrote:

...

I have been working on a design for a prototype policy-based universal demultiplexor. One of the policies is a 'dispatch policy' that provides the actual mechanism for notifying resource classes of events. In the initial implementation of this policy, I am using Boost.Signal.

My initial implementation is based on Boost.Signals as well. However, every operation in the system goes through this dispatcher, and I was hoping to benefit from work already done by others.

...

I'm going to get back to the list on this, probably about when I'm done with the demultiplexor prototype.

Thanks! I look forward to the information from your tests...

Jody Hagins

1 Dec 1 Dec

4:13 a.m.

On Sun, 17 Oct 2004 19:36:39 -0500 "Aaron W. LaFramboise" <aaronrabiddog51@aaronwl.com> wrote:

...

On the other hand, as Alexandrescu in _Modern C++ Design_ quotes Len Lattanzi, "Belated pessimization is the leaf of no good." Alexandrescu goes on to write, "A pessimization of one order of magnitude in the runtime of a core object like a functor, a smart pointer, or a string can easily make the difference between success and failure for a whole project" (Ch. 4, p. 77). Library writers often do not have the same liberty as application writers to go back later, after profiling, and fix their slow code.

Right. My application is entirely event drive, with boost:signal() dispatching every event. The application will be processing thousands of messages per second, each one causing at least one signal() invocation.

...

Preliminary analysis suggests that the performance efficiency of the demultiplexor when using this dispatcher is "bad." While I can not yet offer meaningful numbers, it seems that there is something quite slow happening under the covers in the Signals library. I have not examined the implementation of the Signals library at all, however. The extent of this problem, and whether it is easily fixable, remains to be seen.

I have not done much examination either, but to satisfy my curiosity, I hacked together a "lite" implementation of the signals interface that provides *minimum* functionality. Same interface as boost::signal<>, w.r.t. connect(), disconnect() and operator() (i.e, dispatching a signal). Also, allows connect/disconnect/replace inside a slot handler. It does not provide return value combining and the fancier features. However, I can do this... #ifdef REPLACE_BOOST_SIGNALS #define PE_SIGNAL_TEMPLATE lite::signals::Signal #else #define PE_SIGNAL_TEMPLATE boost::signal #endif and easily switch between boost::signal and my lite version. Thus, I can verify that my code compiles and runs the same (and I can verify that the tests give the same results for either implementation). However, a very quick test shows that boost::signal is at least 2 orders of magnitude slower than the "lite" version. Note that this code compiles/runs on g++, and I have not tried it on other compilers. The attached files... Connection.hpp and Signal.hpp implement the "lite" signal interface. speed_test.cpp is a first glance at speed comparisons between boost::signal and the "lite" version. pt50.txt is output of running the test on a SuSE based Opteron. shandalle.txt is output of running the test on a RH7.3 based Xeon. The format is... ===== 1000000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 1000000 17.9360 0.0701 10 100000 5.3777 0.0289 50 20000 4.0621 0.0244 100 10000 3.9715 0.0232 250 4000 3.8375 0.0254 500 2000 3.8873 0.0237 1000 1000 3.7644 0.0229 5000 200 4.0361 0.1169 10000 100 4.0450 0.1678 50000 20 3.9394 0.1953 100000 10 3.9916 0.2268 500000 2 3.8303 0.1128 The first line means that we connected 1 slot to the signal, and called signal() 1000000 times, resulting in a total of 1000000 slot invocations (sortof... there are a few extra). It took 17.9360 seconds to make these calls with the boost version, and 0.0701 seconds to make these calls with the lite version. The last line, obviously, means that we connected 500000 slots to the signal, and called signal() 2 times, resulting in the same number of slot invocations, with a time of 3.8303 seconds for boost and 0.1128 seconds for lite. Note that there seems to be some heavy overhead just in calling signal(). The output files show several different size runs, on two different architectures/compiler versions. The test is not really that great, but is, I think, a reasonable first attempt at measuring the performance. Of course, I understnad that I may be measuring the worst case features of signal, and the lite version does not even come close to the quality, depth, or breadth of the boost implementation. However, it would be nice if someone could give boost::signal a "boost" for what I think are very common use cases. It seems that using it in a simple way requires a heavy price for features that are not used. This seems to fly in the face of common library design (where we ATTEMPT to make users not have to pay for things they are not using). Also, note that the lite version degrades as the number of slots increases, indicative in the overhead of iterating through the list of "slots." The overhead difference is measurable between using std::list and std::vector, but std::list is easier to implement supporting connect/disconnect from within a "slot" handler. Comments? P.S. I hope no one sees this as a slam on boost::signal or Doug, as I feel totally the opposite and am extremely grateful for the Boost.Community. In fact, I'd really like someone to point out where my test is woefully flawed, or my use if boost::signal is terribly misguided, or some other lame brain mistake of mine. Thanks!

Jody Hagins

4:39 a.m.

One more thing. The test linked against the signals shared library. Linking static speeds things up measurably, but not much, and does not change the order of magnitude in difference.

Aaron W. LaFramboise

5:28 a.m.

Jody Hagins wrote:

...

I have not done much examination either, but to satisfy my curiosity, I hacked together a "lite" implementation of the signals interface that provides *minimum* functionality. Same interface as boost::signal<>, w.r.t. connect(), disconnect() and operator() (i.e, dispatching a signal). Also, allows connect/disconnect/replace inside a slot handler. It does not provide return value combining and the fancier features.

This is very useful information! The primary observation I make from this is how much slower by ratio Boost.Signals is. I'm going to look at your test 'lite' implementation later. Is there some additional feature Boost.Signals is offering that causes it to do all of this extra work?

...

Note that there seems to be some heavy overhead just in calling signal(). The output files show several different size runs, on two different architectures/compiler versions.

It's also useful to know how much of a win it is to group as many calls as possible into a single signal rather than separating them into separate signals. Aaron W. LaFramboise

Jody Hagins

3:42 p.m.

On Tue, 30 Nov 2004 23:28:48 -0600 "Aaron W. LaFramboise" <aaronrabiddog51@aaronwl.com> wrote:

...

This is very useful information!

I hope it is correct ;-) Can you verify similar performance characteristics?

...

The primary observation I make from this is how much slower by ratio Boost.Signals is. I'm going to look at your test 'lite' implementation later. Is there some additional feature Boost.Signals is offering that causes it to do all of this extra work?

Strictly guessing here... probably something to do with combining the return values On a different note,I have resisted looking at Boost.Preprocessor for a long time (since the first look made my head spin). However, after just reading a post on boostpp, relative to expanding parameters, I decided to wade into the shallows. Here is the result, in case you want to play with larger arity, or want to change the operator() code... I have to admit that a devilish grin started forming on my face as I was writing this little piece of code, which should be a good replacement in the Signals.hpp file. Replace the rest of the file starting with template <typename Signature> struct Signal; ------------------------------------ template <typename Signature> struct Signal; } // end signals namespace } // end lite namespace // Generate the code for all Signal<> specializations. #include "boost/preprocessor/repetition.hpp" #include "boost/preprocessor/arithmetic/add.hpp" #include "boost/preprocessor/punctuation/comma_if.hpp" #include "boost/preprocessor/comparison/less.hpp" #include "boost/preprocessor/facilities/empty.hpp" #ifndef LITE_SIGNAL_MAX_ARITY #define LITE_SIGNAL_MAX_ARITY 10 #endif #undef LITE_PP_DEF #define LITE_PP_DEF(z, n, lastn) \ template <typename R \ BOOST_PP_COMMA_IF(n) BOOST_PP_ENUM_PARAMS(n, typename T)> \ struct Signal < R ( BOOST_PP_ENUM_PARAMS(n,T) ) > \ : public detail::Signal_Base<R (BOOST_PP_ENUM_PARAMS(n,T))> \ { \ R operator()( BOOST_PP_ENUM_BINARY_PARAMS(n, T, t) ) const \ { \ typename list::iterator i = list_.begin(); \ while (i != list_.end()) \ { \ if (i->function_) \ { \ (i++)->function_( BOOST_PP_ENUM_PARAMS(n,t) ); \ } \ else \ { \ i = list_.erase(i); \ } \ } \ } \ } BOOST_PP_IF(BOOST_PP_LESS(n, lastn), ;, BOOST_PP_EMPTY()) namespace lite { namespace signals { BOOST_PP_REPEAT( \ BOOST_PP_ADD(LITE_SIGNAL_MAX_ARITY, 1), \ LITE_PP_DEF, \ LITE_SIGNAL_MAX_ARITY); } // end signals namespace } // end lite namespace #undef LITE_PP_DEF #endif // lite__Signal__hpp_

Stuart Dootson

6:17 p.m.

On Wed, 1 Dec 2004 10:42:50 -0500, Jody Hagins <jody-boost-011304@atdesk.com> wrote:

...

On Tue, 30 Nov 2004 23:28:48 -0600 "Aaron W. LaFramboise" <aaronrabiddog51@aaronwl.com> wrote:

...
This is very useful information!

I hope it is correct ;-) Can you verify similar performance characteristics?

Jody - I ran your code on Windows 2000, compiled with VC++ 7.1 and got similar results : ===== 1000000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 1000000 26.0735 0.1077 10 100000 8.0136 0.0359 50 20000 7.0922 0.0296 100 10000 5.4384 0.0334 250 4000 6.0977 0.0295 500 2000 5.4257 0.0291 1000 1000 6.4010 0.0388 5000 200 6.5758 0.1964 10000 100 5.5138 0.2056 50000 20 5.5390 0.2022 100000 10 6.5492 0.2038 500000 2 6.5538 0.2163 The executable was built with the command 'cl -EHsc -O1 speed_test.cpp' (synchronous exception handling on, optimize for speed) and pulled in static libraries for Boost.Signals, C/C++ runtimes etc. HTH Stuart Dootson

Jody Hagins

8:11 p.m.

On Wed, 1 Dec 2004 18:17:37 +0000 Stuart Dootson <stuart.dootson@gmail.com> wrote:

...

Jody - I ran your code on Windows 2000, compiled with VC++ 7.1 and got similar results :

Thanks for the info. Did you have to make any changes to the code?

Reece Dunn

10:36 p.m.

Jody Hagins wrote:

...

On Wed, 1 Dec 2004 18:17:37 +0000 Stuart Dootson <stuart.dootson@gmail.com> wrote:

...
Jody - I ran your code on Windows 2000, compiled with VC++ 7.1 and got similar results :

Thanks for the info. Did you have to make any changes to the code?

I have also managed to get the tests working, but needed to make the following changes (VC++ does not have <sys/time.h> by default): ===== [begin] ===== // workaround for mising <sys/time.h> on MS compilers... #if defined(BOOST_MSVC) #include <windows.h> // for timeval #include <time.h> #pragma comment( lib, "winmm.lib" ) int gettimeofday(struct timeval* tp, void* tzp) { DWORD t; t = timeGetTime(); tp->tv_sec = t / 1000; tp->tv_usec = t % 1000; return 0; // 0 indicates that the call succeeded. } inline void timersub( const timeval * tvp, const timeval * uvp, timeval * vvp ) { vvp->tv_sec = tvp->tv_sec - uvp->tv_sec; vvp->tv_usec = tvp->tv_usec - uvp->tv_usec; if( vvp->tv_usec < 0 ) { --vvp->tv_sec; vvp->tv_usec += 1000000; } } #else #include <sys/time.h> #endif // ...end of missing <sys/time.h> workaround ===== [end] ===== Also included is a running comment on what the profiler is doing, so the user can tell what is going on. Have you thought about using boost::timer? I have also created a v2 jam file to make it easier to build on different compilers. It would be interesting to hook this into Boost.Test as this would make it easier to capture the results and perform processing/analysis on the data. Also, it would possibly be beneficial to have some sort of Boost.Profiling library that sits on top of Boost.Test to make it easier to do this. Regards, Reece exe speed_test : speed_test.cpp /boost/signals//boost_signals/<link>static : <optimization>speed ; stage bin : speed_test ; #include "Connection.hpp" #include "Signal.hpp" #include "boost/signal.hpp" #include "boost/bind.hpp" #include <vector> #include <cstdio> #include <iostream> // workaround for mising <sys/time.h> on MS compilers... #if defined(BOOST_MSVC) #include <windows.h> // for timeval #include <time.h> #pragma comment( lib, "winmm.lib" ) int gettimeofday(struct timeval* tp, void* tzp) { DWORD t; t = timeGetTime(); tp->tv_sec = t / 1000; tp->tv_usec = t % 1000; return 0; // 0 indicates that the call succeeded. } inline void timersub( const timeval * tvp, const timeval * uvp, timeval * vvp ) { vvp->tv_sec = tvp->tv_sec - uvp->tv_sec; vvp->tv_usec = tvp->tv_usec - uvp->tv_usec; if( vvp->tv_usec < 0 ) { --vvp->tv_sec; vvp->tv_usec += 1000000; } } #else #include <sys/time.h> #endif // ...end of missing <sys/time.h> workaround struct X { X(int x = 0) : x_(x) { } int x_; }; void foo(X & x) { x.x_ += 1234; } void blarg(X & x) { x.x_ /= 71; } struct bar { void operator()(X & x) { x.x_ *= 7; } }; struct foobar { void doit(X & x) { x.x_ += 7; } }; struct timing { size_t nslots_; size_t ncalls_; timeval start1_; timeval stop1_; timeval start2_; timeval stop2_; }; timeval operator-( timeval const & x, timeval const & y) { timeval result; timersub(&x, &y, &result); return result; } double as_double( timeval const & x) { double result = x.tv_usec; result /= 1000000; result += x.tv_sec; return result; } int main( int argc, char * argv[]) { try { std::vector<timing> timings; size_t num_slots[] = { 1, 10, 50, 100, 250, 500, 1000, 5000, 10000, 50000, 100000, 500000, 0 }; typedef lite::signals::Signal<void (X &)> Signal1; Signal1 signal1; typedef boost::signal<void (X &)> Signal2; Signal2 signal2; size_t totcalls[] = { 1000, 10000, 100000, 1000000, 0 }; for (size_t * tc = totcalls; *tc > 0; ++tc) { size_t total_calls = *tc; std::cout << "profiling for " << total_calls << " total calls\n"; for (size_t * ns = num_slots; *ns > 0 && *ns <= total_calls; ++ns) { std::cout << '.'; size_t nslots = *ns; size_t niters = total_calls / nslots; signal1.disconnect_all_slots(); signal1.connect(bar()); foobar foobar1; signal1.connect(boost::bind(&foobar::doit, &foobar1, _1)); signal2.disconnect_all_slots(); signal2.connect(bar()); foobar foobar2; signal2.connect(boost::bind(&foobar::doit, &foobar2, _1)); std::vector<foobar> slots(nslots); for (size_t i = 0; i < slots.size(); ++i) { signal1.connect(boost::bind(&foobar::doit, &slots[i], _1)); signal2.connect(boost::bind(&foobar::doit, &slots[i], _1)); } X my_x(5); timing t; t.nslots_ = nslots; t.ncalls_ = niters; gettimeofday(&t.start1_, 0); for (size_t i = 0; i < niters; ++i) { signal1(my_x); } gettimeofday(&t.stop1_, 0); X bs_x(5); gettimeofday(&t.start2_, 0); for (size_t i = 0; i < niters; ++i) { signal2(bs_x); } gettimeofday(&t.stop2_, 0); timings.push_back(t); if (my_x.x_ != bs_x.x_) { std::cerr << "my_x(" << my_x.x_ << ") != bs_x(" << bs_x.x_ << ")\n"; } } std::cout << '\n'; } size_t last_size = (size_t)-1; for (size_t i = 0; i < timings.size(); ++i) { if (last_size != timings[i].nslots_ * timings[i].ncalls_) { last_size = timings[i].nslots_ * timings[i].ncalls_; fprintf(stdout, "\n===== %u Total Calls =====\n", last_size); fprintf(stdout, "Num Slots Calls/Slot Boost Lite\n"); fprintf(stdout, "--------- ---------- ------- -------\n"); } fprintf(stdout, "%9u%14u%11.4f%10.4f\n", timings[i].nslots_, timings[i].ncalls_, as_double(timings[i].stop2_ - timings[i].start2_), as_double(timings[i].stop1_ - timings[i].start1_)); } return 0; } catch (std::exception const & ex) { std::cerr << "exception: " << ex.what() << std::endl; } return 1; }

Jody Hagins

2 Dec 2 Dec

12:35 a.m.

On Wed, 01 Dec 2004 22:36:37 +0000 Reece Dunn <msclrhd@hotmail.com> wrote:

...

I have also managed to get the tests working, but needed to make the following changes (VC++ does not have <sys/time.h> by default):

Thanks!

...

Also included is a running comment on what the profiler is doing, so the user can tell what is going on.

Hmmm. I seem to have missed that part.

...

Have you thought about using boost::timer? I have also created a v2 jam file to make it easier to build on different compilers.

...

From the timer.hpp header file...

// It is recommended that implementations measure wall clock rather than CPU // time since the intended use is performance measurement on systems where // total elapsed time is more important than just process or CPU time. However, for better portability (and higher boost usage points), I can use date_time. The test program, using date_time is attached.

...

It would be interesting to hook this into Boost.Test as this would make it easier to capture the results and perform processing/analysis on the data. Also, it would possibly be beneficial to have some sort of Boost.Profiling library that sits on top of Boost.Test to make it easier to do this.

I agree, but I am still a Boost.Build/Boost.Test neophite. The jamfile is simple enough, but what is *not* there is what makes it confusing for us old "make" hacks...

Stuart Dootson

12:13 a.m.

On Wed, 1 Dec 2004 15:11:03 -0500, Jody Hagins <jody-boost-011304@atdesk.com> wrote:

...

On Wed, 1 Dec 2004 18:17:37 +0000 Stuart Dootson <stuart.dootson@gmail.com> wrote:

...
Jody - I ran your code on Windows 2000, compiled with VC++ 7.1 and got similar results :

Thanks for the info. Did you have to make any changes to the code?

Just changed the timing mechanism to use QueryPerformanceCounter (Win32 call) as the VC++ runtime doesn't have gettimeofday. Stuart Dootson

XinWei Hu

9:22 a.m.

Hi all: I repeat this test case on my suse 9.2 box, the result is: ===== 1000000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 1000000 1.0813 0.1065 10 100000 0.4274 0.0378 50 20000 0.3676 0.0325 100 10000 0.3717 0.0340 250 4000 0.3832 0.0402 500 2000 0.3814 0.0442 1000 1000 0.3630 0.0456 5000 200 0.4770 0.0972 10000 100 0.5525 0.2437 50000 20 0.5187 0.2084 100000 10 0.5478 0.2415 500000 2 0.8851 3.6850 a bit different from your original post. And the whole situation is bit improved while compile with -funit-at-a-time option, but the trend is the same. On Tue, 30 Nov 2004 23:13:43 -0500, Jody Hagins <jody-boost-011304@atdesk.com> wrote:

...

On Sun, 17 Oct 2004 19:36:39 -0500 "Aaron W. LaFramboise" <aaronrabiddog51@aaronwl.com> wrote:

...
On the other hand, as Alexandrescu in _Modern C++ Design_ quotes Len Lattanzi, "Belated pessimization is the leaf of no good." Alexandrescu goes on to write, "A pessimization of one order of magnitude in the runtime of a core object like a functor, a smart pointer, or a string can easily make the difference between success and failure for a whole project" (Ch. 4, p. 77). Library writers often do not have the same liberty as application writers to go back later, after profiling, and fix their slow code.

Right. My application is entirely event drive, with boost:signal() dispatching every event. The application will be processing thousands of messages per second, each one causing at least one signal() invocation.

...
Preliminary analysis suggests that the performance efficiency of the demultiplexor when using this dispatcher is "bad." While I can not yet offer meaningful numbers, it seems that there is something quite slow happening under the covers in the Signals library. I have not examined the implementation of the Signals library at all, however. The extent of this problem, and whether it is easily fixable, remains to be seen.

I have not done much examination either, but to satisfy my curiosity, I hacked together a "lite" implementation of the signals interface that provides *minimum* functionality. Same interface as boost::signal<>, w.r.t. connect(), disconnect() and operator() (i.e, dispatching a signal). Also, allows connect/disconnect/replace inside a slot handler. It does not provide return value combining and the fancier features. However, I can do this...

#ifdef REPLACE_BOOST_SIGNALS #define PE_SIGNAL_TEMPLATE lite::signals::Signal #else #define PE_SIGNAL_TEMPLATE boost::signal #endif

and easily switch between boost::signal and my lite version. Thus, I can verify that my code compiles and runs the same (and I can verify that the tests give the same results for either implementation). However, a very quick test shows that boost::signal is at least 2 orders of magnitude slower than the "lite" version. Note that this code compiles/runs on g++, and I have not tried it on other compilers.

The attached files...

Connection.hpp and Signal.hpp implement the "lite" signal interface. speed_test.cpp is a first glance at speed comparisons between boost::signal and the "lite" version. pt50.txt is output of running the test on a SuSE based Opteron. shandalle.txt is output of running the test on a RH7.3 based Xeon.

The format is... ===== 1000000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 1000000 17.9360 0.0701 10 100000 5.3777 0.0289 50 20000 4.0621 0.0244 100 10000 3.9715 0.0232 250 4000 3.8375 0.0254 500 2000 3.8873 0.0237 1000 1000 3.7644 0.0229 5000 200 4.0361 0.1169 10000 100 4.0450 0.1678 50000 20 3.9394 0.1953 100000 10 3.9916 0.2268 500000 2 3.8303 0.1128

The first line means that we connected 1 slot to the signal, and called signal() 1000000 times, resulting in a total of 1000000 slot invocations (sortof... there are a few extra). It took 17.9360 seconds to make these calls with the boost version, and 0.0701 seconds to make these calls with the lite version.

The last line, obviously, means that we connected 500000 slots to the signal, and called signal() 2 times, resulting in the same number of slot invocations, with a time of 3.8303 seconds for boost and 0.1128 seconds for lite.

Note that there seems to be some heavy overhead just in calling signal(). The output files show several different size runs, on two different architectures/compiler versions.

The test is not really that great, but is, I think, a reasonable first attempt at measuring the performance. Of course, I understnad that I may be measuring the worst case features of signal, and the lite version does not even come close to the quality, depth, or breadth of the boost implementation. However, it would be nice if someone could give boost::signal a "boost" for what I think are very common use cases. It seems that using it in a simple way requires a heavy price for features that are not used. This seems to fly in the face of common library design (where we ATTEMPT to make users not have to pay for things they are not using).

Also, note that the lite version degrades as the number of slots increases, indicative in the overhead of iterating through the list of "slots." The overhead difference is measurable between using std::list and std::vector, but std::list is easier to implement supporting connect/disconnect from within a "slot" handler.

Comments?

P.S. I hope no one sees this as a slam on boost::signal or Doug, as I feel totally the opposite and am extremely grateful for the Boost.Community. In fact, I'd really like someone to point out where my test is woefully flawed, or my use if boost::signal is terribly misguided, or some other lame brain mistake of mine.

Thanks!

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Jody Hagins

11:44 a.m.

On Thu, 2 Dec 2004 17:22:44 +0800 XinWei Hu <huxw1980@gmail.com> wrote:

...

Hi all:

I repeat this test case on my suse 9.2 box, the result is:

===== 1000000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 1000000 1.0813 0.1065 10 100000 0.4274 0.0378 50 20000 0.3676 0.0325 100 10000 0.3717 0.0340 250 4000 0.3832 0.0402 500 2000 0.3814 0.0442 1000 1000 0.3630 0.0456 5000 200 0.4770 0.0972 10000 100 0.5525 0.2437 50000 20 0.5187 0.2084 100000 10 0.5478 0.2415 500000 2 0.8851 3.6850

a bit different from your original post.

And the whole situation is bit improved while compile with -funit-at-a-time option, but the trend is the same.

This is very interesting, especially the dramatic improvement in the Boost numbers and the last test for Lite. What version of compiler and OS? Also, what CPU type and memory available? What version of boost? Thanks!!!

Jody Hagins

12:06 p.m.

On Thu, 2 Dec 2004 17:22:44 +0800 XinWei Hu <huxw1980@gmail.com> wrote:

...

And the whole situation is bit improved while compile with -funit-at-a-time option, but the trend is the same.

This should be enabled with -O3. From the g++ 3.3.3 manpage... -O3 Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the -finline-func tions, -fweb, -funit-at-a-time, -ftracer, -funswitch-loops and -frename-registers options.

Robert Zeh

2:32 p.m.

I profiled speed_test.C with quantify to determine where the signals library was spending its time. On my SPARC/Solaris system with gcc 3.3.2 it was spending about 1/3 of its time in malloc, mostly as the result of the cache construction inside of slot_call_iterator. At home I decided to try replacing the shared_ptr in slot_call_iterator with an instance variable of the result_type. Since slot_call_iterator uses the shared_ptr to determine if the cache is valid, I had to add a bool indicating if the cached value is valid or not. These changes made the benchmark run about two to four times faster (on my home system, a 650 MHz Duron with gcc 3.3.2 running Debian). I expect better results with my SPARC machine, because the malloc implementation seems to be slower. There are several drawbacks to my modifications. It's harder to maintain because of the added bool. The result_type used by slot_call_iterator must now have a default constructor. If it is expensive to copy the result_type, and slot_call_iterator is copied a lot, replacing the shared_ptr with an instance variable will actually make things slower. I don't know enough about the internals to weigh how important these issues are. I believe the correct way to do things is to create a cache interface that encapsulates the behavior the slot_call_iterator needs, and to then choose the appropriate cache implementation at runtime using the mpl. I've attached both a diff for my changes to slot_call_iterator.hpp and the modified file. I'd be interested in knowing how it changes the performance on other platforms. Current performance on my 650 MHz Duron ===== 1000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 1000 0.0073 0.0002 10 100 0.0025 0.0001 50 20 0.0023 0.0001 100 10 0.0021 0.0001 250 4 0.0021 0.0001 500 2 0.0024 0.0001 1000 1 0.0026 0.0002 ===== 10000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 10000 0.0680 0.0022 10 1000 0.0258 0.0007 50 200 0.0214 0.0006 100 100 0.0209 0.0006 250 40 0.0209 0.0006 500 20 0.0218 0.0008 1000 10 0.0246 0.0008 5000 2 0.0254 0.0025 10000 1 0.0257 0.0027 ===== 100000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 100000 0.7370 0.0224 10 10000 0.2573 0.0090 50 2000 0.2284 0.0067 100 1000 0.2691 0.0072 250 400 0.2111 0.0069 500 200 0.2202 0.0094 1000 100 0.2635 0.0259 5000 20 0.2684 0.0318 10000 10 0.2749 0.0330 50000 2 0.2629 0.0266 100000 1 0.2638 0.0290 ===== 1000000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 1000000 6.9143 0.2232 10 100000 2.5620 0.0752 50 20000 2.2086 0.0669 100 10000 2.1817 0.0724 250 4000 2.1839 0.0894 500 2000 2.1643 0.1066 1000 1000 2.6981 0.3238 5000 200 2.7720 0.3870 10000 100 2.7220 0.3980 50000 20 2.7763 0.3479 100000 10 2.8006 0.3774 500000 2 2.6703 0.2991 slot_call_iterator with an instance variable instead of a shared_ptr performance on my 650MHz Duron. ===== 1000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 1000 0.0014 0.0002 10 100 0.0004 0.0001 50 20 0.0003 0.0001 100 10 0.0406 0.0001 250 4 0.0003 0.0001 500 2 0.0004 0.0001 1000 1 0.0007 0.0002 ===== 10000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 10000 0.0145 0.0022 10 1000 0.0039 0.0007 50 200 0.0030 0.0006 100 100 0.0029 0.0006 250 40 0.0029 0.0006 500 20 0.0033 0.0008 1000 10 0.0066 0.0009 5000 2 0.0073 0.0025 10000 1 0.0076 0.0029 ===== 100000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 100000 0.1446 0.0219 10 10000 0.0385 0.0075 50 2000 0.0302 0.0066 100 1000 0.0288 0.0065 250 400 0.0296 0.0068 500 200 0.0344 0.0083 1000 100 0.0949 0.0257 5000 20 0.0844 0.0318 10000 10 0.0819 0.0329 50000 2 0.0825 0.0294 100000 1 0.0741 0.0282 ===== 1000000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 1000000 1.5345 0.2376 10 100000 0.6345 0.0758 50 20000 0.6562 0.2402 100 10000 0.2960 0.0686 250 4000 0.3477 0.0932 500 2000 0.5850 0.1311 1000 1000 1.7935 0.3861 5000 200 1.0274 0.4182 10000 100 1.1677 0.4073 50000 20 1.1002 0.7818 100000 10 1.5208 0.4197 500000 2 0.7921 0.2938 http://home.earthlink.net/~rzeh Robert Zeh

Doug Gregor

4:25 p.m.

On Dec 2, 2004, at 9:32 AM, Robert Zeh wrote:

...

At home I decided to try replacing the shared_ptr in slot_call_iterator with an instance variable of the result_type. Since slot_call_iterator uses the shared_ptr to determine if the cache is valid, I had to add a bool indicating if the cached value is valid or not. These changes made the benchmark run about two to four times faster (on my home system, a 650 MHz Duron with gcc 3.3.2 running Debian). I expect better results with my SPARC machine, because the malloc implementation seems to be slower.

Interesting! I never would have expected that to be such a bottleneck.

...

There are several drawbacks to my modifications. It's harder to maintain because of the added bool. The result_type used by slot_call_iterator must now have a default constructor. If it is expensive to copy the result_type, and slot_call_iterator is copied a lot, replacing the shared_ptr with an instance variable will actually

make things slower. I don't know enough about the internals to weigh how important these issues are.

Unfortunately, some of them are pretty important. For instance, if one tried to have a result_type that is an iterator, copying a slot_call_iterator that has not been dereferenced could result in undefined behavior, because the iterator would have been default-constructed and therefore singular. That doesn't make the optimization invalid, of course, but it does limit where we can apply the optimization.

...

I believe the correct way to do things is to create a cache interface that encapsulates the behavior the slot_call_iterator needs, and to then choose the appropriate cache implementation at runtime using the mpl.

We could do it at compile time using some combination of type traits, probably. Unfortunately, those that come to mind--is_default_constructible and is_assignable--don't give us all the information we need, because iterators will return "true" for both but will have the wrong behavior. We would need to introduce a new trait for this. Actually, there is one other corner case that's really ugly. The combiner could theoretically do something like this: template<typename InputIterator> result_type operator()(InputIterator first, InputIterator last) { while (first != last) { InputIterator annoying = first; *first; if (*annoying) /* ... */ } } I *think* that's valid code according to the Input Iterator requirements (but not 100% sure!), and without the pointer "*annoying" would call the slot again... Oh, I know a solution now... we stick an "optional<R> slot_result" into the signal's operator() and keep a pointer to that variable in the slot_call_iterator. The optional<R> has nearly the same effect as the cache interface you mentioned, but it doesn't require default-constructibility; storing a pointer to it ensures that slot_iterator copies are fast and that all slot_iterators see the updates (for those truly pathological combiners). I'm really interested in this thread but, as seems to happen, it's come at a time when I can't dedicate a whole lot of time to it. After December 10th, I'm all over it :) The timing tests, "lite" versions, performance results and performance analyses are greatly appreciated... with any luck we'll be able to improve Signals performance or come up with its replacement :) On the technical side, I expect the the orders-of-magnitude speed lag in Signals comes from the named_slot_map. Jody's "lite" version does not support named slots. Perhaps that feature is too painful to have, and that we should just drop back to using an std::list as Jody and perhaps provide an alternative signal type that allows named slots. Doug

Robert Zeh

6:21 p.m.

Doug Gregor <dgregor@cs.indiana.edu> writes:

...

... [snipped]

Unfortunately, some of them are pretty important. For instance, if one tried to have a result_type that is an iterator, copying a slot_call_iterator that has not been dereferenced could result in undefined behavior, because the iterator would have been default-constructed and therefore singular. That doesn't make the optimization invalid, of course, but it does limit where we can apply the optimization.

This is just a thought, but I believe this could be fixed by defining the caching object so that it only copies the cached value if the cache is valid.

...

We could do it at compile time using some combination of type traits, probably. Unfortunately, those that come to mind--is_default_constructible and is_assignable--don't give us all the information we need, because iterators will return "true" for both but will have the wrong behavior. We would need to introduce a new trait for this.

Actually, there is one other corner case that's really ugly. The combiner could theoretically do something like this:

template<typename InputIterator> result_type operator()(InputIterator first, InputIterator last) { while (first != last) { InputIterator annoying = first; *first; if (*annoying) /* ... */ } }

I *think* that's valid code according to the Input Iterator requirements (but not 100% sure!), and without the pointer "*annoying" would call the slot again...

Oh, I know a solution now... we stick an "optional<R> slot_result" into the signal's operator() and keep a pointer to that variable in the slot_call_iterator. The optional<R> has nearly the same effect as the cache interface you mentioned, but it doesn't require default-constructibility; storing a pointer to it ensures that slot_iterator copies are fast and that all slot_iterators see the updates (for those truly pathological combiners).

Doesn't the optional<R> value have to be allocated on the heap when a slot_call_iterator is created, and doesn't it have to be a shared_ptr so that the copy semantics will be correct? I ask because I'm trying to avoid any heap allocation, which I think of as slow. Robert

Doug Gregor

7:19 p.m.

On Dec 2, 2004, at 1:21 PM, Robert Zeh wrote:

...

Doug Gregor <dgregor@cs.indiana.edu> writes:

...
... [snipped]

Unfortunately, some of them are pretty important. For instance, if one tried to have a result_type that is an iterator, copying a slot_call_iterator that has not been dereferenced could result in undefined behavior, because the iterator would have been default-constructed and therefore singular. That doesn't make the optimization invalid, of course, but it does limit where we can apply the optimization.

This is just a thought, but I believe this could be fixed by defining the caching object so that it only copies the cached value if the cache is valid.

You're right, of course; we even do this in the implementation of named_slot_map! :)

...

oesn't the optional<R> value have to be allocated on the heap when a slot_call_iterator is created,

Nope :) optional<R> does the equivalent of storing an R and storing a bool, both on the stack, but doesn't actually default-construct the R. It uses aligned_storage so that it has the space available within the optional<R>, but does an in-place new when you want to copy a value in.

...

and doesn't it have to be a shared_ptr so that the copy semantics will be correct?

So long as the optional<R> is in the signal's operator(), I think the slot_call_iterators can just have a pointer to that optional<R> and we'll be okay... incrementing a slot_call_iterator would clear out the optional<R>.

...

I ask because I'm trying to avoid any heap allocation, which I think of as slow.

Right, and this does avoid heap allocation. Doug

Robert Zeh

11:05 p.m.

Doug Gregor <dgregor@cs.indiana.edu> writes:

...

[snipped]

...
doesn't the optional<R> value have to be allocated on the heap when a slot_call_iterator is created,

Nope :) optional<R> does the equivalent of storing an R and storing a bool, both on the stack, but doesn't actually default-construct the R. It uses aligned_storage so that it has the space available within the optional<R>, but does an in-place new when you want to copy a value in.

...
and doesn't it have to be a shared_ptr so that the copy semantics will be correct?

So long as the optional<R> is in the signal's operator(), I think the slot_call_iterators can just have a pointer to that optional<R> and we'll be okay... incrementing a slot_call_iterator would clear out the optional<R>.

[ snip ] First let me mention that optional<R> is really cool. I had never really looked at it before today. Let me see if I have the basic ideas down for the changes to slot_call_iterator and signal's operator(). 1) Modify slot_call_iterator's constructor to take an additional argument, which is a pointer to an optional<R> that it uses to cache the value of the slot call. The slot_call_iterator will only use the optional<R> pointer; it will never delete it. 2) Have signal's operator() provide the pointer that slot_call_iterator's constructor now requires. The pointer will be to a stack allocated optional<R>. Wouldn't this create a slot_call_iterator that was fragile? Copies of the slot_call_iterator would be share the optional<R> pointer, which could lead to really interesting results. The slot_call_iterator's lifetime would have to be within the optional<R> pointer's lifetime or the slot_call_iterator would be accessing an invalid pointer. Robert Zeh

Doug Gregor

3 Dec 3 Dec

2:33 a.m.

On Dec 2, 2004, at 6:05 PM, Robert Zeh wrote:

...

First let me mention that optional<R> is really cool. I had never really looked at it before today.

Let me see if I have the basic ideas down for the changes to slot_call_iterator and signal's operator(). [snip]

Yep, you've got it.

...

Wouldn't this create a slot_call_iterator that was fragile? Copies of the slot_call_iterator would be share the optional<R> pointer, which could lead to really interesting results. The slot_call_iterator's lifetime would have to be within the optional<R> pointer's lifetime or the slot_call_iterator would be accessing an invalid pointer.

Yes, you are correct. But slot_call_iterators have always been fragile, because they have references to all of the arguments passed to the signal as well... so this won't be a change in behavior. Doug

Robert Zeh

7 Dec 7 Dec

2:58 p.m.

Doug Gregor <dgregor@cs.indiana.edu> writes:

...

On Dec 2, 2004, at 6:05 PM, Robert Zeh wrote:

...
First let me mention that optional<R> is really cool. I had never really looked at it before today.

Let me see if I have the basic ideas down for the changes to slot_call_iterator and signal's operator(). [snip]

Yep, you've got it.

Here is a patch for the discussed changes. I would be very interested in seeing how the patch affects performance on platforms other then my Duron box or my SPARC box. The only divergence from the plan was passing in a reference instead of a pointer for the boost::optional<result_type>. I'll be doing some performance testing and running some of the tests with it later (hopefully today). I'm inclined to think that these changes have to run faster, but I'd like to be sure. Robert

Robert Zeh

9 Dec 9 Dec

5:44 p.m.

Robert Zeh <razeh@archelon-us.com> writes:

...

I'll be doing some performance testing and running some of the tests with it later (hopefully today). I'm inclined to think that these changes have to run faster, but I'd like to be sure.

Robert

Quantify now tells me that most of the time is being spent in the new that named_slot_map_iterator does... Robert Zeh

Doug Gregor

15 Dec 15 Dec

10:22 p.m.

On Dec 7, 2004, at 9:58 AM, Robert Zeh wrote:

...

Doug Gregor <dgregor@cs.indiana.edu> writes: Here is a patch for the discussed changes.

Looks good! I changed the reference to a pointer (so that the slot_call_iterator copy-assigns properly and can be default-constructed) and checked it in. Thanks! Doug

Robert Zeh

31 Mar 31 Mar

2:28 a.m.

After consulting with Doug Gregor I've done some more work on signals performance. The patch I've attached removes the pImpl idiom in named_slot_map and iterator_facade. The signals regression tests still past. The performance numbers look better. Here are the speed test numbers from the current boost CVS. ===== 1000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 1000 0.0306 0.0003 10 100 0.0087 0.0001 50 20 0.0067 0.0001 100 10 0.0065 0.0001 250 4 0.0064 0.0001 500 2 0.0064 0.0001 1000 1 0.0069 0.0003 ===== 10000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 10000 0.2996 0.0027 10 1000 0.0848 0.0010 50 200 0.0669 0.0010 100 100 0.0645 0.0008 250 40 0.0666 0.0008 500 20 0.0677 0.0010 1000 10 0.0697 0.0010 5000 2 0.0679 0.0031 10000 1 0.0758 0.0034 ===== 100000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 100000 3.0874 0.0270 10 10000 0.8717 0.0098 50 2000 0.6830 0.0082 100 1000 0.6758 0.0075 250 400 0.6560 0.0075 500 200 0.6682 0.0082 1000 100 0.7176 0.0164 5000 20 0.7018 0.0445 10000 10 0.7027 0.0369 50000 2 0.7033 0.0345 100000 1 0.7202 0.0346 ===== 1000000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 1000000 29.9847 0.2791 10 100000 8.6300 0.0956 50 20000 6.8220 0.0792 100 10000 6.5083 0.0778 250 4000 6.3998 0.0744 500 2000 6.6824 0.0845 1000 1000 7.2043 0.2038 5000 200 7.0582 0.3525 10000 100 7.1081 0.3722 50000 20 7.1853 0.4216 100000 10 7.2921 0.4401 500000 2 7.8380 0.3574 And the numbers after the pImpl removal: ===== 1000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 1000 0.0025 0.0003 10 100 0.0008 0.0001 50 20 0.0006 0.0001 100 10 0.0006 0.0001 250 4 0.0006 0.0001 500 2 0.0006 0.0001 1000 1 0.0011 0.0003 ===== 10000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 10000 0.0255 0.0026 10 1000 0.0074 0.0009 50 200 0.0060 0.0008 100 100 0.0056 0.0007 250 40 0.0057 0.0007 500 20 0.0064 0.0009 1000 10 0.0107 0.0009 5000 2 0.0114 0.0031 10000 1 0.0120 0.0034 ===== 100000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 100000 0.2454 0.0260 10 10000 0.0733 0.0090 50 2000 0.0580 0.0074 100 1000 0.0599 0.0071 250 400 0.0576 0.0073 500 200 0.0626 0.0079 1000 100 0.1244 0.0152 5000 20 0.1374 0.0371 10000 10 0.1293 0.0367 50000 2 0.1182 0.0321 100000 1 0.1319 0.0341 ===== 1000000 Total Calls ===== Num Slots Calls/Slot Boost Lite --------- ---------- ------- ------- 1 1000000 2.4575 0.2629 10 100000 0.7389 0.0912 50 20000 0.6037 0.0759 100 10000 0.5682 0.0819 250 4000 0.5722 0.0721 500 2000 0.6760 0.0866 1000 1000 1.2747 0.1984 5000 200 1.3177 0.3473 10000 100 1.3181 0.3803 50000 20 1.3439 0.4049 100000 10 1.3590 0.4173 500000 2 1.2095 0.3369 All of the numbers are from my Debian 650 MHz Duron. Robert Zeh

Jody Hagins

3:28 p.m.

On Wed, 30 Mar 2005 20:28:47 -0600 Robert Zeh <razeh@earthlink.net> wrote:

...

After consulting with Doug Gregor I've done some more work on signals performance. The patch I've attached removes the pImpl idiom in named_slot_map and iterator_facade.

The signals regression tests still past.

The performance numbers look better. Here are the speed test numbers from the current boost CVS.

That is a very impressive improvement!!! I have not yet looked at the patch, but I am somewhat interested in the performance impact of the pimp idiom by itself (if my understanding is correct and that is, indeed, the only change).

Doug Gregor

3:35 p.m.

On Mar 30, 2005, at 9:28 PM, Robert Zeh wrote:

...

After consulting with Doug Gregor I've done some more work on signals performance. The patch I've attached removes the pImpl idiom in named_slot_map and iterator_facade.

The signals regression tests still past.

Excellent, thanks! Did you happen to note the size of the test executables before and after, e.g. signal_test and/or signal_n_test? The original tradeoff in Signals was trying to favor smaller over faster, so I'm wondering if it was truly or a tradeoff or just bad design :) I'll take an in-depth look at the patch as soon as I find some time. Doug

razeh

2 Apr 2 Apr

12:22 a.m.

Doug Gregor <dgregor@cs.indiana.edu> writes:

...

On Mar 30, 2005, at 9:28 PM, Robert Zeh wrote:

...
After consulting with Doug Gregor I've done some more work on signals performance. The patch I've attached removes the pImpl idiom in named_slot_map and iterator_facade.

The signals regression tests still past.

Excellent, thanks!

Did you happen to note the size of the test executables before and after, e.g. signal_test and/or signal_n_test? The original tradeoff in Signals was trying to favor smaller over faster, so I'm wondering if it was truly or a tradeoff or just bad design :)

I'll take an in-depth look at the patch as soon as I find some time.

Here the sizes of my builds, built with -O3 and gcc 3.3.5 on my Debian Duron system. static (bytes) dynamic (bytes) cvs build : 115,734 65,222 patched build: 108,465 58,623 Robert Zeh http://home.earthlink.net/~rzeh

Douglas Gregor

2:21 a.m.

On Apr 1, 2005, at 7:22 PM, razeh wrote:

...

Doug Gregor <dgregor@cs.indiana.edu> writes:

...
On Mar 30, 2005, at 9:28 PM, Robert Zeh wrote:

...
After consulting with Doug Gregor I've done some more work on signals performance. The patch I've attached removes the pImpl idiom in named_slot_map and iterator_facade.

The signals regression tests still past.

Excellent, thanks!

Did you happen to note the size of the test executables before and after, e.g. signal_test and/or signal_n_test? The original tradeoff in Signals was trying to favor smaller over faster, so I'm wondering if it was truly or a tradeoff or just bad design :)

I'll take an in-depth look at the patch as soon as I find some time.

Here the sizes of my builds, built with -O3 and gcc 3.3.5 on my Debian Duron system.

static (bytes) dynamic (bytes) cvs build : 115,734 65,222 patched build: 108,465 58,623

Excellent! I'll do an in-depth review of your patch tomorrow. Thanks again! Doug

Douglas Gregor

4 Apr 4 Apr

1:34 a.m.

On Mar 30, 2005, at 9:28 PM, Robert Zeh wrote:

...

After consulting with Doug Gregor I've done some more work on signals performance. The patch I've attached removes the pImpl idiom in named_slot_map and iterator_facade.

The signals regression tests still past.

The performance numbers look better. Here are the speed test numbers from the current boost CVS.

Looks great! I've committed the patch to CVS. Thank you, again. For future reference, if you'd like to verify correctness of changes to signals you can run the random_signal_system test program, which beats on the connect/disconnect/invoke mechanism pretty hard. For instance, I checked your patch by building a system with 1000 signals, 0.01 edge probability, and running 50,000 iterations. Since it passed, we can be pretty darn confident that all is well. Doug

Jody Hagins

2 Dec 2 Dec

6:35 p.m.

On Thu, 2 Dec 2004 11:25:50 -0500 Doug Gregor <dgregor@cs.indiana.edu> wrote:

...

Unfortunately, some of them are pretty important. For instance, if one

tried to have a result_type that is an iterator, copying a slot_call_iterator that has not been dereferenced could result in undefined behavior, because the iterator would have been default-constructed and therefore singular. That doesn't make the optimization invalid, of course, but it does limit where we can apply the optimization.

What about a specialization for NO combiner (if you are worried about results being "forgotten" then maybe deduce it if the return type is void. This would still allow everything to work properly when you want a combiner (and it can always be optimized later ;-). In addition, it would allow users to not have to pay the expense of combiner stuff if they are not using it anyway...

Doug Gregor

7:22 p.m.

On Dec 2, 2004, at 1:35 PM, Jody Hagins wrote:

...

What about a specialization for NO combiner (if you are worried about results being "forgotten" then maybe deduce it if the return type is void. This would still allow everything to work properly when you want a combiner (and it can always be optimized later ;-). In addition, it would allow users to not have to pay the expense of combiner stuff if they are not using it anyway...

We could specialize based on last_value<void>, which is essentially not a combiner. It's worth trying. Doug

Fredrik Blomqvist

8:03 p.m.

Perhaps this could be of some interest in this discussion. http://www.codeproject.com/cpp/FastDelegate.asp The article talks about low-level optimization of delegates on a wide range of platforms. To quote: "Initially, I used boost::function, but I found that the memory allocation for the delegates was consuming over a third of the entire program running time!" Regards // Fredrik Blomqvist Doug Gregor wrote:

...

On Dec 2, 2004, at 9:32 AM, Robert Zeh wrote:

...
At home I decided to try replacing the shared_ptr in slot_call_iterator with an instance variable of the result_type. Since slot_call_iterator uses the shared_ptr to determine if the cache is valid, I had to add a bool indicating if the cached value is valid or not. These changes made the benchmark run about two to four times faster (on my home system, a 650 MHz Duron with gcc 3.3.2 running Debian). I expect better results with my SPARC machine, because the malloc implementation seems to be slower.

Interesting! I never would have expected that to be such a bottleneck.

...
There are several drawbacks to my modifications. It's harder to maintain because of the added bool. The result_type used by slot_call_iterator must now have a default constructor. If it is expensive to copy the result_type, and slot_call_iterator is copied a lot, replacing the shared_ptr with an instance variable will actually

make things slower. I don't know enough about the internals to weigh how important these issues are.

Unfortunately, some of them are pretty important. For instance, if one tried to have a result_type that is an iterator, copying a slot_call_iterator that has not been dereferenced could result in undefined behavior, because the iterator would have been default-constructed and therefore singular. That doesn't make the optimization invalid, of course, but it does limit where we can apply the optimization.

...
I believe the correct way to do things is to create a cache interface that encapsulates the behavior the slot_call_iterator needs, and to then choose the appropriate cache implementation at runtime using the mpl.

We could do it at compile time using some combination of type traits, probably. Unfortunately, those that come to mind--is_default_constructible and is_assignable--don't give us all the information we need, because iterators will return "true" for both but will have the wrong behavior. We would need to introduce a new trait for this.

Actually, there is one other corner case that's really ugly. The combiner could theoretically do something like this:

template<typename InputIterator> result_type operator()(InputIterator first, InputIterator last) { while (first != last) { InputIterator annoying = first; *first; if (*annoying) /* ... */ } }

I *think* that's valid code according to the Input Iterator requirements (but not 100% sure!), and without the pointer "*annoying" would call the slot again...

Oh, I know a solution now... we stick an "optional<R> slot_result" into the signal's operator() and keep a pointer to that variable in the slot_call_iterator. The optional<R> has nearly the same effect as the cache interface you mentioned, but it doesn't require default-constructibility; storing a pointer to it ensures that slot_iterator copies are fast and that all slot_iterators see the updates (for those truly pathological combiners).

I'm really interested in this thread but, as seems to happen, it's come at a time when I can't dedicate a whole lot of time to it. After December 10th, I'm all over it :) The timing tests, "lite" versions, performance results and performance analyses are greatly appreciated... with any luck we'll be able to improve Signals performance or come up with its replacement :)

On the technical side, I expect the the orders-of-magnitude speed lag in Signals comes from the named_slot_map. Jody's "lite" version does not support named slots. Perhaps that feature is too painful to have, and that we should just drop back to using an std::list as Jody and perhaps provide an alternative signal type that allows named slots.

Doug

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Doug Gregor

11:02 p.m.

On Dec 2, 2004, at 3:03 PM, Fredrik Blomqvist wrote:

...

Perhaps this could be of some interest in this discussion. http://www.codeproject.com/cpp/FastDelegate.asp The article talks about low-level optimization of delegates on a wide range of platforms.

We have to be a little careful here, because the author is using some (really neat!) non-portable tricks.

...

To quote: "Initially, I used boost::function, but I found that the memory allocation for the delegates was consuming over a third of the entire program running time!"

Regards // Fredrik Blomqvist

Yes, this is a good point... one common case we had previously talked about optimizing is everything related to: function<int(int)> f(boost::bind(&X::m, x_ptr, _1)); Since most slots just bind an object to the member pointer, being able to store these in a function<> instance without allocating memory could be a big win. I know several people have asked for it on more than one occasion: the problem is implementing it portably (in a practical sense; making it standard C++ is easy). Doug

7433

Age (days ago)

7606

Last active (days ago)

List overview

Download

35 comments

12 participants

participants (12)

Aaron W. LaFramboise
Dan Eloff
Doug Gregor
Douglas Gregor
Fredrik Blomqvist
Jody Hagins
razeh
Reece Dunn
Robert Zeh
Robert Zeh
Stuart Dootson
XinWei Hu