proposal - Statistically robust function timing for performance decisions.

Dear All, Having developed what I think is an easy to use C++ library for the robust and statistically meaningful timing of functions I'm looking to place it under the terrifying glare of peer review and to find a home for it. My motivation for its development was simply to be able to state: "Function 'foo' is between 'x' percent and 'y' percent faster than function 'bar' at the 95% confidence level." since I wish to make reliable, automatic, informed decisions about function choice at run-time. While this looks like a trivial problem - it is not! Being able to determine some confidence limits on the relative speedup, particularly when other user processes are active, turns out to be quite challenging. I am aware of code snippets such as http://www.boost.org/doc/libs/1_34_1/libs/multi_index/perf/ test_perf.cpp based on wrap_code.cpp and, while it's better than many approaches, I'm dissatisfied with it. It's not as reliable as one might think at first blush. I am also aware of the timing code used internally in FFTW for selecting codelets - while it may be appropriate for this type of application I do not view it as a reliable general solution to the problem. I strongly suspect that many people reinvent the wheel regarding this problem and make decisions in an ad-hoc manner. There certainly doesn't seem to be a standard and reliable solution. I'm not sure whether Boost is an appropriate home for such a (compile time) library. Is this something anyone 'out there' would be interested in? If it's not the right place - does anyone have any suggestions other than sourceforge.net? I look forward to your responses. Regards, -ed P.S. A typical example of its use is presented in the output below. Here we time two simple and otherwise identical functions, one of which (a) iterates 1000 times, the other (b) 1005. Once we demand of the timer a nominal percentage precision approaching the difference in expected run-time (0.5%) they can be discriminated successfully. It should be noted that this was carried out simply using gettimeofday () while significant user processes were present! ================================== Calibrate overhead (0,1)? 1 Point estimate of clock overhead (t_c): 0.0975994 Potential jitter on t_c : 0.633371 Increasing precision until I can discriminate between function a and b. The nominal confidence interval represents the region: 1-alpha=0.95 Timing at a nominal precision of: 50% 0 1.82141<--- Uncertain. Timing at a nominal precision of: 10% 0 2.00391<--- Uncertain. Timing at a nominal precision of: 2% 0.383285 0.593955<--- Ok. Timing at a nominal precision of: 0.4% 0.411541 0.62412<--- Ok. Timing at a nominal precision of: 0.08% 0.448847 0.582617<--- Ok. Nominal confidence bounds on percentage speedup of A relative to B: 0.448847 [0.516456] 0.582617 ================================== ------------------------------------------------ "No more boom and bust." -- Dr. J. G. Brown, 1997

I am interested. Perhaps you can upload it to the Boost Vault? Is there any documentation (even minimal)? Cheers, Patrick On Fri, Jul 17, 2009 at 10:36 AM, Edward Grace <ej.grace@imperial.ac.uk>wrote:
Dear All,
Having developed what I think is an easy to use C++ library for the robust and statistically meaningful timing of functions I'm looking to place it under the terrifying glare of peer review and to find a home for it.
My motivation for its development was simply to be able to state:
"Function 'foo' is between 'x' percent and 'y' percent faster than function 'bar' at the 95% confidence level."
since I wish to make reliable, automatic, informed decisions about function choice at run-time.
While this looks like a trivial problem - it is not! Being able to determine some confidence limits on the relative speedup, particularly when other user processes are active, turns out to be quite challenging.
I am aware of code snippets such as
http://www.boost.org/doc/libs/1_34_1/libs/multi_index/perf/test_perf.cpp
based on wrap_code.cpp and, while it's better than many approaches, I'm dissatisfied with it. It's not as reliable as one might think at first blush. I am also aware of the timing code used internally in FFTW for selecting codelets - while it may be appropriate for this type of application I do not view it as a reliable general solution to the problem.
I strongly suspect that many people reinvent the wheel regarding this problem and make decisions in an ad-hoc manner. There certainly doesn't seem to be a standard and reliable solution.
I'm not sure whether Boost is an appropriate home for such a (compile time) library. Is this something anyone 'out there' would be interested in? If it's not the right place - does anyone have any suggestions other than sourceforge.net?
I look forward to your responses.
Regards,
-ed
P.S. A typical example of its use is presented in the output below. Here we time two simple and otherwise identical functions, one of which (a) iterates 1000 times, the other (b) 1005. Once we demand of the timer a nominal percentage precision approaching the difference in expected run-time (0.5%) they can be discriminated successfully.
It should be noted that this was carried out simply using gettimeofday() while significant user processes were present!
================================== Calibrate overhead (0,1)? 1 Point estimate of clock overhead (t_c): 0.0975994 Potential jitter on t_c : 0.633371 Increasing precision until I can discriminate between function a and b. The nominal confidence interval represents the region: 1-alpha=0.95 Timing at a nominal precision of: 50% 0 1.82141<--- Uncertain. Timing at a nominal precision of: 10% 0 2.00391<--- Uncertain. Timing at a nominal precision of: 2% 0.383285 0.593955<--- Ok. Timing at a nominal precision of: 0.4% 0.411541 0.62412<--- Ok. Timing at a nominal precision of: 0.08% 0.448847 0.582617<--- Ok.
Nominal confidence bounds on percentage speedup of A relative to B: 0.448847 [0.516456] 0.582617 ==================================
------------------------------------------------ "No more boom and bust." -- Dr. J. G. Brown, 1997
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Patrick Mihelich wrote:
I am interested. Perhaps you can upload it to the Boost Vault? Is there any documentation (even minimal)?
Cheers, Patrick
On Fri, Jul 17, 2009 at 10:36 AM, Edward Grace <ej.grace@imperial.ac.uk>wrote:
Dear All,
Having developed what I think is an easy to use C++ library for the robust and statistically meaningful timing of functions I'm looking to place it under the terrifying glare of peer review and to find a home for it.
I'm also interested, and would like to see code + whatever docs you have, so far. John

Add me to the bandwagon ;) How do you handle the fact that sometimes, compiler may otpimize out benchamrk code because it has no side-effects ? -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

On 18 Jul 2009, at 07:40, joel wrote:
Add me to the bandwagon ;)
How do you handle the fact that sometimes, compiler may otpimize out benchamrk code because it has no side-effects ?
Easy. I don't! When I started this I was concerned that might happen and was quite surprised to see that the compiler (g++ 4.0.1) didn't optimise out the function. More generally, I don't think a function such as: void f(); can be considered free of side-effects unless you can guarantee that it does not access global variables (or otherwise sneakily modify state). From the prototype of a compiled function I don't see how there's any way to know this. With fully templated code it may be far more subtle since the compiler can potentially 'see' everything. That said, at the moment I want to concentrate on making the timings reliable on the assumption that problems such as you describe can be circumvented later on by someone who knows more about compilers than me! -ed

Dear All, As suggested I have packaged up a rough-n-ready version and stuffed it in the vault. It includes Doxygen generated documentation (HTML/PDF) and an example (example_timer.cpp) which itself has copious comments. Platform specifics (as far as I know) will be related to the chronometer. You may have to wrap your own chronometer (e.g. using something like gettimeofday() or getticks() from FFTW). I do not recommend using std::clock() for the included example as it is simply not precise enough - consequently you will grow bored waiting for it to finish! http://www.boostpro.com/vault/index.php? action=downloadfile&filename=ejg_timer.zip&directory=Tools http://tinyurl.com/nejw9o On my Mac, the example can be built and run by unzipping the file and doing the following: $ cd ejg_timer/ $ g++ -DNDEBUG -O3 -I. -o example_timer ejg/example_timer.cpp $ ./example_timer Calibrating the clock, for a slow timer this may take a while! Point estimate of clock overhead (t_c): 0.0986196 Potential jitter on t_c (sigma_c) : 0.633371 Increasing precision until I can discriminate between function a and b. The nominal confidence interval (minimum, [median], maximum) represents the region: 1-alpha=0.95 Timing at a nominal precision of 50%: We cannot distinguish between A and B: (-1.62543 [-0.339245] 1.78913)% Timing at a nominal precision of 25%: We cannot distinguish between A and B: (0 [0] 2.38656)% Timing at a nominal precision of 12.5%: We cannot distinguish between A and B: (0 [0] 0.69492)% Timing at a nominal precision of 6.25%: We cannot distinguish between A and B: (0 [1.84168] 3.71728)% Timing at a nominal precision of 3.125%: We cannot distinguish between A and B: (-0.173508 [0.757778] 0.924288)% Timing at a nominal precision of 1.5625%: A is faster than B by: (0.563103 [0.690461] 0.940433)% As you will note, there's a fair bit of extra cruft that has come along for the ride, in particular I think it is worth being aware of: striding_iterator -- Perhaps boost could do with this as a separate thing - it's pretty handy. robust_linear_fit -- A least absolute deviation robust linear fit, used to calibrate the clock overhead. This is preferable to least squares as it is insensitive to outliers. statistics::ks_test -- This implements a Kolmogorov-Smirnov two- sample test and is critical to answering the question 'have we taken enough samples'. wilcoxon_ci -- Carry out a Wilcoxon signed-rank test. Used to determine confidence intervals on paired samples for the relative timing. I will take some time to carefully review my thought processes to check that what I did is what I intended to do. There are some subtle points which stretch my statistics understanding regarding the repeated application of the ks_test - I suspect this is best approached with some Baysian prior technique - that said what I have seems to work well! Perhaps more by luck than judgement! Comments and questions are welcome. -ed ------------------------------------------------ "No more boom and bust." -- Dr. J. G. Brown, 1997

Add me to the bandwagon ;)
How do you handle the fact that sometimes, compiler may otpimize out benchamrk code because it has no side-effects ?
Ok Joel, now I've run in to precisely the problem you flagged up. Since I uploaded that stuff to the file vault I thought 'better check it with the latest g++' on a different machine. Sure enough when I go to -O1and I see no difference between the function calls... I guess it's just throwing them away! On one hand at least I'm getting consistent results - the relative percentage times reported state that there's no way to tell the difference! On the other, it means I'll have to think a little harder about avoiding this. Any ideas? -ed

On 18 Jul 2009, at 07:40, joel wrote:
Add me to the bandwagon ;)
How do you handle the fact that sometimes, compiler may otpimize out benchamrk code because it has no side-effects ?
I modified the example to force side-effects by making the variables in the loop global. Consequently it now doesn't get optimised away - compilers sure are clever these days. I have also fixed a couple of minor errors which I spotted when compiling under a different platform, The update may be obtained from: http://tinyurl.com/ksbukc After unpacking the zip file, typical compilation with G++ 4.1 and execution should yield something like: $ g++-4 -DNDEBUG -O4 -o example_timer -I. example_timer.cpp $ ./example_timer Calibrating the clock, for a slow timer this may take a while! Point estimate of clock overhead (t_c): 0.0941176 Potential jitter on t_c (sigma_c) : 9.45022e-14 Increasing precision until I can discriminate between function a and b. The nominal confidence interval (minimum, [median], maximum) represents the region: 1-alpha=0.95 Timing at a nominal precision of 50%: We cannot distinguish between A and B: (-17.2061 [1.58791] 4.79154)% Timing at a nominal precision of 25%: We cannot distinguish between A and B: (0 [0] 0)% Timing at a nominal precision of 12.5%: We cannot distinguish between A and B: (-2.02578 [0.0953992] 1.30765)% Timing at a nominal precision of 6.25%: We cannot distinguish between A and B: (0 [0] 0.471908)% Timing at a nominal precision of 3.125%: We cannot distinguish between A and B: (0 [0.235676] 1.10004)% Timing at a nominal precision of 1.5625%: A is faster than B by: (0.090861 [0.471734] 0.738181)% Regards, -ed

Edward Grace wrote:
I modified the example to force side-effects by making the variables in the loop global. Consequently it now doesn't get optimised away - compilers sure are clever these days.
Rather, I think you should use volatile or something, which is guaranteed by the C++ standard not be optimized out.

On 19 Jul 2009, at 22:15, Mathias Gaunard wrote:
Edward Grace wrote:
I modified the example to force side-effects by making the variables in the loop global. Consequently it now doesn't get optimised away - compilers sure are clever these days.
Rather, I think you should use volatile or something, which is guaranteed by the C++ standard not be optimized out.
Nice idea. I wondered if it could be applied to functions or function pointers. I'm not convinced it's properly supported in this context. Anyhow, I think a 'global variable' trick works. A more elegant solution is welcome! -ed ------------------------------------------------ "No more boom and bust." -- Dr. J. G. Brown, 1997

On 19 Jul 2009, at 16:00, Edward Grace wrote:
On 19 Jul 2009, at 22:15, Mathias Gaunard wrote:
Edward Grace wrote:
I modified the example to force side-effects by making the variables in the loop global. Consequently it now doesn't get optimised away - compilers sure are clever these days.
Rather, I think you should use volatile or something, which is guaranteed by the C++ standard not be optimized out.
Nice idea. I wondered if it could be applied to functions or function pointers. I'm not convinced it's properly supported in this context. Anyhow, I think a 'global variable' trick works. A more elegant solution is welcome!
Volatile will have more side effects and prevent optimization that we might want the compiler to make

Nice idea. I wondered if it could be applied to functions or function pointers. I'm not convinced it's properly supported in this context. Anyhow, I think a 'global variable' trick works. A more elegant solution is welcome!
Volatile will have more side effects and prevent optimization that we might want the compiler to make
Thanks for that comment. I looked into the true meaning of 'volatile' and unfortunately I don't think it's quite appropriate either. I've spotted a trick in something highlighted by Joel de Guzman regarding ( I think ) the following: http://tinyurl.com/kk858o to avoid dead code and control certain cache behaviour. I think this is what I want to ensure. It appears that using boost::bind has a similar effect (at least as far as dead code is concerned) - but I do not want to rely on that. -ed ------------------------------------------------ "No more boom and bust." -- Dr. J. G. Brown, 1997

Dear Mathias, On 19 Jul 2009, at 22:15, Mathias Gaunard wrote:
Edward Grace wrote:
I modified the example to force side-effects by making the variables in the loop global. Consequently it now doesn't get optimised away - compilers sure are clever these days.
Rather, I think you should use volatile or something, which is guaranteed by the C++ standard not be optimized out.
Obtaining a reliable 'fire and forget' solution to this problem is quite important. I have played a little with boost::bind - through the way it forwards the arguments I think the side-effect problem is avoided. With reference to the timer code: ejg-timer-0.0.4.zip available here: http://tinyurl.com/lro5ok The attached code is timing a pair of trivial functions that simply multiply or divide two numbers and compare to see which is fastest. Clearly this will be data dependent. On my platform (Intel - Mac) as expected multiplication is generally faster than division. $ ./example_binding Enter a number: 4 Enter another: 8 When evaluating 4*0.125 vs. 4/8 There is no practial difference between division and multiplication The difference is : 0 $ ./example_binding Enter a number: 5 Enter another: 7 When evaluating 5*0.142857 vs. 5/7 Multiplication is quicker than division by around: 112.795% The difference is : 1.11022e-16 Any comments? -ed

On 18 Jul 2009, at 01:08, Patrick Mihelich wrote:
I am interested. Perhaps you can upload it to the Boost Vault? Is there any documentation (even minimal)?
Cheers, Patrick
Certainly... It's pretty large and rambling at the moment - but half the point of getting other people to look through it is to have to justify everything and pare it down to the essentials. On another note, I haven't tested it on anything else. Perhaps I have inadvertently done something that only works on my platform! I guess we'll soon see. I've found out where 'Boost Vault' is and will prepare it for upload. -ed
participants (6)
-
Edward Grace
-
joel
-
John Phillips
-
Mathias Gaunard
-
Matthias Troyer
-
Patrick Mihelich