
Paul, [Excuse the following - perhaps it's an attack of logorrhea ]
I also am encouraged by what Edward's timer might add though I'm a little wary of possible inclusion of code following FFTW's license as it may be incompatible with Boost.
I'm pretty sure the FFTW license is Boost compatible (judging by the comments in the header and the BSD-like adoption of FFTW in many commercial and open source projects) however my aim is for a robust performance framework that is chronometer agnostic - there's no need to use FFTW's timer. The getticks() macro just happens to be a handy cross platform implementation! There are other concerns that I need to address regarding the licensing and provenance of this code. For now I am simply keen to see if it works for other people and in other real world situations - hence the application to the boost::spirit / xpressive debate. After all, you can read books on nonparametric statistical inference until you are paralysed with inaction - the only real way to know if it works is to try it and get other people involved and apply it in real-world situations. From that I'm sure I will learn a lot and then be in a situation to weed out unnecessary cruft (e.g. bootstrap resampling) and add useful stuff (potentially the cache related code mentioned earlier). After that it's a case of reviewing everything that's left after I've done my weeding and making sure it's appropriate.
Edward, I do wonder about the statistical significance metrics you provide (great idea by the way). I'm wondering if your code assumes a normal timing duration distribution and if so, can it 'measure' whether the distribution of results conform to this assumption.
It makes few assumptions about any underlying probability density function (not even that it's strongly stationary) since it's based on nonparametric techniques. Key to this is that, while the underlying distribution may not be normal or even known, the distribution of certain statistics estimated from a sample *is* well known. For example, the median is asymptotically normal (with certain constraints on the PDF). In other words if you take a large sample and calculate the median - then repeat this ad nauseam, the resulting medians will be approximately normally distributed. For a concrete example this is put to use in the calculation of the confidence bounds - by using the Wilcoxon signed-rank test for which the W+ statistic has a well known distribution. We can therefore determine if one function is faster than another in a (hopefully) statistically significant way.
In my experience, timing benchmarks can suffer from having outliers
I don't view these as 'outliers' in the normal usage of 'measurement error' - they are measurements that are just as valid as any other - one just has to ask "what is this measuring - the function under test or the OS?". Often it's the latter but there is (as I see it) no clear border-line between classifying any given measurement as one and not the other. Consequently I view 'throwing away' (trimming) measurements with great suspicion.
(usually OS-induced by not having a RTOS)
RTOS? Real-time OS?? Do these eliminate this effect?
that make the distribution less than 'normal'.
The distribution would not be normal even in principle. It must have a well defined minimum, but (in principle) has no upper bound - the (gross) lack of symmetry rules out normality in a heartbeat!
Would it be possible to establish that the given sample set of timings correspond to a normal distribution (or perhaps discard a certain percentage of outliers if necessary).
It is possible to determine if the sample corresponds to a known distribution - the Kolmogorov-Smirnov test. However which one should we check? There is no fundamental requirement for the measurements to belong to any family - even vaguely. In fact it may not even be smooth!
I'm no statistics person, but I have seen cases where 10,000 timing samples have been biased by 10 samples that probably relate to a program issue or a severe OS issue. e.g. normally 1 ms per iteration, 10 @ 1-10s each
Yes - that issue has a dramatic impact on the robustness of (or lack thereof) of the mean. Often of course these things are even vaguely IID (independent and identically distributed) - burst like behaviour can lead to strongly non-stationary behaviour. For example some garbage collector might go nuts and skew your timing. That's why, in my view, it's so important to compare relative timings and to obtain confidence bounds. Pragmatically one might argue that when you're doing these things at the console during development and making decisions it's not really a major issue. One can look at the numbers and say "hmm, fishy - I don't trust that" and tinker around until the result is in line with your expectation. That's part of the art that is experimental science and many of us do it everyday without even realising. How to codify that opinion into something that's autonomous and capable of e.g. selecting the 'correct' function in a high-load multi- user environment in the real world is far from clear however. That is my goal. -ed ------------------------------------------------ "No more boom and bust." -- Dr. J. G. Brown, 1997