Re: [boost] [xpressive] Performance Tuning?

28 Jul 2009

      ...
Paul,

[Excuse the following - perhaps it's an attack of  logorrhea ]
...
I also am encouraged by what Edward's timer might add though I'm a  
little
wary of possible inclusion of code following FFTW's license as it  
may be
incompatible with Boost.
I'm pretty sure the FFTW license is Boost compatible (judging by the  
comments in the header and the BSD-like adoption of FFTW in many  
commercial and open source projects) however my aim is for a robust  
performance framework that is chronometer agnostic - there's no need  
to use FFTW's timer.

The getticks() macro just happens to be a handy cross platform  
implementation!

There are other concerns that I need to address regarding the  
licensing and provenance of this code. For now I am simply keen to  
see if it works for other people and in other real world situations -  
hence the application to the boost::spirit / xpressive debate.

After all, you can read books on nonparametric statistical inference  
until you are paralysed with inaction - the only real way to know if  
it works is to try it and get other people involved and apply it in  
real-world situations.

 From that I'm sure I will learn a lot and then be in a situation to  
weed out unnecessary cruft (e.g. bootstrap resampling) and add useful  
stuff (potentially the cache related code mentioned earlier).  After  
that it's a case of reviewing everything that's left after I've done  
my weeding and making sure it's  appropriate.
...
Edward, I do wonder about the statistical significance metrics you  
provide
(great idea by the way).
I'm wondering if your code assumes a normal timing duration  
distribution and
if so, can it 'measure' whether the distribution of results conform  
to this
assumption.
It makes few assumptions about any underlying probability density  
function (not even that it's strongly stationary) since it's based on  
nonparametric techniques.

Key to this is that, while the underlying distribution may not be  
normal or even known, the distribution of certain statistics  
estimated from a sample *is* well known.

For example, the median is asymptotically normal (with certain  
constraints on the PDF).  In other words if you take a large sample  
and calculate the median - then repeat this ad nauseam, the resulting  
medians will be approximately normally distributed.

For a concrete example this is put to use in the calculation of the  
confidence bounds - by using the Wilcoxon signed-rank test for which  
the W+ statistic has a well known distribution.  We can therefore   
determine if one function is faster than another in a (hopefully)  
statistically significant way.
...
In my experience, timing benchmarks can suffer from having outliers
I don't view these as 'outliers' in the normal usage of 'measurement  
error' - they are measurements that are just as valid as any other -  
one just has to ask "what is this measuring - the function under test  
or the OS?".  Often it's the latter but there is (as I see it) no  
clear border-line between classifying any given measurement as one  
and not the other. Consequently I view 'throwing away' (trimming)  
measurements with great suspicion.
...
(usually
OS-induced by not having a RTOS)
RTOS? Real-time OS??  Do these eliminate this effect?
...
that make the distribution less than
'normal'.
The distribution would not be normal even in principle.  It must have  
a well defined minimum, but (in principle) has no upper bound - the  
(gross) lack of symmetry rules out normality in a heartbeat!
...
Would it be possible to establish that the given sample set of timings
correspond to a normal distribution (or perhaps discard a certain  
percentage
of outliers if necessary).
It is possible to determine if the sample corresponds to a known  
distribution - the Kolmogorov-Smirnov test. However which one should  
we check?  There is no fundamental requirement for the measurements  
to belong to any family - even vaguely.  In fact it may not even be  
smooth!
...
I'm no statistics person, but I have seen cases
where 10,000 timing samples have been biased by 10 samples that  
probably
relate to a program issue or a severe OS issue. e.g. normally 1 ms per
iteration, 10 @ 1-10s each
Yes - that issue has a dramatic impact on the robustness of (or lack  
thereof) of the mean.

Often of course these things are even vaguely IID (independent and  
identically distributed) - burst like behaviour can lead to strongly  
non-stationary behaviour.  For example some garbage collector might  
go nuts and skew your timing.  That's why, in my view, it's so  
important to compare relative timings and to obtain confidence bounds.

Pragmatically one might argue that when you're doing these things at  
the console during development and making decisions it's not really a  
major issue.  One can look at the numbers and say "hmm, fishy - I  
don't trust that" and tinker around until the result is in line with  
your expectation.  That's part of the art that is experimental  
science and many of us do it everyday without even realising.

How to codify that opinion into something that's autonomous and  
capable of e.g. selecting the 'correct' function in a high-load multi- 
user environment in the real world is far from clear however.

That is my goal.

-ed
------------------------------------------------
"No more boom and bust." -- Dr. J. G. Brown, 1997

Re: [boost] [xpressive] Performance Tuning?

Edward Grace