Re: [boost] Boost.Plot? - Scalable Vector Graphics (was [expressive] performance tuning

31 Jul 2009

      [ Warning to all - the following is a long, complex response -- don't  
bother reading unless you are *really* interested in the pedantic  
nuances of timing stuff! It may cause your eyes to glaze over and  
loss of consciousness.

Apologies also for clogging up inboxes with pngs - be thankful for  
the mailing sentry bouncing this thing out until it got short enough!  
It just didn't read well with hyperlinked pictures. ]

On 30 Jul 2009, at 23:21, Simonson, Lucanus J wrote:
...
Edward Grace wrote:
...
...
Personally I perform several runs of the same benchmark and then
take the minimum time as the time I report.
Be very cautious about that for a number of reasons.
a) By reporting the minimum, without qualifying your statement, you
are lying.  For example if you say "My code takes 10.112 seconds to
run." people are unlikely to ever observe that time - it is by
construction atypical.
I try to be explicit about explaining what metric I'm reporting.   
If I'm reporting the minimum of ten trials I report it as the  
minimum of ten trials.
Good show!

Re-reading my comments, it looks a bit adversarial - it's not  
intended to be! I'm not accusing you of falsehood or dodgy dealings!  
Apologies if that was the apparent tone!
...
Taking the minimum of ten trials and comparing it to the minimum  
of twenty trials and not qaulifying it would be even more dishonest  
because it isn't fair.  I compare same number of trails, report  
that it is the minimum and the number of trails.
...
b) If ever you do this (use the minimum or maximum) then you are
unwittingly sampling an extreme value distribution:
http://en.wikipedia.org/wiki/Extreme_value_distribution
For example - say you 'know' that the measured times are normally
distributed with mean (and median) zero and a variance of 10.  If you
[...]
...
...
scenario.  Notice that the histogram of the observed minimum is *far*
broader than the observed median.
In this particular case of runtime the distribution is usually not  
normal.
Of course, hence the quotes around 'know'.  My demonstration of the  
effect of taking the minimum is based on the normal distribution  
since everyone knows it and it's assumed to be 'nice'.  If you repeat  
what I did with another distribution (e.g. uniform, or Pareto) both  
of which have a well defined minimum you will still see the same  
effect!  In fact, you are likely to observe even more extreme  
behaviour!  I did not discuss more realistic distributions since it  
would cause people's eyes to glaze over even faster than this will!
...
There is some theoritical fastest the program could go if nothing  
like the OS or execution of other programs on the same machine  
cause it to run slower.  This is the number we actually want.  We  
can't measure it, but we observe that there is a "floor" under our  
data and the more times we run the more clear it becomes where that  
floor is.
Sure.  This is partially a philosophical point regarding what one is  
trying to measure. In your case if you are trying to determine the  
fastest speed that the code could possibly go (and you don't have a  
real-time OS)  then the minimum observed time looks like a good idea.

I don't object to the use of the minimum (or any other estimator) per  
se, at least not yet.  What one wishes to measure should determine  
the choice of estimator.

For example, I do not want to know the minimum possible time as that  
is not indicative of the typical time. I want to compare the typical  
execution times of two codes *with* the OS taken in to account since,  
in the real world, this is what a user will notice and therefore what  
matters to me.

Paraphrasing "Hamlet",

  'To median or not to median - that is not quite the question.'

My point is far more subtle than I think you have appreciated.

The crux is that whatever your estimator you need to understand how  
it's distributed if you want to infer anything from it. So, if you  
*are* going to use the minimum (which is not robust) then you need to  
be aware that you may be inadvertently bitten by extreme value theory  
- consequently you should consider the form of the dispersion - it  
sure as heck won't be normal! This may seem bizarre - these things  
often are!

To give you a concrete example, consider the following code (this  
also contains PDFs of the plots):

   http://tinyurl.com/n6kfe7

Here, we crudely time a simple function using a given number of  
iterations in a fairly typical manner.  We repeat this timing 4001  
times and then report the minimum and median of these 4001  
acquisitions. This constitutes 'a measurement'.

Each experimental 'measurement' is carried out for a randomly chosen  
number of iterations of the main timing loop and is hopefully  
independent.  This is to try and satisfy the first i in iid  
(independent and identically distributed).

For those of you bash oriented I do:

  ./ejg-make-input-for-empirical-ev > inputs.dat
  rm test.dat; while read line; do echo $line; echo $line | ./ejg- 
empirical-ev >> test.dat;  done < inputs.dat

No attempt is made to 'sanitise' the computer, nor do I deliberately  
load it.

The raw results are shown below,

The most obvious observation is that as the number of iterations  
increases the times both appear to converge in an asymptotic-like  
manner towards some 'true' value.  This is clearly the effect of some  
overhead (e.g. the time taken to call the clock) having progressively  
less impact on the overall time. Secondly, the minimum observed time  
is lower than the median (as expected).

What you may find a little surprising is that the *variation* of the  
minimum observed time is far greater than the variation in the median  
observed time.  This is the key to my warning!

If we crudely account for the overhead of the clock (which of course  
will also be stochastic - but let's ignore that) and subtract it then  
this phenomenon becomes even more stark.

The variation of the observed times is now far clearer.  Also, notice  
that rogue-looking red point in the bottom left of the main plot -  
for small numbers of iterations the dispersion of the minimum is  
massive compared to the dispersion of the median. You will also  
notice that there's now little difference between the predominant  
value of the median and minimum in this case.

Now consider the inset, the second obvious characteristic is that the  
times do not appear to fluctuate entirely randomly with iteration  
number - instead there is obvious structure - indeed a sudden jump in  
behaviour around n=60.  This is likely to be a strong interference  
effect between the clock and the function that's being measured and  
other architecture dependent issues (e.g. cache) that I won't pretend  
to fully understand.

It's also indicative of how to interpret the observed variation -  
it's not necessarily purely random, but random + pseudorandom.  Like  
std::rand(), if you keep the same seed (number of iterations) you may  
well observe the same result.  That does not however mean that your  
true variance is zero - it's just hidden!

Looking at the zoom in the inset it's clear to see that the median  
time has a generally far lower dispersion than the minimum time (by  
around an order of magnitude). There are even situations where the  
minimum time is clearly greater than the median time - directly  
counter to your expectation. This is not universally the case, but  
the general behaviour is perfectly clear.

This can be considered to be a direct result of extreme value  
theory.  If you use the minimum (or similar quantile) to estimate  
something about a random variable the result may have a large  
dispersion (variance) compared to the median.  Just because you  
haven't seen it (changed the seed in your random number generator)  
doesn't mean it's not there!

The practical amongst you may well say "Hey - who cares, we're  
talking about tiny differences!"

I agree!

So, is the moral of this tale to avoid using the minimum and instead  
to use the median?  Not at all. I'm agnostic on this issue for  
reasons I mentioned above, however when / if one uses the minimum one  
needs to be careful and be aware of the (hidden) assumptions!  For  
example, as various aspects of the timing change one may observe a  
sudden change in the cdf, 'magic' in the cache could perhaps cause  
the median measure to change far more than the minimum measure - so  
even though median may have better nominal precision, the minimum may  
well have better nominal accuracy.  If you look in the code you'll  
see evidence that I've mucked around with different quantiles - I  
have not closed my mind to using the minimum!
...
It just won't run faster than that no matter how often we run.
Have you tried varying the number of iterations?  You'll find that  
for some iteration lengths it may go significantly faster - doubtless  
this is a machine architecture issue, I don't know why - but it happens.
...
The idea is that using the minimum as the metric eliminates most of  
the "noise" introduced into the data by stuff unrelated to what you  
are trying to measure.  Taking the median is just taking the median  
of the noise.
Not necessarily - as you can see from the above plots.  Just because  
your estimator is repeatable for a given number of iterations doesn't  
mean it's 'ok' or eliminated the 'noise' - it's far more subtle than  
that.
...
In an ideal compute environment your runtimes would be so similar  
from run to run you wouldn't bother to run more than once.  If  
there is noise in your data set your statistic should be designed  
to eliminate the noise, not characterize it.
If you want to try and eliminate it you need to characterise it.   
This is why relative measurements are so important if you desire  
precision - the noise is always there, no matter what you do, so  
having 'well behaved' noise (e.g. asymptotically normal) means you  
can infer things in a meaningful manner when you take differences.   
If you have no idea what the form of your noise is - or think you've  
eliminated it you can get repeatable results but they may be a subtle  
distortion of the truth.  By taking the minimum you are not  
necessarily avoiding the 'noise', you may think you are - indeed it  
may look like you are, but you could have just done a good job of  
hiding it!
...
In my recent boostcon presentation my benchmarks were all for a  
single trial of each algorithm for each benchmark because noise  
would be eliminated during the curve fitting of the data for  
analysis of how well the algorithms scaled (if not by the  
conversion to logarithmic axis....)
As an aside you may find least absolute deviation (LAD) regression  
more appropriate than ordinary least squares (I'm assuming you used  
OLS of course).

For practical application what you've done is almost certainly fine,  
'good enough for government work' as the saying goes - I'm not  
suggesting it isn't.  After all, as I have already commented, when  
human experimenters are involved we can eyeball the results and see  
when things look fishy or don't behave as expected.

Ultimately - the take away message is apparently axiomatic 'truths'  
are not always as true as one might think.

A dark art indeed! ;-)

-ed

------------------------------------------------
"No more boom and bust." -- Dr. J. G. Brown, 1997