Re: [boost] Boost.Plot? - Scalable Vector Graphics

30 Jul 2009

      ...
...
Again this is a long one but I feel it's worth discussing these  
things in some detail --- only by truly understanding the problem can  
we try to solve it.

(Something funny is going on regarding my access to this list - I  
keep getting things in strange order and with the wrong subject  
line.  Anyhow - to add my oar in to this discussion.)
...
...
Well, I normally use a tool like vtune running the code on real data.
Do you think such a tool is unreliable?
VTune is a sweet piece of kit, as are other profilers.  I don't think  
they are unreliable - one just has to understand what they do and /  
or don't do and their limitations.

For example, consider the 'fast' and 'slow' functions in my code  
which nominally differ in run-time by around 0.5%.  Do you think the  
'fire and forget' operation of a profiler would *reliably* detect  
this difference? [This is not a rhetorical question - I don't know  
and am genuinely curious].    It appears that my benchmarking does.   
Obviously this is a contrived example, no one is likely to care about  
0.5% differences.

For example, when you see that function 'abracadabra' takes 14.3323%  
of the run-time - how reliable is that number?  When you compare it  
to the other function that takes 14.5231% is it actually faster?  Are  
you sure?  What happens if you profile again do you come to the same  
conclusion?

 From my point of view profilers are useless as they are for the off- 
line optimising of code.  My original goal, as I have remarked before  
(somewhat obliquely) is that I wish to select the 'best' function or  
parameters dynamically at run-time.  In other words, consider a set  
of functions (algorithms) which nominally do the same thing, I want  
to be able to do (I may be syntactically clumsy here)

   std::set<unary_function> sf;
   sf.insert(function_a);
   sf.insert(function_b);
   sf.insert(function_c));
   ..
   ..
   ..
   function_pointer f = get_fastest_function(sf,test_argument);
   ..
   ..
   f(argument); // <--- This is now the fastest implementation of  
whatever we want.

Here, the relative speed of the different functions could be 100% and  
differ for differing sizes of data such that it may not be possible  
to know a priori which is quickest.  Likewise it may be run in a  
heterogeneous environment so that the knowledge about which function/ 
parameter combination works best on node "A" is exactly wrong for  
node "B".

This may sound like a niche requirement - it is!
...
VTune is for tuning application performance.   VTune is too big a  
hammer for measuring benchmark runtime.  Benchmarks are about wall  
clock time of a piece of code as a metric for performance of  
something else.  How you measure that time and how you report those  
measurements is a problem that is prone to error.
Ok, I should have read the whole email before writing the above  
monologue -- agreed.  Though I'd caution against a bare 'wall clock'  
times in favour of relative times (some times of course you may  
*have* to measure the wall-clock time).
...
Personally I perform several runs of the same benchmark and then  
take the minimum time as the time I report.
Be very cautious about that for a number of reasons.

a) By reporting the minimum, without qualifying your statement, you  
are lying.  For example if you say "My code takes 10.112 seconds to  
run." people are unlikely to ever observe that time - it is by  
construction atypical.

b) If ever you do this (use the minimum or maximum) then you are  
unwittingly sampling an extreme value distribution:

   http://en.wikipedia.org/wiki/Extreme_value_distribution

For example - say you 'know' that the measured times are normally  
distributed with mean (and median) zero and a variance of 10.  If you  
take, say 100 measurements and then take the minimum - how is that  
minimum distributed?  In other words if your repeat this - again and  
again - what does the histogram of the minimum look like?

What you will find is that the dispersion (variance or otherwise) on  
the median is small say \sigma^2=0.15 whereas the dispersion on the  
minimum can be much larger \sigma^2=1.8 - *and* the distribution is  
highly asymmetrical.

If your two times were nominally separated by around 0.5 you would  
not necessarily reliably determine which was fastest from the minima  
since the distributions of the minima could overlap.

Attached are two examples, histograms of the observed minimum for  
many trials and the observed median for many trials of the above  
scenario.  Notice that the histogram of the observed minimum is *far*  
broader than the observed median.

I suggest you experiment with this - you may be (unpleasantly)  
surprised; it's somewhat counter-intuitive an appropriate MATLAB (ok- 
ok put away the pitchforks) snippet below

% EV
clear all;
v_median=zeros(50000,1); % Lots of trials to get a decent looking  
histogram.
v_min=v_median;
for m=1:length(v_median)
     v=randn(100,1).*sqrt(10); % Normally distributed, try other  
distributions.
     v_median(m)=median(v);
     v_min(m)=min(v);
end
figure(1);
hist(v_median,51);
figure(2);
hist(v_min,51);
...
This excludes all outliers due to OS and such.
That may well be the aim, but I'm not sure it does get you what you  
want. This is where understanding of the maths (statistics) needs to  
be blended with an understanding of the machine.  Your use of the  
minimum could make it *less* possible to infer something from the  
measurements.  On the other hand you might be right - I don't know  
for sure -- but I am sceptical.
...
If a car company reported the average (mean or median) 0 to 60 mph  
for a given car when their test driver was trying to see how fast  
it was everyone would think they were crazy.
That's all down to marketing.  I imagine they *do* determine it, they  
just don't report it that way. Most people couldn't care less what  
these numbers actually mean - just that it's 'better' than the other  
one.  In the infamous words of "Spinal Tap" - "it goes up to eleven!".
...
Why should the fact that he pushed the gas too hard on some of the  
trials and spun out and not hard enough on others count against the  
car?
I imagine they test a number of different cars rather than just the  
same car a lot of times and that the reported time is some form of  
median.  This would mean that half the cars will go a little faster -  
half a little slower than this 'typical' car. Almost certainly the  
testing criteria and their analysis is more sophisticated than that -  
speculation on my part -- anyone in the car industry?
...
I also typically don't have the luxury of benchmarking on an  
unloaded machine, so I have to make sure to do things fairly, for  
instance, instead of runing many trials of A then many trials of B  
and taking the minimum for each I run many trials of A then B back  
to back and take the minimum.  That way A and B on average see the  
same environement over the course of the experiement.
Hopefully.  As someone pointed out "It's a dark art"!
...
If I run A and B in the same process I run A first then B and then  
B first and then A as a separate run to ensure that the order
 doesn't impact their performance.
Just to add to the mix, I suggest randomising the order in which you  
run them (that's what I do) that way you are very unlikely to  
accidentally synchronise yourself with some other process e.g.

ABBBABAABAB etc...
...
Usually benchmarking implies comparision, so the key concern is  
that the results are fair.
Easy to write very hard to do! ;-)

Regards,

-ed

------------------------------------------------
"No more boom and bust." -- Dr. J. G. Brown, 1997

Re: [boost] Boost.Plot? - Scalable Vector Graphics

Edward Grace