
Again this is a long one but I feel it's worth discussing these things in some detail --- only by truly understanding the problem can we try to solve it. (Something funny is going on regarding my access to this list - I keep getting things in strange order and with the wrong subject line. Anyhow - to add my oar in to this discussion.)
Well, I normally use a tool like vtune running the code on real data. Do you think such a tool is unreliable?
VTune is a sweet piece of kit, as are other profilers. I don't think they are unreliable - one just has to understand what they do and / or don't do and their limitations. For example, consider the 'fast' and 'slow' functions in my code which nominally differ in run-time by around 0.5%. Do you think the 'fire and forget' operation of a profiler would *reliably* detect this difference? [This is not a rhetorical question - I don't know and am genuinely curious]. It appears that my benchmarking does. Obviously this is a contrived example, no one is likely to care about 0.5% differences. For example, when you see that function 'abracadabra' takes 14.3323% of the run-time - how reliable is that number? When you compare it to the other function that takes 14.5231% is it actually faster? Are you sure? What happens if you profile again do you come to the same conclusion? From my point of view profilers are useless as they are for the off- line optimising of code. My original goal, as I have remarked before (somewhat obliquely) is that I wish to select the 'best' function or parameters dynamically at run-time. In other words, consider a set of functions (algorithms) which nominally do the same thing, I want to be able to do (I may be syntactically clumsy here) std::set<unary_function> sf; sf.insert(function_a); sf.insert(function_b); sf.insert(function_c)); .. .. .. function_pointer f = get_fastest_function(sf,test_argument); .. .. f(argument); // <--- This is now the fastest implementation of whatever we want. Here, the relative speed of the different functions could be 100% and differ for differing sizes of data such that it may not be possible to know a priori which is quickest. Likewise it may be run in a heterogeneous environment so that the knowledge about which function/ parameter combination works best on node "A" is exactly wrong for node "B". This may sound like a niche requirement - it is!
VTune is for tuning application performance. VTune is too big a hammer for measuring benchmark runtime. Benchmarks are about wall clock time of a piece of code as a metric for performance of something else. How you measure that time and how you report those measurements is a problem that is prone to error.
Ok, I should have read the whole email before writing the above monologue -- agreed. Though I'd caution against a bare 'wall clock' times in favour of relative times (some times of course you may *have* to measure the wall-clock time).
Personally I perform several runs of the same benchmark and then take the minimum time as the time I report.
Be very cautious about that for a number of reasons. a) By reporting the minimum, without qualifying your statement, you are lying. For example if you say "My code takes 10.112 seconds to run." people are unlikely to ever observe that time - it is by construction atypical. b) If ever you do this (use the minimum or maximum) then you are unwittingly sampling an extreme value distribution: http://en.wikipedia.org/wiki/Extreme_value_distribution For example - say you 'know' that the measured times are normally distributed with mean (and median) zero and a variance of 10. If you take, say 100 measurements and then take the minimum - how is that minimum distributed? In other words if your repeat this - again and again - what does the histogram of the minimum look like? What you will find is that the dispersion (variance or otherwise) on the median is small say \sigma^2=0.15 whereas the dispersion on the minimum can be much larger \sigma^2=1.8 - *and* the distribution is highly asymmetrical. If your two times were nominally separated by around 0.5 you would not necessarily reliably determine which was fastest from the minima since the distributions of the minima could overlap. Attached are two examples, histograms of the observed minimum for many trials and the observed median for many trials of the above scenario. Notice that the histogram of the observed minimum is *far* broader than the observed median. I suggest you experiment with this - you may be (unpleasantly) surprised; it's somewhat counter-intuitive an appropriate MATLAB (ok- ok put away the pitchforks) snippet below % EV clear all; v_median=zeros(50000,1); % Lots of trials to get a decent looking histogram. v_min=v_median; for m=1:length(v_median) v=randn(100,1).*sqrt(10); % Normally distributed, try other distributions. v_median(m)=median(v); v_min(m)=min(v); end figure(1); hist(v_median,51); figure(2); hist(v_min,51);
This excludes all outliers due to OS and such.
That may well be the aim, but I'm not sure it does get you what you want. This is where understanding of the maths (statistics) needs to be blended with an understanding of the machine. Your use of the minimum could make it *less* possible to infer something from the measurements. On the other hand you might be right - I don't know for sure -- but I am sceptical.
If a car company reported the average (mean or median) 0 to 60 mph for a given car when their test driver was trying to see how fast it was everyone would think they were crazy.
That's all down to marketing. I imagine they *do* determine it, they just don't report it that way. Most people couldn't care less what these numbers actually mean - just that it's 'better' than the other one. In the infamous words of "Spinal Tap" - "it goes up to eleven!".
Why should the fact that he pushed the gas too hard on some of the trials and spun out and not hard enough on others count against the car?
I imagine they test a number of different cars rather than just the same car a lot of times and that the reported time is some form of median. This would mean that half the cars will go a little faster - half a little slower than this 'typical' car. Almost certainly the testing criteria and their analysis is more sophisticated than that - speculation on my part -- anyone in the car industry?
I also typically don't have the luxury of benchmarking on an unloaded machine, so I have to make sure to do things fairly, for instance, instead of runing many trials of A then many trials of B and taking the minimum for each I run many trials of A then B back to back and take the minimum. That way A and B on average see the same environement over the course of the experiement.
Hopefully. As someone pointed out "It's a dark art"!
If I run A and B in the same process I run A first then B and then B first and then A as a separate run to ensure that the order doesn't impact their performance.
Just to add to the mix, I suggest randomising the order in which you run them (that's what I do) that way you are very unlikely to accidentally synchronise yourself with some other process e.g. ABBBABAABAB etc...
Usually benchmarking implies comparision, so the key concern is that the results are fair.
Easy to write very hard to do! ;-) Regards, -ed ------------------------------------------------ "No more boom and bust." -- Dr. J. G. Brown, 1997