
Steven Ross wrote:
On Sun, Jul 5, 2009 at 9:17 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:
If the problem were simply that "x86 FP is very slow", then this problem would be evident with std::sort as well as your algorithm. I've therefore done a simple test with a program like the following:
Agreed Phil. I do see it with std::sort. Also Vladimir also saw a large runtime difference with 23-bit random data, though not quite as large.
Note that I haven't actually checked if I get the right answers from this, so please let me know if you can see any mistakes! Also note that I have not done any fixup for negative floats, so some small additional time would be needed for that.
I've run variations of this code with the following results:
vector<int>, std::sort(begin,end) : 3.71 s vector<float>, std::sort(begin,end) : 9.12 s vector<float>, std::sort(begin,end,compare_floats_as_ints) : 4.19 s
My results, with 32-bit random data, which is what you appear to be using (you did not provide your random_fill function; I used randomgen 16 16 10000000 to get these results): vector<float>, sorting as integers (my approach) : 1.417s and data is correctly sorted vector<float>, std::sort(begin, end): 2.793s and data is correctly sorted vector<float>, std::sort(begin,end,compare_floats_as_ints) : 4.291s and data is not correctly sorted (because negatives aren't flipped) I set NaNs (I get 3975 of them) to 0.0 in my testing so as to have a unique correct sorted output. I'm using a Core 2 Duo.
I think I've figured out what's going on with my 23-bit random data: floats on x86 have some form of optimization to look at the exponent first, and then the coefficient afterwards. On x86 systems, the comparison is extremely slow if the exponents are the same.
Same, and equal to what? If your test ends up generating a pile of denormalized numbers, then it's rather broken regardless of what processor you are using.
With 23-bit random data, all the exponents are the same. With 32-bit random data, there will be 512 different values for the exponents,
Are you sure? 32-bit IEEE format has 8 bits for the exponent.
splitting the identical-exponent problems much smaller, to the point where they are close to the size of the cache (in-cache sorting is extremely fast). With realistic data, there will be multiple exponents, but not as many as there are data elements. More likely, there will be a small range of exponents, and for each of them, many elements with the same exponent. Also, a denial-of-service attack could easily be made against a system that sort floats by just feeding it floats with the same exponent, but different coefficients. When I test with 26-bit random data, where there are 8 different exponent values (a more realistic case), here are my results: sorting floats as integers (my approach): 1.473s std::sort 21.511s (I used randomgen 10 16 10000000 to generate the input data)
So not 120X, but still a substantial difference.
Also worth noting is that when the element count increases, the speed difference increases, even on 32-bit random data. randomgen 16 16 200000000:
What real-world problems require sorting such huge arrays of floats?
I'll test with SSE and other compiler settings next. Is there an easy way to add these with bjam?
Because bjam is actually a low-level build engine, it is pretty much impossible to do anything with it at all. Boost.Build allows you to pass any flags to compiler you see fit, my previous post gives the actual syntax I have used. - Volodya