sorting floats by casting to integer

I've been investigating why float_sort is so much faster than std::sort on x86 processors, and why sorting integers is so much faster than float_sort, and came to the conclusion that floating-point comparisons are extremely slow on x86. Based upon that, I've written a different sorting approach that copies the input array of floats, casting them to integers, and sorts them as integers, dealing with the negative floats sorting in reverse order issue, completely eliminating floating-point comparisons. It takes O(n) additional memory, and is somewhat less general, as a different approach is required for key plus data sorting (as non-float data can't be safely cast this way), but it obtains integer-sorting speeds. On some problems it is as much as 120X as fast as std::sort. I doubt average results will be quite that good, but they could easily be well over 10X as fast. Does this sound like it's appropriate for Boost? If so, I'll add it to my proposed Boost.Algorithm.Sorting library, which is complete, aside from this issue. Example source (not templated yet): #define DATATYPE float #define CAST_TYPE int inline void float_as_integer_sort(std::vector<DATATYPE> &input) { if(input.size() < 2) return; std::vector<CAST_TYPE> cast_vec; cast_vec.resize(input.size()); unsigned neg_index = 0; int pos_index = input.size() - 1; //Casting to integer and presorting into negative and positive for(unsigned u = 0; u < input.size(); ++u) { int cast_val = float_mem_cast<DATATYPE, CAST_TYPE>(input[u]); if(cast_val < 0) cast_vec[neg_index++] = cast_val; else cast_vec[pos_index--] = cast_val; } assert((pos_index + 1) == neg_index); //Sort the negatives integer_sort(&(cast_vec[0]), &(cast_vec[neg_index])); //Then the positives integer_sort(&(cast_vec[neg_index]), &(cast_vec[0]) + cast_vec.size()); //casting back negatives for(unsigned u = 0; u < neg_index; ++u) std::memcpy(&(input[neg_index - u - 1]), &(cast_vec[u]), sizeof(DATATYPE)); //Then positives for(unsigned u = neg_index; u < cast_vec.size(); ++u) std::memcpy(&(input[u]), &(cast_vec[u]), sizeof(DATATYPE)); }

Steven Ross wrote:
I've been investigating why float_sort is so much faster than std::sort on x86 processors, and why sorting integers is so much faster than float_sort, and came to the conclusion that floating-point comparisons are extremely slow on x86. Based upon that, I've written a different sorting approach that copies the input array of floats, casting them to integers, and sorts them as integers, dealing with the negative floats sorting in reverse order issue, completely eliminating floating-point comparisons. It takes O(n) additional memory, and is somewhat less general, as a different approach is required for key plus data sorting (as non-float data can't be safely cast this way), but it obtains integer-sorting speeds. On some problems it is as much as 120X as fast as std::sort. I doubt average results will be quite that good, but they could easily be well over 10X as fast. Does this sound like it's appropriate for Boost? If so, I'll add it to my proposed Boost.Algorithm.Sorting library, which is complete, aside from this issue.
Which x86 processors specifically? Some manufacturers have been known to focus on integer performance and let floating point suffer because floating point performance is usually not a dominant factor in overall performance. Other x86 processors, however, have quite good floating point performance. If you think about it, if casting to integer to compare floating point values is faster, why doesn't the processor do that itself? I have been told that some processors do. Regards, Luke

Simonson, Lucanus J wrote:
If you think about it, if casting to integer to compare floating point values is faster, why doesn't the processor do that itself? I have been told that some processors do.
Some floating point values that don't correspond to real numbers (like INF and NaN) have a very complicated behavior with respect to comparison. At least I remember that std::sort used to crash when trying to sort data containing NaNs. I think it has something to do with "a < b" being somehow related to "(a-b) < 0" or "0 < (b-a)" and a minus operation involving a NaN must yield a NaN.

I've been investigating why float_sort is so much faster than std::sort on x86 processors, and why sorting integers is so much faster than float_sort, and came to the conclusion that floating-point comparisons are extremely slow on x86.
Hey Steven, Is it true even with SSE3 optimizations turned on and data aligned to 16-bytes boundaries? Which compiler were you using? Regards. -- Edouard __________ Information from ESET NOD32 Antivirus, version of virus signature database 4215 (20090704) __________ The message was checked by ESET NOD32 Antivirus. http://www.eset.com

I'm replying to everybody at once; I appreciate all your input. On Sat, Jul 4, 2009 at 1:52 AM, Edouard A. <edouard@fausse.info> wrote:
I've been investigating why float_sort is so much faster than std::sort on x86 processors, and why sorting integers is so much faster than float_sort, and came to the conclusion that floating-point comparisons are extremely slow on x86.
Hey Steven,
Is it true even with SSE3 optimizations turned on and data aligned to 16-bytes boundaries?
Which compiler were you using?
I am using MSVC 2005 standard edition with bjam's default release settings for it. I have not tried other compiler settings, besides 64-bit compilation, which has comparable performance. This is on a Core 2 Duo purchased earlier this year for Boost development. I saw the same behavior and the same magnitude of speedup for float_sort vs. std::sort on floats on a 6 year old refurbished Dell Windows system using gcc in Cygwin. With two different compilers on processors manufactured at least 6 years apart, I suspect it's an issue with the architecture. There may be Intel processors that are faster for this purpose, but they certainly aren't the two mass-market Intel processors I've tried, with default optimizations.
If anybody would like to look at my testcase, it is in the boost vault, inside algorithm_sorting.zip. Inside there, it is in libs/algorithm/sorting/samples/float_as_integer.cpp. Even without using integer_sort, and just using std::sort on integers with the code I posted initially, the speedup is on the same order (80X). I tested using randomgen 7 16 1000000 (randomgen is compiled from my randomgen.cpp). 7 is the bit range among the "high" 16 bits, the second number (16) is the bit range among the "low" 16 bits, and 1000000 is the size of the vector to sort. 23 bits are used to exclude NaNs, which I assume to be uncommon in real data, and which are sorted arbitrarily, meaning that two correct sorted results containing NaNs will often not be identical. With mostly NaNs the speedup with this approach is about 2X.
From the libs/algorithm/sorting directory (once copied into the boost tree), you can test this way: bjam --tune randomgen randomgen 7 16 1000000 bjam --tune float_as_integer float_as_integer (uses the approach I described) float_as_integer -std (uses std::sort on floats) (I see a huge difference in runtime on x86, but not on PowerPC)
If you think about it, if casting to integer to compare floating point values is faster, why doesn't the processor do that itself? I have been told that some processors do.
PowerPC processors appear to do so, as floating-point sorting for them is only slightly slower than integer sorting, much faster than the x86 processors I've tried. I don't know why the x86 processors I've tested don't do it that way, but my best guess is that they are, but their cast operation is slow (switching between float and integer registers). My casting certainly was extremely slow relative to making a integer copy that I only cast once, and using that.
Some floating point values that don't correspond to real numbers (like INF and NaN) have a very complicated behavior with respect to comparison. At least I remember that std::sort used to crash when trying to sort data containing NaNs. I think it has something to do with "a < b" being somehow related to "(a-b) < 0" or "0 < (b-a)" and a minus operation involving a NaN must yield a NaN.
You are correct. a < NaN and NaN < a both return false (NaN is treated as equal to everything). Other odd values have similar behavior. As this places such invalid values in arbitrary places depending upon the comparison-based algorithm you are using and the input order, I cannot see how it is in any way superior to what integer sorting will do with it, which is put NaNs in a specific, non-arbitrary place, based upon their bit value.
Possibilities that occur to me include: - Unnecessary transfers between integer and floating point registers.
I suspect this is the cause of the delay, but these transfers may be necessary.
- Memory alignment issues.
Why? I see the same issue with doubles vs. 64-bit integers.
- Lack of optimisation.
I'm compiling in bjam with release or msvc-8.0; with gcc I compiled with -O3. All show similar results.
- Something related to NaNs, de-normals etc (do you have these in your test data?).
I've tested with NaNs. As long as there aren't many NaNs, the runtime isn't impacted much by them. What I see is that std::sort is sped up by NaNs, as they are considered identical to everything and thus less effort is required to "sort" them; I don't consider 90% NaNs a realistic performance test. My integer-based approach just cares about bits, so it puts the NaNs in a specific place and takes roughly the same time as it does with normal floating-point values.
Also, I don't see why you actually need to copy all the data to achieve what you're doing. Can't you cast it "in place" and then deal with the negative numbers (etc) as an in-place pre- or post-processing step?
With regards to casting to an integer, in further testing I found that if I cast the floats to an integer for comparison, but leave them in their original vector of floats, and sort them as integers that way, that the speedup is roughly 2X, not 120X (slower than my current "float_sort" algorithm which gets 7X). The casting operation must also be slow on x86 (it isn't on PowerPC), even though it's just this: int result; std::memcpy(&result, &(*floatiter), sizeof(float)); Note that a direct cast from a float to an integer is illegal, and can cause undefined behavior. I'm thinking that the more general approach for key plus data sorting with casting to an integer is this: User provides first and last iterators, along with a method to cast values to an integer. A vector is created of integer, element pairs. This vector is then sorted on the integer key plus the iterator as data. Once the sorting is complete, the elements are assigned to the corresponding location in the iterators passed in. Memory overhead is 1 integer plus a full copy per element. There may be more memory-efficient algorithms, but this should be simple and fast. I've considered using an iterator as an index instead, but then putting the data into the correct order in O(n) time without substantial additional memory is tricky; the best such algorithm I could think of uses 1 integer + 3 iterators memory overhead per element.

Steven Ross wrote:
I am using MSVC 2005 standard edition with bjam's default release settings for it. I have not tried other compiler settings, besides 64-bit compilation, which has comparable performance. This is on a Core 2 Duo purchased earlier this year for Boost development. I saw the same behavior and the same magnitude of speedup for float_sort vs. std::sort on floats on a 6 year old refurbished Dell Windows system using gcc in Cygwin. With two different compilers on processors manufactured at least 6 years apart, I suspect it's an issue with the architecture. There may be Intel processors that are faster for this purpose, but they certainly aren't the two mass-market Intel processors I've tried, with default optimizations.
If anybody would like to look at my testcase, it is in the boost vault, inside algorithm_sorting.zip. Inside there, it is in libs/algorithm/sorting/samples/float_as_integer.cpp.
This version fails to compile for me, because it includes boost/static_warning.hpp, and it does not seem to be present in my SVN HEAD tree. I've changed include to boost/serialization/static_warning.hpp, but this can't be right. Am I missing something?
Even without using integer_sort, and just using std::sort on integers with the code I posted initially, the speedup is on the same order (80X). I tested using randomgen 7 16 1000000 (randomgen is compiled from my randomgen.cpp). 7 is the bit range among the "high" 16 bits, the second number (16) is the bit range among the "low" 16 bits, and 1000000 is the size of the vector to sort. 23 bits are used to exclude NaNs, which I assume to be uncommon in real data, and which are sorted arbitrarily, meaning that two correct sorted results containing NaNs will often not be identical. With mostly NaNs the speedup with this approach is about 2X.
From the libs/algorithm/sorting directory (once copied into the boost tree), you can test this way: bjam --tune randomgen randomgen 7 16 1000000 bjam --tune float_as_integer float_as_integer (uses the approach I described) float_as_integer -std (uses std::sort on floats) (I see a huge difference in runtime on x86, but not on PowerPC)
To bet bigger times that are less prone to noise, I have changed loopCount in your test to 10. Then, then time reported without -std is 1 second, with very small difference between each run. The time with -std is 43 seconds, again with negligible variation. If I add: cxxflags="-march=nocona -mfpmath=sse" then -std runs in 16 seconds, while the time for your algorithm is the same 1 second. Of course, this is still significant speedup, but smaller than number you have reported, so probably a deeper look is reasonable. Incidentally, does your library allow to sort only a collection of integers/floats, or it also allows to sort an arbitrary collection where each item has integer/float key? If so, can you point me at example? - Volodya

On Sun, Jul 5, 2009 at 1:01 AM, Vladimir Prus <vladimir@codesourcery.com>wrote:
Steven Ross wrote:
If anybody would like to look at my testcase, it is in the boost vault, inside algorithm_sorting.zip. Inside there, it is in libs/algorithm/sorting/samples/float_as_integer.cpp.
This version fails to compile for me, because it includes boost/static_warning.hpp, and it does not seem to be present in my SVN HEAD tree. I've changed include to boost/serialization/static_warning.hpp, but this can't be right. Am I missing something?
I downloaded boost_1_38_0, and I have boost/static_warning.hpp in my tree. If it would make it easier for others to compile, I'll shift the include to serialization.
To bet bigger times that are less prone to noise, I have changed loopCount in your test to 10. Then, then time reported without -std is 1 second, with very small difference between each run. The time with -std is 43 seconds, again with negligible variation.
If I add:
cxxflags="-march=nocona -mfpmath=sse"
then -std runs in 16 seconds, while the time for your algorithm is the same 1 second. Of course, this is still significant speedup, but smaller than number you have reported, so probably a deeper look is reasonable.
Incidentally, does your library allow to sort only a collection of integers/floats, or it also allows to sort an arbitrary collection where each item has integer/float key? If so, can you point me at example?
Yes. samples/keyplusdatasample.cpp for integer keys + string data. I see a 3X improvement for this test, much more than I get for just sorting integers on
Thanks for the suggestion Vladimir, that's a substantial difference, and I will investigate. I also appreciate the verification that it's not just my system doing this, and I don't expect a 120X speedup across all tests and systems. their own. samples/floatfunctorsort.cpp shows the functors you need to pass to float_sort, but the test just uses a float; you should be able to just replace DATATYPE and update the functors. I'll modify it to use key + data so there is an example of such usage. I will note that floatfunctorsort.cpp does not use my new copying approach, so it only gets around a 7X speedup, and possibly less with the optimizations you applied.

Steven Ross wrote:
I've been investigating why float_sort is so much faster than std::sort on x86 processors, and why sorting integers is so much faster than float_sort, and came to the conclusion that floating-point comparisons are extremely slow on x86.
Hi Steven, If the problem were simply that "x86 FP is very slow", then this problem would be evident with std::sort as well as your algorithm. I've therefore done a simple test with a program like the following: static inline bool compare_floats_as_ints(float f1, float f2) { int i1, i2; memcpy(&i1,&f1,sizeof(int)); memcpy(&i2,&f2,sizeof(int)); return i1<i2; } int main() { std::srand(0); std::vector<float> data(10000000); random_fill(data.begin(),data.end()); pbe::HPtime t_start(pbe::HPtime::now()); std::sort(data.begin(),data.end(),compare_floats_as_ints); pbe::HPtime t_end(pbe::HPtime::now()); std::cout << "t = " << t_end-t_start << "\n"; return 0; } Note that I haven't actually checked if I get the right answers from this, so please let me know if you can see any mistakes! Also note that I have not done any fixup for negative floats, so some small additional time would be needed for that. I've run variations of this code with the following results: vector<int>, std::sort(begin,end) : 3.71 s vector<float>, std::sort(begin,end) : 9.12 s vector<float>, std::sort(begin,end,compare_floats_as_ints) : 4.19 s So it looks like integer comparisons are two to three times faster than float comparisons on this hardware. This is the sort of difference that I would expect. Sorting floats as integers is a bit slower than sorting integers, but significantly faster than sorting them as floats (but note that the fixup for negatives etc. is not implemented). This is with g++-4.3 -O3 -fprofile-generate/use on Linux, on a 1.2 GHz VIA C7-M. Using profile-driven optimisation is important to get good results. I feel that your observation that FP comparisons are 100X (or even 20X) slower than integer comparisons on Intel hardware are implausible. There must be some other reason for the slowness. Regards, Phil.

On Sun, Jul 5, 2009 at 9:17 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:
If the problem were simply that "x86 FP is very slow", then this problem would be evident with std::sort as well as your algorithm. I've therefore done a simple test with a program like the following:
Agreed Phil. I do see it with std::sort. Also Vladimir also saw a large runtime difference with 23-bit random data, though not quite as large.
Note that I haven't actually checked if I get the right answers from this, so please let me know if you can see any mistakes! Also note that I have not done any fixup for negative floats, so some small additional time would be needed for that.
I've run variations of this code with the following results:
vector<int>, std::sort(begin,end) : 3.71 s vector<float>, std::sort(begin,end) : 9.12 s vector<float>, std::sort(begin,end,compare_floats_as_ints) : 4.19 s
My results, with 32-bit random data, which is what you appear to be using (you did not provide your random_fill function; I used randomgen 16 16 10000000 to get these results): vector<float>, sorting as integers (my approach) : 1.417s and data is correctly sorted vector<float>, std::sort(begin, end): 2.793s and data is correctly sorted vector<float>, std::sort(begin,end,compare_floats_as_ints) : 4.291s and data is not correctly sorted (because negatives aren't flipped) I set NaNs (I get 3975 of them) to 0.0 in my testing so as to have a unique correct sorted output. I'm using a Core 2 Duo. I think I've figured out what's going on with my 23-bit random data: floats on x86 have some form of optimization to look at the exponent first, and then the coefficient afterwards. On x86 systems, the comparison is extremely slow if the exponents are the same. With 23-bit random data, all the exponents are the same. With 32-bit random data, there will be 512 different values for the exponents, splitting the identical-exponent problems much smaller, to the point where they are close to the size of the cache (in-cache sorting is extremely fast). With realistic data, there will be multiple exponents, but not as many as there are data elements. More likely, there will be a small range of exponents, and for each of them, many elements with the same exponent. Also, a denial-of-service attack could easily be made against a system that sort floats by just feeding it floats with the same exponent, but different coefficients. When I test with 26-bit random data, where there are 8 different exponent values (a more realistic case), here are my results: sorting floats as integers (my approach): 1.473s std::sort 21.511s (I used randomgen 10 16 10000000 to generate the input data) So not 120X, but still a substantial difference. Also worth noting is that when the element count increases, the speed difference increases, even on 32-bit random data. randomgen 16 16 200000000: sorting floats as integer: 21.967 std::sort: 104.336 Which concurs with the splitting-on-exponent argument. I'll test with SSE and other compiler settings next. Is there an easy way to add these with bjam? Otherwise I'll just use MSVC directly.

Steven Ross wrote:
On Sun, Jul 5, 2009 at 9:17 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:
If the problem were simply that "x86 FP is very slow", then this problem would be evident with std::sort as well as your algorithm. I've therefore done a simple test with a program like the following:
Agreed Phil. I do see it with std::sort. Also Vladimir also saw a large runtime difference with 23-bit random data, though not quite as large.
Note that I haven't actually checked if I get the right answers from this, so please let me know if you can see any mistakes! Also note that I have not done any fixup for negative floats, so some small additional time would be needed for that.
I've run variations of this code with the following results:
vector<int>, std::sort(begin,end) : 3.71 s vector<float>, std::sort(begin,end) : 9.12 s vector<float>, std::sort(begin,end,compare_floats_as_ints) : 4.19 s
My results, with 32-bit random data, which is what you appear to be using (you did not provide your random_fill function; I used randomgen 16 16 10000000 to get these results): vector<float>, sorting as integers (my approach) : 1.417s and data is correctly sorted vector<float>, std::sort(begin, end): 2.793s and data is correctly sorted vector<float>, std::sort(begin,end,compare_floats_as_ints) : 4.291s and data is not correctly sorted (because negatives aren't flipped) I set NaNs (I get 3975 of them) to 0.0 in my testing so as to have a unique correct sorted output. I'm using a Core 2 Duo.
I think I've figured out what's going on with my 23-bit random data: floats on x86 have some form of optimization to look at the exponent first, and then the coefficient afterwards. On x86 systems, the comparison is extremely slow if the exponents are the same.
Same, and equal to what? If your test ends up generating a pile of denormalized numbers, then it's rather broken regardless of what processor you are using.
With 23-bit random data, all the exponents are the same. With 32-bit random data, there will be 512 different values for the exponents,
Are you sure? 32-bit IEEE format has 8 bits for the exponent.
splitting the identical-exponent problems much smaller, to the point where they are close to the size of the cache (in-cache sorting is extremely fast). With realistic data, there will be multiple exponents, but not as many as there are data elements. More likely, there will be a small range of exponents, and for each of them, many elements with the same exponent. Also, a denial-of-service attack could easily be made against a system that sort floats by just feeding it floats with the same exponent, but different coefficients. When I test with 26-bit random data, where there are 8 different exponent values (a more realistic case), here are my results: sorting floats as integers (my approach): 1.473s std::sort 21.511s (I used randomgen 10 16 10000000 to generate the input data)
So not 120X, but still a substantial difference.
Also worth noting is that when the element count increases, the speed difference increases, even on 32-bit random data. randomgen 16 16 200000000:
What real-world problems require sorting such huge arrays of floats?
I'll test with SSE and other compiler settings next. Is there an easy way to add these with bjam?
Because bjam is actually a low-level build engine, it is pretty much impossible to do anything with it at all. Boost.Build allows you to pass any flags to compiler you see fit, my previous post gives the actual syntax I have used. - Volodya

On Mon, Jul 6, 2009 at 8:37 AM, Vladimir Prus <vladimir@codesourcery.com>wrote:
With 23-bit random data, all
the exponents are the same. With 32-bit random data, there will be 512
different values for the exponents,
Are you sure? 32-bit IEEE format has 8 bits for the exponent.
+1 for sign.
splitting the identical-exponent problems much smaller, to the point where they are close to the size of the cache (in-cache sorting is extremely fast). With realistic data, there will be multiple exponents, but not as many as there are data elements. More likely, there will be a small range of exponents, and for each of them, many elements with the same exponent. Also, a denial-of-service attack could easily be made against a system that sort floats by just feeding it floats with the same exponent, but different coefficients. When I test with 26-bit random data, where there are 8 different exponent values (a more realistic case), here are my results: sorting floats as integers (my approach): 1.473s std::sort 21.511s (I used randomgen 10 16 10000000 to generate the input data)
So not 120X, but still a substantial difference.
Also worth noting is that when the element count increases, the speed difference increases, even on 32-bit random data. randomgen 16 16 200000000:
What real-world problems require sorting such huge arrays of floats?
Circuit layouts, if someone wants to sort them for various purposes. They are gigabytes in size, so they take a long time to sort unless casting to integer is done. Yes, I guess I was using denormalized numbers for the 23-bit case, but it's interesting that there is such a huge speed difference, and I doubt it's because they're denormalized; there is no reason to special-case denormalized sorting. Notably, normalized numbers also show the same problem if their exponent range is small relative to the number of elements; this can be achieved by using a small number of different exponents, or simply a large number of floats (as shown with my large test). What do you consider a reasonable range? I can easily make a test that excludes denormalized numbers. Even if the full exponent range is in use (which isn't realistic), wouldn't it be better to use an algorithm that doesn't blow up when n gets much larger than 10 million elements, and is still faster with a smaller number of elements?

Steven Ross wrote:
On Mon, Jul 6, 2009 at 8:37 AM, Vladimir Prus <vladimir@codesourcery.com>wrote:
With 23-bit random data, all
the exponents are the same. With 32-bit random data, there will be 512
different values for the exponents,
Are you sure? 32-bit IEEE format has 8 bits for the exponent.
+1 for sign.
Technically, it's not part of the exponent. That is, the first bit is not the sign of the exponent, it's the sign of the value.
So not 120X, but still a substantial difference.
Also worth noting is that when the element count increases, the speed difference increases, even on 32-bit random data. randomgen 16 16 200000000:
What real-world problems require sorting such huge arrays of floats?
Circuit layouts, if someone wants to sort them for various purposes. They are gigabytes in size, so they take a long time to sort unless casting to integer is done.
OK, I know nothing about circuit layouts :-)
Yes, I guess I was using denormalized numbers for the 23-bit case, but it's interesting that there is such a huge speed difference, and I doubt it's because they're denormalized; there is no reason to special-case denormalized sorting.
It is well possible that comparing denormalized numbers is slow.
Notably, normalized numbers also show the same problem if their exponent range is small relative to the number of elements; this can be achieved by using a small number of different exponents, or simply a large number of floats (as shown with my large test). What do you consider a reasonable range? I can easily make a test that excludes denormalized numbers.
Even if the full exponent range is in use (which isn't realistic), wouldn't it be better to use an algorithm that doesn't blow up when n gets much larger than 10 million elements, and is still faster with a smaller number of elements?
I don't quite follow the above point. My argument is (and always was), that it's important to provide accurate measurements on the data that ordinary humans will use. Sorting of 200'000'000 floats does not seem a typical use case, for performance on that is probably less important. Just like sorting demormalized numbers. - Volodya

On Mon, Jul 6, 2009 at 11:07 AM, Vladimir Prus <vladimir@codesourcery.com>wrote:
Yes, I guess I was using denormalized numbers for the 23-bit case, but it's interesting that there is such a huge speed difference, and I doubt it's because they're denormalized; there is no reason to special-case denormalized sorting.
It is well possible that comparing denormalized numbers is slow.
Taking 10M elements with my 23 bits of randomness case, and making the top exponent bit 1, so that the data is normalized: if(!(float_mem_cast<float, int>(array[v]) & 0x7f800000)) { //Make the top exponent bit high CAST_TYPE temp = 0x80000000 | float_mem_cast<float, int>(array[v]); memcpy(&(array[v]), &temp, sizeof(DATATYPE)); //printf("new %e\n", array[v]); } I see std::sort take 83.527s, as compared to 1.226s when sorted as integers. denormalized numbers are not the problem.
Notably, normalized numbers also show the same problem if their exponent
range is small relative to the number of elements; this can be achieved by using a small number of different exponents, or simply a large number of floats (as shown with my large test). What do you consider a reasonable range? I can easily make a test that excludes denormalized numbers.
Even if the full exponent range is in use (which isn't realistic), wouldn't it be better to use an algorithm that doesn't blow up when n gets much larger than 10 million elements, and is still faster with a smaller number of elements?
I don't quite follow the above point. My argument is (and always was), that it's important to provide accurate measurements on the data that ordinary humans will use. Sorting of 200'000'000 floats does not seem a typical use case, for performance on that is probably less important. Just like sorting demormalized numbers.
Normal problems don't use the full space of possible exponents for floating-point numbers. Can you think of a benchmark that accurately reflects this reality?

On Mon, Jul 6, 2009 at 1:48 PM, Steven Ross <spreadsort@gmail.com> wrote:
On Mon, Jul 6, 2009 at 11:07 AM, Vladimir Prus <vladimir@codesourcery.com>wrote:
Yes, I guess I was using denormalized numbers for the 23-bit case, but it's interesting that there is such a huge speed difference, and I doubt it's because they're denormalized; there is no reason to special-case denormalized sorting.
It is well possible that comparing denormalized numbers is slow.
Taking 10M elements with my 23 bits of randomness case, and making the top exponent bit 1, so that the data is normalized: if(!(float_mem_cast<float, int>(array[v]) & 0x7f800000)) { //Make the top exponent bit high CAST_TYPE temp = 0x80000000 | float_mem_cast<float, int>(array[v]); memcpy(&(array[v]), &temp, sizeof(DATATYPE)); //printf("new %e\n", array[v]); }
I see std::sort take 83.527s, as compared to 1.226s when sorted as integers. denormalized numbers are not the problem.
That was the sign bit. Sorry. With that fixed (0x40000000), the change does appear to be an issue with denormalized numbers. Instead, I'm seeing 3X speedups on modest-sized tests. I still have a problem: what's a good floating-point benchmark?

On 6 Jul 2009, at 21:51, Steven Ross wrote:
On Mon, Jul 6, 2009 at 1:48 PM, Steven Ross <spreadsort@gmail.com> wrote:
On Mon, Jul 6, 2009 at 11:07 AM, Vladimir Prus <vladimir@codesourcery.com
wrote:
Yes, I guess I was using denormalized numbers for the 23-bit case, but it's interesting that there is such a huge speed difference, and I doubt it's because they're denormalized; there is no reason to special-case denormalized sorting.
It is well possible that comparing denormalized numbers is slow.
Taking 10M elements with my 23 bits of randomness case, and making the top exponent bit 1, so that the data is normalized: if(!(float_mem_cast<float, int>(array[v]) & 0x7f800000)) { //Make the top exponent bit high CAST_TYPE temp = 0x80000000 | float_mem_cast<float, int>(array[v]); memcpy(&(array[v]), &temp, sizeof(DATATYPE)); //printf("new %e\n", array[v]); }
I see std::sort take 83.527s, as compared to 1.226s when sorted as integers. denormalized numbers are not the problem.
That was the sign bit. Sorry. With that fixed (0x40000000), the change does appear to be an issue with denormalized numbers. Instead, I'm seeing 3X speedups on modest-sized tests.
I still have a problem: what's a good floating-point benchmark?
Possibly a stupid thing to say -- rather than trying to massage bits like this, why not just generate floats directly, something like rand() * pow((float)10, rand()%20) for example?

On Mon, Jul 6, 2009 at 3:33 PM, Christopher Jefferson <chris@bubblescope.net
wrote:
I still have a problem: what's a good floating-point benchmark?
Possibly a stupid thing to say -- rather than trying to massage bits like this, why not just generate floats directly, something like rand() * pow((float)10, rand()%20) for example?
So that all possible bit combinations can be exercised, making sure there is no corner-case where sorting is incorrect. My tests do these things at the same time: 1) Verify correctness (most important) 2) Benchmark performance 3) Identify weak points to be used for tuning or so the user can know when not to use the algorithm as appropriate for their system. Additionally, the tests and testcase generation need to run quickly so they can be both accurate and finish in a reasonable amount of time. I think the best way to do this is to generate semi-random bits, and for the case of floats, to specifically eliminate denormalized numbers and NaNs. Generating them the way you suggest may be better for goal #2, but worse for #1.

Steven Ross wrote:
What real-world problems require sorting such huge arrays of floats?
Circuit layouts, if someone wants to sort them for various purposes. They are gigabytes in size, so they take a long time to sort unless casting to integer is done.
Which purposes? I work with circuit layouts that are gigabytes in size, I do polygon clipping on them, and I sort them as a first step before clipping. Why are your layouts represented in floating point in the first place? File formats for circuit layouts are integer. You will have had to have converted from original integer coordinates to floating point to get them into floating point. Regards, Luke

Steven Ross wrote:
What real-world problems require sorting such huge arrays of floats?
Circuit layouts, if someone wants to sort them for various purposes. They are gigabytes in size, so they take a long time to sort unless casting to integer is done.
Which purposes? I work with circuit layouts that are gigabytes in size, I do polygon clipping on them, and I sort them as a first step before clipping. Why are your layouts represented in floating point in the first place? File formats for circuit layouts are integer. You will have had to have converted from original integer coordinates to floating point to get them into floating point.
I don't write layout tools. I know someone who does, and he did represent some of his data as floating
On Mon, Jul 6, 2009 at 11:23 AM, Simonson, Lucanus J < lucanus.j.simonson@intel.com> wrote: point.

On Mon, Jul 6, 2009 at 8:37 AM, Vladimir Prus <vladimir@codesourcery.com>wrote:
I'll test with SSE and other compiler settings next. Is there an easy way to add these with bjam?
Because bjam is actually a low-level build engine, it is pretty much impossible to do anything with it at all. Boost.Build allows you to pass any flags to compiler you see fit, my previous post gives the actual syntax I have used.
Thanks for the suggestion. I ended up passing those flags you suggested, and obtained this result: cl : Command line warning D9002 : ignoring unknown option '-march=nocona' cl : Command line warning D9002 : ignoring unknown option '-mfpmath=sse' I've checked and my cheap version of MSVC 8.0 apparently doesn't support sse. So I can't benchmark with your recommended options, without spending more money on a volunteer project. With default release optimizations, all 3 of my algorithms are roughly twice as fast as std::sort on randomized data. I'm stripping out denorms from the float tests because they skew results so much and aren't realistic data, and my floating-point performance improvement dropped from fantastic to reasonable. I've tested string_sort on a Dickens novel and had comparable results to random data, relative to std::sort (a little better than 2X faster). For now I'm using random bits for all my performance testing, which tends to generate long strings and I admit doesn't seem to be realistic string data, though it does test corner-cases.

On Sun, Jul 26, 2009 at 4:06 PM, Steven Ross<spreadsort@gmail.com> wrote:
On Mon, Jul 6, 2009 at 8:37 AM, Vladimir Prus <vladimir@codesourcery.com>wrote:
I'll test with SSE and other compiler settings next. Is there an easy way to add these with bjam?
Because bjam is actually a low-level build engine, it is pretty much impossible to do anything with it at all. Boost.Build allows you to pass any flags to compiler you see fit, my previous post gives the actual syntax I have used.
Thanks for the suggestion. I ended up passing those flags you suggested, and obtained this result: cl : Command line warning D9002 : ignoring unknown option '-march=nocona' cl : Command line warning D9002 : ignoring unknown option '-mfpmath=sse'
Those are for GCC. For VC++, you want to use /arch:SSE2. -- Cory Nelson http://int64.org

On Sun, Jul 26, 2009 at 5:22 PM, Cory Nelson <phrosty@gmail.com> wrote:
Thanks for the suggestion. I ended up passing those flags you suggested, and obtained this result: cl : Command line warning D9002 : ignoring unknown option '-march=nocona' cl : Command line warning D9002 : ignoring unknown option '-mfpmath=sse'
Those are for GCC. For VC++, you want to use /arch:SSE2.
Thanks Cory. There was no warning with that option specified, but there was also no measurable performance change. I verified that MSVC was being called with the option, and that if I misspelled the option, a warning was issued. I think that SSE is deliberately disabled in the cheap version of MSVC I bought, based upon the Microsoft documentation.

Steven Ross wrote:
On Sun, Jul 26, 2009 at 5:22 PM, Cory Nelson <phrosty@gmail.com> wrote:
Thanks for the suggestion. I ended up passing those flags you suggested, and obtained this result: cl : Command line warning D9002 : ignoring unknown option '-march=nocona' cl : Command line warning D9002 : ignoring unknown option '-mfpmath=sse'
Those are for GCC. For VC++, you want to use /arch:SSE2.
Thanks Cory. There was no warning with that option specified, but there was also no measurable performance change. I verified that MSVC was being called with the option, and that if I misspelled the option, a warning was issued. I think that SSE is deliberately disabled in the cheap version of MSVC I bought, based upon the Microsoft documentation.
No they are not disabled, not even in the free Express edition. Disabled optimizations was true for older free versions, producing bad benchmarks all over the place. Someone learned from this! :-) Bo Persson

Steven Ross wrote:
I think I've figured out what's going on with my 23-bit random data: floats on x86 have some form of optimization to look at the exponent first, and then the coefficient afterwards. On x86 systems, the comparison is extremely slow if the exponents are the same.
I still find that implausible.
With 23-bit random data, all the exponents are the same.
You aren't setting all the exponents to zero are you? If you are then I think you're generating de-normal values. The idea that comparing de-normals is slow is plausible; it's the sort of thing that might be done in microcode or in software on some systems. (I've just noticed that Volodya has also suggested this explanation...) Phil.

Thanks Phil and Vladimir for pointing out the denormalization issue. I have not seen this issue on PowerPC, which was my main test system for sorting for years, and I still don't see why the processor treats them differently, but it clearly does. The performance impact of denormalized numbers is so huge that it doubles the runtime of std::sort on a large list of completely random bits being sorted (200 million), even though only 1 in 256 is denormalized, but normally people won't use them, so they're not a realistic case. Given that input, I'm going to scrap my sort-as-integer approach, and stick with the cast-to-integer approach already in my float_sort implementation, using no additional memory overhead. I'll also do my testing with sse optimizations, and make sure to denormalize my float_sort tests so that performance results are more realistic. I expect it to drop to a roughly 2X improvement, which is about what the other algorithms (integer_sort, string_sort) get. I apologize for being stubborn on this issue.

Steven Ross wrote:
I've been investigating why float_sort is so much faster than std::sort on x86 processors, and why sorting integers is so much faster than float_sort, and came to the conclusion that floating-point comparisons are extremely slow on x86. Based upon that, I've written a different sorting approach that copies the input array of floats, casting them to integers, and sorts them as integers, dealing with the negative floats sorting in reverse order issue, completely eliminating floating-point comparisons. It takes O(n) additional memory, and is somewhat less general, as a different approach is required for key plus data sorting (as non-float data can't be safely cast this way), but it obtains integer-sorting speeds. On some problems it is as much as 120X as fast as std::sort. I doubt average results will be quite that good, but they could easily be well over 10X as fast.
Hi Steven, 120X seems so enormous that it makes me think there must be something else going on. I would suggest looking at the assembler that is generated, but I think that sort is sufficiently complicated that the assembler will be hard to read (unless you're used to it). So, I suggest instead looking at a simpler problem involving comparisons e.g. is_sorted(range) or minmax(range). Assuming that you see a similar difference between floating point and integer for that, you should be able to check the assembler for "smoking guns". Possibilities that occur to me include: - Unnecessary transfers between integer and floating point registers. - Memory alignment issues. - Lack of optimisation. - Something related to NaNs, de-normals etc (do you have these in your test data?). Also, I don't see why you actually need to copy all the data to achieve what you're doing. Can't you cast it "in place" and then deal with the negative numbers (etc) as an in-place pre- or post-processing step? Regards, Phil.
participants (9)
-
Bo Persson
-
Christopher Jefferson
-
Cory Nelson
-
Edouard A.
-
Phil Endecott
-
Simonson, Lucanus J
-
Steven Ross
-
Thomas Klimpel
-
Vladimir Prus