Review Request: Algorithm.Sorting

I have uploaded algorithm_sorting.zip for a formal review to become Boost.Algorithm.Sorting, if people feel that is appropriate. It is 3 templated hybrid radix-based/comparison-based algorithms, specialized to sort certain types of data more efficiently. All 3 have superior worst-case and average-case performance to std::sort, based upon my testing. None of these algorithms use much additional memory. All of them can sort key + data, and as they do fewer swaps, tend to do so much faster than std::sort. As a general rule, the speedup of these algorithms relative to std::sort increases as data set size increases. These algorithms tend to be faster relative to std::sort when comparisons are slow, as they do less (or no) comparisons. string_sort, which sorts strings in O(n*length) time, as compared to O(n*length*log(n)) for pure comparison-based algorithms. It is consistently faster than std::sort in string sorting, by a little more than a factor of 2 based upon my testing. float_sort, which sorts IEEE floating-point numbers as integers, but attaining the same ordering as std::sort, with the exception of -0.0 and NaNs, both of which cannot be ordered based upon their comparison function. float_sort is roughly twice as fast as std::sort on UNIX-type OSes I've tested on, and roughly 7X as fast on Windows. I believe this speedup to be because floating-point comparisons are slow on Windows. integer_sort, which sorts integers around 60% faster than std::sort on UNIX-type OSes, and 25% faster on Windows. On Windows with MSVC, std::sort sorts already-sorted and small range data extremely quickly, and integer_sort processes already-sorted data slower than it by as much as a factor of 3X. I believe this to be due to some optimization for already-sorted data, as integer_sort takes substantially less absolute time on already-sorted data than unsorted data. So integer_sort on Windows has a tradeoff: better worst-case and average-case performance, but worse best-case performance. integer_sort and float_sort share some tuning parameters. I have set values that seem to work well on both UNIX-type and Windows OSes, but people are welcome to try to see what values are best for their system and problem type. The ideal values for integers and floats may be different on some systems (such as Windows), but for simplicity they are kept the same. A tuning/testing script named tune.pl is included, for those who wish to see how fast these algorithms are on their system (and verify correctness). I don't recommend using it for tuning without the -large option, and that takes days. If you do use it for tuning, I also recommend the -verbose option, to see what it's doing.

on Mon Apr 27 2009, Steven Ross <spreadsort-AT-gmail.com> wrote:
All 3 have superior worst-case and average-case performance to std::sort, based upon my testing.
How can you determine worst-case performance by testing? Or have I misunderstood your statement? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

True worst-case has to be determined by determining the worst possible input, then feeding it to the algorithm. I believe I know what the real worst-case input is to this algorithm, and it will take O(n(bit_range/C - 1 + C)) time, where C is the smaller of MAX_SPLITS and (LOG_MEAN_BIN_SIZE + LOG_MIN_SPLIT_COUNT), and would take some effort to construct, like the Quicksort worst-case. For 32-bit integers and the default constants I provide, C is 8, so it ends up being: n((32/8 -1) = 3 slow iterations + 8 std::sort iterations)). Based upon the value of LOG_CONST and the distribution it can also give up and use std::sort, but only after n is cut into small pieces, so that the log(n) factor is small. What I am referring to is that I ran tests with randomness ranging from 0 to 32 bits, and I compared the worst, average, and best runtimes among those. Graphs are included in the libs/algorithm/sorting directory, linked to from the index.html. You are welcome to test it on whatever distribution or test you like. I believe float_sort results for randomness > 23 bits are skewed by NaNs. This change is obvious on the graph. On Mon, Apr 27, 2009 at 9:54 AM, David Abrahams <dave@boostpro.com> wrote:
on Mon Apr 27 2009, Steven Ross <spreadsort-AT-gmail.com> wrote:
All 3 have superior worst-case and average-case performance to std::sort, based upon my testing.
How can you determine worst-case performance by testing? Or have I misunderstood your statement? <http://lists.boost.org/mailman/listinfo.cgi/boost>

on Mon Apr 27 2009, Steven Ross <spreadsort-AT-gmail.com> wrote:
True worst-case has to be determined by determining the worst possible input, then feeding it to the algorithm. I believe I know what the real worst-case input is to this algorithm, and it will take O(n(bit_range/C - 1 + C)) time, where C is the smaller of MAX_SPLITS and (LOG_MEAN_BIN_SIZE + LOG_MIN_SPLIT_COUNT), and would take some effort to construct, like the Quicksort worst-case. For 32-bit integers and the default constants I provide, C is 8, so it ends up being: n((32/8 -1) = 3 slow iterations + 8 std::sort iterations)). Based upon the value of LOG_CONST and the distribution it can also give up and use std::sort, but only after n is cut into small pieces, so that the log(n) factor is small.
What I am referring to is that I ran tests with randomness ranging from 0 to 32 bits, and I compared the worst, average, and best runtimes among those. Graphs are included in the libs/algorithm/sorting directory, linked to from the index.html.
Thanks for clarifying, that helps. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

I have figured out an optimization that in simple testing makes integer_sort faster than std::sort on already-sorted data (and comparable on mostly-sorted data), and improves performance slightly on unsorted data. As the worse best-case performance was the main drawback to the algorithm based upon my prior results, I was wondering if it was appropriate to delay the review until I can get this change fully implemented and tested? If it is proper to delay the review, I apologize for wasting people's time. I thought I was done optimizing it, but apparently not. I think I can finish all this up by monday.

AMDG Steven Ross wrote:
I have figured out an optimization that in simple testing makes integer_sort faster than std::sort on already-sorted data (and comparable on mostly-sorted data), and improves performance slightly on unsorted data. As the worse best-case performance was the main drawback to the algorithm based upon my prior results, I was wondering if it was appropriate to delay the review until I can get this change fully implemented and tested?
If it is proper to delay the review, I apologize for wasting people's time. I thought I was done optimizing it, but apparently not.
I think I can finish all this up by monday
It usually takes a fair amount of time to schedule a review. There shouldn't be any problem. In Christ, Steven Watanabe

I've uploaded a new algorithm_sorting.zip with the performance improvements. integer_sort on Windows is now 34% faster on average than std::sort (and about 58% faster on worst-tested-case), faster than std::sort for fully sorted data, but still takes about 60% longer on mostly sorted data. It's ready for review. float_sort, string_sort, and key plus data sorting still have much larger speed improvements (over 2X) relative to std::sort than just sorting integers.

As an additional note, I am seeing substantial system-to-system variation in performance results; a friend of mine saw a 2X average integer_sort speed improvement on a Windows system, where I only see 34% (but 70% on Mac OSX). It would be helpful to obtain results from many different systems when this is reviewed.

I thought of another optimization for mostly-sorted data. Currently I preprocess data so that mostly-sorted data won't be swapped as much, in a fashion that doesn't substantially impact unsorted data performance (<1% penalty). During that preprocessing, I could make good estimate of how well-sorted the data is, and if it is highly sorted (say >80% sorted), then I could quit out and use std::sort. That way the penalty on mostly-sorted data for using integer_sort on Windows would be much smaller. The bad part of that is it would eliminate the speed improvement seen on MacOSX for mostly-sorted data, which is substantial, instead having a small net penalty. I decided not to make this change for that reason, but I wanted to mention it as a possibility. Comments?

Steven Ross wrote:
I thought of another optimization for mostly-sorted data. Currently I preprocess data so that mostly-sorted data won't be swapped as much, in a fashion that doesn't substantially impact unsorted data performance (<1% penalty). During that preprocessing, I could make good estimate of how well-sorted the data is, and if it is highly sorted (say >80% sorted), then I could quit out and use std::sort. That way the penalty on mostly-sorted data for using integer_sort on Windows would be much smaller. The bad part of that is it would eliminate the speed improvement seen on MacOSX for mostly-sorted data, which is substantial, instead having a small net penalty. I decided not to make this change for that reason, but I wanted to mention it as a possibility. Comments? _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
If the mostly-sorted penalty is Windows-specific, you could consider using macros to select which version to use based on the platform in use. --Jeffrey Bosboom

That's a good idea. Considering that I've only run this on 4 different systems, I don't have a good idea of what is platform-specific or not. At this point I have no platform-specific code, so I'll see what the reviewers find, and if it's platform-specific and a serious concern, I'll probably handle it as you suggest. On Tue, May 5, 2009 at 10:05 AM, Jeffrey Bosboom <jbosboom@uci.edu> wrote:
If the mostly-sorted penalty is Windows-specific, you could consider using macros to select which version to use based on the platform in use.
--Jeffrey Bosboom

For a detailed description of why this has superior worst-case performance, you can read the paper I wrote on the Spreadsort algorithm in 2002:** http://portal.acm.org/citation.cfm?id=691537 The original algorithm I wrote that paper based upon did not have LOG_CONST or MAX_SPLITS, which made it less cache-friendly, but the basic concept is the same. On Mon, Apr 27, 2009 at 9:54 AM, David Abrahams <dave@boostpro.com> wrote:
on Mon Apr 27 2009, Steven Ross <spreadsort-AT-gmail.com> wrote:
All 3 have superior worst-case and average-case performance to std::sort, based upon my testing.
How can you determine worst-case performance by testing? Or have I misunderstood your statement?
-- Dave Abrahams BoostPro Computing http://www.boostpro.com
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Does this ease your mind? float_sort shouldn't compile if the float type isn't correct, and this is the code I've been using for casting floats to integers: BOOST_STATIC_ASSERT(sizeof(Cast_type) == sizeof(Data_type)); BOOST_STATIC_ASSERT(std::numeric_limits<Data_type>::is_iec559); BOOST_STATIC_ASSERT(std::numeric_limits<Cast_type>::is_integer); Cast_type result; std::memcpy(&result, &(*floatiter), sizeof(Data_type)); Also, the mostly-sorted performance issue is resolved; performance is now comparable on Windows between integer_sort and std::sort on mostly-sorted data, without any platform-specific edits. That should resolve all outstanding concerns I am aware of. On Wed, May 6, 2009 at 1:26 PM, Ross Levine <ross.levine@uky.edu> wrote:
On Mon, Apr 27, 2009 at 10:32 AM, Steven Ross <spreadsort@gmail.com> wrote:
float_sort, which sorts IEEE floating-point numbers as integers
This worries me a little. C++ floating-point format is implementation-defined, and is not necessarily in IEEE 754. _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Steven Ross wrote:
Also, the mostly-sorted performance issue is resolved; performance is now comparable on Windows between integer_sort and std::sort on mostly-sorted data, without any platform-specific edits.
I'm curious as to why std::sort was faster for mostly-sorted data, and what you did to your algorithm to match its speed. --Jeffrey Bosboom

std::sort is fast for mostly-sorted data on x86 processors because it handles sorted data in an optimized fashion and minimizes the number of swaps. It iterates forward from the first unprocessed element until it finds one that is out of order. Then it iterates back from the last unprocessed element until it finds one that is out of order. Then those two are swaps (after which the loop repeats, starting from the next elements, until done). This approach has two benefits: 1) It minimizes the number of swaps, which are slow operations 2) If the data is mostly (or fully) sorted, the iteration to the first unordered element is extremely fast, as modern x86 processors can handle long predictable and simple iterations very efficiently. Hence the speed advantage std::sort has for mostly-sorted data. std::sort does log(n) very fast operations when the data is mostly sorted. integer_sort does about 2 slow iterations with mostly-sorted data. As the fast operations optimize much better, the log(n) can be faster. To resolve the already-sorted data performance issue, I took my find_extremes method, which finds the maximum and minimum of the data, enabling more efficient processing, and modified it. Since the maximum of a contiguous list of elements is the last one (assuming sorting on <; for > it's just backwards and handled by the functors; look at the reverse integer sort sample), and the minimum is the first one, I can instead check to see if data is sorted to find the minimum and maximum, and if the data is sorted, then it's faster. As soon as it hits an out-of-order element, the sorted check stops, and I fall back on the conventional approach for the rest of the data. This approach has negligible impact on performance for unsorted data, but for fully-sorted (or identical) data, I find out the data is fully sorted, and quit as soon as this quick check is done, saving lots of time. For mostly-sorted data, some iteration time is saved by iterating up to the first unsorted element before switching to the slower conventional min/max finding. (Yes, I've thought of using a single loop to find the min and a single loop to find the max; it was slower when I last tested it on my old Mac, but might be faster on x86). That was the first optimization; it improved weighted-average performance on random data 8% , already-sorted performance by roughly an order of magnitude, and cut mostly-sorted runtime by about 40%. The second optimization was to preprocess the bins to minimize swapping. I keep track of the first unsorted element in a bin, and the change was to increment that position as long as the element in that position belongs where it currently is. This substantially reduces the number of swaps on mostly-sorted data. I only apply this optimization if the bins are at least a modest minimum size to avoid adding unnecessary overhead to processing unsorted data. After this optimization I'm finding a 15% speed penalty on mostly-sorted data relative to std::sort on average and a 8% speed improvement on my worst-tested-case with mostly-sorted data. By contrast, I see a 34% speedup on average and 58% speedup for worst-tested-case for unsorted data. An optimization I tried a while ago was to check if the item I'm thinking of swapping with belongs where it is, and if it doesn't, I check the next element in the bin, iterating until I find an out-of-order element to swap with. As this adds a loop inside the innermost loop in the slowest part of the code, it doesn't optimize well and I found it hurt performance. On Wed, May 6, 2009 at 8:24 PM, Jeffrey Bosboom <jbosboom@uci.edu> wrote:
I'm curious as to why std::sort was faster for mostly-sorted data, and what you did to your algorithm to match its speed.
--Jeffrey Bosboom

Maybe instead of a static assert, do some sort of conditional compilation or template metaprogramming, such that if std::numeric_limits<Data_type>::is_iec559 is false, it just defaults to std::sort? That would be more cross-platform. Also, couldn't you just use reinterpret_cast instead of memcpy? Or is there a problem with that? I must say that this sort of thing could be very useful. On Wed, May 6, 2009 at 10:43 PM, Steven Ross <spreadsort@gmail.com> wrote:
Does this ease your mind? float_sort shouldn't compile if the float type isn't correct, and this is the code I've been using for casting floats to integers: BOOST_STATIC_ASSERT(sizeof(Cast_type) == sizeof(Data_type)); BOOST_STATIC_ASSERT(std::numeric_limits<Data_type>::is_iec559); BOOST_STATIC_ASSERT(std::numeric_limits<Cast_type>::is_integer); Cast_type result; std::memcpy(&result, &(*floatiter), sizeof(Data_type));
Also, the mostly-sorted performance issue is resolved; performance is now comparable on Windows between integer_sort and std::sort on mostly-sorted data, without any platform-specific edits. That should resolve all outstanding concerns I am aware of.
On Wed, May 6, 2009 at 1:26 PM, Ross Levine <ross.levine@uky.edu> wrote:
On Mon, Apr 27, 2009 at 10:32 AM, Steven Ross <spreadsort@gmail.com> wrote:
float_sort, which sorts IEEE floating-point numbers as integers
This worries me a little. C++ floating-point format is implementation-defined, and is not necessarily in IEEE 754. _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Fri, May 8, 2009 at 13:33, Ross Levine <ross.levine@uky.edu> wrote:
Also, couldn't you just use reinterpret_cast instead of memcpy? Or is there a problem with that?
See the archives. In short, the reinterpret-cast runs afoul of aliasing rules, and in practice, compilers can optimize the memcpy down to a bitcast anyways (http://llvm.org/demo will demonstrate), so it's no slower than the non-standard way. ~ Scott

On Fri, May 8, 2009 at 10:33 AM, Ross Levine <ross.levine@uky.edu> wrote:
Maybe instead of a static assert, do some sort of conditional compilation or template metaprogramming, such that if std::numeric_limits<Data_type>::is_iec559 is false, it just defaults to std::sort? That would be more cross-platform.
The user might accidentally pass the wrong argument, and think they're using float_sort, but they'd just be using std::sort. I consider that a more common usage case than a non-standard float implementation, and even with a non-standard float the user has the option of using std::sort themselves.

As a note, the LOG_CONST value doesn't do much with default settings; it currently defaults to 2, but must be at least 4 to do much on a 32-bit integer, with the other constants set as they are. It's there to constrain worst-case performance on long data types (64bit+); it used to be necessary for 32-bit data types, but I've optimized performance enough that it isn't anymore. I've attempted to create a good set of default constants so that the average user doesn't have to worry about them, but these tuning constants make a substantial impact on integer_sort and float_sort performance, so I'm open to suggestions for improvement.
participants (6)
-
David Abrahams
-
Jeffrey Bosboom
-
Ross Levine
-
Scott McMurray
-
Steven Ross
-
Steven Watanabe