
On Mon, Jul 7, 2008 at 12:40 PM, Simonson, Lucanus J < lucanus.j.simonson@intel.com> wrote:
Phil wrote:
Having said all that, personally I have little interest in an algorithm
that has no benefit for fewer than millions of items. What do other people think about that aspect?
I frequently sort gigabytes, tens of gigabytes or even hundreds of gigabytes of geometry data as a pre-requisite to scanline over the data. However, a faster sort would only benefit me if the speedup were reliably 2X or more. A 30% speedup in sort would disappear with the effect of Amdahl's Law to something under 10%. I get better speedup just by adopting each new version of the compiler. For me, the modest size and tenuous nature (depends upon architecture specific constants) of the speedup does not justify any effort to integrate and carry the dependency to the library that provides it. I have been following the thread only out of academic interest.
Maybe I should specify a little more about the performance: As the ratio of keysize to n decreases (and as the data size increases), the speedup of Spreadsort vs. std::sort improves. A 30% speedup is roughly what I get at around 10 million random integer keys with no data on an OSX system with an old G4 processor. Linux with Intel processors shows a speedup of around 40% with the same data (and Windows around 15%). The speedup becomes larger as n increases, because keysize/n decreases and more radix sorting occurs. Spreadsort balances O(n(keysize/constant)) radix-based sorting with O(nlog(n)) comparison-based sorting, and what effectively happens as n increases is the O(n(keysize/constant) factor dominates, and the unit runtime of Spreadsort stays roughly constant, while that of std::sort goes up by the log(n). So they're roughly at parity when log(n) = 13 (8K), but by the time log(n)=26 (64M), Spreadsort is roughly 40% faster. Projecting that out, at log(n)=34(16G), it should be around 66% faster. Not 2X, but not trivial either. I fail to see how Amdahl's law applies to Spreadsort, as it's not parallel, but non-sorting tasks in your application obviously won't be sped up by Spreadsort. Spreadsort already calls std::sort automatically if the count is less than a minimum value, which last time I checked was 8192. The idea is to have an integer_sort that calls the better algorithm for the provided data, whichever it is, so that part of what you want is already there. To actually make it part of std::sort, I'd need to have float, integer, and string versions, check for which applies, and if none apply, call the current std::sort. That will require significant work. With regards to running this in parallel, here's how you do it: 1) Find the maximum and minimum of a list in parallel in FindExtremes. This would seem easy to run in parallel, as you just divide the list up evenly, and merge the results at the end. 2) The only fully serial part is finding the size of the bins to create in the first iteration. This could be made parallel by allocating external memory for bins, but at the cost of locking code, or merging bins from each processor, in addition to the memory allocation, which probably isn't worth it with a small number of processors. If you did insert into bins at this point, the algorithm would be fully parallel, and it could skip the next two steps. 3) Allocate bins (there are normally n/512 bins in the first iteration, so this is fast, even if serial) 4) In the swap loop, as soon as a bin is filled, fire it off on a separate thread. As you cut the data into 512-or-so pieces in the first iteration, each successive iteration should now be running efficiently in parallel. How many swaps are required to fill the first bin is not reliable, so this step is not fully parallel. As to std::sort being a straw man, I'd be happy to compare against a faster general-case serial sorting algorithm (if said algorithm exists). Improvements in serial sorting algorithms should mostly apply to parallel algorithms, and nothing stops you from running Spreadsort in parallel. I'd be happy to look into developing parallel versions once integer_sort, float_sort, and string_sort are in Boost, and hopefully I have enough money to buy a test machine with enough cores to be worth testing on. In reply to David Abrahams: Sorting and searching are analogs of each other; radix sorting is like a trie structure, and many comparison-based algorithms are like a binary tree. So the two are related. You can even use the Spreadsort bin structure (if you don't delete it) to do high-speed lookups into the array; in the past I found this to be much faster than binary search. The Judy Array data structure is based upon a similar idea. If you have a generic Quicksort or some other fast sort implementation that doesn't have O(n*n) worst-case behavior and that can be used on any forward-iterator, I'd be happy to call that on non-random-access iterators inside of integer_sort, thus allowing integer_sort to work on any forward-iterator. In reply to Phil: Thank you for your detailed suggestions. I will implement those that make sense (probably most of them), and reply in detail later. Yes, a pure radix_sort may be useful, though you can always write a comparison function if you can describe a radix sort (I don't see how you can sort anything where you don't know if a < b; I think that's part of the definition of sorting in Knuth's Sorting And Searching book). An LSB external radix_sort is both fast (for small keys and large n) and stable, so that could clearly be useful to generalize, but the tricky part is how to generalize it (they normally use bit masks)? Spreadsort is a hybrid that uses an MSB radix sort, and is able to identify for itself whether the distribution is such that pure radix sorting or comparison-based sorting is better, so I don't see the point of writing a separate MSB radix sort.