Re: [boost] Proposed templated integer_sort

7 Jul 2008

      On Mon, Jul 7, 2008 at 12:40 PM, Simonson, Lucanus J <
lucanus.j.simonson@intel.com> wrote:
...
Phil wrote:
...
Having said all that, personally I have little interest in an algorithm
...
that has no benefit for fewer than millions of items.  What do other
people think about that aspect?
I frequently sort gigabytes, tens of gigabytes or even hundreds of
gigabytes of geometry data as a pre-requisite to scanline over the data.
However, a faster sort would only benefit me if the speedup were
reliably 2X or more.  A 30% speedup in sort would disappear with the
effect of Amdahl's Law to something under 10%.  I get better speedup
just by adopting each new version of the compiler.  For me, the modest
size and tenuous nature (depends upon architecture specific constants)
of the speedup does not justify any effort to integrate and carry the
dependency to the library that provides it.  I have been following the
thread only out of academic interest.
Maybe I should specify a little more about the performance:
As the ratio of keysize to n decreases (and as the data size increases), the
speedup of Spreadsort vs. std::sort improves.  A 30% speedup is roughly what
I get at around 10 million random integer keys with no data on an OSX system
with an old G4 processor.  Linux with Intel processors shows a speedup of
around 40% with the same data (and Windows around 15%).  The speedup becomes
larger as n increases, because keysize/n decreases and more radix sorting
occurs.  Spreadsort balances O(n(keysize/constant)) radix-based sorting with
O(nlog(n)) comparison-based sorting, and what effectively happens as n
increases is the O(n(keysize/constant) factor dominates, and the unit
runtime of Spreadsort stays roughly constant, while that of std::sort goes
up by the log(n).
So they're roughly at parity when log(n) = 13 (8K), but by the time
log(n)=26 (64M), Spreadsort is roughly 40% faster.  Projecting that out, at
log(n)=34(16G), it should be around 66% faster.  Not 2X, but not trivial
either.
I fail to see how Amdahl's law applies to Spreadsort, as it's not parallel,
but non-sorting tasks in your application obviously won't be sped up by
Spreadsort.

Spreadsort already calls std::sort automatically if the count is less than a
minimum value, which last time I checked was 8192.  The idea is to have an
integer_sort that calls the better algorithm for the provided data,
whichever it is, so that part of what you want is already there.
To actually make it part of std::sort, I'd need to have float, integer, and
string versions, check for which applies, and if none apply, call the
current std::sort.  That will require significant work.

With regards to running this in parallel, here's how you do it:
1) Find the maximum and minimum of a list in parallel in FindExtremes.  This
would seem easy to run in parallel, as you just divide the list up evenly,
and merge the results at the end.
2) The only fully serial part is finding the size of the bins to create in
the first iteration.  This could be made parallel by allocating external
memory for bins, but at the cost of locking code, or merging bins from each
processor, in addition to the memory allocation, which probably isn't worth
it with a small number of processors.  If you did insert into bins at this
point, the algorithm would be fully parallel, and it could skip the next two
steps.
3) Allocate bins (there are normally n/512 bins in the first iteration, so
this is fast, even if serial)
4) In the swap loop, as soon as a bin is filled, fire it off on a separate
thread.  As you cut the data into 512-or-so pieces in the first iteration,
each successive iteration should now be running efficiently in parallel.
How many swaps are required to fill the first bin is not reliable, so this
step is not fully parallel.

As to std::sort being a straw man, I'd be happy to compare against a faster
general-case serial sorting algorithm (if said algorithm exists).
Improvements in serial sorting algorithms should mostly apply to parallel
algorithms, and nothing stops you from running Spreadsort in parallel.  I'd
be happy to look into developing parallel versions once integer_sort,
float_sort, and string_sort are in Boost, and hopefully I have enough money
to buy a test machine with enough cores to be worth testing on.

In reply to David Abrahams:
Sorting and searching are analogs of each other; radix sorting is like a
trie structure, and many comparison-based algorithms are like a binary
tree.  So the two are related.  You can even use the Spreadsort bin
structure (if you don't delete it) to do high-speed lookups into the array;
in the past I found this to be much faster than binary search.  The Judy
Array data structure is based upon a similar idea.
If you have a generic Quicksort or some other fast sort implementation that
doesn't have O(n*n) worst-case behavior and that can be used on any
forward-iterator, I'd be happy to call that on non-random-access iterators
inside of integer_sort, thus allowing integer_sort to work on any
forward-iterator.

In reply to Phil:
Thank you for your detailed suggestions.   I will implement those that make
sense (probably most of them), and reply in detail later.  Yes, a pure
radix_sort may be useful, though you can always write a comparison function
if you can describe a radix sort (I don't see how you can sort anything
where you don't know if a < b; I think that's part of the definition of
sorting in Knuth's Sorting And Searching book).   An LSB external radix_sort
is both fast (for small keys and large n) and stable, so that could clearly
be useful to generalize, but the tricky part is how to generalize it (they
normally use bit masks)?  Spreadsort is a hybrid that uses an MSB radix
sort, and is able to identify for itself whether the distribution is such
that pure radix sorting or comparison-based sorting is better, so I don't
see the point of writing a separate MSB radix sort.

Re: [boost] Proposed templated integer_sort

Steven Ross