Re: [boost] [sort] Re: [review] Formal review period for Sort library begins today, November 10, and ends Wednesday, November 19

10 Nov 2014


      Vladimir,
...
It's interesting, and probably would make a good addition to documentation.
Though bzip appears to be phased out these days, replaced by xz, which does
not
use Burrows–Wheeler transformation.
Will do (in the development branch).
...
...
One quote from that article you cite: "Merge sort performs better than
radix sort for sorting keys of large sizes - such keys will be
required to accommodate the increasing cardinality of future
databases"
Merge sort is fundamentally less efficient than a smart hybrid-radix
approach when CPU sorting long keys because it has to compare the
entire length of the key up to the point of difference multiple times,
where the string_sort in this library only passes through each
character in the key once (or twice depending on how you count) during
the radix sorting portion, and then only compares what hasn't already
been passed through in the final comparison-sorting stage once the
problem has been cut down to less than a specific constant size.
The problem is that it does not seem to me that Boost is a forum to
discuss algorithm implementation. You propose a library that purports
to do efficient sorting, based on 2002 paper. Were there no papers in
subsequent 12 years with better algorithms?
I did an algorithm search in 2008 when preparing the boost submission.
ALR came out, fairly similar to my paper.  I have not seen much
attention paid in academia to hybrid radix sorting.
...
...
Do you happen to have a copy of the full article?  I'm happy to apply
any helpful concept to improve the library.
The "full text" link at the top of that page works for me.
I hit a paywall at home, but accessed it at work.  This is a
comparison of parallel LSD radix sort vs Mergesort.  The conclusion is
that parallel LSD radix sort is slower with long keys.  This is not
surprising.  MSD hybrid radix sort (string_sort) does much better than
Mergesort or LSD radix sort with long keys.  This library does not get
into parallel implementation issues; it can be used as a subsort
inside a parallel Mergesort, or you can split up the problem by MSD to
parallelize it and use this library to subsort.
...
...
With regards to classic algorithms, this library provides an
implementation similar to Adaptive Left Radix for integer_sort, which
came after it but was more widely available, with the addition of
better worst-case performance.  It also provides an adaptation of
American Flag Sort (a classic algorithm) to C++ strings and with
better worst-case performance guarantees.
I'm worried about "similar" and "adaptation" above. If I see that function
whatever_sort
implements an algorithm from paper N, with 200 citations, then I can assume
enough
people reviewed the algorithm. If that function implements an adaptation,
then
I'm not really sure what complexity I can expect.
Since you've read the overview, how about just looking through the
code and seeing if it works as described, or testing it?  Can you ever
be sure a concrete implementation correctly applies an abstract
concept without looking at the implementation?  These implementations
are fast for the same reason ALR and AFS are.  They have superior
worst-case performance because they fall back to std::sort at a size
that optimizes worst-case performance.  The only other difference is
that these algorithms scan to see if they can shrink the range each
iteration, which improves performance for non-worst-case distributions
with little impact on worst-case distributions.

Re: [boost] [sort] Re: [review] Formal review period for Sort library begins today, November 10, and ends Wednesday, November 19

Steven Ross