Edward Diener wrote:
The formal review of the Sort library by Steven Ross starts today, November 10 and is scheduled to continue through November 19th.
I studied this submission back in 2008 when Steven first presented
it; those exchanges can be found here:
http://thread.gmane.org/gmane.comp.lib.boost.devel/176781
There is much interesting discussion there and I suggest that the
review manager should have a good look at it, especially if the volume
of actual reviews now should be low.
I'm not going to attempt a full review at this time, but some
assorted comments follow:
Should Boost accept novel algorithms?
-------------------------------------
Vladimir Prus has suggested that Boost should not accept submissions
of new algorithms such as this one because we're not qualified to
review them. I disagree: personally, I feel more comfortable
reviewing a sorting algorithm than trying to understand things like
atomics or the dark corners of C++ TMP. There have been plenty of
other libraries that have been accepted despite containing novel
algorithms (e.g. the two geometry libraries).
Having said that, I would have been happier if the submission had
simply been a collection of sort algorithms copied from Knuth, or
if Steven's paper had hundreds of citations!
Is a "faster" serial sort algorithm useful?
-------------------------------------------
In 2008, Luke Simonson said:
anyone interested in runtime of an application that
spends a lot of time sorting large data volumes is
going to do it on an up-to-date machine, which means
4, 8, 16 or 32 cores
I think std::sort is a straw man to compare yourself
to, not because it isn't good at what it does, but
because the target use case in question of very large
data volumes is better suited to a parallel algorithm.
At that time I didn't entirely agree as I had some embedded applications
that could benefit from faster sorting of perhaps tens of thousands of
items. But in the intervening five years, even phones have gained
multiple CPUs. Today, if I needed faster sorting my first thought would
be to parallelise it.
Code Quality
------------
Back in 2008, the code had issues like calls to malloc() and free()
and a lack of exception safety. No doubt it has improved, but perhaps
any keen reviewers could look for any remaining issues.
Skipping common prefixes
------------------------
One undeniable complexity advantage of this algorithm compared to
std::sort<string> is that it doesn't need to re-compare common
string prefixes once it knows that they are equal. Intuitively,
one would expect a substantial speedup from this change if the
input contains many strings with common prefixes. But intuition
is not always correct. I did some experiments using the pathnames
of all the files on my computer as the test data; these certainly
have many common prefixes e.g. /usr/local/include/boost/. I
wrote a modified quicksort that would skip common prefixes (see
http://svn.chezphil.org/libpbe/trunk/include/string_qsort.hh )
and concluded that actually the intuition is wrong; comparing
bytes can be done 8-at-a-time on a 64-bit system, and the cost of
constantly re-comparing strings like /usr/local/include/boost/
must be small. So the intuition that Steven's algorithm would
be useful in applications where strings have long common prefixes
should not be followed without first measuring it.
Does anyone sort a vector<int> ?
--------------------------------
I don't think I've ever needed to sort a vector<int>; there is normally
some associated data, e.g. I'm sorting a vector<struct> or vector