
Edward Diener wrote:
The formal review of the Sort library by Steven Ross starts today, November 10 and is scheduled to continue through November 19th.
I studied this submission back in 2008 when Steven first presented it; those exchanges can be found here: http://thread.gmane.org/gmane.comp.lib.boost.devel/176781 There is much interesting discussion there and I suggest that the review manager should have a good look at it, especially if the volume of actual reviews now should be low. I'm not going to attempt a full review at this time, but some assorted comments follow: Should Boost accept novel algorithms? ------------------------------------- Vladimir Prus has suggested that Boost should not accept submissions of new algorithms such as this one because we're not qualified to review them. I disagree: personally, I feel more comfortable reviewing a sorting algorithm than trying to understand things like atomics or the dark corners of C++ TMP. There have been plenty of other libraries that have been accepted despite containing novel algorithms (e.g. the two geometry libraries). Having said that, I would have been happier if the submission had simply been a collection of sort algorithms copied from Knuth, or if Steven's paper had hundreds of citations! Is a "faster" serial sort algorithm useful? ------------------------------------------- In 2008, Luke Simonson said: anyone interested in runtime of an application that spends a lot of time sorting large data volumes is going to do it on an up-to-date machine, which means 4, 8, 16 or 32 cores I think std::sort is a straw man to compare yourself to, not because it isn't good at what it does, but because the target use case in question of very large data volumes is better suited to a parallel algorithm. At that time I didn't entirely agree as I had some embedded applications that could benefit from faster sorting of perhaps tens of thousands of items. But in the intervening five years, even phones have gained multiple CPUs. Today, if I needed faster sorting my first thought would be to parallelise it. Code Quality ------------ Back in 2008, the code had issues like calls to malloc() and free() and a lack of exception safety. No doubt it has improved, but perhaps any keen reviewers could look for any remaining issues. Skipping common prefixes ------------------------ One undeniable complexity advantage of this algorithm compared to std::sort<string> is that it doesn't need to re-compare common string prefixes once it knows that they are equal. Intuitively, one would expect a substantial speedup from this change if the input contains many strings with common prefixes. But intuition is not always correct. I did some experiments using the pathnames of all the files on my computer as the test data; these certainly have many common prefixes e.g. /usr/local/include/boost/. I wrote a modified quicksort that would skip common prefixes (see http://svn.chezphil.org/libpbe/trunk/include/string_qsort.hh ) and concluded that actually the intuition is wrong; comparing bytes can be done 8-at-a-time on a 64-bit system, and the cost of constantly re-comparing strings like /usr/local/include/boost/ must be small. So the intuition that Steven's algorithm would be useful in applications where strings have long common prefixes should not be followed without first measuring it. Does anyone sort a vector<int> ? -------------------------------- I don't think I've ever needed to sort a vector<int>; there is normally some associated data, e.g. I'm sorting a vector<struct> or vector<struct*> where one field of the struct is an int key. Or the key has multiple fields like an (x,y) pair, or there is some extra feature of the comparison such as being case-insensitive. The proposed API does provide a reasonable way to sort these more complex containers, i.e. by providing additional functors that (for string_sort) return the nth character of the key string. But what we don't know is whether its performance benefits relative to introsort also apply in these cases: as far as I've seen, the performance results are all for simple vector<int>-like cases. One could easily imagine that the time taken for copying would dominate, diluting the algorithmic benefits. Should this library be accepted? -------------------------------- Note that I've not done a detailed review of the code at this time, but I did spend quite a lot of time looking at this a few years ago. My opinion is yes the library should be accepted, because - Steven is still here six years after first proposing his library! - There probably are some applications that would benefit from this algorithm, which does at least seem to work correctly. My only condition would be that it should be called "Spreadsort", unless it is (imminently) going to gain other algorithms. Regards, Phil.