
On Tue, Jun 2, 2009 at 11:02 PM, Jonathan Franklin < franklin.jonathan@gmail.com> wrote:
Radix sorting is indeed appropriate, IMO. However, there are many "proven" radix sort algorithms, and (probably) many more "unproven" algorithms. Seems to me that boost is a set of "generally useful" libraries, and not a "proving ground" for new algorithms.
The point in question is then: At what point is an algorithm adequately "proven" for inclusion in a boost library?
L1 cache size, and also because it cuts memory usage (as described in the
There are multiple papers in the literature about Adaptive Left Radix, which is fast on random data (and like Spreadsort, it wasn't published a day ago; it's been over 6 years). ALR has inferior worst-case performance because it uses insertion_sort as its fallback sort; worst-case performance is the top reason to use Spreadsort. Is insertion_sort to std::sort such a big complicated change that it can't be accepted? The other main change is a rough estimate of worst-case performance at runtime, and fallback when std::sort has better performance. With default settings, this doesn't do much on 32-bit integers, but on systems where std::sort is better optimized, with larger data types (64-bit), or with a smaller cache it can improve performance. Those are the biggest changes with this library vs. these other modern Radix-sorting algorithms. I've described the changes in string_sort vs. American Flag Sort already, which are similar, coming down to two core changes: replacing insertion_sort with std::sort, and the identical substring optimization. I'm not strongly attached to the substring optimization, though it seems to work well, and handles a common near-worst-case performance issue better. The change to not compare the already-sorted portions of the strings is a simple and effective performance tweak. Going back to a higher level, it's easy to test a sorting algorithm for correctness, and radix sorting isn't that hard to understand; the general approach is not novel, just some optimizations and worst-case handling. It would get expensive to publish a paper for every little tweak I've stuck in a sorting algorithm, so depending on publications for every improvement is a great way to slow down progress, especially if you demand that people cite said publications. Also, Knuth has a good book on sorting, and an excellent analysis of pure comparison-based algorithms. People seem to get the impression from his book that comparison-based sorting is the only general way to sort, but he does have mention of hybrid algorithms, including "*Markku Tamminen*: *Two*Levels are as *Good as Any*. J. Algorithms 6: 138-144 (1985)" and the O(n) performance on continuous integrable functions described on page 177 of The Art of Computer Programming Volume 3. The current Spreadsort uses smaller chunks to be cache-friendly, because Tamminem's approach is too slow when operating on n paper on ALR), but that just multiplies the number of iterations for continuous integrable functions by log(n)/log(estimated L1 cache size). Spreadsort's other main change is to use std::sort as a fallback, because not all distributions are continuous and integrable, and the related fallback check to minimize worst-case performance that takes into account that std::sort can be faster on small lists if the distribution range is large. (Tamminem's and the ALR approach both have poor absolute worst-case performance). In closing, this has: insertion_sort -> std::sort a worst-case fallback check (integer_sort/float_sort), falling back to std::sort an optimization for long substrings (string_sort) float to integer cast (float_sort) assorted performance tweaks Are those changes really so dangerous?