
Correction: I need the >> operator, not the << operator. It's used for division. I apologize for the error. First, Johan: I would probably need to have three different methods: integer_sort, float_sort, and string_sort. What I have right now is a C non-templated version of integer_sort in the sourceforge Spreadsort project , so I would start with a generic integer_sort. float_sort is only slightly different. 2 and 3 could readily be handled by either adding >>, -, and < operators to either the structure being sorted or a functor. Phil: You are correct, I do require a comparison operator too (<). I assumed that, but it's good to make clear. You are also correct that negative floats have to sorted differently. Also, the IEEE float casting operation works on most OSes, but not all, so my float_sort implementation would probably never make it to a standard. That said, it would still be useful on most platforms, and it's very similar to integer_sort, just requiring flipping the order of negative bins, which should have a trivial performance impact. I could sort fixed-point data, given a >> operator, which you should be able to implement. Here is what is needed for a hybrid MSB radix sort and introsort combo: bool operator <(const T &) : for comparison size_t operator -(const T & lhs, const T & rhs) : a - b > 0 -> b > a Note: - will always be used such that lhs >= rhs inside the algorithm. T operator >>(const T &) : to divide the values in a quick and efficient manner that will preserve ordering (a >> n) < (b >> n) -> a < b. Each item is allocated a bin number that corresponds to (x >> n) - (minval >> n), where the algorithm picks an ideal n. That's all that should be necessary for integer_sort based upon Spreadsort, and for a cast float_sort. string_sort is a different beast. I would probably specialize it to strings and let the user specify a comparison operator initially, adding more generality once it has proved its worse. Yes, it should work well if there are long common substrings. I'll leave that until after integer_sort. As a little note Phil about lookups, if you keep the bin structure created by Spreadsort around and don't delete any of the sub-bins, you can do high-speed lookups using the bin structure that are much faster than binary search (radix-based lookup, like a trie). Back when Spreadsort wasn't cache-friendly and I split the data into n/8 bins, I found that doing lookups using that bin structure on random data was 3-5X as fast as a binary search. The idea works like this: you have an array of bins over the range MIN to MAX bin 0 points to the first value >= MIN bin 1 points to the first value >= MIN + offset bin 2 points to the first value >= MIN + 2*offset ... bin n points to the first value >= MIN + n*offset ... the last bin points to the first value >= MAX - offset The you find the bin index for "x" by taking (x - MIN)/offset, and do a binary search between that bin and the next bin. This can be made a faster on integers by using the >> operator. This kind of lookup will basically have one cache miss (into the big list of bins), and then a binary_search inside of one bin, which doesn't normally have many elements. If you use the Spreadsort-created bin structure, you might be able to get away with zero cache misses and more evenly divided bins, but you'll have more lookups. The most practical application would probably be a flat lookup into n/C bins, where C is small enough that most bins will fit in a memory page (256?). But I digress. It sounds like there is interest in a generic integer_sort and I will start working on one. After that float_sort should be simple. string_sort will be a lot of work, but doable. Steve