
On Mon, Jun 30, 2008 at 12:26 PM, Simonson, Lucanus J < lucanus.j.simonson@intel.com> wrote:
Steven Ross wrote:
For larger n, the advantage vs. pure comparison-based algorithms increases. In effect, Spreadsort smoothly transitions from pure MSB Radix sort to std::sort with its data cut up by a couple iterations of MSB Radix sort (reducing the size of the log(n) factor), depending upon the distribution and size of the data.
The analysis seems a little too optimistic and seems to be based upon the assumption that the input is evenly distributed. To analyze the complexity you need to think like an adversary who knows your algorithm and intentionally creates an input that will make it perform poorly. In the worst case your algorithm would be slower than std::sort because a data set with a cluster of values that all fall within the same bin and a handful of outliers that give rise to the unfortunate binning would result in runtime for std::sort + runtime for one or more recursive stages of profitless binning. So, the algorithm should be expected to be faster than std::sort alone for some inputs, but slower for others.
The worst case is actually binning that only marginally splits up the data each iteration in such a fashion that std::sort must be called on an only slightly reduced size list at the end. This corresponds to the square_root(bit_sizeof(T)) + a constant number of std::sort iterations situation. Because each radix-based sorting iteration is significantly slower than an introsort iteration, this situation may actually be slightly slower than introsort (depending on the system configuration), but the worst-case performance is still nicely bounded. The situation you describe can actually perform quite well. Under that situation it takes bit_sizeof(T)/x iterations until the data is fully radix sorted, where x is the lesser of MAX_SPLITS (usually 9) and log(n). MAX_SPLITS is picked for performance reasons and is not limited in any fundamental way, but even operating with 9 as the divisor, for a 64-bit integer chunky data with outliers as you describe it will take 8 iterations until the data is fully radix sorted. It's important to note that Spreadsort will give up and call std::sort on the current bin for iterations after the first if log(remaining_range)*LOG_CONST > (log(n_before_sort) * log(n_in_bin)) (LOG_CONST is normally 3). So for chunky data as you describe, if the data is 64 bits long and we just completed the first iteration, log(remaining_range) = 64 - 9 = 55, and n_before_sort is 16, n_after_sort is 16 (very few outliers), then the comparison is: 55 * 3 > 16 * 16 : 165 > 256, it would continue radix-sorting all the way to the end. If n was smaller (12 or less), it would stop and call std::sort. If on the other hand, it only slowly diminished in size: starting with log(n) = 16 second iteration: 15 (16 x 15 = 240 > 165) third iteration: 14 (15 x 14 = 210 > (46 * 3) = 138) fourth iteration: 13 (14 x 13 = 182 > (37 * 3) = 111) fifth iteration: 12 (13 x 12 = 156 > (28 * 3) = 84) sixth iteration: 11 (12 x 11 = 132 > (19 * 3) = 57) seventh iteration: 10 (11 x 10 = 110 > (10 * 3) = 30) eighth iteration: hits minimum sort size (9) and calls std::sort on n=512 chunks. That's the worst case. Note that if LOG_CONST were larger or n diminished in size faster, std::sort would be called earlier. A LOG_CONST of 5 would significantly improve worst-case performance relative to std::sort for 64-bit integers. Note also that for larger n, the worst-case runtime does not increase. Technically, the greater of bit_sizeof(T)/MAX_SPLITS and square_root(bit_sizeof(T)) are both worst-case performance situations; which one is more applicable as "worst" depends on the value of LOG_CONST and the size of n. For large n, the bit_sizeof(T)/MAX_SPLITS is accurate, but on the other hand, MAX_SPLITS can be increased to any value up to log(n)/8 with reasonable performance, which is why I don't like to use the bit_sizeof(T)/MAX_SPLITS form. Using a smaller MAX_SPLITS avoids cache misses.