Re: [boost] Proposed templated integer_sort

1 Jul 2008

      On Mon, Jun 30, 2008 at 12:26 PM, Simonson, Lucanus J <
lucanus.j.simonson@intel.com> wrote:
...
Steven Ross wrote:
...
For larger n, the advantage vs. pure comparison-based algorithms
increases.
In effect, Spreadsort smoothly transitions from pure MSB Radix sort to
std::sort
with its data cut up by a couple iterations of MSB Radix sort
(reducing the size of the log(n) factor),
depending upon the distribution and size of the data.
The analysis seems a little too optimistic and seems to be based upon
the assumption that the input is evenly distributed.  To analyze the
complexity you need to think like an adversary who knows your algorithm
and intentionally creates an input that will make it perform poorly.  In
the worst case your algorithm would be slower than std::sort because a
data set with a cluster of values that all fall within the same bin and
a handful of outliers that give rise to the unfortunate binning would
result in runtime for std::sort + runtime for one or more recursive
stages of profitless binning.  So, the algorithm should be expected to
be faster than std::sort alone for some inputs, but slower for others.
The worst case is actually binning that only marginally splits up the data
each iteration in such a fashion that std::sort must be called on an only
slightly reduced size list at the end.  This corresponds to the
square_root(bit_sizeof(T)) + a constant number of std::sort
iterations situation.  Because each radix-based sorting iteration is
significantly slower than an introsort iteration, this situation may
actually be slightly slower than introsort
(depending on the system configuration), but the worst-case
performance is still nicely bounded.
The situation you describe can actually perform quite well.
Under that situation it takes bit_sizeof(T)/x iterations until the data
is fully radix sorted, where x is the lesser of MAX_SPLITS (usually 9) and
log(n).
MAX_SPLITS is picked for performance reasons and is not limited in any
fundamental way, but even operating with 9 as the divisor, for a 64-bit
integer
chunky data with outliers as you describe it will take 8 iterations until
the data is
fully radix sorted.  It's important to note that Spreadsort will give up and

call std::sort on the current bin for iterations after the first if
log(remaining_range)*LOG_CONST > (log(n_before_sort) * log(n_in_bin))
(LOG_CONST is normally 3).  So for chunky data as you describe, if the data
is 64
bits long and we just completed the first iteration, log(remaining_range) =
64 - 9 = 55,
and n_before_sort is 16, n_after_sort is 16 (very few outliers), then the
comparison is:
55 * 3 > 16 * 16 : 165 > 256, it would continue radix-sorting all the way to
the end.
If n was smaller (12 or less), it would stop and call std::sort.
If on the other hand, it only slowly diminished in size:
starting with log(n) = 16
second iteration: 15 (16 x 15 = 240 > 165)
third iteration: 14 (15 x 14 =   210 >  (46 * 3)  = 138)
fourth iteration: 13 (14 x 13 = 182  >  (37 * 3) = 111)
fifth iteration: 12  (13 x 12 = 156 > (28 * 3) = 84)
sixth iteration: 11  (12 x 11 = 132 > (19 * 3) = 57)
seventh iteration: 10 (11 x 10 = 110 > (10 * 3) = 30)
eighth iteration: hits minimum sort size (9) and calls std::sort on n=512
chunks.
That's the worst case.  Note that if LOG_CONST were larger or n diminished
in size faster,
std::sort would be called earlier.  A LOG_CONST of 5 would significantly
improve
worst-case performance relative to std::sort for 64-bit integers.  Note also
that for larger n,
the worst-case runtime does not increase.
Technically, the greater of bit_sizeof(T)/MAX_SPLITS and
square_root(bit_sizeof(T)) are
both worst-case performance situations; which one is more applicable as
"worst" depends on
the value of LOG_CONST and the size of n.  For large n, the
bit_sizeof(T)/MAX_SPLITS
is accurate, but on the other hand, MAX_SPLITS can be increased to any value
up to log(n)/8
with reasonable performance, which is why I don't like to use the
bit_sizeof(T)/MAX_SPLITS
form.  Using a smaller MAX_SPLITS avoids cache misses.