OTOH the single-threaded ones are interesting, especially the stable_sort
and intro_sort for cases where usage of spreadsort is not possible.
I challenge you to find a sorting problem where it isn't possible to use
spreadsort. string_sort can sort anything that std::sort can;
you just need to convert the sort key comparison sequence into a sequence
of bytes
and feed string_sort the correct functors. That said, it may not be
efficient to sort some problems with string_sort,
and it'll probably be too much work to be worth the effort for many
problems.
This is what I meant. Sorry for the lack of precision.
I'll add it to my performance tests, it might be an interesting variant as
the algorithm is really fast.
A short summary:
100000000 uint64_t elements already sorted:
Boost parallel sort : 0.064965 secs
Asynchronous parallel sort : 0.0167134 secs (same with other
algorithms)
Asynchronous provides a special optimization for this case.
If this optimization is practical to incorporate in Boost parallel sort,
it is a common case and your solution seems noticeably faster,
so I would like Francisco to consider it. That said, .06s isn't bad.
Sure.
I added this one:
100000000 uint64_t elements reverse sorted
Boost parallel sort : 0.581381 secs
Asynchronous parallel sort : 0.0478701 secs
Asynchronous provides a special optimization for this case.
I this the library should do it too. This one is pretty common and a
popular DOS attack.
This optimization would be nice to have, if it's cheap.
A parallel reverse is in O(n) so it's cheap compared to a sort.
100000000 uint64_t elements randomly filled
Asynchronous parallel quickspreadsort: 0.587432 secs
Asynchronous quick_intro_sort : 0.728393 secs
This mixing of quicksort degrading into a spreadsort works here best.
I find this algorithm interesting;
Frankly, there is nothing special. It's simply a parallel partition, which
degrades into a "normal" parallel merge sort (in this case spreadsort)
after a while.
have you considered using MSD radix-based sorting to break up the data
for the threads in parallel?
For example, take T threads, have each thread take 1/T of the data
and bucket it into N MSD-based buckets (just like Spreadsort),
where T < N <= ~2^11, then merge the equivalent buckets,
and as you merge them hand them off to another thread to sort.
This will probably divide better if you find the max and min before
bucketing
using a parallel min/max search.
I admit I didn't. But it sounds really interesting and I think I'll have a
go at it.
I'm not a sort expert though. I approached this from the parallel
perspective.
10000000 strings randomly filled
OMP parallel sort : 1.05803 secs
Boost parallel sort : 0.933055 secs
OMP parallel stable sort : 1.12269 secs
Boost sample sort : 0.889216 secs
Boost parallel stable sort : 1.56564 secs
Asynchronous parallel quickspreadsort: 0.788856 secs
Asynchronous quick_intro_sort : 0.893652 secs
Asynchronous parallel stable sort : 1.23495 secs
Asynchronous boost::stable sort : 1.21817 secs
Similar results.
Let's move on to big objects:
1562500 elements of size 512 randomly filled
H E A V Y C O M P A R I S O N
OMP parallel sort : 0.308923 secs
Boost parallel sort : 0.342992 secs
Asynchronous parallel_quicksort : 0.246709 secs
Asynchronous quick_intro_sort : 0.269666 secs
Both quicksorts are best, with a slight advantage to intro_sort.
The light comparison is similar.
Do we have your permission to copy and/or adapt your quick_intro_sort
algorithm into the Boost.Sort library?
Please feel free to copy whatever might be useful. The library has a Boost
license as it's intended for inclusion in Boost.I also took the freedom of
adding the single-thread versions of Fernando's algorithms as basis for my
parallel versions.
The algorithm is quite simple. It's the same as quick_spreadsort, a
quicksort which degrades into Francisco's intro_sort after some iterations.
I just found it interesting to add a wrapper to Francisco's algorithm for
the mini-review.
It was just a quick test (no time for cleverer algorithms) but I'm pretty
satisfied with the result and I'll keep these versions.
How is the worst-case performance relative to Francisco's library?
The sort versions have O(n log n) like any merge sort. The quicksort have
O(n log n) to O(n^2) as expected from a quicksort.
Cheers,
Christophe