On Sun, Nov 20, 2016 at 3:24 PM Christophe Henry < christophe.j.henry@gmail.com> wrote:
I did not manage yet to write a full review. One week is a bit short so a few more days will be necessary. I provide some performance tests first.
Thanks Christophe. We will be interested in incorporating your feedback to improve the library.
OTOH the single-threaded ones are interesting, especially the stable_sort and intro_sort for cases where usage of spreadsort is not possible.
I challenge you to find a sorting problem where it isn't possible to use spreadsort. string_sort can sort anything that std::sort can; you just need to convert the sort key comparison sequence into a sequence of bytes and feed string_sort the correct functors. That said, it may not be efficient to sort some problems with string_sort, and it'll probably be too much work to be worth the effort for many problems.
A short summary:
100000000 uint64_t elements already sorted: Boost parallel sort : 0.064965 secs
Asynchronous parallel sort : 0.0167134 secs (same with other algorithms)
Asynchronous provides a special optimization for this case.
If this optimization is practical to incorporate in Boost parallel sort, it is a common case and your solution seems noticeably faster, so I would like Francisco to consider it. That said, .06s isn't bad.
I added this one: 100000000 uint64_t elements reverse sorted
Boost parallel sort : 0.581381 secs
Asynchronous parallel sort : 0.0478701 secs
Asynchronous provides a special optimization for this case. I this the library should do it too. This one is pretty common and a popular DOS attack.
This optimization would be nice to have, if it's cheap.
100000000 uint64_t elements randomly filled
Asynchronous parallel quickspreadsort: 0.587432 secs Asynchronous quick_intro_sort : 0.728393 secs
This mixing of quicksort degrading into a spreadsort works here best.
I find this algorithm interesting; have you considered using MSD radix-based sorting to break up the data for the threads in parallel? For example, take T threads, have each thread take 1/T of the data and bucket it into N MSD-based buckets (just like Spreadsort), where T < N <= ~2^11, then merge the equivalent buckets, and as you merge them hand them off to another thread to sort. This will probably divide better if you find the max and min before bucketing using a parallel min/max search.
10000000 strings randomly filled
OMP parallel sort : 1.05803 secs Boost parallel sort : 0.933055 secs
OMP parallel stable sort : 1.12269 secs Boost sample sort : 0.889216 secs Boost parallel stable sort : 1.56564 secs
Asynchronous parallel quickspreadsort: 0.788856 secs Asynchronous quick_intro_sort : 0.893652 secs
Asynchronous parallel stable sort : 1.23495 secs Asynchronous boost::stable sort : 1.21817 secs
Similar results.
Let's move on to big objects:
1562500 elements of size 512 randomly filled H E A V Y C O M P A R I S O N
OMP parallel sort : 0.308923 secs Boost parallel sort : 0.342992 secs
Asynchronous parallel_quicksort : 0.246709 secs Asynchronous quick_intro_sort : 0.269666 secs
Both quicksorts are best, with a slight advantage to intro_sort.
The light comparison is similar.
Do we have your permission to copy and/or adapt your quick_intro_sort algorithm into the Boost.Sort library? How is the worst-case performance relative to Francisco's library?