
Edouard A wrote:
For example, for 4 threads you have :
parallel_sort<4> -> parallel_sort<2> -> parallel_sort<1> -> parallel_sort<1> -> parallel_sort<2> -> parallel_sort<1> -> parallel_sort<1>
This code is generated at compile time.
The one thread version use the fallback sorter which is by default std::sort.
The default > 1 thread version merges the results of the underlying calls (there is a version with quicksort).
So basically it's something like this: thread t( sort(begin,mid) ); sort(mid,end); t.join(); merge(begin,mid,end); // [*] The trouble with this is that one of the threads will terminate before the other, e.g. because its data was easier to sort or was in more-local memory, and it will then do nothing until the other thread (and all its sub-threads) has also finished. It would be better to keep all threads occupied for as much of the time as possible, e.g. by dividing the problem up into chunks and giving the chunks to threads as and when they are ready to handle them. You mentioned the Intel implementation before. Have you benchmarked that in comparison with your algorithm? I guess that you may have benchmarked with random input data. This will tend to favour your method as the input is uniformly difficult to sort. You should also benchmark with input that is less homogeneous, e.g. data that's mostly sorted but with local permutations, etc. Is anyone aware of synthetic or real benchmark data sets for sorting? [*] Of course a fundamental issue for this algorithm is that even if the two sub-sorts are equally time-consuming, the final merge is still a sequential operation. But I just wondered whether it would be possible to have one thread start doing in_place_merge(begin,mid,end); while the other thread starts at the other end, going backwards: in_place_merge(rbegin,reverse_iterator(mid),rend); Of course they will crash into each other in the middle. But maybe it's possible to do this in some sort of safe lock-free manner? Phil.