
On Mon, 02 Feb 2009 17:41:58 +0000, "Phil Endecott" <spam_from_boost_dev@chezphil.org> wrote:
So basically it's something like this:
thread t( sort(begin,mid) ); sort(mid,end); t.join(); merge(begin,mid,end); // [*]
Exactly, with some differences as I do a memory copy before the thread and I run two threads. Working on different buffers allows to prevent hidden system locks when you reach the middle. so it's more like buffer = new []; copy(...) thread t1(sort(buffer,...); buffer2 = new []; copy(...) thread t2(sort(buffer2, ...); t2.join(); t1.join(); merge(buffer, buffer2, result);
The trouble with this is that one of the threads will terminate before the other, e.g. because its data was easier to sort or was in more-local memory, and it will then do nothing until the other thread (and all its sub-threads) has also finished. It would be better to keep all threads occupied for as much of the time as possible, e.g. by dividing the problem up into chunks and giving the chunks to threads as and when they are ready to handle them.
You're absolutely right and this is my primary concern. That's what the TBB does, I wanted to try something more simple. Doing what you describe requires writing a scheduler and a dispatcher for you threads. This is costly. If you create it for each parallel_sort call then you get a huge performance hit. You could also decide to create it only for the first parallel_sort call which means that the following calls won't get the hit (or create it at runtime).
You mentioned the Intel implementation before. Have you benchmarked that in comparison with your algorithm?
I'm in the process of doing so. If the TBB does really better then I will see how I could use their approach, or maybe an hybrid approach.
I guess that you may have benchmarked with random input data. This will tend to favour your method as the input is uniformly difficult to sort. You should also benchmark with input that is less homogeneous, e.g. data that's mostly sorted but with local permutations, etc. Is anyone aware of synthetic or real benchmark data sets for sorting?
I think there is. I test on random and sorted data. I will verify again tonight and give you the results.
[*] Of course a fundamental issue for this algorithm is that even if the two sub-sorts are equally time-consuming, the final merge is still a sequential operation. But I just wondered whether it would be possible to have one thread start doing
in_place_merge(begin,mid,end);
while the other thread starts at the other end, going backwards:
in_place_merge(rbegin,reverse_iterator(mid),rend);
Of course they will crash into each other in the middle. But maybe it's possible to do this in some sort of safe lock-free manner?
It's maybe more efficient to search in parallel in the two arrays, build some sort of index and then merge. But you are right I was lazy to use std::merge instead of writing my own parallel_merge. -- EA