Re: [boost] [parallel_sort] Proposal

2 Feb 2009

      On Mon, 02 Feb 2009 17:41:58 +0000, "Phil Endecott"
<spam_from_boost_dev@chezphil.org> wrote:
...
So basically it's something like this:
thread t( sort(begin,mid) );
   sort(mid,end);
   t.join();
   merge(begin,mid,end);  // [*]
Exactly, with some differences as I do a memory copy before the thread and
I run two threads.

 Working on different buffers allows to prevent hidden system locks when
you reach the middle.

so it's more like 

buffer = new [];
copy(...)
thread t1(sort(buffer,...);

buffer2 = new [];
copy(...)
thread t2(sort(buffer2, ...);

t2.join();
t1.join();

merge(buffer, buffer2, result);
...
The trouble with this is that one of the threads will terminate before
the other, e.g. because its data was easier to sort or was in
more-local memory, and it will then do nothing until the other thread
(and all its sub-threads) has also finished.  It would be better to
keep all threads occupied for as much of the time as possible, e.g. by
dividing the problem up into chunks and giving the chunks to threads as
and when they are ready to handle them.
You're absolutely right and this is my primary concern.

That's what the TBB does, I wanted to try something more simple. 

Doing what you describe requires writing a scheduler and a dispatcher for
you threads. This is costly. If you create it for each parallel_sort call
then you get a huge performance hit. You could also decide to create it
only for the first parallel_sort call which means that the following calls
won't get the hit (or create it at runtime).
...
You mentioned the Intel implementation before.  Have you benchmarked
that in comparison with your algorithm?
I'm in the process of doing so. If the TBB does really better then I will
see how I could use their approach, or maybe an hybrid approach.
...
I guess that you may have benchmarked with random input data.  This
will tend to favour your method as the input is uniformly difficult to
sort.  You should also benchmark with input that is less homogeneous,
e.g. data that's mostly sorted but with local permutations, etc.  Is
anyone aware of synthetic or real benchmark data sets for sorting?
I think there is. 

I test on random and sorted data. 
I will verify again tonight and give you the results.
...
[*] Of course a fundamental issue for this algorithm is that even if
the two sub-sorts are equally time-consuming, the final merge is still
a sequential operation.  But I just wondered whether it would be
possible to have one thread start doing
in_place_merge(begin,mid,end);
while the other thread starts at the other end, going backwards:
in_place_merge(rbegin,reverse_iterator(mid),rend);
Of course they will crash into each other in the middle.  But maybe
it's possible to do this in some sort of safe lock-free manner?
It's maybe more efficient to search in parallel in the two arrays, build
some sort of index
and then merge.

But you are right I was lazy to use std::merge instead of writing my own
parallel_merge.

-- 

EA

Re: [boost] [parallel_sort] Proposal

Edouard A.