[parallel_sort] Proposal

Howdy, I just finished the first working version of parallel_sort. It seems there is a lot of interest for this function in the C++ community, and in the boost community specifically, which is the motivation for this post. ;) My implementation is somewhat very different in nature from the one present in Intel's TBB. It's designed to take advantages of a multi-core user machine, it's not designed to run on a grid for massive sorting operation. It only uses the STL and the boost library. It doesn't have any fancy prerequisites, except having several cores to unleash its power! Current benchmarks on my Q6600 vs2008-64-bit default allocator: 2 threads : 160 % faster 4 threads : 260 % faster Keep in mind the Q6600 is not a real equivalent to a quad-cpu machine and that the default memory allocator is not multithreading friendly. It can be heavily customized. I already offer the possibility to choose between quick sorting and merge sorting, and for quick sorting I offer two pivot algorithms : fast or secure. Then you can of course specify a predicate and a fallback sorting algorithm for when you run out of threads. If you don't care to customize, no problem! Just write: parallel_sort<2>(v.begin(), v.end()); // sort with 2 threads There is still some work to be done, especially in the documentation area. Nevertheless, I think it's time to open up the code and start getting feedback. J As an added bonus I also offer to add my implementation of the smoothsort algorithm (with some hints from Steven). This algorithm is faster on sorted input but slower on random input. Cheers -- EA

On 1 Feb 2009, at 17:49, Edouard A. wrote:
Howdy,
If you don't care to customize, no problem! Just write:
parallel_sort<2>(v.begin(), v.end()); // sort with 2 threads
Does templating on the number of threads produce a large gain, as opposed to making it a run-time option? Having to compile separate executables for different numbers of cores seems a bit limiting. Chris

Howdy,
If you don't care to customize, no problem! Just write:
parallel_sort<2>(v.begin(), v.end()); // sort with 2 threads
Does templating on the number of threads produce a large gain, as opposed to making it a run-time option? Having to compile separate executables for different numbers of cores seems a bit limiting.
You can write a runtime wrapper if needed: parallel_runtime(Iterator first, Iterator last, int threads) { switch(threads) { case 2: parallel_sort<2>(first, last); break; // etc. } } I haven't benchmarked against a runtime option, I did a template parameter as it felt more natural to me to make this a compile time parameter rather than a runtime parameter. -- EA

On Mon, Feb 2, 2009 at 14:02, Edouard A. <edouard@fausse.info> wrote:
I haven't benchmarked against a runtime option, I did a template parameter as it felt more natural to me to make this a compile time parameter rather than a runtime parameter.
It seems fundamentally runtime to me, since different machines or just different runs will want different levels of concurrency. The overhead ought to just be a few compares and arithmetic operations, which would be swamped by the effort involved in the sorting. I think normally I'd want to just use parallel_sort(b, e, thread::hardware_concurrency())

It seems fundamentally runtime to me, since different machines or just different runs will want different levels of concurrency. The overhead ought to just be a few compares and arithmetic operations, which would be swamped by the effort involved in the sorting.
I think normally I'd want to just use parallel_sort(b, e, thread::hardware_concurrency())
Maybe you're right. That's not a major design change to make it runtime, I can easily give it a try. We spend most of our time sorting, making an additional check is no big deal. As for TBB, I did a quick benchmark. As expected, TBB is faster, but not a lot. TBB is roughly 3 times faster than std::sort. Again, I didn't expect to get a x 4 gain because of the design of the Q6600 CPU. My implementation is 2.6 times faster than std::sort. Although I don't slice the data, my synchronization mechanisms are extremely simple, which explain why the difference isn't very big (15% advantage for tbb). I have however noticed that the behavior on sorted input is not very satisfactory on my side, probably for the reasons described by Phil. tbb::parallel_sort behavior is constant from what I can see. I'm going to: * write a parallel_merge * see what I can do to optimize memory allocations * think about the slicing issue * probably play left4dead instead of all the above :( -- EA

----- Original Message ----- From: "Edouard A." <edouard@fausse.info> To: <boost@lists.boost.org> Sent: Monday, February 02, 2009 8:41 PM Subject: Re: [boost] [parallel_sort] Proposal
It seems fundamentally runtime to me, since different machines or just different runs will want different levels of concurrency. The overhead ought to just be a few compares and arithmetic operations, which would be swamped by the effort involved in the sorting.
I think normally I'd want to just use parallel_sort(b, e, thread::hardware_concurrency())
Maybe you're right. That's not a major design change to make it runtime, I can easily give it a try. We spend most of our time sorting, making an additional check is no big deal.
As for TBB, I did a quick benchmark.
As expected, TBB is faster, but not a lot.
TBB is roughly 3 times faster than std::sort. Again, I didn't expect to get a x 4 gain because of the design of the Q6600 CPU.
My implementation is 2.6 times faster than std::sort.
Although I don't slice the data, my synchronization mechanisms are extremely simple, which explain why the difference isn't very big (15% advantage for tbb).
I have however noticed that the behavior on sorted input is not very satisfactory on my side, probably for the reasons described by Phil. tbb::parallel_sort behavior is constant from what I can see.
I'm going to:
* write a parallel_merge * see what I can do to optimize memory allocations * think about the slicing issue * probably play left4dead instead of all the above :(
Hi, There is also Boost.ThreadPool on the queue schedule. In case this can help you? Vicente

Hi,
There is also Boost.ThreadPool on the queue schedule. In case this can help you?
Yes, most likely. boost::thread_group didn't do the job, but a thread pool would be great and would take away a lot of work required to achieve some sort of optimizations. When is it planned? I thought about the parallel_merge. It's just a question of dividing one source into blocks and then find matching blocks in the other source. Once you have matching pairs, you do a sequential merge on each pair. You need to divide in blocks small enough so that you won't have starving threads, but that shouldn't be too small in order to prevent synchronization overhead to become significant. -- EA

Edouard A. wrote
I just finished the first working version of parallel_sort.
Current benchmarks on my Q6600 vs2008-64-bit default allocator:
2 threads : 160 % faster
4 threads : 260 % faster
So you're getting a super-linear speedup going from 1 to 2 processors? That seems odd. What size input does the benchmark relate to? I'm curious to know whether you still get a worthwhile speedup for relatively small inputs.
It can be heavily customized. I already offer the possibility to choose between quick sorting and merge sorting, and for quick sorting I offer two pivot algorithms : fast or secure. Then you can of course specify a predicate and a fallback sorting algorithm for when you run out of threads.
How about run-time selection between quicksort and mergesort, in the style of introsort? This seems hard to beat in uniprocessor applications.
Nevertheless, I think it's time to open up the code and start getting feedback.
Yes please. You haven't said anything about how it actually works... Phil.

On Mon, 02 Feb 2009 09:57:40 +0000, "Phil Endecott" <spam_from_boost_dev@chezphil.org> wrote:
Current benchmarks on my Q6600 vs2008-64-bit default allocator:
2 threads : 160 % faster
4 threads : 260 % faster
So you're getting a super-linear speedup going from 1 to 2 processors? That seems odd.
What size input does the benchmark relate to? I'm curious to know whether you still get a worthwhile speedup for relatively small inputs.
No sorry, this is wrongly said, I get factor 1.6 and 2.6. I meant 160 % speed, not 160 % faster. For small input it's very simple, anything under 1000 is not parallelized because of the thread overhead cost. This would require some precise benchmarking to find the right value.
It can be heavily customized. I already offer the possibility to choose between quick sorting and merge sorting, and for quick sorting I offer two pivot algorithms : fast or secure. Then you can of course specify a predicate and a fallback sorting algorithm for when you run out of threads.
How about run-time selection between quicksort and mergesort, in the style of introsort? This seems hard to beat in uniprocessor applications.
Not really needed for merge_sort, there is no problematic behaviour like quicksort. The reason why you would want to use quicksort is that if the extra memory allocation is an issue.
Nevertheless, I think it's time to open up the code and start getting feedback.
Yes please. You haven't said anything about how it actually works...
Threads are specified as a template parameter. There are finite. If you wish to have a mutlithreaded version that uses 2 threads you do parallel_sort<2>(v.begin(), v.end()); 4 threads ? parallel_sort<4>(v.begin(), v.end()); What happens then is that you have a template recursive call that generates the right amount of threading through recursion. That means apart from thread creation and termination you have little overhead. For example, for 4 threads you have : parallel_sort<4> -> parallel_sort<2> -> parallel_sort<1> -> parallel_sort<1> -> parallel_sort<2> -> parallel_sort<1> -> parallel_sort<1> This code is generated at compile time. The one thread version use the fallback sorter which is by default std::sort. The default > 1 thread version merges the results of the underlying calls (there is a version with quicksort). Cheers. -- EA

Edouard A wrote:
For example, for 4 threads you have :
parallel_sort<4> -> parallel_sort<2> -> parallel_sort<1> -> parallel_sort<1> -> parallel_sort<2> -> parallel_sort<1> -> parallel_sort<1>
This code is generated at compile time.
The one thread version use the fallback sorter which is by default std::sort.
The default > 1 thread version merges the results of the underlying calls (there is a version with quicksort).
So basically it's something like this: thread t( sort(begin,mid) ); sort(mid,end); t.join(); merge(begin,mid,end); // [*] The trouble with this is that one of the threads will terminate before the other, e.g. because its data was easier to sort or was in more-local memory, and it will then do nothing until the other thread (and all its sub-threads) has also finished. It would be better to keep all threads occupied for as much of the time as possible, e.g. by dividing the problem up into chunks and giving the chunks to threads as and when they are ready to handle them. You mentioned the Intel implementation before. Have you benchmarked that in comparison with your algorithm? I guess that you may have benchmarked with random input data. This will tend to favour your method as the input is uniformly difficult to sort. You should also benchmark with input that is less homogeneous, e.g. data that's mostly sorted but with local permutations, etc. Is anyone aware of synthetic or real benchmark data sets for sorting? [*] Of course a fundamental issue for this algorithm is that even if the two sub-sorts are equally time-consuming, the final merge is still a sequential operation. But I just wondered whether it would be possible to have one thread start doing in_place_merge(begin,mid,end); while the other thread starts at the other end, going backwards: in_place_merge(rbegin,reverse_iterator(mid),rend); Of course they will crash into each other in the middle. But maybe it's possible to do this in some sort of safe lock-free manner? Phil.

On Mon, 02 Feb 2009 17:41:58 +0000, "Phil Endecott" <spam_from_boost_dev@chezphil.org> wrote:
So basically it's something like this:
thread t( sort(begin,mid) ); sort(mid,end); t.join(); merge(begin,mid,end); // [*]
Exactly, with some differences as I do a memory copy before the thread and I run two threads. Working on different buffers allows to prevent hidden system locks when you reach the middle. so it's more like buffer = new []; copy(...) thread t1(sort(buffer,...); buffer2 = new []; copy(...) thread t2(sort(buffer2, ...); t2.join(); t1.join(); merge(buffer, buffer2, result);
The trouble with this is that one of the threads will terminate before the other, e.g. because its data was easier to sort or was in more-local memory, and it will then do nothing until the other thread (and all its sub-threads) has also finished. It would be better to keep all threads occupied for as much of the time as possible, e.g. by dividing the problem up into chunks and giving the chunks to threads as and when they are ready to handle them.
You're absolutely right and this is my primary concern. That's what the TBB does, I wanted to try something more simple. Doing what you describe requires writing a scheduler and a dispatcher for you threads. This is costly. If you create it for each parallel_sort call then you get a huge performance hit. You could also decide to create it only for the first parallel_sort call which means that the following calls won't get the hit (or create it at runtime).
You mentioned the Intel implementation before. Have you benchmarked that in comparison with your algorithm?
I'm in the process of doing so. If the TBB does really better then I will see how I could use their approach, or maybe an hybrid approach.
I guess that you may have benchmarked with random input data. This will tend to favour your method as the input is uniformly difficult to sort. You should also benchmark with input that is less homogeneous, e.g. data that's mostly sorted but with local permutations, etc. Is anyone aware of synthetic or real benchmark data sets for sorting?
I think there is. I test on random and sorted data. I will verify again tonight and give you the results.
[*] Of course a fundamental issue for this algorithm is that even if the two sub-sorts are equally time-consuming, the final merge is still a sequential operation. But I just wondered whether it would be possible to have one thread start doing
in_place_merge(begin,mid,end);
while the other thread starts at the other end, going backwards:
in_place_merge(rbegin,reverse_iterator(mid),rend);
Of course they will crash into each other in the middle. But maybe it's possible to do this in some sort of safe lock-free manner?
It's maybe more efficient to search in parallel in the two arrays, build some sort of index and then merge. But you are right I was lazy to use std::merge instead of writing my own parallel_merge. -- EA

Edouard A wrote:
On Mon, 02 Feb 2009 17:41:58 +0000, "Phil Endecott" <spam_from_boost_dev@chezphil.org> wrote:
So basically it's something like this:
thread t( sort(begin,mid) ); sort(mid,end); t.join(); merge(begin,mid,end); // [*]
Exactly, with some differences as I do a memory copy before the thread and I run two threads.
Working on different buffers allows to prevent hidden system locks when you reach the middle.
so it's more like
buffer = new []; copy(...) thread t1(sort(buffer,...);
buffer2 = new []; copy(...) thread t2(sort(buffer2, ...);
t2.join(); t1.join();
merge(buffer, buffer2, result);
That's a lot of extra copying. What do you mean by "hidden system locks"? Phil.

That's a lot of extra copying. [Edouard A.]
I don't do in place merge. I don't know if this would be faster.
What do you mean by "hidden system locks"?
[Edouard A.] If you have two or more CPUs trying to access the same cache line, locking will occur. For example, let's say you are operating near the middle with two different threads t1 & t2: ......t1..] [..t2....... ------------------------ A | A | A | B | B | B ------------------------ You have a potential (well I would say certain) false sharing because the middle is going to be on the same cache line. -- EA

Edouard A." wrote:
That's a lot of extra copying. [Edouard A.]
I don't do in place merge. I don't know if this would be faster.
My point is that you're doing extra copying compared even to a standard out-of-place mergesort. You're copying all the data on the way down, and then copying it all again during the merge on the way back up. Using in-place merge can avoid some of the copying on the way back up. I think you should get rid of the copying on the way down.
What do you mean by "hidden system locks"? [Edouard A.]
If you have two or more CPUs trying to access the same cache line, locking will occur.
Right. But do you think this effect is more significant than copying all of your data O(log n) times? Surely not. But even if it were a problem you could easily fix it by aligning your split points to cache lines. I encourage you to investigate parallel sorting algorithms in the literature - doesn't Knuth have a whole book of them? - rather than inventing your own. And base your improvements on benchmark results. Regards, Phil.

Edouard A. wrote:
That's a lot of extra copying. [Edouard A.]
I don't do in place merge. I don't know if this would be faster.
May I suggest to store iterators in the allocated storage? After sorting the iterators you could merge them and then apply sorting by simply swapping the elements of the original sequence. If the elements are fat and support fast swapping, this might give some speed.

On Tue, 03 Feb 2009 20:21:47 +0300, Andrey Semashev <andrey.semashev@gmail.com> wrote:
May I suggest to store iterators in the allocated storage? After sorting the iterators you could merge them and then apply sorting by simply swapping the elements of the original sequence. If the elements are fat and support fast swapping, this might give some speed.
Thanks for the idea, I already got a lot of feedback and I have a lot of work ahead of me. :) -- EA

On 3 Feb 2009, at 17:34, Edouard A. wrote:
On Tue, 03 Feb 2009 20:21:47 +0300, Andrey Semashev <andrey.semashev@gmail.com> wrote:
May I suggest to store iterators in the allocated storage? After sorting the iterators you could merge them and then apply sorting by simply swapping the elements of the original sequence. If the elements are fat and support fast swapping, this might give some speed.
Thanks for the idea, I already got a lot of feedback and I have a lot of work ahead of me. :)
Do go look at some of the pre-existing sorts, most STLs now implement things similarly. You should be doing most of the moving of elements during a sort using swap (you can do all of them, but that gets a little expensive). Also, you probably want to think about making your sort suitable for move- only types, both for efficency (never copy anything if you don't have to) and in preparation for such types. Chris

Hi, ----- Original Message ----- From: "Edouard A." <edouard@fausse.info> To: <boost@lists.boost.org> Sent: Monday, February 02, 2009 7:21 PM Subject: Re: [boost] [parallel_sort] Proposal
On Mon, 02 Feb 2009 17:41:58 +0000, "Phil Endecott" <spam_from_boost_dev@chezphil.org> wrote:
So basically it's something like this:
thread t( sort(begin,mid) ); sort(mid,end); t.join(); merge(begin,mid,end); // [*]
Exactly, with some differences as I do a memory copy before the thread and I run two threads.
Working on different buffers allows to prevent hidden system locks when you reach the middle.
Please could you be more precise, which kind of hidden system locks?
so it's more like
buffer = new []; copy(...) thread t1(sort(buffer,...);
buffer2 = new []; copy(...) thread t2(sort(buffer2, ...);
t2.join(); t1.join();
merge(buffer, buffer2, result);
The second thread is not realy useful, as noted by Phil. I would like to be able to write something more generic, something like: template < typename DirectSolver, typename Composer, typename AsynchronousExecutor, typename Input> void inplace_solve(AsynchronousExecutor& ae, Problem& input) { // if (problem is small) if (size(range) < concurrency_threshold) { // directly solve problem DirectSolver()(input); } else { // split problem into independent parts BOOST_AUTO(partition, partition_view(input)); // evaluates asynchronously inplace_solve on each element of the partition // using the asynchronous executor as scheduler wait_for_all(ae, inplace_solve, partition); // compose the result in place from subresults Composer()(partition); } } So parallel sort could be template <typename Range> void parallel_sort(range& range) { boost::tp::pool<> ae; parallel::inplace_solve<sort, merge>(ae, input); } I'm working on a Asynchronous Execution framework that you can get from http://www.boostpro.com/vault/index.php?action=downloadfile&filename=interthreads.zip&directory=Concurrent%20Programming&. It provides a wait_for_all function which will fork each function except the last one which will be executed in the current thread. So if you make 4 partitions you need to use it as wait_for_all(ae, bind(inplace_solve, at_c<0>(partition)), bind(inplace_solve, at_c<1>(partition)), bind(inplace_solve, at_c<2>(partition)), bind(inplace_solve, at_c<3>(partition)), ); I'll try to implement this overloading. template< typename AE, typename F, typename Sequence> typename result_of::wait_for_all<AE, F,Sequence >::type wait_for_all( AE& ae, F f, Sequence seq ); HTH, Vicente

Please could you be more precise, which kind of hidden system locks?
False sharing, see my other answer. Classic problem.
I'm working on a Asynchronous Execution framework that you can get from http://www.boostpro.com/vault/index.php?action=downloadfile&filename=in terthreads.zip&directory=Concurrent%20Programming&. It provides a wait_for_all function which will fork each function except the last one which will be executed in the current thread. So if you make 4 partitions you need to use it as
wait_for_all(ae, bind(inplace_solve, at_c<0>(partition)), bind(inplace_solve, at_c<1>(partition)), bind(inplace_solve, at_c<2>(partition)), bind(inplace_solve, at_c<3>(partition)), );
I'll try to implement this overloading.
template< typename AE, typename F, typename Sequence> typename result_of::wait_for_all<AE, F,Sequence >::type wait_for_all( AE& ae, F f, Sequence seq );
This opens up many possibilities... Thanks I will have a look, but I need to implement parallel_merge first. ;) With your library, it seems that slicing the input and scheduling the threads will be pretty straightforward. -- EA

Earlier I wrote:
Is anyone aware of synthetic or real benchmark data sets for sorting?
In the absence of anything else, this data is still available: http://thread.gmane.org/gmane.comp.lib.boost.devel/177032 (scroll down) Pre-sorting on one column and then benchmarking sorting on another column is the sort of thing I had in mind. Phil.
participants (6)
-
Andrey Semashev
-
Christopher Jefferson
-
Edouard A.
-
Phil Endecott
-
Scott McMurray
-
vicente.botet