
On Mon, 02 Feb 2009 09:57:40 +0000, "Phil Endecott" <spam_from_boost_dev@chezphil.org> wrote:
Current benchmarks on my Q6600 vs2008-64-bit default allocator:
2 threads : 160 % faster
4 threads : 260 % faster
So you're getting a super-linear speedup going from 1 to 2 processors? That seems odd.
What size input does the benchmark relate to? I'm curious to know whether you still get a worthwhile speedup for relatively small inputs.
No sorry, this is wrongly said, I get factor 1.6 and 2.6. I meant 160 % speed, not 160 % faster. For small input it's very simple, anything under 1000 is not parallelized because of the thread overhead cost. This would require some precise benchmarking to find the right value.
It can be heavily customized. I already offer the possibility to choose between quick sorting and merge sorting, and for quick sorting I offer two pivot algorithms : fast or secure. Then you can of course specify a predicate and a fallback sorting algorithm for when you run out of threads.
How about run-time selection between quicksort and mergesort, in the style of introsort? This seems hard to beat in uniprocessor applications.
Not really needed for merge_sort, there is no problematic behaviour like quicksort. The reason why you would want to use quicksort is that if the extra memory allocation is an issue.
Nevertheless, I think it's time to open up the code and start getting feedback.
Yes please. You haven't said anything about how it actually works...
Threads are specified as a template parameter. There are finite. If you wish to have a mutlithreaded version that uses 2 threads you do parallel_sort<2>(v.begin(), v.end()); 4 threads ? parallel_sort<4>(v.begin(), v.end()); What happens then is that you have a template recursive call that generates the right amount of threading through recursion. That means apart from thread creation and termination you have little overhead. For example, for 4 threads you have : parallel_sort<4> -> parallel_sort<2> -> parallel_sort<1> -> parallel_sort<1> -> parallel_sort<2> -> parallel_sort<1> -> parallel_sort<1> This code is generated at compile time. The one thread version use the fallback sorter which is by default std::sort. The default > 1 thread version merges the results of the underlying calls (there is a version with quicksort). Cheers. -- EA