[threadpool] parallel_sort example

Hi, I have implemented a parallel sort with the threadpool library (version 21 http://tiny.cc/HVOOa )+patch to have tp::task default constructibe, and the Boost.Range reviewed currently. (attached the test file). Well a have implemented a generic algorithm that should works to other partition and compose problems. template < typename DirectSolver, typename Composer, typename AE, typename Range
void inplace_solve( AE & ae, boost::sub_range<Range> range, unsigned cutoff ); template < typename DirectSolver, typename Composer, typename AE, typename Range
void inplace_solve( AE & ae, boost::sub_range<Range> range, unsigned cutoff) { unsigned size = boost::size(range); if ( size <= cutoff) DirectSolver()(range); else { partition<Range> parts(range, BOOST_PARTS); task_type tasks[BOOST_PARTS]; for (unsigned i=0;i < BOOST_PARTS-1; ++i) { task_type tmp(ae.submit( boost::bind( &inplace_solve<DirectSolver,Composer,AE,Range>, boost::ref(ae), parts[i], cutoff ))); tasks[i] = tmp; } inplace_solve<DirectSolver,Composer,AE,Range>(ae, parts[BOOST_PARTS-1], cutoff); for (unsigned i=0;i < BOOST_PARTS-1; ++i) { tasks[i].wait(); }; Composer()(range); } } struct sort_fct { template<class RandomAccessRange> RandomAccessRange& operator()(RandomAccessRange rng) { return boost::sort(rng); } }; struct inplace_merge_fct { template<class BidirectionalRange> BidirectionalRange& operator()( BidirectionalRange rng) { return boost::inplace_merge(rng, boost::begin(rng)+(boost::size(rng)/2)); } }; // this implementation mask the ThreadPool. template <typename Range> void parallel_sort(Range& range, unsigned cutoff=10000) { pool_type pool( boost::tp::poolsize( 2) ); boost::sub_range<Range> rng(boost::begin(range), boost::end(range)); inplace_solve<sort_fct,inplace_merge_fct,pool_type,Range>( pool, rng, cutoff); } The differences in time on my computer are not too good. The sort has been done on a reversed container and the container sorted for 400000 elements parallel_sort xxx: where xxx is the cutoff parameter with BOOST_PARTS=2 std::sort: reverse 0..400000 37 milli seconds std::sort: 0..400000 35 milli seconds boost::sort: reverse 0..400000 38 milli seconds boost::sort: 0..400000 34 milli seconds parallel_sort 200000: reverse 0..400000 24 milli seconds parallel_sort 200000: 0..400000 24 milli seconds parallel_sort 100000: reverse 0..400000 28 milli seconds parallel_sort 100000: 0..400000 26 milli seconds parallel_sort 50000: reverse 0..400000 31 milli seconds parallel_sort 50000: 0..400000 29 milli seconds parallel_sort 25000: reverse 0..400000 32 milli seconds parallel_sort 25000: 0..400000 34 milli seconds parallel_sort 12500: reverse 0..400000 37 milli seconds parallel_sort 12500: 0..400000 34 milli seconds My computer has only two cores. I expected that with a cutoff of 50.000 we will have a better time that withh a cutoff of 200.000. Could someone having access to a computer with more cores run the test in order to compare the results I have also implemented parallel_sort without masking the pool template <typename AE, typename Range> void parallel_sort(AE& ae, Range& range, unsigned cutoff=10000) { boost::iterator_range<typename boost::range_iterator<Range>::type> rng(range); inplace_solve<sort_fct,inplace_merge_fct,pool_type,Range>( ae, rng, cutoff); } and in this case we win ~3 mili seconds for the creation of the Thread Pool. std::sort: reverse 0..400000 37 milli seconds std::sort: 0..400000 35 milli seconds boost::sort: reverse 0..400000 36 milli seconds boost::sort: 0..400000 34 milli seconds parallel_sort 200000: reverse 0..400000 27 milli seconds parallel_sort 200000: 0..400000 23 milli seconds parallel_sort 100000: reverse 0..400000 29 milli seconds parallel_sort 100000: 0..400000 27 milli seconds parallel_sort 50000: reverse 0..400000 29 milli seconds parallel_sort 50000: 0..400000 30 milli seconds parallel_sort 25000: reverse 0..400000 31 milli seconds parallel_sort 25000: 0..400000 29 milli seconds parallel_sort 12500: reverse 0..400000 32 milli seconds parallel_sort 12500: 0..400000 27 milli seconds This let me think that the application should not create ThreadPools for specific algorithms, but use a default one. The V21 of Boost.ThreadPool provide a boost::this_task::get_thread_pool<pool_type>(), but this function works only if called from a task. It should be great if the Boost.ThreadPool provided a default pool which could be recovered with this function on threads that are not worker threads of a ThreadPool. Otherwise the application is forced eother to add a parameter to all the functions or define its own default thread pool statically. Oliver you are free to take this as example for the ThreadPool documentation, if you consider this can improve the understanding of what can be done with your library. Best, Vicente

let me think that the application should not create ThreadPools
for specific algorithms, but use a default one. The V21 of Boost.ThreadPool provide a boost::this_task::get_thread_pool<pool_type>(), but this function works only if called from a task. It should be great if the Boost.ThreadPool provided a default pool which could be recovered with this function on threads that are not worker threads of a ThreadPool. Otherwise the application is forced eother to add a parameter to all the functions or define its own default thread pool statically.
[Edouard A.] First of all don't be disappointed by the performance gain, while it is true that you can hope for almost 100% performance gain on a dual core CPU, there are a lot of parameters to consider (including the CPU itself, what is the specific CPU you are using?). I also think that having a default pool is a must. What you can do is have a boost::tp:get_default_pool() that would be using Meiers idiom for singleton's (i.e. having the function returning a static local variable). You should also compare your results with Intel TBB parallel_sort and use containers designed for multithreading operation (especially true for vectors). If TBB offers similar performance gain then maybe the problem is your CPU. Also test on random data: std::generate(values1.begin(), values1.end(), &rand); For your information I worked a little bit on a parallel implementation that didn't use any task scheduling, performances was 20% below TBB. Using quicksort or merge sort offered similar performances. With your strategy you should be between these 20% and TBB (perhaps you can outperform TBB!). The overhead of scheduling is compensated by the fact that the load is equally distributed amongst the cores. This is the conclusion of my tests on a Q6600 under Vista-64. Regards. -- Edouard

Am Monday 02 March 2009 00:06:25 schrieb Edouard A.:
I also think that having a default pool is a must. What you can do is have a boost::tp:get_default_pool() that would be using Meiers idiom for singleton's (i.e. having the function returning a static local variable).
How many worker-threads should be created inside the default threadpool? Which scheduling strategy should by applied? The pool could use FIFO ordering of the queued tasks but the amount of worker-threads depend on the threadpool usage. THe pool could create as many worker-threads as CPUs are online an bind each thread to one specific CPU/Core - CPU intensive apps could benefit form this scenario. Applications with lot of IO require usally more worker-threads than CPUs. Hmmm Oliver

How many worker-threads should be created inside the default threadpool? [Edouard A.]
As many threads as there are physical available sounds like a nice default to me.
Which scheduling strategy should by applied? [Edouard A.]
The pool could use FIFO ordering of the queued tasks but the amount of worker-threads depend on the threadpool usage. THe pool could create as many worker-threads as CPUs are online an bind each thread to one specific CPU/Core - CPU intensive apps could benefit form this scenario.
[Edouard A.] Unless you know what you're doing it's best to let the OS decide on which CPU the thread should be running. For a default pool I submit it is better to let affinity alone. As noted above I agree with one thread / core. Don't forget you're going to have threads outside your pool also requesting the preciouuuus CPU resource. Applications with lot of IO require usally more worker-
threads than CPUs. [Edouard A.]
You are correct but this doesn't sound like an use case for a "default" pool. -- Edouard

----- Original Message ----- From: <k-oli@gmx.de> To: <boost@lists.boost.org> Sent: Monday, March 02, 2009 7:20 PM Subject: Re: [boost] [threadpool] parallel_sort example
Am Monday 02 March 2009 00:06:25 schrieb Edouard A.:
I also think that having a default pool is a must. What you can do is have a boost::tp:get_default_pool() that would be using Meiers idiom for singleton's (i.e. having the function returning a static local variable).
How many worker-threads should be created inside the default threadpool? Which scheduling strategy should by applied?
The pool could use FIFO ordering of the queued tasks but the amount of worker-threads depend on the threadpool usage. THe pool could create as many worker-threads as CPUs are online an bind each thread to one specific CPU/Core - CPU intensive apps could benefit form this scenario. Applications with lot of IO require usally more worker-threads than CPUs.
Hmmm
Hi Olivier, You are right the number of CPUs, depend on the ussage. The fact that the default pool should be configurable do not implies that we cannot have one. The configuration can be done either by compilations flags, or the user can change the features of the default pool at runtime; I don't know there are a lot of possibilities. What is clear for me is that the user needs this default. Best, Vicente

vicente.botet wrote:
I have implemented a parallel sort with the threadpool library
Hi Vicente, I have not tried to totally understand your code, but can you explain:
partition<Range> parts(range, BOOST_PARTS);
return boost::inplace_merge(rng, boost::begin(rng)+(boost::size(rng)/2));
Is this right when BOOST_PARTS != 2 ? Phil.

----- Original Message ----- From: "Phil Endecott" <spam_from_boost_dev@chezphil.org> To: <boost@lists.boost.org> Sent: Monday, March 02, 2009 3:35 PM Subject: Re: [boost] [threadpool] parallel_sort example
vicente.botet wrote:
I have implemented a parallel sort with the threadpool library
Hi Vicente,
I have not tried to totally understand your code, but can you explain:
partition<Range> parts(range, BOOST_PARTS);
return boost::inplace_merge(rng, boost::begin(rng)+(boost::size(rng)/2));
Is this right when BOOST_PARTS != 2 ?
Good point Phil, No this do not works. If BOOST_PARTS is 4 we will have 4 partitions sorted that need to be merged. The boost::inplace_merge works only for two parts. I don't know if a parallel merge could improve the performaces on processors with more cores. A merge of N sorted parts should be easily parallelized, but I have not yet done this. May be this will be the next step. Thanks for the remark, Vicente

Am Sunday 01 March 2009 22:53:14 schrieb vicente.botet:
Oliver you are free to take this as example for the ThreadPool documentation, if you consider this can improve the understanding of what can be done with your library.
Hi Vicente, I've executed your parallel_sort2.cpp with NN = 10000000 on a core 2 duo E6600 (with BOOST_BIND_WORKER_TO_PROCESSORS, BOOST_DISABLE_ASSERTS and -02) std::sort: reverse 0..10000000 158 milli seconds std::sort: 0..10000000 148 milli seconds boost::sort: reverse 0..10000000 158 milli seconds boost::sort: 0..10000000 148 milli seconds parallel_sort 5000000: reverse 0..10000000 145 milli seconds parallel_sort 5000000: 0..10000000 121 milli seconds parallel_sort 2500000: reverse 0..10000000 169 milli seconds parallel_sort 2500000: 0..10000000 148 milli seconds parallel_sort 1250000: reverse 0..10000000 190 milli seconds parallel_sort 1250000: 0..10000000 166 milli seconds parallel_sort 625000: reverse 0..10000000 201 milli seconds parallel_sort 625000: 0..10000000 184 milli seconds parallel_sort 312500: reverse 0..10000000 216 milli seconds parallel_sort 312500: 0..10000000 203 milli seconds I've added your example - thx. Oliver

----- Original Message ----- From: <k-oli@gmx.de> To: <boost@lists.boost.org> Sent: Monday, March 02, 2009 10:05 PM Subject: Re: [boost] [threadpool] parallel_sort example
Am Sunday 01 March 2009 22:53:14 schrieb vicente.botet:
Oliver you are free to take this as example for the ThreadPool documentation, if you consider this can improve the understanding of what can be done with your library.
Hi Vicente, I've executed your parallel_sort2.cpp with NN = 10000000 on a core 2 duo E6600 (with BOOST_BIND_WORKER_TO_PROCESSORS, BOOST_DISABLE_ASSERTS and -02)
std::sort: reverse 0..10000000 158 milli seconds std::sort: 0..10000000 148 milli seconds boost::sort: reverse 0..10000000 158 milli seconds boost::sort: 0..10000000 148 milli seconds parallel_sort 5000000: reverse 0..10000000 145 milli seconds parallel_sort 5000000: 0..10000000 121 milli seconds parallel_sort 2500000: reverse 0..10000000 169 milli seconds parallel_sort 2500000: 0..10000000 148 milli seconds parallel_sort 1250000: reverse 0..10000000 190 milli seconds parallel_sort 1250000: 0..10000000 166 milli seconds parallel_sort 625000: reverse 0..10000000 201 milli seconds parallel_sort 625000: 0..10000000 184 milli seconds parallel_sort 312500: reverse 0..10000000 216 milli seconds parallel_sort 312500: 0..10000000 203 milli seconds
I've added your example - thx.
Just to comment some variations With 4 threads I get the following results: std::sort: reverse 0..400000 37 milli seconds std::sort: 0..400000 35 milli seconds boost::sort: reverse 0..400000 36 milli seconds boost::sort: 0..400000 34 milli seconds parallel_sort 200000: reverse 0..400000 39 milli seconds parallel_sort 200000: 0..400000 22 milli seconds *parallel_sort 100000: reverse 0..400000 26 milli seconds *parallel_sort 100000: 0..400000 26 milli seconds parallel_sort 50000: reverse 0..400000 30 milli seconds parallel_sort 50000: 0..400000 28 milli seconds parallel_sort 25000: reverse 0..400000 30 milli seconds parallel_sort 25000: 0..400000 32 milli seconds With 6 threads std::sort: reverse 0..400000 40 milli seconds std::sort: 0..400000 36 milli seconds boost::sort: reverse 0..400000 36 milli seconds boost::sort: 0..400000 39 milli seconds parallel_sort 200000: reverse 0..400000 43 milli seconds parallel_sort 200000: 0..400000 21 milli seconds * parallel_sort 100000: reverse 0..400000 25 milli seconds * parallel_sort 100000: 0..400000 31 milli seconds parallel_sort 50000: reverse 0..400000 31 milli seconds parallel_sort 50000: 0..400000 30 milli seconds parallel_sort 25000: reverse 0..400000 30 milli seconds parallel_sort 25000: 0..400000 30 milli seconds These results seems extrange to me for an algorith that do not have IO, and no synchronization between tasks. Vicente
participants (4)
-
Edouard A.
-
k-oli@gmx.de
-
Phil Endecott
-
vicente.botet