[threadpool] parallel_sort example

1 Mar 2009

      Hi,

I have implemented a parallel sort with the threadpool library (version 21  http://tiny.cc/HVOOa )+patch to have tp::task default constructibe, and the Boost.Range reviewed currently.
(attached the test file). Well a have implemented a generic algorithm that should works to other partition and compose problems.

template <
    typename DirectSolver,
    typename Composer,
    typename AE,
    typename Range
...
void inplace_solve( AE & ae, boost::sub_range<Range> range, unsigned cutoff );
template <
    typename DirectSolver,
    typename Composer,
    typename AE,
    typename Range
...
void inplace_solve( AE & ae, 
    boost::sub_range<Range> range, 
    unsigned cutoff)  {
    unsigned size = boost::size(range);
    if ( size <= cutoff) DirectSolver()(range);
    else {
        partition<Range> parts(range, BOOST_PARTS);
        task_type tasks[BOOST_PARTS];
        for (unsigned i=0;i < BOOST_PARTS-1; ++i) {
            task_type tmp(ae.submit(
                boost::bind(
                    &inplace_solve<DirectSolver,Composer,AE,Range>,
                    boost::ref(ae),
                    parts[i],
                    cutoff
            )));
            tasks[i] = tmp;

        }
        inplace_solve<DirectSolver,Composer,AE,Range>(ae, parts[BOOST_PARTS-1], cutoff);
        for (unsigned i=0;i < BOOST_PARTS-1; ++i) {
            tasks[i].wait();
        };
        Composer()(range);
    }
  }

struct sort_fct {
    template<class RandomAccessRange>
    RandomAccessRange& operator()(RandomAccessRange rng) {
        return boost::sort(rng);
    }
};

struct inplace_merge_fct {
    template<class BidirectionalRange>
    BidirectionalRange&
    operator()( BidirectionalRange rng) {
        return boost::inplace_merge(rng, boost::begin(rng)+(boost::size(rng)/2));
    }
};

// this implementation mask the ThreadPool.
template <typename Range>
void parallel_sort(Range& range, unsigned cutoff=10000) {
    pool_type pool( boost::tp::poolsize( 2) );
    boost::sub_range<Range> rng(boost::begin(range), boost::end(range));
    inplace_solve<sort_fct,inplace_merge_fct,pool_type,Range>( pool, rng, cutoff);
}

The differences in time on my computer are not too good.
The sort has been done on a reversed container and the container sorted for 400000 elements 
parallel_sort xxx: where xxx is the cutoff parameter with BOOST_PARTS=2

std::sort: reverse 0..400000 37 milli seconds
std::sort: 0..400000 35 milli seconds
boost::sort: reverse 0..400000 38 milli seconds
boost::sort: 0..400000 34 milli seconds
parallel_sort 200000:  reverse 0..400000 24 milli seconds
parallel_sort 200000:  0..400000 24 milli seconds
parallel_sort 100000:  reverse 0..400000 28 milli seconds
parallel_sort 100000:  0..400000 26 milli seconds
parallel_sort 50000:  reverse 0..400000 31 milli seconds
parallel_sort 50000:  0..400000 29 milli seconds
parallel_sort 25000:  reverse 0..400000 32 milli seconds
parallel_sort 25000:  0..400000 34 milli seconds
parallel_sort 12500:  reverse 0..400000 37 milli seconds
parallel_sort 12500:  0..400000 34 milli seconds

My computer has only two cores. I expected that with a cutoff of 50.000 we will have a better time that withh a cutoff of 200.000. Could someone having access to a computer with more cores run the test in order to compare the results

I have also implemented parallel_sort without masking the pool

template <typename AE, typename Range>
void parallel_sort(AE& ae, Range& range, unsigned cutoff=10000) {
    boost::iterator_range<typename boost::range_iterator<Range>::type> rng(range);
    inplace_solve<sort_fct,inplace_merge_fct,pool_type,Range>( ae, rng, cutoff);
}

and in this case we win ~3 mili seconds for the creation of the Thread Pool. 

std::sort: reverse 0..400000 37 milli seconds
std::sort: 0..400000 35 milli seconds
boost::sort: reverse 0..400000 36 milli seconds
boost::sort: 0..400000 34 milli seconds
parallel_sort 200000:  reverse 0..400000 27 milli seconds
parallel_sort 200000:  0..400000 23 milli seconds
parallel_sort 100000:  reverse 0..400000 29 milli seconds
parallel_sort 100000:  0..400000 27 milli seconds
parallel_sort 50000:  reverse 0..400000 29 milli seconds
parallel_sort 50000:  0..400000 30 milli seconds
parallel_sort 25000:  reverse 0..400000 31 milli seconds
parallel_sort 25000:  0..400000 29 milli seconds
parallel_sort 12500:  reverse 0..400000 32 milli seconds
parallel_sort 12500:  0..400000 27 milli seconds

This let me think that the application should not create ThreadPools for specific algorithms, but use a default one. 
The V21 of Boost.ThreadPool provide a boost::this_task::get_thread_pool<pool_type>(), but this function works only if called from a task. It should be great if the Boost.ThreadPool provided a default pool which could be recovered with this function on threads that are not worker threads of a ThreadPool. Otherwise the application is forced eother to add a parameter to all the functions or define its own default thread pool statically.

Oliver you are free to take this as example for the ThreadPool documentation, if you consider this can improve the understanding of what can be done with your library.

Best,

Vicente

vicente.botet

Edouard A.

k-oli＠gmx.de

Edouard A.

vicente.botet

Phil Endecott

vicente.botet

k-oli＠gmx.de

vicente.botet

tags

participants (4)