
I have implemented a parallel_sort that is a bit slower than tbb with threadpool (see below for the code). Benchmark results: std::fill 0..1000000 ok : 1 elapsed: 0.01 tbb::parallel_for fill 0..1000000 ok : 1 elapsed: 0.01 std::sort reverse 0..1000000 ok : 1 elapsed: 0.11 tbb::parallel_sort reverse 0..1000000 ok : 1 elapsed: 0.037 boost::tp::sort reverse 0..1000000 ok : 1 elapsed: 0.048 Machine: Q6600 - 4 GiB Ram - Running Vista 64 - x64 release build with full optimizations. Some remarks: * shutdown() renders the thread pool unusable (marks it terminated). This is not what I was looking for. Ideally, a wait() method that just waits for the pool to have no tasks left to process would be better. I would find it cumbersome to manually wait for the result of each task. My 2c. * It's difficult to properly benchmark the two on my development machine that is probably using one core to do things background work * Instrumentation seems to show that the _entry method of the pool is extremely slow. I have not investigated further. * boost::tp::sort benchmark includes thread destruction... Probably eats up some micro secs * My code is probably sub-optimal The code: template <typename RandomIterator> RandomIterator pivot_partition(RandomIterator first, RandomIterator last) { typedef RandomIterator iterator_type; typedef typename iterator_type::difference_type difference_type; typedef typename iterator_type::value_type value_type; iterator_type pivot = first + std::distance(first, last) / 2; // we partition value_type pivot_value = *pivot; // we "save" the pivot in putting it to the first place std::iter_swap(first, pivot); // we skip the pivot otherwise the partitioning won't work as expected iterator_type par_start = first + 1; std::less<value_type> pred; pivot = std::partition(par_start, last, std::bind2nd(pred, pivot_value)); // we restore the pivot std::iter_swap(--pivot, first); return pivot; } template <typename RandomIterator, typename Pool> void rec_pivot_partition(RandomIterator first, RandomIterator last, Pool & p) { if (std::distance(first, last) > 1000) { RandomIterator pivot = pivot_partition(first, last); rec_pivot_partition(first, pivot, p); rec_pivot_partition(pivot, last, p); } else { p.submit(boost::bind(&std::sort<iterator_type>, first, last)); } }; template <typename RandomIterator> void parallel_tp(RandomIterator first, RandomIterator last) { typedef RandomIterator iterator_type; typedef typename iterator_type::difference_type difference_type; typedef typename iterator_type::value_type value_type; difference_type n = std::distance(first, last); RandomIterator l = first; const difference_type block_size = 1000; // we partition until we have "atoms", ie block of size < block_size boost::tp::default_pool & p = boost::tp::get_default_pool(); rec_pivot_partition(first, last, p); p.shutdown(); }