[threadpool] version 22 with default pool

I've uploaded a new version of boost-threadpool. changes: * free-function get_default_pool() returning reference to static pool< unbounded_channel< fifo > > * task has new function result() returning the internal shared_future * future-like functions removed from task * boost-threadpool not a header-only library Oliver

Hi Oliver, ----- Original Message ----- From: <k-oli@gmx.de> To: <boost@lists.boost.org> Sent: Tuesday, March 03, 2009 10:28 PM Subject: [boost] [threadpool] version 22 with default pool
I've uploaded a new version of boost-threadpool.
changes: * free-function get_default_pool() returning reference to static pool< unbounded_channel< fifo > > * task has new function result() returning the internal shared_future * future-like functions removed from task * boost-threadpool not a header-only library
First of all thanks for incorporating the get_default_pool() and result() functions and the default constructor for the task class. And also for changing jss by boost on the future file (BTW the reschedule_until example has jet some jss::) I would preserv the future-like functions, but I can live without. With the current reschedule_until prototype we can not made any generic algotithm. template< typename Pool, typename R > void reschedule_until( boost::shared_future< R > const&); Fisrt because we need to give the pool type. Generic code is unable to know the pool type staticaly. Second because the future parameter is too restrictive. In order the future be ready another thread (or task) must set the promise. The prototype I'm locking for is template< typename Predicate > void reschedule_until( Predicate cnd); In order to achieve this, you need that the thread specific pointer tss_worker_ for the worker threads be unique for all the pools, so instead of been a member of the pool context it could be a static variable of the worker class. namespace this_task { template< typename Predicate > void reschedule_until( Predicate cnd) { worker* w( worker::tss_worker_.get() ); BOOST_ASSERT( w); w->reschedule_until( cnd ); } } The worker class defines a reschedule_until member function as follows: template< typename Predicate > void worker::reschedule_until( Predicate cnd) { (*reschedule_until_ptr)( pool_ptr_, cnd); } The class worker will store a reschedule_until_ptr function pointer and pointer to a banalised pool void * pool_ptr_; void (reschedule_until_ptr)(void*, function<bool()>); The pool class will have a private static function that casts and forwards to the private member function. static void reschedule_until( void* pool, function<bool()> cnd) { static_cast<pool*>(pool)->reschedule_until_(cnd); } The pool will give this function at the construction of the worker. So the worker and the pool more decoupled. As the worker is no more a template class, its implementation can be on a .cpp file. As I have already say in another post I don't know how to implement this_task::sleep with the current reschedule_until interface. I repost here an adapted version in case the message was lost. "Oliver, I think that one of the use cases is to be able to implement with the public interface namespace this_task { void sleep_until(t); } namespace this_task { namespace detail { struct time_reached { system_time& abs_time_; time_reached(system_time& abs_time) : abs_time_(abs_time) {} bool operator()() { return boost::get_system_time() >= abs_time_; } }; } void sleep_until(system_time& abs_time) { detail::time_reached t(abs_time); this_task::reschedule_until(t); } } This sounds good, but as we don't want the user writes different code when the function will be called by a task or by a thread we need just to implement a sleep that depending on the kind of thread the function will reschedule or call to the thread version. void sleep_until(system_time& abs_time) { if (this_task::pressent()) { detail::time_reached t(abs_time); this_task::reschedule_until(t); } else { this_thread::sleep(abs_time); } } So the function present() or any better name should be added. " Resuming, in order to implement this_task::sleep without blocking the worker thread we need: namespace this_task { template< typename Pool, typename Predicate > void reschedule_until(Predicate cnd); bool present(); } Oliver, let me know if have you found a way to implement this_task::sleep_until without blocking the current thread. Thanks for all the good work already accomplished, Vicente

Hello Oliver,
I've uploaded a new version of boost-threadpool.
changes: * free-function get_default_pool() returning reference to static pool< unbounded_channel< fifo > > * task has new function result() returning the internal shared_future * future-like functions removed from task * boost-threadpool not a header-only library
Unfortunately, I am unable to compile the tp library. TT default_pool.cpp doesn't compile... I edited it enough to have it compile but afterward it cannot link with boost thread. I did a work around in including the .cpp directly into my project but I thought I should let you know. struct static_pool { static default_pool instance; }; } inline default_pool & get_default_pool() { return detail::static_pool::instance; } } By the way, why didn't you write it as: inline default_pool & get_default_pool() { static default_pool instance(poolsize(thread::hardware_concurrency())); return instance; } This eliminates the need of a .cpp for default_pool. Another remark is that I am not sure I understand the raison d'être of the poolsize type. What is the added value to say, "int" or std::size_t. This prevents us from writing default_pool instance(thread::hardware_concurrency()); (Or any other value for that matter) When scheduling several tasks with submit, I find it odd that I don't have a : pool.wait() that would allow me to wait until all tasks are finished (condition_variable-like pattern for example). I'm currently writing a simple parallel_sort for the threadpool library to deny or confirm Vicente discoveries. A simple parallel_sort where I would split the input into blocks of size 1000, sort these blocks in parallel and then merge them (at first in a serial manner, then in a parallel manner, depending on how much time I can allocate on this task). Some tests done with parallel "filling" on a concurrent_vector show that tbb does better but threadpool is not humiliated: std::fill reverse 0..1000000 ok : 1000000 elapsed: 0.403 boost::tp:fill reverse 0..1000000 ok : 1000000 elapsed: 0.274 tbb::parallel_for fill 0..1000000 ok : 1000000 elapsed: 0.261 Run made on a Q6600, we're not going much faster for hardware reasons I think, I don't remember exactly the Q6600 architecture but I believe the four cores don't all get direct memory access (2x2 I think?), notwithstanding the memory bandwidth limitation. This is how I implemented tp::fill: template <typename RandomIterator, typename Value> void parallel_fill(RandomIterator first, RandomIterator last, Value v) { typedef typename RandomIterator::difference_type difference_type; difference_type n = std::distance(first, last); boost::tp::default_pool p(boost::tp::poolsize(4)); RandomIterator l = first; RandomIterator i = l; const difference_type block_size = 1000; typedef boost::tp::task<void> task_type; for(;i != last; l = i) { std::advance(i, std::min BOOST_PREVENT_MACRO_SUBSTITUTION(std::distance(i, last), block_size)); BOOST_ASSERT(l < i); BOOST_ASSERT(l != last); BOOST_ASSERT(i != first); p.submit(boost::bind(&std::fill<RandomIterator, Value>, l, i, v)); } } Kind regards. -- EA

Hello, Am Sunday 08 March 2009 15:29:05 schrieb Edouard A.:
Unfortunately, I am unable to compile the tp library. TT
Hmm - did you try version 23? I've no problems to compile default_pool( gcc-4.3).
By the way, why didn't you write it as:
inline default_pool & get_default_pool() { static default_pool instance(poolsize(thread::hardware_concurrency())); return instance; }
static locals are not thread-safe
This eliminates the need of a .cpp for default_pool.
code-bloading!
Another remark is that I am not sure I understand the raison d'être of the poolsize type. What is the added value to say, "int" or std::size_t. This prevents us from writing
pool accepts multiple integer arguments (poolsize, high_watermark low_watermark,...) - with specific types for each argument the meaning of each arg becomes clear (see scott meyers advise).
When scheduling several tasks with submit, I find it odd that I don't have a pool.wait()
pool::shutdown() or pool::shutdown_now() -> diff. explained in the docs
that would allow me to wait until all tasks are finished (condition_variable-like pattern for example).
Some tests done with parallel "filling" on a concurrent_vector show that tbb does better but threadpool is not humiliated:
a task is finsihed if its result (shared_future) becomes ready -> shared_future< R >task< R >::result() void shared_future< R >::wait() this is clear because threadpool use a single-lock global qeue. I'm working on a lock-free global queue - I hope theadpool becomes faster.
This is how I implemented tp::fill:
template <typename RandomIterator, typename Value> void parallel_fill(RandomIterator first, RandomIterator last, Value v) { typedef typename RandomIterator::difference_type difference_type;
difference_type n = std::distance(first, last);
boost::tp::default_pool p(boost::tp::poolsize(4));
the default pool should already be avaliable - I don't understand why you initialize it with poolsize. Oliver

Am Sunday 08 March 2009 15:29:05 schrieb Edouard A.:
Unfortunately, I am unable to compile the tp library. TT
Hmm - did you try version 23? I've no problems to compile default_pool( gcc-4.3). [Edouard A.]
Yes, this is version 23, I am using Visual Studio 2008.
pool accepts multiple integer arguments (poolsize, high_watermark low_watermark,...) - with specific types for each argument the meaning of each arg becomes clear (see scott meyers advise). [Edouard A.]
When I initialize a std::vector, I'm happy to be able to write std::vector<> v(10) and not std::vector<> v(vector_size(10)). Sometimes parameter's purpose is obvious. Your call.
When scheduling several tasks with submit, I find it odd that I don't have a pool.wait()
pool::shutdown() or pool::shutdown_now() -> diff. explained in the docs
[Edouard A.] I'm afraid that the method name might be a bit misleading, I would expect a shutdown() method to kill my pool and make it unusable for later use without going calling startup() function.
a task is finsihed if its result (shared_future) becomes ready -> shared_future< R >task< R >::result() void shared_future< R >::wait() [Edouard A.]
I guess your internal scheduler knows when all tasks are processed, why not offering a condition variable... Otherwise it means I have to scan all the tasks to know if they are all finished. But you pointed out shutdown() and I think it will do the trick.
Some tests done with parallel "filling" on a concurrent_vector show that tbb does better but threadpool is not humiliated: this is clear because threadpool use a single-lock global qeue. I'm working on a lock-free global queue - I hope theadpool becomes faster. [Edouard A.]
For long queues, most likely.
the default pool should already be avaliable - I don't understand why you initialize it with poolsize.
The shutdown() method's purpose was unclear to me before your explanation, making it easier to create a pool and let the default destructor wait for all threads. -- EA

Am Sunday 08 March 2009 18:38:05 schrieb Edouard A.:
Am Sunday 08 March 2009 15:29:05 schrieb Edouard A.:
Unfortunately, I am unable to compile the tp library. TT
Hmm - did you try version 23? I've no problems to compile default_pool( gcc-4.3).
[Edouard A.]
Yes, this is version 23, I am using Visual Studio 2008.
I'll test it with ms-vc soon
pool accepts multiple integer arguments (poolsize, high_watermark low_watermark,...) - with specific types for each argument the meaning of each arg becomes clear (see scott meyers advise).
[Edouard A.]
When I initialize a std::vector, I'm happy to be able to write std::vector<> v(10) and not std::vector<> v(vector_size(10)). Sometimes parameter's purpose is obvious. Your call.
in your example vector has only one argument - tp::poll has more: tp::pool(poolsize high_watermark, low_watermark,...) and this is more intuitive than tp::pool( int, int, int, ...)
When scheduling several tasks with submit, I find it odd that I don't
have
a pool.wait()
pool::shutdown() or pool::shutdown_now() -> diff. explained in the docs
[Edouard A.]
I'm afraid that the method name might be a bit misleading, I would expect a shutdown() method to kill my pool and make it unusable for later use without going calling startup() function.
I don't think it is missleading because it does what is names - it shuts down the pool. pool::shutdown() wait for all tasks and pool::shutdown_now intrrupts all running tasks It makes no sense to introduce a wait() member-function because you have to wait in the future associated with the task ( memebr function shared_future< R > pool::result() ) - please look into the future documentation (link is inside the threadpool docu).
I guess your internal scheduler knows when all tasks are processed, why not offering a condition variable... Otherwise it means I have to scan all the tasks to know if they are all finished. But you pointed out shutdown() and I think it will do the trick. no this is complete wrong - you can wait for multiple tasks without the nned to shutdown the pool! callwait_for_all() or wait_for_any() from the future lib (from Anthony Williams) for the tasks you are waiting for (you have to pass the shared_future from each task).
Oliver

I have implemented a parallel_sort that is a bit slower than tbb with threadpool (see below for the code). Benchmark results: std::fill 0..1000000 ok : 1 elapsed: 0.01 tbb::parallel_for fill 0..1000000 ok : 1 elapsed: 0.01 std::sort reverse 0..1000000 ok : 1 elapsed: 0.11 tbb::parallel_sort reverse 0..1000000 ok : 1 elapsed: 0.037 boost::tp::sort reverse 0..1000000 ok : 1 elapsed: 0.048 Machine: Q6600 - 4 GiB Ram - Running Vista 64 - x64 release build with full optimizations. Some remarks: * shutdown() renders the thread pool unusable (marks it terminated). This is not what I was looking for. Ideally, a wait() method that just waits for the pool to have no tasks left to process would be better. I would find it cumbersome to manually wait for the result of each task. My 2c. * It's difficult to properly benchmark the two on my development machine that is probably using one core to do things background work * Instrumentation seems to show that the _entry method of the pool is extremely slow. I have not investigated further. * boost::tp::sort benchmark includes thread destruction... Probably eats up some micro secs * My code is probably sub-optimal The code: template <typename RandomIterator> RandomIterator pivot_partition(RandomIterator first, RandomIterator last) { typedef RandomIterator iterator_type; typedef typename iterator_type::difference_type difference_type; typedef typename iterator_type::value_type value_type; iterator_type pivot = first + std::distance(first, last) / 2; // we partition value_type pivot_value = *pivot; // we "save" the pivot in putting it to the first place std::iter_swap(first, pivot); // we skip the pivot otherwise the partitioning won't work as expected iterator_type par_start = first + 1; std::less<value_type> pred; pivot = std::partition(par_start, last, std::bind2nd(pred, pivot_value)); // we restore the pivot std::iter_swap(--pivot, first); return pivot; } template <typename RandomIterator, typename Pool> void rec_pivot_partition(RandomIterator first, RandomIterator last, Pool & p) { if (std::distance(first, last) > 1000) { RandomIterator pivot = pivot_partition(first, last); rec_pivot_partition(first, pivot, p); rec_pivot_partition(pivot, last, p); } else { p.submit(boost::bind(&std::sort<iterator_type>, first, last)); } }; template <typename RandomIterator> void parallel_tp(RandomIterator first, RandomIterator last) { typedef RandomIterator iterator_type; typedef typename iterator_type::difference_type difference_type; typedef typename iterator_type::value_type value_type; difference_type n = std::distance(first, last); RandomIterator l = first; const difference_type block_size = 1000; // we partition until we have "atoms", ie block of size < block_size boost::tp::default_pool & p = boost::tp::get_default_pool(); rec_pivot_partition(first, last, p); p.shutdown(); }

I have implemented a parallel_sort that is a bit slower than tbb with threadpool (see below for the code).
* shutdown() renders the thread pool unusable (marks it terminated). This is not what I was looking for. Ideally, a wait() method that just waits for the pool to have no tasks left to process would be better. I would find it cumbersome to manually wait for the result of each task. My 2c.
Am Sunday 08 March 2009 19:48:11 schrieb Edouard A.: threadpool uses a signle-lock global queue - if a thread enters the queue for enqueing/dequeing a task it aquires the lock so no other thread can enter the queue. with a lock-free implementation it should work faster. pool::shutdown() is not correct - use boost::wait_for_all() or boost::wait_for_any() from the future lib from Anthony WIlliams (I've added the lib to threadpool archive). task< int > tsk1 = pool.submit(...); task< string > tsk2 = pool.submit(...); task< int > tsk3 = pool.submit(...); wait_for_all( tsk1.result(), tsk2.result(), tsk3.result()); // here all tasks are finsihed
* Instrumentation seems to show that the _entry method of the pool is extremely slow. I have not investigated further.
the pool::entry_ method is entered by boost::thread only on startup and will only left if the worker-thread terminates Oliver

I have implemented a parallel_sort that is a bit slower than tbb with threadpool (see below for the code).
Am Sunday 08 March 2009 19:48:11 schrieb Edouard A.: threadpool uses a signle-lock global queue - if a thread enters the queue for enqueing/dequeing a task it aquires the lock so no other thread can enter the queue. with a lock-free implementation it should work faster.
This is quite possible. I'm looking forward to test it with a lock free queue. I've also seen a lots of .lock() .unlock() in the code... Sometimes holding the lock is faster than releasing it and acquiring it again.
pool::shutdown() is not correct - use boost::wait_for_all() or boost::wait_for_any() from the future lib from Anthony WIlliams (I've added the lib to threadpool archive).
task< int > tsk1 = pool.submit(...); task< string > tsk2 = pool.submit(...); task< int > tsk3 = pool.submit(...);
wait_for_all( tsk1.result(), tsk2.result(), tsk3.result()); // here all tasks are finsihed
I'm not sure it's a good idea to force the user to collect the tasks and wait for them. I thought threadpool could abstract that out. I mean, it's nice to have the alternative to wait for specific tasks, but generally you just throw in work into the pool and want that work to be finished... Perhaps a task_group object could solve this problem? You could also "link" tasks. Ie, you need depending tasks to be finished when you wait on one. Anyway, with wait_for_all, it's clearly slower than with shutdown(). Not really surprising as you go through more abstraction layers and you have to store the tasks' results in a structure. IMHO, the fastest way would probably to have a condition_variable bound to the number of running tasks in the pool itself. With wait_for_all: std::fill 0..1000000 ok : 1 elapsed: 0.01 tbb::parallel_for fill 0..1000000 ok : 1 elapsed: 0.01 std::sort reverse 0..1000000 ok : 1 elapsed: 0.111 tbb::parallel_sort reverse 0..1000000 ok : 1 elapsed: 0.038 boost::tp::sort reverse 0..1000000 ok : 1 elapsed: 0.055 Regards. -- EA

Am Sunday 08 March 2009 20:38:22 schrieb Edouard A.:
I've also seen a lots of .lock() .unlock() in the code... Sometimes holding the lock is faster than releasing it and acquiring it again.
You have also to prevent deadlocks - so sometimes it would be faster to hold a lock but a deadlock could be raised.
I'm not sure it's a good idea to force the user to collect the tasks and wait for them. I thought threadpool could abstract that out. I mean, it's nice to have the alternative to wait for specific tasks, but generally you just throw in work into the pool and want that work to be finished...
that's what threadpool does - you submit work and you get for each item a handle (task) back. the pool schedules and executes the work inside to the worker-threads. the pool itself is not interessted in the result of the work-items nor should the pool have knowledge about the submitted work. This can only be done outside the pool where you have anougth context. So it makes no sense that the pool waits for a subset of the submitted work.
Anyway, with wait_for_all, it's clearly slower than with shutdown(). Why? Waiting for the futures is only signaling that the result of the submitted work has calculated (done via condition variable inside the pool).
Not really surprising as you go through more abstraction layers and you have to store the tasks' results in a structure. IMHO, the fastest way would probably to have a condition_variable bound to the number of running tasks in the pool itself. you could take a look into the future library - because future is used to transfer the result between threads (using condition variables inside).
Oliver

You have also to prevent deadlocks - so sometimes it would be faster to hold a lock but a deadlock could be raised.
I figured you did that for a reason. Deadlocks are tricky and one way is to indeed hold one lock at the time. Another possibility is to make sure you take all the locks in the same order every time... You can also reduce the number of locks. I know this is not straightforward and it takes a lot of time to find the right settings. How many different locks do you have?
that's what threadpool does - you submit work and you get for each item a handle (task) back. the pool schedules and executes the work inside to the worker-threads. the pool itself is not interessted in the result of the work-items nor should the pool have knowledge about the submitted work. This can only be done outside the pool where you have anougth context. So it makes no sense that the pool waits for a subset of the submitted work.
I understand. This sounds logical. But... I don't want to sound narrow minded, I really think there are use cases where you simply want to know that your pool has done all the work you gave to it (without knowing what the work actually was). That would be waiting for pending() and running() to be == 0. Of course when your threadpool is handling a lot of tasks coming from different clients, that doesn't make sense anymore. In which case it would be nice to have some sort of "root" task on which other tasks depend. You would only need to wait for the root task to finish, making code simpler to write (and maybe the waiting more efficient to write?). Alternatively you can embed in your task some sort of synchronization mechanism... But I think it's best to have the client write as little synchronization code as possible.
you could take a look into the future library - because future is used to transfer the result between threads (using condition variables inside).
The problem is that you can have lots of tasks when sorting a container, and that means a lot of overhead with this approach. If I'm correct, if you have many tasks the wait_all starts to be slow. Maybe it's just a problem on my platform. I would need to investigate this further. Regards. -- EA

----- Original Message ----- From: "Edouard A." <edouard@fausse.info> To: <boost@lists.boost.org> Sent: Sunday, March 08, 2009 11:11 PM Subject: Re: [boost] [threadpool] version 22 with default pool
that's what threadpool does - you submit work and you get for each item a handle (task) back. the pool schedules and executes the work inside to the worker-threads. the pool itself is not interessted in the result of the work-items nor should the pool have knowledge about the submitted work. This can only be done outside the pool where you have anougth context. So it makes no sense that the pool waits for a subset of the submitted work.
I understand. This sounds logical. But... I don't want to sound narrow minded, I really think there are use cases where you simply want to know that your pool has done all the work you gave to it (without knowing what the work actually was). That would be waiting for pending() and running() to be == 0.
Of course when your threadpool is handling a lot of tasks coming from different clients, that doesn't make sense anymore.
IMO is is up to the user to know which tasks is waiting for.
In which case it would be nice to have some sort of "root" task on which other tasks depend. You would only need to wait for the root task to finish, making code simpler to write (and maybe the waiting more efficient to write?).
This is exactly what the user does calling wait_for_all wait_for_all(f1, ..., fn) Oliver, I think the problem is that the worker thread will block on this call without stealing other tasks. We need to modify the call to wait_for_all in some way, so instead of template<typename F1,typename F2,typename F3> void wait_for_all(F1& f1,F2& f2,F3& f3) { f1.wait(); f2.wait(); f3.wait(); } we need template<typename T1,typename T2,typename T3> void wait_for_all(task<T1>& t1,task<T2>& t2,task<T2>& t3) { boost::tp::this_task::reschedule_until(t1.result()); boost::tp::this_task::reschedule_until(t2.result()); boost::tp::this_task::reschedule_until(t3.result()); }
Alternatively you can embed in your task some sort of synchronization mechanism... But I think it's best to have the client write as little synchronization code as possible.
Could you elaborate on this?
you could take a look into the future library - because future is used to transfer the result between threads (using condition variables inside).
The problem is that you can have lots of tasks when sorting a container, and that means a lot of overhead with this approach. If I'm correct, if you have many tasks the wait_all starts to be slow. Maybe it's just a problem on my platform. I would need to investigate this further.
Why there is overhead? Why wait_for_all is slow? Best, Vicente

On Sun, 8 Mar 2009 23:35:00 +0100, "vicente.botet" <vicente.botet@wanadoo.fr> wrote:
Alternatively you can embed in your task some sort of synchronization mechanism... But I think it's best to have the client write as little synchronization code as possible.
Could you elaborate on this?
If you add a barrier in your task. The problem is to avoid having one thread waiting for n threads. If you have threads waiting for each other, I would say it's better. If you are using a kernel object to wait, you pay the cost of a transition even if you don't have to wait.
The problem is that you can have lots of tasks when sorting a container, and that means a lot of overhead with this approach. If I'm correct, if you have many tasks the wait_all starts to be slow. Maybe it's just a problem on my platform. I would need to investigate this further.
Why there is overhead? Why wait_for_all is slow?
I don't know. Maybe for the reason stated above. -- EA

----- Original Message ----- From: "Edouard A." <edouard@fausse.info> To: <boost@lists.boost.org> Sent: Monday, March 09, 2009 9:42 AM Subject: Re: [boost] [threadpool] version 22 with default pool
On Sun, 8 Mar 2009 23:35:00 +0100, "vicente.botet" <vicente.botet@wanadoo.fr> wrote:
Alternatively you can embed in your task some sort of synchronization mechanism... But I think it's best to have the client write as little synchronization code as possible.
Could you elaborate on this?
If you add a barrier in your task. The problem is to avoid having one thread waiting for n threads. If you have threads waiting for each other, I would say it's better.
If you are using a kernel object to wait, you pay the cost of a transition even if you don't have to wait.
Sorry, I'm a little bit lost.
The problem is that you can have lots of tasks when sorting a container, and that means a lot of overhead with this approach. If I'm correct, if you have many tasks the wait_all starts to be slow. Maybe it's just a problem on my platform. I would need to investigate this further.
Why there is overhead? Why wait_for_all is slow?
I don't know. Maybe for the reason stated above.
Do you mean that wait_for_all will block the current thread? What about my proposition to specialize wait_for_all when we are in a worker thread? template<typename T1,typename T2,typename T3> void wait_for_all(task<T1>& t1,task<T2>& t2,task<T2>& t3) { boost::this_task::reschedule_until(t1.result()); boost::this_task::reschedule_until(t2.result()); boost::this_task::reschedule_until(t3.result()); } or even better add an indirection, instead of act.wait() use wait(act): template<typename CF1,typename CF2,typename CF3> void wait_for_all(F1& f1,F2& f2,F3& f3) { wait(f1); wait(f2); wait(f3); } and define wait as template <typename ACT> void wait(ACT& act) { if (boost::this_task::is_worker()) { boost::this_task::reschedule_until(predicate::is_ready(act)); } else { act.wait(); } } template <typename R> void wait(tp::task<R>& t) { t.result().wait(); } Boost.Interthreads has already this kind of free functions (wait()). Now that Boost.ThreadPool allows to check if the tread is a worker of a pool we can add a layer doing exactly that. What do you think? Vicente

If you are using a kernel object to wait, you pay the cost of a transition even if you don't have to wait.
Sorry, I'm a little bit lost. [Edouard A.]
My phrasing was confusing. I meant: Even if you don't need to wait, you pay the cost of context switch when accessing a kernel synchronization object.
Boost.Interthreads has already this kind of free functions (wait()). Now that Boost.ThreadPool allows to check if the tread is a worker of a pool we can add a layer doing exactly that.
What do you think?
The principle is good, however I'm afraid that you will add to the cost of the waiting a nasty thing: poisoning. The threads that will wait will not be able to schedule other useful tasks. Unless you wish to dedicate a pool for this purpose. In other news, I've improved the sorting function as you suggested: template <typename RandomIterator, typename Pool> void rec_pivot_partition(RandomIterator first, RandomIterator last, Pool & p) { if (std::distance(first, last) > 1000) { RandomIterator pivot = pivot_partition(first, last); boost::tp::task<void> task = p.submit(boost::bind(&rec_pivot_partition<RandomIterator, Pool>, boost::ref(first), boost::ref(pivot), boost::ref(p))); rec_pivot_partition(pivot, last, p); task.result().wait(); } else { std::sort(first, last); } }; The results are much better, this time on a larger vector to account Phil's remark: std::fill reverse 0..10000000 elapsed: 0.039 ok : 1 tbb::parallel_for fill 0..10000000 elapsed: 0.016 ok : 1 std::sort reverse 0..10000000 elapsed: 1.172 ok : 1 tbb::parallel_sort reverse 0..10000000 elapsed: 0.269 ok : 1 boost::tp::sort reverse 0..10000000 elapsed: 0.331 ok : 1 This should not be seen as a real benchmark for a parallel_sort implementation - there are many more cases to evaluate when it comes to sorting - but I think it's safe to say that TBB is significantly more efficient. That should not discourage anyone of course: TBB is a mature library with a lot of skilled people working on it. I also increased the size of the block to 10,000 and performance improved a little bit (to ~ 0.316). If I increase to 100,000, performance go back to ~ 0.331. I initially set up the slice to 1,000 because that's the size of a slice in TBB. I wanted - first of all - to compare the engines. I've run a test with sizes between 100 and 150,000 and it seems to have little impact on the outcome. This is strange. It's a bit early to say if it's bad news. See csv & graph attached. I've used GetTickCount to measure. We should perhaps do a test where we would inject large amount of tasks of precise duration and see how the scheduler behaves. It would also be interesting to measure the delay of execution of one of the tasks (if you know a task should last 1 s, measure how long it actually took). Kind regards. -- EA

----- Original Message ----- From: "Edouard A." <edouard@fausse.info> To: <boost@lists.boost.org> Sent: Monday, March 09, 2009 9:19 PM Subject: Re: [boost] [threadpool] version 22 with default pool
If you are using a kernel object to wait, you pay the cost of a transition even if you don't have to wait.
Sorry, I'm a little bit lost. [Edouard A.]
My phrasing was confusing. I meant:
Even if you don't need to wait, you pay the cost of context switch when accessing a kernel synchronization object.
The single function accesd will be is_ready, which is not a core function (see below).
Boost.Interthreads has already this kind of free functions (wait()). Now that Boost.ThreadPool allows to check if the tread is a worker of a pool we can add a layer doing exactly that.
What do you think?
The principle is good, however I'm afraid that you will add to the cost of the waiting a nasty thing: poisoning. The threads that will wait will not be able to schedule other useful tasks. Unless you wish to dedicate a pool for this purpose.
As I have already explained in other threads the worker thread will not block while waiting, because it is replaced by reschedul_until the condition is satisfyed (see in particular my last post)
In other news, I've improved the sorting function as you suggested:
template <typename RandomIterator, typename Pool> void rec_pivot_partition(RandomIterator first, RandomIterator last, Pool & p) { if (std::distance(first, last) > 1000) { RandomIterator pivot = pivot_partition(first, last); boost::tp::task<void> task = p.submit(boost::bind(&rec_pivot_partition<RandomIterator, Pool>, boost::ref(first), boost::ref(pivot), boost::ref(p))); rec_pivot_partition(pivot, last, p); task.result().wait(); } else { std::sort(first, last); }
};
The results are much better, this time on a larger vector to account Phil's remark:
std::fill reverse 0..10000000 elapsed: 0.039 ok : 1 tbb::parallel_for fill 0..10000000 elapsed: 0.016 ok : 1 std::sort reverse 0..10000000 elapsed: 1.172 ok : 1 tbb::parallel_sort reverse 0..10000000 elapsed: 0.269 ok : 1 boost::tp::sort reverse 0..10000000 elapsed: 0.331 ok : 1
This should not be seen as a real benchmark for a parallel_sort implementation - there are many more cases to evaluate when it comes to sorting - but I think it's safe to say that TBB is significantly more efficient. That should not discourage anyone of course: TBB is a mature library with a lot of skilled people working on it.
I also increased the size of the block to 10,000 and performance improved a little bit (to ~ 0.316). If I increase to 100,000, performance go back to ~ 0.331.
Good new, isn't it?
I initially set up the slice to 1,000 because that's the size of a slice in TBB. I wanted - first of all - to compare the engines.
I've run a test with sizes between 100 and 150,000 and it seems to have little impact on the outcome. This is strange. It's a bit early to say if it's bad news. See csv & graph attached. I've used GetTickCount to measure.
I agree this is suspect.
We should perhaps do a test where we would inject large amount of tasks of precise duration and see how the scheduler behaves. It would also be interesting to measure the delay of execution of one of the tasks (if you know a task should last 1 s, measure how long it actually took).
Profiling will be welcome. Best, Vicente

I also increased the size of the block to 10,000 and performance improved a little bit (to ~ 0.316). If I increase to 100,000, performance go back to ~ 0.331.
Good new, isn't it?
Yes, but there is room for improvement... As long as we don't go four times faster than std::sort, we can do better. ;)
I've run a test with sizes between 100 and 150,000 and it seems to have little impact on the outcome. This is strange. It's a bit early to say if it's bad news. See csv & graph attached. I've used GetTickCount to measure.
I agree this is suspect.
We should perhaps do a test where we would inject large amount of tasks of precise duration and see how the scheduler behaves. It would also be interesting to measure the delay of execution of one of the tasks (if you know a task should last 1 s, measure how long it actually took).
Profiling will be welcome.
The latency + bandwidth test should explain why the slice's size doesn't seem to affect the performances. For the test task I see something like: for(int count = ::GetTickCount(); count != target; count = ::GetTickCount()); This is a trivial spin to make sure the tasks eat up some CPU for the desired amount of ms. -- EA

----- Original Message ----- From: "Edouard A." <edouard@fausse.info> To: <boost@lists.boost.org> Sent: Monday, March 09, 2009 10:13 PM Subject: Re: [boost] [threadpool] version 22 with default pool
I also increased the size of the block to 10,000 and performance improved a little bit (to ~ 0.316). If I increase to 100,000, performance go back to ~ 0.331.
Good new, isn't it?
Yes, but there is room for improvement... As long as we don't go four times faster than std::sort, we can do better. ;)
we are alreadt ~3.7 times better. I suspect tbb::parallel_sort doesn't use a std::sort as base algorithm.
The latency + bandwidth test should explain why the slice's size doesn't seem to affect the performances. For the test task I see something like:
for(int count = ::GetTickCount(); count != target; count = ::GetTickCount());
This is a trivial spin to make sure the tasks eat up some CPU for the desired amount of ms.
May be you can do the test, I work on cygwin. Vicente

we are alreadt ~3.7 times better. I suspect tbb::parallel_sort doesn't use a std::sort as base algorithm.
They actually use std::sort. Why should they use something else? :) I wonder if we could do better with a median of 3 pivot selection.
for(int count = ::GetTickCount(); count != target; count = ::GetTickCount());
This is a trivial spin to make sure the tasks eat up some CPU for the desired amount of ms.
May be you can do the test, I work on cygwin.
I will do it, yes, but not today. ;) -- EA

Edouard A. wrote:
In which case it would be nice to have some sort of "root" task on which other tasks depend. You would only need to wait for the root task to finish, making code simpler to write (and maybe the waiting more efficient to write?).
If you want a lightweight mechanism and can afford to be 'a bit wrong' then you could just assign task ids from a 128-bit number and return the id from submission, then you could perhaps wait for 'lowest task > n'. In most systems you get a more-or-less FIFO approach and it won't be far out - and its cheap. That or perhaps have a 'task group' facility where you can optionally associate tasks with a group and check the waiting+executing count per group, which might use an atomic counter and an event that fires iff the atomic decrement went to 0. I think you have to allow pools to have extremely bursty usage patterns and high throughput - anything that. locks, sorts or otherwise stalls the queue is Bad News. James

In most systems you get a more-or-less FIFO approach and it won't be far out - and its cheap. That or perhaps have a 'task group' facility where you can optionally associate tasks with a group and check the waiting+executing count per group, which might use an atomic counter and an event that fires iff the atomic decrement went to 0.
That's a good idea. That would exhibit different things I guess.
I think you have to allow pools to have extremely bursty usage patterns and high throughput - anything that. locks, sorts or otherwise stalls the queue is Bad News.
I haven't reviewed the threadpool code. The only glance I had at it indicated that yes, there seem to be way too many locks around. I will wait for the tests to point out some strange behavior before dwelling further, because I'm incredibly lazy. :) -- EA

----- Original Message ----- From: "Edouard A." <edouard@fausse.info> To: <boost@lists.boost.org> Sent: Sunday, March 08, 2009 7:48 PM Subject: Re: [boost] [threadpool] version 22 with default pool
I have implemented a parallel_sort that is a bit slower than tbb with threadpool (see below for the code).
Benchmark results:
std::fill 0..1000000 ok : 1 elapsed: 0.01 tbb::parallel_for fill 0..1000000 ok : 1 elapsed: 0.01 std::sort reverse 0..1000000 ok : 1 elapsed: 0.11 tbb::parallel_sort reverse 0..1000000 ok : 1 elapsed: 0.037 boost::tp::sort reverse 0..1000000 ok : 1 elapsed: 0.048
Machine: Q6600 - 4 GiB Ram - Running Vista 64 - x64 release build with full optimizations.
Some remarks:
* shutdown() renders the thread pool unusable (marks it terminated). This is not what I was looking for. Ideally, a wait() method that just waits for the pool to have no tasks left to process would be better. I would find it cumbersome to manually wait for the result of each task. My 2c. * It's difficult to properly benchmark the two on my development machine that is probably using one core to do things background work * Instrumentation seems to show that the _entry method of the pool is extremely slow. I have not investigated further. * boost::tp::sort benchmark includes thread destruction... Probably eats up some micro secs * My code is probably sub-optimal
The code:
template <typename RandomIterator> RandomIterator pivot_partition(RandomIterator first, RandomIterator last) { typedef RandomIterator iterator_type; typedef typename iterator_type::difference_type difference_type; typedef typename iterator_type::value_type value_type;
iterator_type pivot = first + std::distance(first, last) / 2;
// we partition value_type pivot_value = *pivot;
// we "save" the pivot in putting it to the first place std::iter_swap(first, pivot);
// we skip the pivot otherwise the partitioning won't work as expected iterator_type par_start = first + 1;
std::less<value_type> pred;
pivot = std::partition(par_start, last, std::bind2nd(pred, pivot_value));
// we restore the pivot std::iter_swap(--pivot, first);
return pivot; }
template <typename RandomIterator, typename Pool> void rec_pivot_partition(RandomIterator first, RandomIterator last, Pool & p) { if (std::distance(first, last) > 1000) { RandomIterator pivot = pivot_partition(first, last); rec_pivot_partition(first, pivot, p); rec_pivot_partition(pivot, last, p); } else { p.submit(boost::bind(&std::sort<iterator_type>, first, last)); }
};
Hi, I don't see when the merge is done. Have you tested that the result is ordered? Could you attach the complete file? Thanks, Vicente

Hi,
I don't see when the merge is done. Have you tested that the result is ordered? Could you attach the complete file?
There is no merge, this is a quicksort. I changed my mind. :D I have tested the input to be sorted but maybe some bug eluded me. I could try something with a merge sort as well. You would divide your collection in blocks then merge them in parallel. Would have to be careful with false sharing, but concurrent_vector should protect us from most nasty cases. Anyway as for parallel_quick_sort. First you traverse the collection, partitioning, until you reach a block size of 1000, at which point you create a task to sort this block in a separate thread. As you can see this is pretty straightforward with std::partition. Limits : * I have no protection against quadratic behavior * Block size may be suboptimal, I don't merge smaller blocks (ie if we have two blocks of 500, we will sort two blocks of 500 instead of one of 1000) You will need to remove the calls to TBB and probably some useless includes referring to some work in progress on my side. Regards. -- EA

Hi Edouard , ----- Original Message ----- From: "Edouard A." <edouard@fausse.info> To: <boost@lists.boost.org> Sent: Sunday, March 08, 2009 10:58 PM Subject: Re: [boost] [threadpool] version 22 with default pool
Hi,
I don't see when the merge is done. Have you tested that the result is ordered? Could you attach the complete file?
There is no merge, this is a quicksort. I changed my mind. :D I have tested the input to be sorted but maybe some bug eluded me.
Ok, I see now how ot works.
I could try something with a merge sort as well. You would divide your collection in blocks then merge them in parallel. Would have to be careful with false sharing, but concurrent_vector should protect us from most nasty cases.
Anyway as for parallel_quick_sort. First you traverse the collection, partitioning, until you reach a block size of 1000, at which point you create a task to sort this block in a separate thread. As you can see this is pretty straightforward with std::partition.
Limits :
* I have no protection against quadratic behavior
What do you mean?
* Block size may be suboptimal, I don't merge smaller blocks (ie if we have two blocks of 500, we will sort two blocks of 500 instead of one of 1000)
You will need to remove the calls to TBB and probably some useless includes referring to some work in progress on my side.
I have not installe TBB. There is a lot of change to make it compilable without TBB. Instead of template <typename RandomIterator, typename Pool, typename Tasks> void rec_pivot_partition(RandomIterator first, RandomIterator last, Pool & p, Tasks & t) { if (std::distance(first, last) > 1000) { RandomIterator pivot = pivot_partition(first, last); rec_pivot_partition(first, pivot, p, t); rec_pivot_partition(pivot, last, p, t); } else { t.push_back(p.submit(boost::bind(&std::sort<iterator_type>, first, last)).result()); } }; I would do template <typename RandomIterator, typename Pool> void rec_pivot_partition(RandomIterator first, RandomIterator last, Pool & p) { typedef RandomIterator iterator_type; if (std::distance(first, last) > 1000) { RandomIterator pivot = pivot_partition(first, last); task<void> t = p.submit(boost::bind( &rec_pivot_partition<RandomIterator, Pool>, first, pivot, p, t)); rec_pivot_partition(pivot, last, p, t); t.wait(); } else { std::sort(first, last); } }; so no need to store the futures in a list. Do you see any problem with this approach? Best, Vicente

so no need to store the futures in a list. Do you see any problem with
On Mon, 9 Mar 2009 08:16:08 +0100, "vicente.botet" <vicente.botet@wanadoo.fr> wrote: this
approach?
No, that's exactly what I had in mind. It should run faster as partitionning will be parallelized as well. -- EA

Edouard A. wrote:
template <typename RandomIterator, typename Pool> void rec_pivot_partition(RandomIterator first, RandomIterator last, Pool & p) { if (std::distance(first, last) > 1000) { RandomIterator pivot = pivot_partition(first, last); rec_pivot_partition(first, pivot, p); rec_pivot_partition(pivot, last, p); }
So you are doing all of the partitioning in the main thread, before starting any other threads? Have you profiled that?
const difference_type block_size = 1000;
Have you benchmarked the effect of adjusting that? If that is sufficiently large, the performance of the threadpool queue (etc) will not matter as much. I suggest that you also tweak your benchmarking so that it runs for a *lot* longer than 30 milliseconds. Phil.

On Sun, 08 Mar 2009 22:36:16 +0000, "Phil Endecott" <spam_from_boost_dev@chezphil.org> wrote:
So you are doing all of the partitioning in the main thread, before starting any other threads?
Have you profiled that?
Yes I know this is suboptimal, see below.
If that is sufficiently large, the performance of the threadpool queue (etc) will not matter as much.
I suggest that you also tweak your benchmarking so that it runs for a *lot* longer than 30 milliseconds.
Yes I know, I have done benchmarks on larger data. To simplify: - std::sort - x 1 - boost::tp::sort - x 2.5 - tbb::sort - x 4 I had some ideas this morning to improve my existing implementation. As you said the partitionning phase it not parallelized at all, which is a shame really and would account for the huge difference. I guess you can simply start a rec_pivot_partition in a thread without waiting for it, in which case there is no need for the sort in a different thread. As for the block size, I guess the ideal size will change from platform to platform. -- EA
participants (5)
-
Edouard A.
-
James Mansion
-
k-oli@gmx.de
-
Phil Endecott
-
vicente.botet