Is there interest in parallel versions of <algorithm> functions?

Hey all, This is actually a 2 part interest inquiry: 1. I think there could be some usefulness in parallel versions of the algorithm functions, like for_each and find and unique_copy. 2. In the couple days of r&d I have come up with a class named parallel_task_master which runs a thread function many times over and manages them so you keep your thread counts low and working as hard as possible. So if you have a quad-core cpu it can be specified so you never go over a 4 thread limit. It's also setup so you can keep adding tasks recursively and it will keep the thread count fixed and never deadlock. Because the tasks in parallel_task_master run relatively in order, I think some sort of queue could be made to parallelize functions that needed to return output in a serial form. So it could be used to decode video stream packets for example. This parallel_task_master in itself may have some value alone, but it was written to assist in creating parallel version of the functions in <algorithm>. Some of them are not so parallelizable, some are embarrassingly so. -Clint Levijoki

On Thu, Jun 08, 2006 at 03:53:46AM -0600, Clint Levijoki wrote:
Hey all,
This is actually a 2 part interest inquiry:
1. I think there could be some usefulness in parallel versions of the algorithm functions, like for_each and find and unique_copy.
2. In the couple days of r&d I have come up with a class named parallel_task_master which runs a thread function many times over and manages them so you keep your thread counts low and working as hard as possible. So if you have a quad-core cpu it can be specified so you never go over a 4 thread limit. It's also setup so you can keep adding tasks recursively and it will keep the thread count fixed and never deadlock.
Because the tasks in parallel_task_master run relatively in order, I think some sort of queue could be made to parallelize functions that needed to return output in a serial form. So it could be used to decode video stream packets for example.
This parallel_task_master in itself may have some value alone, but it was written to assist in creating parallel version of the functions in <algorithm>. Some of them are not so parallelizable, some are embarrassingly so.
I think this could be interesting. My there is a research group at alma mater that deals with parallel algorithms and such, and has a project along these lines. I know the people involved (some of them), but I don't know if they'd be willing to share project sources. I suspect that they would. Either way, the papers are published and you may find them useful. Note that Bjarne is one of the faculty members involved in this project. Go here to find out about STAPL (Standard Template Adaptive Parallel Library): http://parasol.tamu.edu/groups/rwergergroup/research/stapl/ bc -- Benjamin Collins <ben.collins@acm.org>

Thanks for the link, they have some brilliant ideas. I will have to do some thinking on things, after more testing my initial task_master idea is only beneficial for very narrow type of calculations. The idea of parallel containers sounds like it could solve a lot of the issues I am having. On 6/9/06, Benjamin A. Collins < ben.collins@acm.org> wrote:
Hey all,
This is actually a 2 part interest inquiry:
1. I think there could be some usefulness in parallel versions of the algorithm functions, like for_each and find and unique_copy.
2. In the couple days of r&d I have come up with a class named parallel_task_master which runs a thread function many times over and manages them so you keep your thread counts low and working as hard as possible. So if you have a quad-core cpu it can be specified so you never go over a 4 thread limit. It's also setup so you can keep adding tasks recursively and it will keep the thread count fixed and never deadlock.
Because the tasks in parallel_task_master run relatively in order, I
On Thu, Jun 08, 2006 at 03:53:46AM -0600, Clint Levijoki wrote: think
some sort of queue could be made to parallelize functions that needed to
return output in a serial form. So it could be used to decode video stream packets for example.
This parallel_task_master in itself may have some value alone, but it was written to assist in creating parallel version of the functions in <algorithm>. Some of them are not so parallelizable, some are embarrassingly so.
I think this could be interesting. My there is a research group at alma mater that deals with parallel algorithms and such, and has a project along these lines. I know the people involved (some of them), but I don't know if they'd be willing to share project sources. I suspect that they would. Either way, the papers are published and you may find them useful. Note that Bjarne is one of the faculty members involved in this project.
Go here to find out about STAPL (Standard Template Adaptive Parallel Library): http://parasol.tamu.edu/groups/rwergergroup/research/stapl/
bc -- Benjamin Collins <ben.collins@acm.org >
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- - Clint Levijoki "I'd rather have a bottle in front of me than a frontal lobotomy" - Tom Waits

OK, I have done a lot more work and unfortunately I don't think that the parallel containers/algorithms idea is very useful outside of very specific applications, and in those applications it's easy and clearer just to make your program multithreaded manually. I can still post what I came up with if anybody is interested but I don't see much future in it. The problem is that timing issues between threads are just too great to overcome. It takes longer for the computer to resume from a boost::conditional wait than it does to search through a million intergers. Also I would like to vote that some easier way of sleeping is put into boost::thread, right now it is too awkward to use. On 6/10/06, Clint Levijoki <clevijoki@gmail.com> wrote:
Thanks for the link, they have some brilliant ideas. I will have to do some thinking on things, after more testing my initial task_master idea is only beneficial for very narrow type of calculations.
The idea of parallel containers sounds like it could solve a lot of the issues I am having.
On 6/9/06, Benjamin A. Collins < ben.collins@acm.org> wrote:
On Thu, Jun 08, 2006 at 03:53:46AM -0600, Clint Levijoki wrote: Hey all,
This is actually a 2 part interest inquiry:
1. I think there could be some usefulness in parallel versions of the algorithm functions, like for_each and find and unique_copy.
2. In the couple days of r&d I have come up with a class named parallel_task_master which runs a thread function many times over and manages them so you keep your thread counts low and working as hard as possible. So if you have a quad-core cpu it can be specified so you never go over a 4 thread limit. It's also setup so you can keep adding tasks recursively and it will keep the thread count fixed and never deadlock.
Because the tasks in parallel_task_master run relatively in order, I think some sort of queue could be made to parallelize functions that needed to
return output in a serial form. So it could be used to decode video stream packets for example.
This parallel_task_master in itself may have some value alone, but it was written to assist in creating parallel version of the functions in <algorithm>. Some of them are not so parallelizable, some are embarrassingly so.
I think this could be interesting. My there is a research group at alma mater that deals with parallel algorithms and such, and has a project along these lines. I know the people involved (some of them), but I don't know if they'd be willing to share project sources. I suspect that they would. Either way, the papers are published and you may find them useful. Note that Bjarne is one of the faculty members involved in this project.
Go here to find out about STAPL (Standard Template Adaptive Parallel Library): http://parasol.tamu.edu/groups/rwergergroup/research/stapl/
bc -- Benjamin Collins <ben.collins@acm.org >
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- - Clint Levijoki
"I'd rather have a bottle in front of me than a frontal lobotomy" - Tom Waits
-- - Clint Levijoki "I'd rather have a bottle in front of me than a frontal lobotomy" - Tom Waits

I can't say that I agree with your conclusion that parallel algorithms
aren't very useful for general applications nor that manually multithreading applications is the best option. On the contrary, I think it's going to be very important that the average programmer has high-level generic tools for working with parallel algorithms for even seemingly trivial applications in the not-too-far future. Coincidentally, my summer of code project for Boost, Boost.Act, is all about parallel algorithms along with active objects and simple asynchronous function calls. You can see the full proposal at http://2006.planet-soc.com/node/201 . Be sure to check out the included preliminary documentation for code samples and rationale. Right now, in terms of the algorithms side of the project, my library just allows you to easily work with a few common algorithms such as for_each, copy, generate, and fill, with the execution model able to be toggled via policies, though I plan to have all or nearly all standard algorithms implemented by the end of the summer. At this point in time I do not have any plans for parallel containers for the immediate future, but once everything else is completed, I would definately like to pursue them as a next step. Currently, my library is implemented using a combination of Boost.Threadsand OpenMP. I've read up a bit on STAPL thanks to this thread and I'm looking into the possibility of using it internally in future versions of the project, pending their willingness to release the source. -- -Matt Calabrese

On 6/16/06, Matt Calabrese <rivorus@gmail.com> wrote:
I can't say that I agree with your conclusion that parallel algorithms
aren't very useful for general applications nor that manually multithreading applications is the best option. On the contrary, I think it's going to be very important that the average programmer has high-level generic tools for working with parallel algorithms for even seemingly trivial applications in the not-too-far future.
Coincidentally, my summer of code project for Boost, Boost.Act, is all about parallel algorithms along with active objects and simple asynchronous function calls. You can see the full proposal at http://2006.planet-soc.com/node/201 . Be sure to check out the included preliminary documentation for code samples and rationale. Right now, in terms of the algorithms side of the project, my library just allows you to easily work with a few common algorithms such as for_each, copy, generate, and fill, with the execution model able to be toggled via policies, though I plan to have all or nearly all standard algorithms implemented by the end of the summer. At this point in time I do not have any plans for parallel containers for the immediate future, but once everything else is completed, I would definately like to pursue them as a next step.
Currently, my library is implemented using a combination of Boost.Threadsand OpenMP. I've read up a bit on STAPL thanks to this thread and I'm looking into the possibility of using it internally in future versions of the project, pending their willingness to release the source.
While we are on the subject- is there any interest in a scalable lock-free data structures library? I have some code that works in x86 and x64 w/ vc++7.1 and gcc 4.1, but I'm not familiar enough with other compilers/architectures so I'd need some help with implementing them (should be very easy). This would include thread-safe versions (which do not use a mutex) of a stack, queue, some simple freelist memory pools, and maybe a linked list. -- Cory Nelson http://www.int64.org

On 6/16/06, Cory Nelson <phrosty@gmail.com> wrote:
While we are on the subject- is there any interest in a scalable lock-free data structures library? I have some code that works in x86 and x64 w/ vc++7.1 and gcc 4.1, but I'm not familiar enough with other compilers/architectures so I'd need some help with implementing them (should be very easy).
This would include thread-safe versions (which do not use a mutex) of a stack, queue, some simple freelist memory pools, and maybe a linked list.
Yes, I would personally be very interested in such a library. In particular, a portable lock-free queue would be extremely useful for my current project. -- -Matt Calabrese

On 6/16/06, Matt Calabrese <rivorus@gmail.com> wrote:
I can't say that I agree with your conclusion that parallel algorithms
aren't very useful for general applications nor that manually multithreading applications is the best option. On the contrary, I think it's going to be very important that the average programmer has high-level generic tools for working with parallel algorithms for even seemingly trivial applications in the not-too-far future.
I agree that in the future this will be important. But currently the technologies to support it (thread apis and mutexes) are not good enough to make this happen. I have uploaded what I came up with to parallel_stl_test.zip in the concurrent programming section of the boost vault. Perhaps I am doing something horribly wrong. I tried several different approaches to implementing things, and this one ended up being the fastest. All of the tests are also written to be kind to the parallel execution mode, and they still end up falling behind. If you were to 'optimize' the serial searching tests even further by using a hash table or qsort there would be no way that the parallel versions could keep up. I approach the problem in the same way STAPL does (I assume), using parallel containers and algorithms. It's only implemented just far enough that it will work at all. The one other thing STAPL does that I never got around to was to optimize algorithms at runtime so you would only run a serial version instead of a parallel version if that ended up being faster. But I'm fairly certain that the serial version will be faster 95% of the time, so often that it would negate any rare benefits of the parallel mode several times over. The way I see things, if the algorithms don't make inherently serial tasks faster then they don't have any reason to exist. Because inherently parallel things can already be made parallel easily. btw... this is how to interpret the output of the test application: find total = the total time it takes to execute overhead: setup = how much time has been spent in function overheads, copying values around etc pool = how much time has been spent in the thread pool, which includes all the waiting for the primary thread wait0 = how much time has been spent waiting in thread 0 wait1 = how much time has been spent waiting in thread 1 execution - these list how much time was actually spent in the execution of the task at hand on each cpu. this is how fast things could be if it wasn't for all the timing and waiting going on. find total should equal setup + pool + cpu0 - Clint Levijoki

Before parallel <algorithm> things, I think we need to step back and handle the cross-thread domain container issues. It is very common in hardware programming to have a FIFO go between clock domains. I'm picturing the same thing for software; a FIFO where the push is in one thread and the pop in the other. The thread with the pop would block on a pop call until there was data and the thread with the push would block on push if the FIFO were full. Sound valuable? Consider the case of a dual processor/core system where you want I/O running in one thread and an N^2 algorithm in the other that only outputs 10% of the N^2's data. I need to handle the I/O bottleneck before I'm worried about using parallel algorithm functionality.
participants (5)
-
Benjamin A. Collins
-
Brannon King
-
Clint Levijoki
-
Cory Nelson
-
Matt Calabrese