[compute] GPGPU Library - Request For Feedback

Hi everyone, A while back I posted a message asking for interest in a GPGPU computing library and the response seemed positive. I've been slowly working on it for the last few months and it has finally reached a usable state. I've made an initial release on GitHub (details below) and would like to get feedback from the community. The Boost Compute library provides a partial implementation of the C++ standard library for GPUs and multi-core CPUs. It includes common containers (vector<T>, flat_set<T>) and standard algorithms (transform, sort, accumulate). It also features a number of extensions including parallel-computing focused algorithms (exclusive_scan, scatter, reduce) along with a number of fancy iterators (transform_iterator, permutation_iterator). The library is built around the OpenCL framework which allows it to be portable across many types of devices (GPUs, CPUs, and accelerator cards) from many different vendors (NVIDIA, Intel, AMD). The source code and documentation are available from the links below. Code: https://github.com/kylelutz/compute Documentation: http://kylelutz.github.com/compute Bug Tracker: https://github.com/kylelutz/compute/issues I've tested the library with GCC 4.7 and Clang 3.3 on both NVIDIA GPUs and Intel CPUs. However, I would not yet consider the library production-ready. Most of my time has been devoted to reaching a solid and well-tested API rather than on performance. Over time this will improve. Feel free to send any questions, comments or feedback. Thanks, Kyle

A while back I posted a message asking for interest in a GPGPU computing library and the response seemed positive. I've been slowly working on it for the last few months and it has finally reached a usable state. I've made an initial release on GitHub (details below) and would like to get feedback from the community.
The Boost Compute library provides a partial implementation of the C++ standard library for GPUs and multi-core CPUs. It includes common containers (vector<T>, flat_set<T>) and standard algorithms (transform, sort, accumulate). It also features a number of extensions including parallel-computing focused algorithms (exclusive_scan, scatter, reduce) along with a number of fancy iterators (transform_iterator, permutation_iterator). The library is built around the OpenCL framework which allows it to be portable across many types of devices (GPUs, CPUs, and accelerator cards) from many different vendors (NVIDIA, Intel, AMD).
The source code and documentation are available from the links below.
Code: https://github.com/kylelutz/compute Documentation: http://kylelutz.github.com/compute Bug Tracker: https://github.com/kylelutz/compute/issues
I've tested the library with GCC 4.7 and Clang 3.3 on both NVIDIA GPUs and Intel CPUs. However, I would not yet consider the library production- ready. Most of my time has been devoted to reaching a solid and well-tested API rather than on performance. Over time this will improve.
Feel free to send any questions, comments or feedback.
Looks interesting. One question: what support does the library provide to orchestrate parallelism, i.e. doing useful work while the GPGPU is executing a kernel? Do you have something like: int main() { // create data array on host int host_data[] = { 1, 3, 5, 7, 9 }; // create vector on device boost::compute::vector<int> device_vector(5); // copy from host to device future<void> f = boost::compute::copy_async(host_data, host_data + 5, device_vector.begin()); // do other stuff f.get(); // wait for transfer to be done return 0; } ? All libraries I have seen so far assume that the CPU has to idle while waiting for the GPU, is yours different? Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Looks interesting. One question: what support does the library provide to
orchestrate parallelism, i.e. doing useful work while the GPGPU is executing a kernel? Do you have something like:
int main() { // create data array on host int host_data[] = { 1, 3, 5, 7, 9 };
// create vector on device boost::compute::vector<int> device_vector(5);
// copy from host to device future<void> f = boost::compute::copy_async(host_data, host_data + 5, device_vector.begin());
// do other stuff
f.get(); // wait for transfer to be done
return 0; }
?
All libraries I have seen so far assume that the CPU has to idle while waiting for the GPU, is yours different?
Yes. The library allows for asynchronous computation and the API is almost exactly the same as your proposed example. See this example: http://kylelutz.github.com/compute/boost_compute/advanced_topics.html#boost_... Cheers, Kyle

Kyle Lutz wrote:
Looks interesting. One question: what support does the library provide to
orchestrate parallelism, i.e. doing useful work while the GPGPU is executing a kernel? Do you have something like:
int main() { // create data array on host int host_data[] = { 1, 3, 5, 7, 9 };
// create vector on device boost::compute::vector<int> device_vector(5);
// copy from host to device future<void> f = boost::compute::copy_async(host_data, host_data + 5, device_vector.begin());
// do other stuff
f.get(); // wait for transfer to be done
return 0; }
?
All libraries I have seen so far assume that the CPU has to idle while waiting for the GPU, is yours different?
Yes. The library allows for asynchronous computation and the API is almost exactly the same as your proposed example.
See this example: http://kylelutz.github.com/compute/boost_compute/advanced_topics.html#boost_...
Very nice interface, very nice docs. I'll put some more time into looking at it.

Looks interesting. One question: what support does the library provide to
orchestrate parallelism, i.e. doing useful work while the GPGPU is executing a kernel? Do you have something like:
int main() { // create data array on host int host_data[] = { 1, 3, 5, 7, 9 };
// create vector on device boost::compute::vector<int> device_vector(5);
// copy from host to device future<void> f = boost::compute::copy_async(host_data, host_data + 5, device_vector.begin());
// do other stuff
f.get(); // wait for transfer to be done
return 0; }
?
All libraries I have seen so far assume that the CPU has to idle while waiting for the GPU, is yours different?
Yes. The library allows for asynchronous computation and the API is almost exactly the same as your proposed example.
See this example: http://kylelutz.github.com/compute/boost_compute/advanced_topics.html#boos t_compute.advanced_topics.asynchronous_operations
That's excellent (sorry I have not seen this in the docs before). However I think it's not a good idea to create your own futures. This does not scale well nor does it compose with std::future or boost::future. Is there a way to use whatever futures (or more specifically, threading implementation) the user decides to use? Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

On Sun, Mar 3, 2013 at 6:37 PM, Hartmut Kaiser <hartmut.kaiser@gmail.com> wrote:
That's excellent (sorry I have not seen this in the docs before). However I think it's not a good idea to create your own futures. This does not scale well nor does it compose with std::future or boost::future. Is there a way to use whatever futures (or more specifically, threading implementation) the user decides to use?
The boost::compute::future class wraps a cl_event object which is used to monitor the progress of a compute kernel. I'm not sure how to achieve this with std::future or boost.thread's future (or even which to chose for the API). Any pointers (or example code) would be greatly appreciated. -kyle

Looks interesting. One question: what support does the library provide to
orchestrate parallelism, i.e. doing useful work while the GPGPU is executing a kernel? Do you have something like:
int main() { // create data array on host int host_data[] = { 1, 3, 5, 7, 9 };
// create vector on device boost::compute::vector<int> device_vector(5);
// copy from host to device future<void> f = boost::compute::copy_async(host_data, host_data + 5, device_vector.begin());
// do other stuff
f.get(); // wait for transfer to be done
return 0; }
?
All libraries I have seen so far assume that the CPU has to idle while waiting for the GPU, is yours different?
Yes. The library allows for asynchronous computation and the API is almost exactly the same as your proposed example.
See this example: http://kylelutz.github.com/compute/boost_compute/advanced_topics.html# boos t_compute.advanced_topics.asynchronous_operations
That's excellent (sorry I have not seen this in the docs before). However I think it's not a good idea to create your own futures. This does not scale well nor does it compose with std::future or boost::future. Is there a way to use whatever futures (or more specifically, threading implementation) the user decides to use?
Also, do you have asynchronous versions for all algorithms? Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

On Sun, Mar 3, 2013 at 6:39 PM, Hartmut Kaiser <hartmut.kaiser@gmail.com> wrote:
Also, do you have asynchronous versions for all algorithms?
As of now only copy_async() is implemented (as it is the most important for achieving good performance by overlapping memory transfers with computation). However, I am not sure if this is the best interface to be extended to all the other algorithms. I've been thinking about moving towards an asynchronous execution function like std::async() which would be passed the algorithm function along with its arguments. However, I've encountered problems with overloading and having to explicitly specify the argument template types. This is one point where I'd like to get some feedback/ideas from the community. I would also like to have a <algorithm>_async_after() version which would take the algorithms parameters along with a list of futures/events to wait for before beginning execution. My dilemma is that extending these different versions would increase the size of the API by a factor of three. Any feedback/ideas/comments would be very helpful. -kyle

On 03/02/2013 11:25 PM, Kyle Lutz wrote:
Hi everyone,
A while back I posted a message asking for interest in a GPGPU computing library and the response seemed positive. I've been slowly working on it for the last few months and it has finally reached a usable state. I've made an initial release on GitHub (details below) and would like to get feedback from the community.
The Boost Compute library provides a partial implementation of the C++ standard library for GPUs and multi-core CPUs. It includes common containers (vector<T>, flat_set<T>) and standard algorithms (transform, sort, accumulate). It also features a number of extensions including parallel-computing focused algorithms (exclusive_scan, scatter, reduce) along with a number of fancy iterators (transform_iterator, permutation_iterator). The library is built around the OpenCL framework which allows it to be portable across many types of devices (GPUs, CPUs, and accelerator cards) from many different vendors (NVIDIA, Intel, AMD).
The source code and documentation are available from the links below.
Code: https://github.com/kylelutz/compute Documentation: http://kylelutz.github.com/compute Bug Tracker: https://github.com/kylelutz/compute/issues
I've tested the library with GCC 4.7 and Clang 3.3 on both NVIDIA GPUs and Intel CPUs. However, I would not yet consider the library production-ready. Most of my time has been devoted to reaching a solid and well-tested API rather than on performance. Over time this will improve.
Feel free to send any questions, comments or feedback.
It looks really interesting, similar to Thrust. How can one iterate over two or more containers at once. In Thrust they have zip_iterators to do this: for_each( zip( first1 , first2 ) , zip( last1 , last2 ) , f , queue ); Is something similar possible with your library?

Hi Kyle, This looks like a very interesting library! When I have a bit of time I may see see if some of my simulation codes are good candidates to make use of your API. For the moment, just a few points from reading the docs: - Under "Lambda Expressions", I am pretty sure that "+ 4" is not the right way to "subtract four" :-) - Like Karsten I am interested in zip iterators - my simulation code uses such structures heavily. - What is the command_queue argument that appears in all of the algorithms? For instance here: http://kylelutz.github.com/compute/boost/compute/copy.html the signature for 'copy' takes four arguments - three iterators and a command_queue - but the example only passes the iterators, and there is no other mention of command_queues. - Under "Vector Data Types", why floats but not doubles? Is this an OpenCL restriction? (sorry for the ignorance - my GPGPU knowledge is mostly on the CUDA side). Also, what header are they declared in? More later, and thanks, -Gabe

On Sun, Mar 3, 2013 at 6:15 AM, Gabriel Redner <gredner@gmail.com> wrote:
Hi Kyle,
This looks like a very interesting library! When I have a bit of time I may see see if some of my simulation codes are good candidates to make use of your API.
Thanks! Please let me know if you encounter any problems.
For the moment, just a few points from reading the docs:
- Under "Lambda Expressions", I am pretty sure that "+ 4" is not the right way to "subtract four" :-)
Fixed. Thanks for pointing this out!
- Like Karsten I am interested in zip iterators - my simulation code uses such structures heavily.
Zip iterators are a feature I have planned but haven't had time to implement yet. Stay tuned!
- What is the command_queue argument that appears in all of the algorithms? For instance here: http://kylelutz.github.com/compute/boost/compute/copy.html the signature for 'copy' takes four arguments - three iterators and a command_queue - but the example only passes the iterators, and there is no other mention of command_queues.
Command queues specify the context and device for the algorithm's execution. For all of the standard algorithms the command_queue parameter is optional. If not provided, a default command_queue will be created for the default GPU device and the algorithm will be executed there.
- Under "Vector Data Types", why floats but not doubles? Is this an OpenCL restriction? (sorry for the ignorance - my GPGPU knowledge is mostly on the CUDA side). Also, what header are they declared in?
Doubles are supported by both my library and by OpenCL. However, doubles are not supported by all compute devices. Trying to use doubles on a device that doesn't support them will cause a runtime_exception to be thrown. The list under "Vector Data Types" just gives a few examples. All of the scalar and vector data types are declared in the "boost/compute/types.hpp" header. Thanks for the feedback! Cheers, Kyle

On 3/2/2013 4:25 PM, Kyle Lutz wrote:
Hi everyone,
A while back I posted a message asking for interest in a GPGPU computing library and the response seemed positive. I've been slowly working on it for the last few months and it has finally reached a usable state. I've made an initial release on GitHub (details below) and would like to get feedback from the community.
The Boost Compute library provides a partial implementation of the C++ standard library for GPUs and multi-core CPUs. It includes common containers (vector<T>, flat_set<T>) and standard algorithms (transform, sort, accumulate). It also features a number of extensions including parallel-computing focused algorithms (exclusive_scan, scatter, reduce) along with a number of fancy iterators (transform_iterator, permutation_iterator). The library is built around the OpenCL framework which allows it to be portable across many types of devices (GPUs, CPUs, and accelerator cards) from many different vendors (NVIDIA, Intel, AMD).
The source code and documentation are available from the links below.
Code: https://github.com/kylelutz/compute Documentation: http://kylelutz.github.com/compute Bug Tracker: https://github.com/kylelutz/compute/issues
I've tested the library with GCC 4.7 and Clang 3.3 on both NVIDIA GPUs and Intel CPUs. However, I would not yet consider the library production-ready. Most of my time has been devoted to reaching a solid and well-tested API rather than on performance. Over time this will improve.
Feel free to send any questions, comments or feedback.
Thanks, Kyle
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
How does it compare to VexCL (https://github.com/ddemidov/vexcl), Bolt (http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-paral...) and Thrust (https://developer.nvidia.com/thrust)? A comparison would be nice. Moreover, why not piggy-back on the libraries that are already available (and they probably have better optimizations in place) and simply write a nice wrapper around them (and maybe, crazy idea, allow a single codebase to use both AMD and nVidia GPUs at the same time)?

On Sun, Mar 3, 2013 at 9:15 PM, Ioannis Papadopoulos <ipapadop@cse.tamu.edu> wrote:
How does it compare to VexCL (https://github.com/ddemidov/vexcl), Bolt (http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-paral...) and Thrust (https://developer.nvidia.com/thrust)?
VexCL is an expression-template based linear-algebra library for OpenCL. The aims and scope are a bit different from the Boost Compute library. VexCL is closer in nature to the Eigen library while Boost.Compute is closer to the C++ standard library. I don't feel that Boost.Compute really fills the same role as VexCL and in fact VexCL could be built on top of Boost.Compute. Bolt is an AMD specific C++ wrapper around the OpenCL API which extends the C99-based OpenCL language to support C++ features (most notably templates). It is similar to NVIDIA's Thrust library and shares the same failure, lack of portability. Thrust implements a C++ STL-like API for GPUs and CPUs. It is built with multiple backends. NVIDIA GPUs use the CUDA backend and multi-core CPUs can use the Intel TBB or OpenMP backends. However, thrust will not work with AMD graphics cards or other lesser-known accelerators. I feel Boost.Compute is superior in that it uses the vendor-neutral OpenCL library to achieve portability across all types of compute devices.
A comparison would be nice. Moreover, why not piggy-back on the libraries that are already available (and they probably have better optimizations in place) and simply write a nice wrapper around them (and maybe, crazy idea, allow a single codebase to use both AMD and nVidia GPUs at the same time)?
Boost.Compute does allow you to use both AMD and nVidia GPUs at the same time with the same codebase. In fact you can also throw in your multi-core CPU, Xeon Phi accelerator card and even a Playstation 3. Not such a crazy idea after all ;-). -kyle

Hello, On 05/03/13 03:47, Kyle Lutz wrote:
On Sun, Mar 3, 2013 at 9:15 PM, Ioannis Papadopoulos <ipapadop@cse.tamu.edu> wrote:
How does it compare to VexCL (https://github.com/ddemidov/vexcl), Bolt (http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-paral...) and Thrust (https://developer.nvidia.com/thrust)?
... Thrust implements a C++ STL-like API for GPUs and CPUs. It is built with multiple backends. NVIDIA GPUs use the CUDA backend and multi-core CPUs can use the Intel TBB or OpenMP backends. However, thrust will not work with AMD graphics cards or other lesser-known accelerators. I feel Boost.Compute is superior in that it uses the vendor-neutral OpenCL library to achieve portability across all types of compute devices. ...
There's an ongoing discussion on possibilities for other backends for Thrust including mentions of AMD's Bolt: https://groups.google.com/forum/?fromgroups=#!topic/thrust-users/Xe2JkFy_hUk Regards, Sylwester -- http://www.igf.fuw.edu.pl/~slayoo/
participants (7)
-
Gabriel Redner
-
Hartmut Kaiser
-
Ioannis Papadopoulos
-
Karsten Ahnert
-
Kyle Lutz
-
Michael Marcin
-
Sylwester Arabas