[compute] GPGPU Library - Request For Feedback

Hi everyone, A while back I posted a message asking for interest in a GPGPU computing library and the response seemed positive. I've been slowly working on it for the last few months and it has finally reached a usable state. I've made an initial release on GitHub (details below) and would like to get feedback from the community. The Boost Compute library provides a partial implementation of the C++ standard library for GPUs and multi-core CPUs. It includes common containers (vector<T>, flat_set<T>) and standard algorithms (transform, sort, accumulate). It also features a number of extensions including parallel-computing focused algorithms (exclusive_scan, scatter, reduce) along with a number of fancy iterators (transform_iterator, permutation_iterator). The library is built around the OpenCL framework which allows it to be portable across many types of devices (GPUs, CPUs, and accelerator cards) from many different vendors (NVIDIA, Intel, AMD). The source code and documentation are available from the links below. Code: https://github.com/kylelutz/compute Documentation: http://kylelutz.github.com/compute Bug Tracker: https://github.com/kylelutz/compute/issues I've tested the library with GCC 4.7 and Clang 3.3 on both NVIDIA GPUs and Intel CPUs. However, I would not yet consider the library production-ready. Most of my time has been devoted to reaching a solid and well-tested API rather than on performance. Over time this will improve. Feel free to send any questions, comments or feedback. Thanks, Kyle

A while back I posted a message asking for interest in a GPGPU computing library and the response seemed positive. I've been slowly working on it for the last few months and it has finally reached a usable state. I've made an initial release on GitHub (details below) and would like to get feedback from the community.
The Boost Compute library provides a partial implementation of the C++ standard library for GPUs and multi-core CPUs. It includes common containers (vector<T>, flat_set<T>) and standard algorithms (transform, sort, accumulate). It also features a number of extensions including parallel-computing focused algorithms (exclusive_scan, scatter, reduce) along with a number of fancy iterators (transform_iterator, permutation_iterator). The library is built around the OpenCL framework which allows it to be portable across many types of devices (GPUs, CPUs, and accelerator cards) from many different vendors (NVIDIA, Intel, AMD).
The source code and documentation are available from the links below.
Code: https://github.com/kylelutz/compute Documentation: http://kylelutz.github.com/compute Bug Tracker: https://github.com/kylelutz/compute/issues
I've tested the library with GCC 4.7 and Clang 3.3 on both NVIDIA GPUs and Intel CPUs. However, I would not yet consider the library production- ready. Most of my time has been devoted to reaching a solid and well-tested API rather than on performance. Over time this will improve.
Feel free to send any questions, comments or feedback.
Looks interesting. One question: what support does the library provide to orchestrate parallelism, i.e. doing useful work while the GPGPU is executing a kernel? Do you have something like: int main() { // create data array on host int host_data[] = { 1, 3, 5, 7, 9 }; // create vector on device boost::compute::vector<int> device_vector(5); // copy from host to device future<void> f = boost::compute::copy_async(host_data, host_data + 5, device_vector.begin()); // do other stuff f.get(); // wait for transfer to be done return 0; } ? All libraries I have seen so far assume that the CPU has to idle while waiting for the GPU, is yours different? Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Looks interesting. One question: what support does the library provide to
orchestrate parallelism, i.e. doing useful work while the GPGPU is executing a kernel? Do you have something like:
int main() { // create data array on host int host_data[] = { 1, 3, 5, 7, 9 };
// create vector on device boost::compute::vector<int> device_vector(5);
// copy from host to device future<void> f = boost::compute::copy_async(host_data, host_data + 5, device_vector.begin());
// do other stuff
f.get(); // wait for transfer to be done
return 0; }
?
All libraries I have seen so far assume that the CPU has to idle while waiting for the GPU, is yours different?
Yes. The library allows for asynchronous computation and the API is almost exactly the same as your proposed example. See this example: http://kylelutz.github.com/compute/boost_compute/advanced_topics.html#boost_... Cheers, Kyle

Kyle Lutz wrote:
Looks interesting. One question: what support does the library provide to
orchestrate parallelism, i.e. doing useful work while the GPGPU is executing a kernel? Do you have something like:
int main() { // create data array on host int host_data[] = { 1, 3, 5, 7, 9 };
// create vector on device boost::compute::vector<int> device_vector(5);
// copy from host to device future<void> f = boost::compute::copy_async(host_data, host_data + 5, device_vector.begin());
// do other stuff
f.get(); // wait for transfer to be done
return 0; }
?
All libraries I have seen so far assume that the CPU has to idle while waiting for the GPU, is yours different?
Yes. The library allows for asynchronous computation and the API is almost exactly the same as your proposed example.
See this example: http://kylelutz.github.com/compute/boost_compute/advanced_topics.html#boost_...
Very nice interface, very nice docs. I'll put some more time into looking at it.

Looks interesting. One question: what support does the library provide to
orchestrate parallelism, i.e. doing useful work while the GPGPU is executing a kernel? Do you have something like:
int main() { // create data array on host int host_data[] = { 1, 3, 5, 7, 9 };
// create vector on device boost::compute::vector<int> device_vector(5);
// copy from host to device future<void> f = boost::compute::copy_async(host_data, host_data + 5, device_vector.begin());
// do other stuff
f.get(); // wait for transfer to be done
return 0; }
?
All libraries I have seen so far assume that the CPU has to idle while waiting for the GPU, is yours different?
Yes. The library allows for asynchronous computation and the API is almost exactly the same as your proposed example.
See this example: http://kylelutz.github.com/compute/boost_compute/advanced_topics.html#boos t_compute.advanced_topics.asynchronous_operations
That's excellent (sorry I have not seen this in the docs before). However I think it's not a good idea to create your own futures. This does not scale well nor does it compose with std::future or boost::future. Is there a way to use whatever futures (or more specifically, threading implementation) the user decides to use? Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

On Sun, Mar 3, 2013 at 6:37 PM, Hartmut Kaiser <hartmut.kaiser@gmail.com> wrote:
That's excellent (sorry I have not seen this in the docs before). However I think it's not a good idea to create your own futures. This does not scale well nor does it compose with std::future or boost::future. Is there a way to use whatever futures (or more specifically, threading implementation) the user decides to use?
The boost::compute::future class wraps a cl_event object which is used to monitor the progress of a compute kernel. I'm not sure how to achieve this with std::future or boost.thread's future (or even which to chose for the API). Any pointers (or example code) would be greatly appreciated. -kyle

That's excellent (sorry I have not seen this in the docs before). However I think it's not a good idea to create your own futures. This does not scale well nor does it compose with std::future or boost::future. Is there a way to use whatever futures (or more specifically, threading implementation) the user decides to use?
The boost::compute::future class wraps a cl_event object which is used to monitor the progress of a compute kernel. I'm not sure how to achieve this with std::future or boost.thread's future (or even which to chose for the API). Any pointers (or example code) would be greatly appreciated.
Yes, that's exactly the problem. It is nothing specific to your particular library but a general issue to solve. In the library we develop (HPX, https://github.com/STEllAR-GROUP/hpx/) we face the same issue of having to rely on our own synchronization primitives which makes it impossible to reuse std::future (or boost::future). Any suggestion on how to create specialization/customization points allowing to make boost::future universally applicable would be most appreciated. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Looks interesting. One question: what support does the library provide to
orchestrate parallelism, i.e. doing useful work while the GPGPU is executing a kernel? Do you have something like:
int main() { // create data array on host int host_data[] = { 1, 3, 5, 7, 9 };
// create vector on device boost::compute::vector<int> device_vector(5);
// copy from host to device future<void> f = boost::compute::copy_async(host_data, host_data + 5, device_vector.begin());
// do other stuff
f.get(); // wait for transfer to be done
return 0; }
?
All libraries I have seen so far assume that the CPU has to idle while waiting for the GPU, is yours different?
Yes. The library allows for asynchronous computation and the API is almost exactly the same as your proposed example.
See this example: http://kylelutz.github.com/compute/boost_compute/advanced_topics.html# boos t_compute.advanced_topics.asynchronous_operations
That's excellent (sorry I have not seen this in the docs before). However I think it's not a good idea to create your own futures. This does not scale well nor does it compose with std::future or boost::future. Is there a way to use whatever futures (or more specifically, threading implementation) the user decides to use?
Also, do you have asynchronous versions for all algorithms? Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

On Sun, Mar 3, 2013 at 6:39 PM, Hartmut Kaiser <hartmut.kaiser@gmail.com> wrote:
Also, do you have asynchronous versions for all algorithms?
As of now only copy_async() is implemented (as it is the most important for achieving good performance by overlapping memory transfers with computation). However, I am not sure if this is the best interface to be extended to all the other algorithms. I've been thinking about moving towards an asynchronous execution function like std::async() which would be passed the algorithm function along with its arguments. However, I've encountered problems with overloading and having to explicitly specify the argument template types. This is one point where I'd like to get some feedback/ideas from the community. I would also like to have a <algorithm>_async_after() version which would take the algorithms parameters along with a list of futures/events to wait for before beginning execution. My dilemma is that extending these different versions would increase the size of the API by a factor of three. Any feedback/ideas/comments would be very helpful. -kyle

On Sun, Mar 3, 2013 at 6:39 PM, Hartmut Kaiser <hartmut.kaiser@gmail.com> wrote:
Also, do you have asynchronous versions for all algorithms?
As of now only copy_async() is implemented (as it is the most important for achieving good performance by overlapping memory transfers with computation). However, I am not sure if this is the best interface to be extended to all the other algorithms.
I've been thinking about moving towards an asynchronous execution function like std::async() which would be passed the algorithm function along with its arguments. However, I've encountered problems with overloading and having to explicitly specify the argument template types. This is one point where I'd like to get some feedback/ideas from the community. I would also like to have a <algorithm>_async_after() version which would take the algorithms parameters along with a list of futures/events to wait for before beginning execution. My dilemma is that extending these different versions would increase the size of the API by a factor of three.
Any feedback/ideas/comments would be very helpful.
I'd strongly suggest not to add <foo>_async_after(). All we need is a) a function/algorithm producing a future b) composable futures (see N3428, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3428.pdf) But for this to be universally applicable we should try to find a way to converge onto ONE future type. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

On Wed, Mar 6, 2013 at 10:46 AM, Hartmut Kaiser <hartmut.kaiser@gmail.com> wrote:
I'd strongly suggest not to add <foo>_async_after(). All we need is
a) a function/algorithm producing a future b) composable futures (see N3428, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3428.pdf)
But for this to be universally applicable we should try to find a way to converge onto ONE future type.
Very interesting, thanks for posting this. I'll take some time to think on how best to integrate these concepts with Boost.Compute. And yes, I would be very supporting of a generic and general solution. -kyle

On Mon, Mar 4, 2013 at 6:25 PM, Kyle Lutz <kyle.r.lutz@gmail.com> wrote:
On Sun, Mar 3, 2013 at 6:39 PM, Hartmut Kaiser <hartmut.kaiser@gmail.com> wrote:
Also, do you have asynchronous versions for all algorithms?
As of now only copy_async() is implemented (as it is the most important for achieving good performance by overlapping memory transfers with computation). However, I am not sure if this is the best interface to be extended to all the other algorithms.
I've been thinking about moving towards an asynchronous execution function like std::async() which would be passed the algorithm function along with its arguments. However, I've encountered problems with overloading and having to explicitly specify the argument template types. This is one point where I'd like to get some feedback/ideas from the community. I would also like to have a <algorithm>_async_after() version which would take the algorithms parameters along with a list of futures/events to wait for before beginning execution. My dilemma is that extending these different versions would increase the size of the API by a factor of three.
Any feedback/ideas/comments would be very helpful.
This sounds vaguely similar to Google's Flume framework: http://dl.acm.org/citation.cfm?id=1806638 - Jeff

On 03/02/2013 11:25 PM, Kyle Lutz wrote:
Hi everyone,
A while back I posted a message asking for interest in a GPGPU computing library and the response seemed positive. I've been slowly working on it for the last few months and it has finally reached a usable state. I've made an initial release on GitHub (details below) and would like to get feedback from the community.
The Boost Compute library provides a partial implementation of the C++ standard library for GPUs and multi-core CPUs. It includes common containers (vector<T>, flat_set<T>) and standard algorithms (transform, sort, accumulate). It also features a number of extensions including parallel-computing focused algorithms (exclusive_scan, scatter, reduce) along with a number of fancy iterators (transform_iterator, permutation_iterator). The library is built around the OpenCL framework which allows it to be portable across many types of devices (GPUs, CPUs, and accelerator cards) from many different vendors (NVIDIA, Intel, AMD).
The source code and documentation are available from the links below.
Code: https://github.com/kylelutz/compute Documentation: http://kylelutz.github.com/compute Bug Tracker: https://github.com/kylelutz/compute/issues
I've tested the library with GCC 4.7 and Clang 3.3 on both NVIDIA GPUs and Intel CPUs. However, I would not yet consider the library production-ready. Most of my time has been devoted to reaching a solid and well-tested API rather than on performance. Over time this will improve.
Feel free to send any questions, comments or feedback.
It looks really interesting, similar to Thrust. How can one iterate over two or more containers at once. In Thrust they have zip_iterators to do this: for_each( zip( first1 , first2 ) , zip( last1 , last2 ) , f , queue ); Is something similar possible with your library?

Hi Kyle, This looks like a very interesting library! When I have a bit of time I may see see if some of my simulation codes are good candidates to make use of your API. For the moment, just a few points from reading the docs: - Under "Lambda Expressions", I am pretty sure that "+ 4" is not the right way to "subtract four" :-) - Like Karsten I am interested in zip iterators - my simulation code uses such structures heavily. - What is the command_queue argument that appears in all of the algorithms? For instance here: http://kylelutz.github.com/compute/boost/compute/copy.html the signature for 'copy' takes four arguments - three iterators and a command_queue - but the example only passes the iterators, and there is no other mention of command_queues. - Under "Vector Data Types", why floats but not doubles? Is this an OpenCL restriction? (sorry for the ignorance - my GPGPU knowledge is mostly on the CUDA side). Also, what header are they declared in? More later, and thanks, -Gabe

On Sun, Mar 3, 2013 at 6:15 AM, Gabriel Redner <gredner@gmail.com> wrote:
Hi Kyle,
This looks like a very interesting library! When I have a bit of time I may see see if some of my simulation codes are good candidates to make use of your API.
Thanks! Please let me know if you encounter any problems.
For the moment, just a few points from reading the docs:
- Under "Lambda Expressions", I am pretty sure that "+ 4" is not the right way to "subtract four" :-)
Fixed. Thanks for pointing this out!
- Like Karsten I am interested in zip iterators - my simulation code uses such structures heavily.
Zip iterators are a feature I have planned but haven't had time to implement yet. Stay tuned!
- What is the command_queue argument that appears in all of the algorithms? For instance here: http://kylelutz.github.com/compute/boost/compute/copy.html the signature for 'copy' takes four arguments - three iterators and a command_queue - but the example only passes the iterators, and there is no other mention of command_queues.
Command queues specify the context and device for the algorithm's execution. For all of the standard algorithms the command_queue parameter is optional. If not provided, a default command_queue will be created for the default GPU device and the algorithm will be executed there.
- Under "Vector Data Types", why floats but not doubles? Is this an OpenCL restriction? (sorry for the ignorance - my GPGPU knowledge is mostly on the CUDA side). Also, what header are they declared in?
Doubles are supported by both my library and by OpenCL. However, doubles are not supported by all compute devices. Trying to use doubles on a device that doesn't support them will cause a runtime_exception to be thrown. The list under "Vector Data Types" just gives a few examples. All of the scalar and vector data types are declared in the "boost/compute/types.hpp" header. Thanks for the feedback! Cheers, Kyle

On 3/2/2013 4:25 PM, Kyle Lutz wrote:
Hi everyone,
A while back I posted a message asking for interest in a GPGPU computing library and the response seemed positive. I've been slowly working on it for the last few months and it has finally reached a usable state. I've made an initial release on GitHub (details below) and would like to get feedback from the community.
The Boost Compute library provides a partial implementation of the C++ standard library for GPUs and multi-core CPUs. It includes common containers (vector<T>, flat_set<T>) and standard algorithms (transform, sort, accumulate). It also features a number of extensions including parallel-computing focused algorithms (exclusive_scan, scatter, reduce) along with a number of fancy iterators (transform_iterator, permutation_iterator). The library is built around the OpenCL framework which allows it to be portable across many types of devices (GPUs, CPUs, and accelerator cards) from many different vendors (NVIDIA, Intel, AMD).
The source code and documentation are available from the links below.
Code: https://github.com/kylelutz/compute Documentation: http://kylelutz.github.com/compute Bug Tracker: https://github.com/kylelutz/compute/issues
I've tested the library with GCC 4.7 and Clang 3.3 on both NVIDIA GPUs and Intel CPUs. However, I would not yet consider the library production-ready. Most of my time has been devoted to reaching a solid and well-tested API rather than on performance. Over time this will improve.
Feel free to send any questions, comments or feedback.
Thanks, Kyle
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
How does it compare to VexCL (https://github.com/ddemidov/vexcl), Bolt (http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-paral...) and Thrust (https://developer.nvidia.com/thrust)? A comparison would be nice. Moreover, why not piggy-back on the libraries that are already available (and they probably have better optimizations in place) and simply write a nice wrapper around them (and maybe, crazy idea, allow a single codebase to use both AMD and nVidia GPUs at the same time)?

On Sun, Mar 3, 2013 at 9:15 PM, Ioannis Papadopoulos <ipapadop@cse.tamu.edu> wrote:
How does it compare to VexCL (https://github.com/ddemidov/vexcl), Bolt (http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-paral...) and Thrust (https://developer.nvidia.com/thrust)?
VexCL is an expression-template based linear-algebra library for OpenCL. The aims and scope are a bit different from the Boost Compute library. VexCL is closer in nature to the Eigen library while Boost.Compute is closer to the C++ standard library. I don't feel that Boost.Compute really fills the same role as VexCL and in fact VexCL could be built on top of Boost.Compute. Bolt is an AMD specific C++ wrapper around the OpenCL API which extends the C99-based OpenCL language to support C++ features (most notably templates). It is similar to NVIDIA's Thrust library and shares the same failure, lack of portability. Thrust implements a C++ STL-like API for GPUs and CPUs. It is built with multiple backends. NVIDIA GPUs use the CUDA backend and multi-core CPUs can use the Intel TBB or OpenMP backends. However, thrust will not work with AMD graphics cards or other lesser-known accelerators. I feel Boost.Compute is superior in that it uses the vendor-neutral OpenCL library to achieve portability across all types of compute devices.
A comparison would be nice. Moreover, why not piggy-back on the libraries that are already available (and they probably have better optimizations in place) and simply write a nice wrapper around them (and maybe, crazy idea, allow a single codebase to use both AMD and nVidia GPUs at the same time)?
Boost.Compute does allow you to use both AMD and nVidia GPUs at the same time with the same codebase. In fact you can also throw in your multi-core CPU, Xeon Phi accelerator card and even a Playstation 3. Not such a crazy idea after all ;-). -kyle

Hello, On 05/03/13 03:47, Kyle Lutz wrote:
On Sun, Mar 3, 2013 at 9:15 PM, Ioannis Papadopoulos <ipapadop@cse.tamu.edu> wrote:
How does it compare to VexCL (https://github.com/ddemidov/vexcl), Bolt (http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-paral...) and Thrust (https://developer.nvidia.com/thrust)?
... Thrust implements a C++ STL-like API for GPUs and CPUs. It is built with multiple backends. NVIDIA GPUs use the CUDA backend and multi-core CPUs can use the Intel TBB or OpenMP backends. However, thrust will not work with AMD graphics cards or other lesser-known accelerators. I feel Boost.Compute is superior in that it uses the vendor-neutral OpenCL library to achieve portability across all types of compute devices. ...
There's an ongoing discussion on possibilities for other backends for Thrust including mentions of AMD's Bolt: https://groups.google.com/forum/?fromgroups=#!topic/thrust-users/Xe2JkFy_hUk Regards, Sylwester -- http://www.igf.fuw.edu.pl/~slayoo/

On 3/4/2013 8:47 PM, Kyle Lutz wrote:
On Sun, Mar 3, 2013 at 9:15 PM, Ioannis Papadopoulos <ipapadop@cse.tamu.edu> wrote:
A comparison would be nice. Moreover, why not piggy-back on the libraries that are already available (and they probably have better optimizations in place) and simply write a nice wrapper around them (and maybe, crazy idea, allow a single codebase to use both AMD and nVidia GPUs at the same time)?
Boost.Compute does allow you to use both AMD and nVidia GPUs at the same time with the same codebase. In fact you can also throw in your multi-core CPU, Xeon Phi accelerator card and even a Playstation 3. Not such a crazy idea after all ;-).
-kyle
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thanks for the comparison. The only issue I see with Boost.Compute is that it will have problems supporting well the well-known architectures. Basically, it takes all the optimizations that have been researched and developed for maximum efficiency and throws them out of the window. For example, there is a wealth of CUDA algorithms highly optimized for nVidia GPUs. These will have to be reimplemented in OpenCL. And tuned (ouch); possibly for each device (ouch x 2). I see it as a massive task for a single person or a small group of people that are doing that in their spare time. However, if Boost.Compute implements something similar to Boost.Multiprecision's multi-backend approach, then you can use underneath thrust, Bolt, whatever else there is and only fall-back to the OpenCL code if there is nothing else (or the user explicitly requires it). They way I'd see something with the title Boost.Compute is as an algorithm selection library - you have multiple backends, which you may choose from based on automatic configuration at compile time and at run time based on type and input size of your input data. Starting from multiple backends is a good start.

03.03.2013 2:25, Kyle Lutz:
The library is built around the OpenCL framework which allows it to be portable across many types of devices (GPUs, CPUs, and accelerator cards) from many different vendors (NVIDIA, Intel, AMD).
It looks like this library is some kind of abstraction around OpenCL. But as I can see, it does not offer ways to abstract from OpenCL kernels syntax in general way: https://github.com/kylelutz/compute/blob/master/example/monte_carlo.cpp . I.e. not just simple things like boost::compute::sqrt, or simple lambda expressions like "_1 * 3 - 4". Kernel description can be abstracted in several ways: 1) Approach similar to Boost.Phoenix - generate OpenCL kernel code based on gathered expression tree. 2) Approach similar to TaskGraph library: http://www3.imperial.ac.uk/pls/portallive/docs/1/45421696.PDF - describe kernel in terms of special function calls, macros, expression templates and so on. Actual kernel is generated when that description code is "executed". Here is small demo - http://ideone.com/qQ4Pvo (check output at bottom). -- Evgeny Panasyuk

On Thu, Mar 7, 2013 at 2:18 PM, Evgeny Panasyuk <evgeny.panasyuk@gmail.com> wrote:
It looks like this library is some kind of abstraction around OpenCL. But as I can see, it does not offer ways to abstract from OpenCL kernels syntax in general way: https://github.com/kylelutz/compute/blob/master/example/monte_carlo.cpp . I.e. not just simple things like boost::compute::sqrt, or simple lambda expressions like "_1 * 3 - 4".
Kernel description can be abstracted in several ways:
1) Approach similar to Boost.Phoenix - generate OpenCL kernel code based on gathered expression tree.
2) Approach similar to TaskGraph library: http://www3.imperial.ac.uk/pls/portallive/docs/1/45421696.PDF - describe kernel in terms of special function calls, macros, expression templates and so on. Actual kernel is generated when that description code is "executed". Here is small demo - http://ideone.com/qQ4Pvo (check output at bottom).
This is the direction I would like to go with the library. However, coming up with a nice general solution for specifying kernel code in C++ is quite tricky. If you have example code for a potential API I'd love to take a look. For now, the only two exposed ways of using custom functions is directly specifying the OpenCL code (like in the monte_carlo example) or using the lambda expression framework. In the short-term future I am looking at also allowing bind() like functions to compose multiple built-in functions along with literal values. Internally there is also the meta_kernel class which is a hybrid of C++ code and raw OpenCL C code strings which is used by the algorithms to implement generic kernels. This may one day be cleaned-up and promoted to the public API but for now it is an implementation detail. I have been planning on making more use of Boost.Phoenix as you suggested. Thanks for posting the TaskGraph paper. It looks very interesting and I will take a closer look when I get some spare time. Cheers, Kyle
participants (7)
-
Evgeny Panasyuk
-
Gabriel Redner
-
Hartmut Kaiser
-
Ioannis Papadopoulos
-
Jeffrey Lee Hellrung, Jr.
-
Karsten Ahnert
-
Kyle Lutz
-
Michael Marcin
-
Sylwester Arabas