Interest in a GPU computing library

This is a call for interest in a GPU computing library for Boost. I have been working on the library in my spare time for the past month or so and it’s reached a point where I am ready for feedback. Below gives a brief overview, some notes on the design, and a small example of the library which I’ve named Boost.Compute. --- Overview --- * C++ library for general-purpose computing on GPUs/Accelerators * Based on OpenCL (Open Computing Language) * Header-only implementation * API inspired by the STL, Boost and Thrust * Boost dependencies: Config, Utility, Iterator, Exception, Preprocessor, TypeTraits, StaticAssert, MPL, Proto --- Design --- OpenCL is a framework for writing programs that run on parallel computing devices such as GPUs and multi-core CPUs. The OpenCL language is based on C99 with a few extensions to simplify writing parallel and vector-based code. More background: http://en.wikipedia.org/wiki/OpenCL. The core of the Boost Compute library is a thin C++ wrapper over the OpenCL C API. It provides classes for creating and managing various OpenCL entities such as contexts, buffers, devices and kernels. These classes are written in a style consistent with Boost and the C++ standard library. Written on top of the core library is a partial implementation of the C++ STL which includes common containers (e.g. vector<T>, array<T, N>) and algorithms (e.g. copy, find_if, sort) along with a few extensions (e.g. scatter, exclusive_scan, flat_set<T>). The aim of Boost.Compute’s STL API is to provide a familiar interface to developers wanting to easily write new code or port existing code to run on GPU devices. It also features a few “fancy” iterators inspired by the Boost.Iterator library such as transform_iterator<>, counting_iterator<>, and permutation_iterator<>. Furthermore, a lambda expression library was written using Boost.Proto which allows for mathematical expressions to be defined at the call site of an algorithm and then be executed on the GPU. For example, to multiply each element in a vector by the square root of itself and then add four: transform(v.begin(), v.end(), v.begin(), _1 * sqrt(_1) + 4); --- Example --- Below is a small example of using the Boost.Compute API to sort a vector of int values: // create vector of random values on the host std::vector<int> host_vector(10000); std::generate(host_vector.begin(), host_vector.end(), rand); // create a compute context for the default gpu device boost::compute::context gpu_context = boost::compute::default_gpu_context(); // create a vector on the gpu boost::compute::vector<int> device_vector(gpu_context); // transfer the values to the device device_vector = host_vector; // sort the values on the device boost::compute::sort(device_vector.begin(), device_vector.end()); // transfer the sorted values back to the host boost::compute::copy(device_vector.begin(), device_vector.end(), host_vector.begin()); --- Conclusion --- The Boost Compute library provides a useful, intuitive, and familiar interface for running high-performance parallel code on GPU devices. Incorporating Boost.Compute into the Boost libraries would make GPU computing readily accessible to a large number of C++ developers. All comments and feedback are welcome and greatly appreciated. Thanks, Kyle

Hi Kyle, On Tue, Sep 18, 2012 at 3:53 AM, Kyle Lutz <kyle.r.lutz@gmail.com> wrote:
This is a call for interest in a GPU computing library for Boost. I have been working on the library in my spare time for the past month or so and it’s reached a point where I am ready for feedback. Below gives a brief overview, some notes on the design, and a small example of the library which I’ve named Boost.Compute.
I certainly *am* interested in this library. Do you have online docs or reference for it yet? [snip] Best, Matus

Hi, On Mon, 17 Sep 2012 21:53:20 -0400 Kyle Lutz <kyle.r.lutz@gmail.com> wrote:
This is a call for interest in a GPU computing library for Boost. I have been working on the library in my spare time for the past month or so and it’s reached a point where I am ready for feedback. Below gives a brief overview, some notes on the design, and a small example of the library which I’ve named Boost.Compute.
--- Overview ---
* C++ library for general-purpose computing on GPUs/Accelerators * Based on OpenCL (Open Computing Language) * Header-only implementation * API inspired by the STL, Boost and Thrust * Boost dependencies: Config, Utility, Iterator, Exception, Preprocessor, TypeTraits, StaticAssert, MPL, Proto
sounds interesting. It seems to provide a nice abstraction layer. Is it possible to mix you library with a low level OpenCL programming style (which might be necessary to call functions from other OpenCL libraries)? Heiko -- -- Lärm ist ein geeignetes Mittel, die Stimme des Gewissens zu -- übertönen. (Pearl S. Buck, amerik. Schriftstellerin 1892-1973) -- Number Crunch Blog @ http://numbercrunch.de -- Heiko Bauke @ http://www.mpi-hd.mpg.de/personalhomes/bauke

This is a call for interest in a GPU computing library for Boost. I have been working on the library in my spare time for the past month or so and it’s reached a point where I am ready for feedback. Below gives a brief overview, some notes on the design, and a small example of the library which I’ve named Boost.Compute.
--- Overview ---
* C++ library for general-purpose computing on GPUs/Accelerators * Based on OpenCL (Open Computing Language) * Header-only implementation * API inspired by the STL, Boost and Thrust * Boost dependencies: Config, Utility, Iterator, Exception, Preprocessor, TypeTraits, StaticAssert, MPL, Proto
--- Design ---
OpenCL is a framework for writing programs that run on parallel computing devices such as GPUs and multi-core CPUs. The OpenCL language is based on C99 with a few extensions to simplify writing parallel and vector-based code. More background: http://en.wikipedia.org/wiki/OpenCL.
The core of the Boost Compute library is a thin C++ wrapper over the OpenCL C API. It provides classes for creating and managing various OpenCL entities such as contexts, buffers, devices and kernels. These classes are written in a style consistent with Boost and the C++ standard library.
Written on top of the core library is a partial implementation of the C++ STL which includes common containers (e.g. vector<T>, array<T, N>) and algorithms (e.g. copy, find_if, sort) along with a few extensions (e.g. scatter, exclusive_scan, flat_set<T>).
The aim of Boost.Compute’s STL API is to provide a familiar interface to developers wanting to easily write new code or port existing code to run on GPU devices. It also features a few “fancy” iterators inspired by the Boost.Iterator library such as transform_iterator<>, counting_iterator<>, and permutation_iterator<>.
Furthermore, a lambda expression library was written using Boost.Proto which allows for mathematical expressions to be defined at the call site of an algorithm and then be executed on the GPU. For example, to multiply each element in a vector by the square root of itself and then add four:
transform(v.begin(), v.end(), v.begin(), _1 * sqrt(_1) + 4); Nice work! Where can we find the code? The only thing i miss from your description is the support for asynchronous operations and how to extend
On 09/18/2012 03:53 AM, Kyle Lutz wrote: the lambda facilities. Are those based on phoenix? How to integrate user defined kernels? I am working on a very similar library (https://github.com/sithhell/oclm) so far our efforts went into making opencl programming easier with a strong focus on supporting asynchronous operations. Would be cool to see support for that in your library. As far as i can tell, your work is far more advanced than what I and my collaborators have done so far. It would be great if i could give up the development of my library and merge my findings with your work!
--- Example ---
Below is a small example of using the Boost.Compute API to sort a vector of int values:
// create vector of random values on the host std::vector<int> host_vector(10000); std::generate(host_vector.begin(), host_vector.end(), rand);
// create a compute context for the default gpu device boost::compute::context gpu_context = boost::compute::default_gpu_context();
// create a vector on the gpu boost::compute::vector<int> device_vector(gpu_context);
// transfer the values to the device device_vector = host_vector;
// sort the values on the device boost::compute::sort(device_vector.begin(), device_vector.end());
// transfer the sorted values back to the host boost::compute::copy(device_vector.begin(), device_vector.end(), host_vector.begin());
--- Conclusion ---
The Boost Compute library provides a useful, intuitive, and familiar interface for running high-performance parallel code on GPU devices. Incorporating Boost.Compute into the Boost libraries would make GPU computing readily accessible to a large number of C++ developers.
All comments and feedback are welcome and greatly appreciated.
Thanks, Kyle
_______________________________________________ Unsubscribe& other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 09/18/2012 03:53 AM, Kyle Lutz wrote:
This is a call for interest in a GPU computing library for Boost. I have been working on the library in my spare time for the past month or so and it’s reached a point where I am ready for feedback. Below gives a brief overview, some notes on the design, and a small example of the library which I’ve named Boost.Compute. Can one have a look at the library?

Hi Kyle, This seems interesting, but what would be the advantages over Thrust for example, which offers a very stl-like interface (not entirely sure if it is fully stl-compliant)? Your example looks a lot like Thrust, so I'm asking if there are anything new or anything that is Boost specific. Best, -- Beren Minor

On 09/18/2012 10:38 AM, Beren Minor wrote:
Hi Kyle,
This seems interesting, but what would be the advantages over Thrust for example, which offers a very stl-like interface (not entirely sure if it is fully stl-compliant)? Your example looks a lot like Thrust, so I'm asking if there are anything new or anything that is Boost specific. Thrust is for CUDA only. Thus only for NVIDIA devices (so far). OpenCL is an open standard which (in theory) is supposed to work on any device implementing it. Devices which support OpenCL range from CPUs to various GPUs, the upcoming Intel MIC accelerator and Cell processors.
Best,

Thrust is for CUDA only. Thus only for NVIDIA devices (so far). OpenCL is an open standard which (in theory) is supposed to work on any device implementing it. Devices which support OpenCL range from CPUs to various GPUs, the upcoming Intel MIC accelerator and Cell processors.
Alright, I thought it was not restricted to CUDA. It's great if the same kind of library can be written for a more standard base then. -- Beren Minor

Hi,
-- Overview ---
* C++ library for general-purpose computing on GPUs/Accelerators * Based on OpenCL (Open Computing Language) * Header-only implementation * API inspired by the STL, Boost and Thrust * Boost dependencies: Config, Utility, Iterator, Exception,
Preprocessor, TypeTraits, StaticAssert, MPL, Proto
It seems very interesting to me. It will be very helpful in GPU acceleration. Will provide a standard for GPU programming. I am surely in. ------------------- Regards, Gaurav On Tue, Sep 18, 2012 at 2:24 PM, Beren Minor <beren.minor+boost@gmail.com>wrote:
Thrust is for CUDA only. Thus only for NVIDIA devices (so far). OpenCL is an open standard which (in theory) is supposed to work on any device implementing it. Devices which support OpenCL range from CPUs to various GPUs, the upcoming Intel MIC accelerator and Cell processors.
Alright, I thought it was not restricted to CUDA. It's great if the same kind of library can be written for a more standard base then.
-- Beren Minor
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 09/18/2012 10:53 AM, Thomas Heller wrote:
Hi Kyle,
This seems interesting, but what would be the advantages over Thrust for example, which offers a very stl-like interface (not entirely sure if it is fully stl-compliant)? Your example looks a lot like Thrust, so I'm asking if there are anything new or anything that is Boost specific. Thrust is for CUDA only. Thus only for NVIDIA devices (so far). Oops, guess I am wrong here. Taking a more thorough look at the thrust documentation it looks like different backends are indeed supported. OpenCL is an open standard which (in theory) is supposed to work on any device implementing it. Devices which support OpenCL range from CPUs to various GPUs, the upcoming Intel MIC accelerator and Cell
On 09/18/2012 10:38 AM, Beren Minor wrote: processors.
Best,

On Tue, Sep 18, 2012 at 10:58 AM, Thomas Heller <thom.heller@gmail.com> wrote:
Oops, guess I am wrong here. Taking a more thorough look at the thrust documentation it looks like different backends are indeed supported.
Multiple backend indeed, but apparently not OpenCL. I guess then it's a gap that has to be filled, but extending Thrust to support OpenCL (if possible) might be more beneficial than rewriting another library that only supports OpenCL. -- Beren Minor

On Tuesday, September 18, 2012 11:43:25 AM Beren Minor wrote:
On Tue, Sep 18, 2012 at 10:58 AM, Thomas Heller <thom.heller@gmail.com> wrote:
Oops, guess I am wrong here. Taking a more thorough look at the thrust documentation it looks like different backends are indeed supported.
Multiple backend indeed, but apparently not OpenCL. I guess then it's a gap that has to be filled, but extending Thrust to support OpenCL (if possible) might be more beneficial than rewriting another library that only supports OpenCL.
A few years ago, when I started using thrust, I think the FAQ stated that OpenCL will never be supported by thrust. Now the statement is a little weaker: "When will Thrust support OpenCL? The primary barrier to OpenCL support is the lack of an OpenCL compiler and runtime with support for C++ templates (e.g., something similar to nvcc and the CUDA Runtime). These features are necessary to achieve close coupling of host and device codes." I also think that from the library design it won't be easier to adopt thrust to OpenCL than implementing a new OpenCL library from scratch as the two techniques (CUDA and OpenCL) are fundamentally different. However, I would be very interested in a boost OpenCL library. In odeint, we currently support CUDA by means of thrust, as well as OpenCL in terms of vexCL [1,2]. vexCL is very nice, fast and easy to use, so it sets quite some standard for a boost opencl implementation, I think. [1] https://github.com/ddemidov/vexcl [2] http://www.codeproject.com/Articles/415058/VexCL-Vector-expression- template-library-for-OpenC Mario

Beren Minor wrote:
This seems interesting, but what would be the advantages over Thrust for example, which offers a very stl-like interface (not entirely sure if it is fully stl-compliant)? Your example looks a lot like Thrust, so I'm asking if there are anything new or anything that is Boost specific.
This might be a big advantage of the new library. Thrust probably isn't using any of the boost features (I didn't dive into docs) and won't do it in the future. Regards, Adam

On 09/18/2012 03:53 AM, Kyle Lutz wrote:
The core of the Boost Compute library is a thin C++ wrapper over the OpenCL C API. It provides classes for creating and managing various OpenCL entities such as contexts, buffers, devices and kernels. These classes are written in a style consistent with Boost and the C++ standard library.
I've seen many similar libraries out there, some of which also support multi-gpu or streaming operations. What's important is to check that there is no performance loss between the Boost.Compute version and the handwritten OpenCL version over a variety of applications, and that Boost.Compute allows a significant subset of what OpenCL allows. Without a study that demonstrates both of these points, I do not think I would be able to vote in favor of such a library. From a glance, the interface seems already quite limited, since you have to copy the whole data from the host to the device before doing computation on it, then copy the memory back. So not only can you not overlap transfer time with computation time, you're also limited by the device memory. Those limitationd could be fine if they are deliberate, but it should be pointed out that a lot of applications are ill-suited for the library, or that some code need to be added on top of the library to do things better. Your example uses 'sort'. What sorting algorithm is it? Parallel sorting is a tricky thing.

Hi,
This is a call for interest in a GPU computing library for Boost.
For me this is very interesting.
* C++ library for general-purpose computing on GPUs/Accelerators * Based on OpenCL (Open Computing Language)
Is it possible to have various backends, so that you could use Cuda with NVidia GPUs or e.g. something completely different on High Performance Clusters that distributes the code and uses e.g. MPI for communication, or whatever might come in handy? As far as I have understood, Thrust is designed that way and there are a fiew Backends available.
Furthermore, a lambda expression library was written using Boost.Proto which allows for mathematical expressions to be defined at the call site of an algorithm and then be executed on the GPU. For example, to multiply each element in a vector by the square root of itself and then add four:
transform(v.begin(), v.end(), v.begin(), _1 * sqrt(_1) + 4);
Is there a way to use C++11 Lambdas here? I think that would make the library feel native in future as well. On the other hand I hav no idea how that could be possible at all.
// transfer the values to the device device_vector = host_vector;
Is there a way to do some kind of streaming here as well? So instead of moving all the data to the GPU and getting the result back I think of some kind of stream, that I could use to transfer data to the GPU, have it do some calculations on it and get the results back from another stream while the GPU is working on the next chunk of data? Christof -- okunah gmbh Software nach Maß Zugspitzstraße 211 www.okunah.de 86165 Augsburg cd@okunah.de Registergericht Augsburg Geschäftsführer Augsburg HRB 21896 Christof Donat UStID: DE 248 815 055

Hi, It would be nice to see the code and some documentation. Could you point the main differences between Boost.Compute and Microsoft's AMP?
From what I understand, AMP is implemented only with Direct Compute so far, but it's basically the same idea? Also, it seems to me that AMP as a different design, so I would like to know the differences. I don't know much about AMP or OpenCL so I'm just asking for clarity.
Joel Lamotte

Thanks for all the comments and feedback so far! I’ve written up answers to your questions below (which should serve as a good start for a FAQ for the library). Please let me know if anything is not clear or if I forgot to answer your question. *** Where can I find the code and/or documentation? *** I have not yet made the code publicly available. I still want to clean up a few things and improve the documentation a fair bit before releasing it. This e-mail was just to gauge the interest of the Boost community in this type of library (and it seems to be positive :-)). As long as I find some free time it should only take a week or so to get the code online. I will notify the list when I do so. *** Why not write as a back-end for Thrust? *** It would not be possible to provide the same API that Thrust expects for OpenCL. The fundamental reason is that functions/functors passed to Thrust algorithms are actual compiled C++ functions whereas for Boost.Compute these form expression objects which are then translated into C99 code which is then compiled for OpenCL. *** Why not target CUDA and/or support multiple back-ends? *** CUDA and OpenCL are two very different technologies. OpenCL works by compiling C99 code at run-time to generate kernel objects which can then be executed on the GPU. CUDA, on the other hand, works by compiling its kernels using a special compiler (nvcc) which then produces binaries which can executed on the GPU. OpenCL already has multiple implementations which allow it to be used on a variety of platforms (e.g. NVIDIA GPUs, Intel CPUs, etc.). I feel that adding another abstraction level within Boost.Compute would only complicate and bloat the library. *** Is it possible to use ordinary C++ functions/functors or C++11 lambdas with Boost.Compute? *** Unfortunately no. OpenCL relies on having C99 source code available at run-time in order to execute code on the GPU. Thus compiled C++ functions or C++11 lambdas cannot simply be passed to the OpenCL environment to be executed on the GPU. This is the reason why I wrote the Boost.Compute lambda library. Basically it takes C++ lambda expressions (e.g. _1 * sqrt(_1) + 4) and transforms them into C99 source code fragments (e.g. “input[i] * sqrt(input[i]) + 4)”) which are then passed to the Boost.Compute STL-style algorithms for execution. While not perfect, it allows the user to write code closer to C++ that still can be executed through OpenCL. *** Does the API support data-streaming operations? *** Yes it does. Though, as a few people pointed out, the example I provided does not show this. Each line of code in the example will be executed in serial and thus will not take advantage of the GPU’s ability to transfer data and perform computations simultaneously. The Boost.Compute STL API does support this but it requires a bit more setup from the user. All of the algorithms take a optional command_queue parameter that serves as a place for them to issue their instructions. The default case (when no command_queue is specified) is for the algorithm to create a command_queue for itself, issue its instructions, and then wait for completion (i.e. a synchronous operation). The example can be made more efficient (though slightly more complex) as follows: // create command queue command_queue queue(context, device); // copy to device, sort, and copy back to host copy(host_vector.begin(), host_vector.end(), device_vector.begin(), queue); sort(device_vector.begin(), device_vector.end(), queue); copy(device_vector.begin(), device_vector.end(), host_vector.begin(), queue); // wait for all above operations to complete queue.finish(); *** Does the Boost.Compute API inter-operate with the OpenCL C API? *** Yes. I have designed the C++ wrapper API to be as unobtrusive as possible. All the functionality available in the OpenCL C API will also be available via the Boost.Compute C++ API. In fact, the C++ wrapped classes all have conversion operators to their underlying OpenCL types so that they can be passed directly to OpenCL functions: // create context object boost::compute::context ctx = boost::compute::default_context(); // query number of devices using the OpenCL C API cl_uint num_devices; clGetContextInfo(ctx, CL_CONTEXT_NUM_DEVICES, sizeof(cl_uint), &num_devices, 0); std::cout << “num_devices: “ << num_devices << std::endl; *** How is the performance? *** As of now many of the Boost.Compute algorithms are not ready for production code (at least performance-wise). I have focused the majority my time on getting the API stable and functional as well as implementing a comprehensive test-suite. In fact, a few of the algorithms are still implemented serially. Over time these will be improved and the library will become competitive with other GPGPU libraries. On that note, if anyone has OpenCL/CUDA code that implements any of the STL algorithms and can be released under the Boost Software License I'd love to hear from you. Thanks, Kyle

On Tue, Sep 18, 2012 at 1:28 PM, Kyle Lutz <kyle.r.lutz@gmail.com> wrote:
Thanks for all the comments and feedback so far! I’ve written up answers to your questions below (which should serve as a good start for a FAQ for the library). Please let me know if anything is not clear or if I forgot to answer your question.
[snip]
*** Is it possible to use ordinary C++ functions/functors or C++11 lambdas with Boost.Compute? ***
Unfortunately no. OpenCL relies on having C99 source code available at run-time in order to execute code on the GPU. Thus compiled C++ functions or C++11 lambdas cannot simply be passed to the OpenCL environment to be executed on the GPU.
This is the reason why I wrote the Boost.Compute lambda library. Basically it takes C++ lambda expressions (e.g. _1 * sqrt(_1) + 4) and transforms them into C99 source code fragments (e.g. “input[i] * sqrt(input[i]) + 4)”) which are then passed to the Boost.Compute STL-style algorithms for execution. While not perfect, it allows the user to write code closer to C++ that still can be executed through OpenCL.
Could it use boost.phoenix v3? IUC, Phoenix v3 creates a boost.proto expression. The library could make transformations to these expressions. [snip]
Thanks, Kyle
Regards, -- Felipe Magno de Almeida

On Tuesday 18 September 2012 13:58:24 Felipe Magno de Almeida wrote:
On Tue, Sep 18, 2012 at 1:28 PM, Kyle Lutz <kyle.r.lutz@gmail.com> wrote:
Thanks for all the comments and feedback so far! I’ve written up answers to your questions below (which should serve as a good start for a FAQ for the library). Please let me know if anything is not clear or if I forgot to answer your question.
[snip]
*** Is it possible to use ordinary C++ functions/functors or C++11 lambdas with Boost.Compute? ***
Unfortunately no. OpenCL relies on having C99 source code available at run-time in order to execute code on the GPU. Thus compiled C++ functions or C++11 lambdas cannot simply be passed to the OpenCL environment to be executed on the GPU.
This is the reason why I wrote the Boost.Compute lambda library. Basically it takes C++ lambda expressions (e.g. _1 * sqrt(_1) + 4) and transforms them into C99 source code fragments (e.g. “input[i] * sqrt(input[i]) + 4)”) which are then passed to the Boost.Compute STL-style algorithms for execution. While not perfect, it allows the user to write code closer to C++ that still can be executed through OpenCL.
Could it use boost.phoenix v3? IUC, Phoenix v3 creates a boost.proto expression. The library could make transformations to these expressions.
Phoenix actors do contain Proto expressions. But as I understand, actors and the additional components Phoenix implements to evaluate expressions in a functional manner are useless in context of Boost.Compute. IIUC, the constructed expressions are not invoked but translated into strings (the C99 source code). Using Proto directly seems more appropriate.

On 9/18/2012 10:45 AM, Andrey Semashev wrote:
On Tuesday 18 September 2012 13:58:24 Felipe Magno de Almeida wrote:
Could it use boost.phoenix v3? IUC, Phoenix v3 creates a boost.proto expression. The library could make transformations to these expressions.
Yes, phoenix was designed to make this sort of application possible.
Phoenix actors do contain Proto expressions. But as I understand, actors and the additional components Phoenix implements to evaluate expressions in a functional manner are useless in context of Boost.Compute.
True, you would need a different back-end.
IIUC, the constructed expressions are not invoked but translated into strings (the C99 source code). Using Proto directly seems more appropriate.
Just because you need a different back-end doesn't mean the phoenix front-end is worthless. Phoenix has defined a schema that maps C++ statements onto C++ expressions and provides the components to build these expressions for you. To ignore that and rebuild it from scratch seems like a waste to me. Besides, having a way to take a single phoenix expression and evaluate it in different ways -- on a CPU with the built-in Phoenix evaluators, or on a GPU with alternate ones -- sounds bloody cool, and very useful. -- Eric Niebler BoostPro Computing http://www.boostpro.com

On 09/18/2012 08:06 PM, Eric Niebler wrote:
Could it use boost.phoenix v3? IUC, Phoenix v3 creates a boost.proto expression. The library could make transformations to these expressions. Yes, phoenix was designed to make this sort of application possible. Indeed. Phoenix actors do contain Proto expressions. But as I understand, actors and
On Tuesday 18 September 2012 13:58:24 Felipe Magno de Almeida wrote: the additional components Phoenix implements to evaluate expressions in a functional manner are useless in context of Boost.Compute. True, you would need a different back-end. Indeed. IIUC, the constructed expressions are not invoked but translated into strings (the C99 source code). Using Proto directly seems more appropriate. Just because you need a different back-end doesn't mean the phoenix front-end is worthless. Phoenix has defined a schema that maps C++ statements onto C++ expressions and provides the components to build
On 9/18/2012 10:45 AM, Andrey Semashev wrote: these expressions for you. To ignore that and rebuild it from scratch seems like a waste to me.
Besides, having a way to take a single phoenix expression and evaluate it in different ways -- on a CPU with the built-in Phoenix evaluators, or on a GPU with alternate ones -- sounds bloody cool, and very useful.
I have to fully agree here. The external transforms we put tremendous effort into were designed for exactly that purpose. I would not like to see this going to waste hand have yet another lambda library implemented. I even put up an example for this http://www.boost.org/doc/libs/1_51_0/libs/phoenix/example/parallel_for.cpp (Ok, this is for openmp, but i think it shows the basic direction one could go). Instead of executing the expression in the evaluation you create a string and be done with it. Reuse a subset of the already existing phoenix expressions, generate an error on the unsupported ones, use the phoenix extension mechanism to create Boost.Compute "overloads" for regular functions. Bloody cool indeed and on my list from day 1 we had the external transforms in place. So question to Kyle is: What did you miss from phoenix 3?

On Wed, Sep 19, 2012 at 2:23 AM, Thomas Heller <thom.heller@gmail.com> wrote:
On 09/18/2012 08:06 PM, Eric Niebler wrote:
On 9/18/2012 10:45 AM, Andrey Semashev wrote:
On Tuesday 18 September 2012 13:58:24 Felipe Magno de Almeida wrote:
Could it use boost.phoenix v3? IUC, Phoenix v3 creates a boost.proto expression. The library could make transformations to these expressions.
Yes, phoenix was designed to make this sort of application possible.
Indeed.
Phoenix actors do contain Proto expressions. But as I understand, actors and the additional components Phoenix implements to evaluate expressions in a functional manner are useless in context of Boost.Compute.
True, you would need a different back-end.
Indeed.
IIUC, the constructed expressions are not invoked but translated into strings (the C99 source code). Using Proto directly seems more appropriate.
Just because you need a different back-end doesn't mean the phoenix front-end is worthless. Phoenix has defined a schema that maps C++ statements onto C++ expressions and provides the components to build these expressions for you. To ignore that and rebuild it from scratch seems like a waste to me.
Besides, having a way to take a single phoenix expression and evaluate it in different ways -- on a CPU with the built-in Phoenix evaluators, or on a GPU with alternate ones -- sounds bloody cool, and very useful.
I have to fully agree here. The external transforms we put tremendous effort into were designed for exactly that purpose. I would not like to see this going to waste hand have yet another lambda library implemented. I even put up an example for this http://www.boost.org/doc/libs/1_51_0/libs/phoenix/example/parallel_for.cpp (Ok, this is for openmp, but i think it shows the basic direction one could go). Instead of executing the expression in the evaluation you create a string and be done with it. Reuse a subset of the already existing phoenix expressions, generate an error on the unsupported ones, use the phoenix extension mechanism to create Boost.Compute "overloads" for regular functions. Bloody cool indeed and on my list from day 1 we had the external transforms in place. So question to Kyle is: What did you miss from phoenix 3?
Truthfully, I've never used (or even really taken a close look at) Boost.Phoenix. The Proto library seemed to have the functionality I needed and thus I used that. I will take a look at Boost.Phoenix 3 and see if it will work for Boost.Compute. Thanks for providing the example link. Cheers, Kyle

On 09/18/2012 06:28 PM, Kyle Lutz wrote:
*** Why not target CUDA and/or support multiple back-ends? ***
CUDA and OpenCL are two very different technologies. OpenCL works by compiling C99 code at run-time to generate kernel objects which can then be executed on the GPU. CUDA, on the other hand, works by compiling its kernels using a special compiler (nvcc) which then produces binaries which can executed on the GPU.
The company I work at has technology to generate both CUDA (at compile-time) and OpenCL (at runtime) kernels from expression templates. At the moment we have support for element-wise, global and partial reduction across all dimensions as well as partial scanning across all dimensions. Element-wise function combinations can be merged into a single reduction and scanning kernel. Everything is automatically streamed and retrieved as needed and data is cached on the device when possible, with a runtime deciding the right amount of memory and computing resources to allocate for each computation depending on the device capabilities. Therefore, I do not think both CUDA and OpenCL is an impossible problem. People want CUDA for a simple reason: CUDA is still faster than equivalent OpenCL on NVIDIA hardware. I think however that automatic kernel generation is a whole problem of its own, and should be clearly separated from the distribution and memory handling logic.
This is the reason why I wrote the Boost.Compute lambda library. Basically it takes C++ lambda expressions (e.g. _1 * sqrt(_1) + 4) and transforms them into C99 source code fragments (e.g. “input[i] * sqrt(input[i]) + 4)”) which are then passed to the Boost.Compute STL-style algorithms for execution. While not perfect, it allows the user to write code closer to C++ that still can be executed through OpenCL.
From your description, it looks like you've reinvented the wheel there, causing needless limitations and interoperability problems for users. It could have just been done by serializing arbitrary Proto transforms to C99, with extension points for custom tags. With CUDA, you'd actually have hit the problem that the Proto functions are not marked __device__, but with OpenCL it doesn't matter.

On 9/18/2012 11:00 AM, Mathias Gaunard wrote:
This is the reason why I wrote the Boost.Compute lambda library. Basically it takes C++ lambda expressions (e.g. _1 * sqrt(_1) + 4) and transforms them into C99 source code fragments (e.g. “input[i] * sqrt(input[i]) + 4)”) which are then passed to the Boost.Compute STL-style algorithms for execution. While not perfect, it allows the user to write code closer to C++ that still can be executed through OpenCL.
From your description, it looks like you've reinvented the wheel there, causing needless limitations and interoperability problems for users.
It could have just been done by serializing arbitrary Proto transforms to C99, with extension points for custom tags.
With CUDA, you'd actually have hit the problem that the Proto functions are not marked __device__, but with OpenCL it doesn't matter.
Mathias, Could you say more about what is needed to make Proto CUDA-friendly? I'm not familiar with CUDA. -- Eric Niebler BoostPro Computing http://www.boostpro.com

With CUDA, you'd actually have hit the problem that the Proto functions are not marked __device__, but with OpenCL it doesn't matter.
Mathias, Could you say more about what is needed to make Proto CUDA-friendly? I'm not familiar with CUDA.
The thing needed to make Proto more CUDA-friendly is the same thing that is needed to make it AMP-friendly, in case you are familiar with C++ AMP [1]. Basically, you have to intrusively annotate every Proto function with the "__host__ __device__" annotation (restrict(x86, amp) in the case of C++ AMP). [1] http://msdn.microsoft.com/en-us/library/hh265137.aspx
-- Eric Niebler BoostPro Computing http://www.boostpro.com
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 9/18/2012 11:42 AM, Manjunath Kudlur wrote:
With CUDA, you'd actually have hit the problem that the Proto functions are not marked __device__, but with OpenCL it doesn't matter.
Mathias, Could you say more about what is needed to make Proto CUDA-friendly? I'm not familiar with CUDA.
The thing needed to make Proto more CUDA-friendly is the same thing that is needed to make it AMP-friendly, in case you are familiar with C++ AMP [1]. Basically, you have to intrusively annotate every Proto function with the "__host__ __device__" annotation (restrict(x86, amp) in the case of C++ AMP).
*Every* function in Proto? Or the just ones that build Proto expressions? Or evaluate them? Or some other subset? There's precedent for this. Someone submitted a patch adding BOOST_FORCEINLINE to the important functions that need to be inlined for optimal evaluation performance. Perhaps this would be a good place to start for adding something like a BOOST_PROTO_GPU_ENABLED macro. I would be perfectly willing to accept a patch. Would anybody care to submit one? -- Eric Niebler BoostPro Computing http://www.boostpro.com

On 09/18/2012 10:05 PM, Eric Niebler wrote:
On 9/18/2012 11:42 AM, Manjunath Kudlur wrote:
With CUDA, you'd actually have hit the problem that the Proto functions are not marked __device__, but with OpenCL it doesn't matter.
Mathias, Could you say more about what is needed to make Proto CUDA-friendly? I'm not familiar with CUDA.
The thing needed to make Proto more CUDA-friendly is the same thing that is needed to make it AMP-friendly, in case you are familiar with C++ AMP [1]. Basically, you have to intrusively annotate every Proto function with the "__host__ __device__" annotation (restrict(x86, amp) in the case of C++ AMP).
*Every* function in Proto? Or the just ones that build Proto expressions? Or evaluate them? Or some other subset?
The ones to evaluate them. value, proto_base and child_c could be a usable subset. Older CUDA versions were a bit broken, and easily had internal compiler errors. At MetaScale we actually translate Proto expressions to our custom expression types for CUDA to avoid this, but maybe straight Proto could work well enough now. As a side node, it also allows use to have shorter symbol names, which was a bit of a problem with Proto.

On 9/18/2012 1:32 PM, Mathias Gaunard wrote:
On 09/18/2012 10:05 PM, Eric Niebler wrote:
On 9/18/2012 11:42 AM, Manjunath Kudlur wrote:
The thing needed to make Proto more CUDA-friendly is the same thing that is needed to make it AMP-friendly, in case you are familiar with C++ AMP [1]. Basically, you have to intrusively annotate every Proto function with the "__host__ __device__" annotation (restrict(x86, amp) in the case of C++ AMP).
*Every* function in Proto? Or the just ones that build Proto expressions? Or evaluate them? Or some other subset?
The ones to evaluate them. value, proto_base and child_c could be a usable subset.
That's simple enough. But since I'm not a CUDA/AMP guy, I can't do this on my own. I would need someone knowledgeable to submit a patch. <nudge>
As a side node, it also allows use to have shorter symbol names, which was a bit of a problem with Proto.
Any expression template library is going to generate huge symbol names. I don't think there's any way around that, do you? -- Eric Niebler BoostPro Computing http://www.boostpro.com

Le 19/09/2012 02:39, Eric Niebler a écrit :
On 9/18/2012 1:32 PM, Mathias Gaunard wrote:
On 09/18/2012 10:05 PM, Eric Niebler wrote:
On 9/18/2012 11:42 AM, Manjunath Kudlur wrote:
The thing needed to make Proto more CUDA-friendly is the same thing that is needed to make it AMP-friendly, in case you are familiar with C++ AMP [1]. Basically, you have to intrusively annotate every Proto function with the "__host__ __device__" annotation (restrict(x86, amp) in the case of C++ AMP).
*Every* function in Proto? Or the just ones that build Proto expressions? Or evaluate them? Or some other subset?
The ones to evaluate them. value, proto_base and child_c could be a usable subset.
That's simple enough. But since I'm not a CUDA/AMP guy, I can't do this on my own. I would need someone knowledgeable to submit a patch. <nudge>
I was pondering proposing a talk about our work on this front for C++Now2013, if you happen to be around, we can sort of have a look at what's needed.

On 09/19/2012 02:39 AM, Eric Niebler wrote:
That's simple enough. But since I'm not a CUDA/AMP guy, I can't do this on my own. I would need someone knowledgeable to submit a patch. <nudge>
I believe that when we were initially working on it, you were against changes to make CUDA happy because it's not a real C++ compiler.
As a side node, it also allows use to have shorter symbol names, which was a bit of a problem with Proto.
Any expression template library is going to generate huge symbol names. I don't think there's any way around that, do you?
cuda::node< cuda::tag::plus, cuda::node< cuda::tag::plus, cuda::node< cuda::tag::terminal, int >, cuda::node< cuda::tag::terminal, int > >, cuda::node< cuda::tag::terminal, int > > is shorter (and more readable) than boost::proto::exprns_::basic_expr< boost::proto::tagns_::plus, boost::proto::argns_::list2< boost::proto::exprns_::basic_expr< boost::proto::tagns_::plus, boost::proto::argns_::list2< boost::proto::exprns_::basic_expr< boost::proto::tagns_::terminal, boost::proto::argns_::term<int> >, boost::proto::exprns_::basic_expr< boost::proto::tagns_::terminal, boost::proto::argns_::term<int> > > >, boost::proto::exprns_::basic_expr< boost::proto::tagns_::terminal, boost::proto::argns_::term<int> > > > This is just a + b + c and proto's type is already almost 3 times larger (arguably in great part because of long namespace names). I think proto-11 is going in the right direction by making things variadic, but I don't think it's avoiding instantiating boost.proto-specific expression types yet.

*Every* function in Proto? Or the just ones that build Proto expressions? Or evaluate them? Or some other subset?
The ones to evaluate them. value, proto_base and child_c could be a usable subset.
I have attempted this once before. I remember parts of Boost.Fusion needed the annotation too. Manjunath
Older CUDA versions were a bit broken, and easily had internal compiler errors. At MetaScale we actually translate Proto expressions to our custom expression types for CUDA to avoid this, but maybe straight Proto could work well enough now. As a side node, it also allows use to have shorter symbol names, which was a bit of a problem with Proto.
______________________________**_________________ Unsubscribe & other changes: http://lists.boost.org/** mailman/listinfo.cgi/boost<http://lists.boost.org/mailman/listinfo.cgi/boost>

On 09/18/2012 08:20 PM, Eric Niebler wrote:
On 9/18/2012 11:00 AM, Mathias Gaunard wrote:
This is the reason why I wrote the Boost.Compute lambda library. Basically it takes C++ lambda expressions (e.g. _1 * sqrt(_1) + 4) and transforms them into C99 source code fragments (e.g. “input[i] * sqrt(input[i]) + 4)”) which are then passed to the Boost.Compute STL-style algorithms for execution. While not perfect, it allows the user to write code closer to C++ that still can be executed through OpenCL.
From your description, it looks like you've reinvented the wheel there, causing needless limitations and interoperability problems for users.
It could have just been done by serializing arbitrary Proto transforms to C99, with extension points for custom tags.
With CUDA, you'd actually have hit the problem that the Proto functions are not marked __device__, but with OpenCL it doesn't matter.
Mathias, Could you say more about what is needed to make Proto CUDA-friendly? I'm not familiar with CUDA.
In CUDA, the same function can exist on the host (the CPU), the device (the GPU), or both. By default, unfortunately, it only exists on the host. So to be able to call certain functions from a kernel (which is on the device), those functions need to be marked as __device__ or __host__ __device__. There is the macro BOOST_GPU_ENABLED for this. I haven't tried this in a long time though, I wonder how this interacts with __attribute__((always_inline)).

On Tue, Sep 18, 2012 at 9:28 AM, Kyle Lutz <kyle.r.lutz@gmail.com> wrote:
Thanks for all the comments and feedback so far! I’ve written up answers to your questions below (which should serve as a good start for a FAQ for the library). Please let me know if anything is not clear or if I forgot to answer your question.
*** Where can I find the code and/or documentation? ***
I have not yet made the code publicly available. I still want to clean up a few things and improve the documentation a fair bit before releasing it. This e-mail was just to gauge the interest of the Boost community in this type of library (and it seems to be positive :-)).
As long as I find some free time it should only take a week or so to get the code online. I will notify the list when I do so.
*** Why not write as a back-end for Thrust? ***
It would not be possible to provide the same API that Thrust expects for OpenCL. The fundamental reason is that functions/functors passed to Thrust algorithms are actual compiled C++ functions whereas for Boost.Compute these form expression objects which are then translated into C99 code which is then compiled for OpenCL.
*** Why not target CUDA and/or support multiple back-ends? ***
CUDA and OpenCL are two very different technologies. OpenCL works by compiling C99 code at run-time to generate kernel objects which can then be executed on the GPU. CUDA, on the other hand, works by compiling its kernels using a special compiler (nvcc) which then produces binaries which can executed on the GPU.
OpenCL already has multiple implementations which allow it to be used on a variety of platforms (e.g. NVIDIA GPUs, Intel CPUs, etc.). I feel that adding another abstraction level within Boost.Compute would only complicate and bloat the library.
*** Is it possible to use ordinary C++ functions/functors or C++11 lambdas with Boost.Compute? ***
Unfortunately no. OpenCL relies on having C99 source code available at run-time in order to execute code on the GPU. Thus compiled C++ functions or C++11 lambdas cannot simply be passed to the OpenCL environment to be executed on the GPU.
Using a DSL to specify a function object, and transforming that to a C99 string to pass down to the OpenCL driver is a nice idea. But it gets really ugly when the function object is anymore complex than _1+_2. I tried this once before, look here : https://github.com/keveman/carbon/blob/master/examples/thrust/discrete_voron... I wouldn't want to write my function objects like that.
This is the reason why I wrote the Boost.Compute lambda library. Basically it takes C++ lambda expressions (e.g. _1 * sqrt(_1) + 4) and transforms them into C99 source code fragments (e.g. “input[i] * sqrt(input[i]) + 4)”) which are then passed to the Boost.Compute STL-style algorithms for execution. While not perfect, it allows the user to write code closer to C++ that still can be executed through OpenCL.
*** Does the API support data-streaming operations? ***
Yes it does. Though, as a few people pointed out, the example I provided does not show this. Each line of code in the example will be executed in serial and thus will not take advantage of the GPU’s ability to transfer data and perform computations simultaneously. The Boost.Compute STL API does support this but it requires a bit more setup from the user. All of the algorithms take a optional command_queue parameter that serves as a place for them to issue their instructions. The default case (when no command_queue is specified) is for the algorithm to create a command_queue for itself, issue its instructions, and then wait for completion (i.e. a synchronous operation).
The example can be made more efficient (though slightly more complex) as follows:
// create command queue command_queue queue(context, device);
// copy to device, sort, and copy back to host copy(host_vector.begin(), host_vector.end(), device_vector.begin(), queue); sort(device_vector.begin(), device_vector.end(), queue); copy(device_vector.begin(), device_vector.end(), host_vector.begin(), queue);
// wait for all above operations to complete queue.finish();
*** Does the Boost.Compute API inter-operate with the OpenCL C API? ***
Yes. I have designed the C++ wrapper API to be as unobtrusive as possible. All the functionality available in the OpenCL C API will also be available via the Boost.Compute C++ API. In fact, the C++ wrapped classes all have conversion operators to their underlying OpenCL types so that they can be passed directly to OpenCL functions:
// create context object boost::compute::context ctx = boost::compute::default_context();
// query number of devices using the OpenCL C API cl_uint num_devices; clGetContextInfo(ctx, CL_CONTEXT_NUM_DEVICES, sizeof(cl_uint), &num_devices, 0); std::cout << “num_devices: “ << num_devices << std::endl;
*** How is the performance? ***
As of now many of the Boost.Compute algorithms are not ready for production code (at least performance-wise). I have focused the majority my time on getting the API stable and functional as well as implementing a comprehensive test-suite. In fact, a few of the algorithms are still implemented serially. Over time these will be improved and the library will become competitive with other GPGPU libraries. On that note, if anyone has OpenCL/CUDA code that implements any of the STL algorithms and can be released under the Boost Software License I'd love to hear from you.
Thanks, Kyle
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 09/18/2012 08:15 PM, Manjunath Kudlur wrote:
Using a DSL to specify a function object, and transforming that to a C99 string to pass down to the OpenCL driver is a nice idea. But it gets really ugly when the function object is anymore complex than _1+_2. I tried this once before, look here : https://github.com/keveman/carbon/blob/master/examples/thrust/discrete_voron... I wouldn't want to write my function objects like that.
IMO DSELs are only a good idea if your language is high-level. If you're using straight C++ rewritten with a convoluted syntax, there is little advantage to it.
participants (17)
-
Adam Wulkiewicz
-
Andrey Semashev
-
Beren Minor
-
Christof Donat
-
Eric Niebler
-
Felipe Magno de Almeida
-
GAURAV GUPTA
-
Heiko Bauke
-
Joel Falcou
-
Karsten Ahnert
-
Klaim - Joël Lamotte
-
Kyle Lutz
-
Manjunath Kudlur
-
Mario Mulansky
-
Mathias Gaunard
-
Matus Chochlik
-
Thomas Heller