
Thanks for all the comments and feedback so far! I’ve written up answers to your questions below (which should serve as a good start for a FAQ for the library). Please let me know if anything is not clear or if I forgot to answer your question. *** Where can I find the code and/or documentation? *** I have not yet made the code publicly available. I still want to clean up a few things and improve the documentation a fair bit before releasing it. This e-mail was just to gauge the interest of the Boost community in this type of library (and it seems to be positive :-)). As long as I find some free time it should only take a week or so to get the code online. I will notify the list when I do so. *** Why not write as a back-end for Thrust? *** It would not be possible to provide the same API that Thrust expects for OpenCL. The fundamental reason is that functions/functors passed to Thrust algorithms are actual compiled C++ functions whereas for Boost.Compute these form expression objects which are then translated into C99 code which is then compiled for OpenCL. *** Why not target CUDA and/or support multiple back-ends? *** CUDA and OpenCL are two very different technologies. OpenCL works by compiling C99 code at run-time to generate kernel objects which can then be executed on the GPU. CUDA, on the other hand, works by compiling its kernels using a special compiler (nvcc) which then produces binaries which can executed on the GPU. OpenCL already has multiple implementations which allow it to be used on a variety of platforms (e.g. NVIDIA GPUs, Intel CPUs, etc.). I feel that adding another abstraction level within Boost.Compute would only complicate and bloat the library. *** Is it possible to use ordinary C++ functions/functors or C++11 lambdas with Boost.Compute? *** Unfortunately no. OpenCL relies on having C99 source code available at run-time in order to execute code on the GPU. Thus compiled C++ functions or C++11 lambdas cannot simply be passed to the OpenCL environment to be executed on the GPU. This is the reason why I wrote the Boost.Compute lambda library. Basically it takes C++ lambda expressions (e.g. _1 * sqrt(_1) + 4) and transforms them into C99 source code fragments (e.g. “input[i] * sqrt(input[i]) + 4)”) which are then passed to the Boost.Compute STL-style algorithms for execution. While not perfect, it allows the user to write code closer to C++ that still can be executed through OpenCL. *** Does the API support data-streaming operations? *** Yes it does. Though, as a few people pointed out, the example I provided does not show this. Each line of code in the example will be executed in serial and thus will not take advantage of the GPU’s ability to transfer data and perform computations simultaneously. The Boost.Compute STL API does support this but it requires a bit more setup from the user. All of the algorithms take a optional command_queue parameter that serves as a place for them to issue their instructions. The default case (when no command_queue is specified) is for the algorithm to create a command_queue for itself, issue its instructions, and then wait for completion (i.e. a synchronous operation). The example can be made more efficient (though slightly more complex) as follows: // create command queue command_queue queue(context, device); // copy to device, sort, and copy back to host copy(host_vector.begin(), host_vector.end(), device_vector.begin(), queue); sort(device_vector.begin(), device_vector.end(), queue); copy(device_vector.begin(), device_vector.end(), host_vector.begin(), queue); // wait for all above operations to complete queue.finish(); *** Does the Boost.Compute API inter-operate with the OpenCL C API? *** Yes. I have designed the C++ wrapper API to be as unobtrusive as possible. All the functionality available in the OpenCL C API will also be available via the Boost.Compute C++ API. In fact, the C++ wrapped classes all have conversion operators to their underlying OpenCL types so that they can be passed directly to OpenCL functions: // create context object boost::compute::context ctx = boost::compute::default_context(); // query number of devices using the OpenCL C API cl_uint num_devices; clGetContextInfo(ctx, CL_CONTEXT_NUM_DEVICES, sizeof(cl_uint), &num_devices, 0); std::cout << “num_devices: “ << num_devices << std::endl; *** How is the performance? *** As of now many of the Boost.Compute algorithms are not ready for production code (at least performance-wise). I have focused the majority my time on getting the API stable and functional as well as implementing a comprehensive test-suite. In fact, a few of the algorithms are still implemented serially. Over time these will be improved and the library will become competitive with other GPGPU libraries. On that note, if anyone has OpenCL/CUDA code that implements any of the STL algorithms and can be released under the Boost Software License I'd love to hear from you. Thanks, Kyle