Re: [boost] Interest in a GPU computing library

18 Sep 2012

      Thanks for all the comments and feedback so far! I’ve written up
answers to your questions below (which should serve as a good start
for a FAQ for the library). Please let me know if anything is not
clear or if I forgot to answer your question.

*** Where can I find the code and/or documentation? ***

I have not yet made the code publicly available. I still want to clean
up a few things and improve the documentation a fair bit before
releasing it. This e-mail was just to gauge the interest of the Boost
community in this type of library (and it seems to be positive :-)).

As long as I find some free time it should only take a week or so to
get the code online. I will notify the list when I do so.

*** Why not write as a back-end for Thrust? ***

It would not be possible to provide the same API that Thrust expects
for OpenCL. The fundamental reason is that functions/functors passed
to Thrust algorithms are actual compiled C++ functions whereas for
Boost.Compute these form expression objects which are then translated
into C99 code which is then compiled for OpenCL.

*** Why not target CUDA and/or support multiple back-ends? ***

CUDA and OpenCL are two very different technologies. OpenCL works by
compiling C99 code at run-time to generate kernel objects which can
then be executed on the GPU. CUDA, on the other hand, works by
compiling its kernels using a special compiler (nvcc) which then
produces binaries which can executed on the GPU.

OpenCL already has multiple implementations which allow it to be used
on a variety of platforms (e.g. NVIDIA GPUs, Intel CPUs, etc.). I feel
that adding another abstraction level within Boost.Compute would only
complicate and bloat the library.

*** Is it possible to use ordinary C++ functions/functors or C++11
lambdas with Boost.Compute? ***

Unfortunately no. OpenCL relies on having C99 source code available at
run-time in order to execute code on the GPU. Thus compiled C++
functions or C++11 lambdas cannot simply be passed to the OpenCL
environment to be executed on the GPU.

This is the reason why I wrote the Boost.Compute lambda library.
Basically it takes C++ lambda expressions (e.g. _1 * sqrt(_1) + 4) and
transforms them into C99 source code fragments (e.g. “input[i] *
sqrt(input[i]) + 4)”) which are then passed to the Boost.Compute
STL-style algorithms for execution. While not perfect, it allows the
user to write code closer to C++ that still can be executed through
OpenCL.

*** Does the API support data-streaming operations? ***

Yes it does. Though, as a few people pointed out, the example I
provided does not show this. Each line of code in the example will be
executed in serial and thus will not take advantage of the GPU’s
ability to transfer data and perform computations simultaneously. The
Boost.Compute STL API does support this but it requires a bit more
setup from the user. All of the algorithms take a optional
command_queue parameter that serves as a place for them to issue their
instructions. The default case (when no command_queue is specified) is
for the algorithm to create a command_queue for itself, issue its
instructions, and then wait for completion (i.e. a synchronous
operation).

The example can be made more efficient (though slightly more complex)
as follows:

// create command queue
command_queue queue(context, device);

// copy to device, sort, and copy back to host
copy(host_vector.begin(), host_vector.end(), device_vector.begin(), queue);
sort(device_vector.begin(), device_vector.end(), queue);
copy(device_vector.begin(), device_vector.end(), host_vector.begin(), queue);

// wait for all above operations to complete
queue.finish();

*** Does the Boost.Compute API inter-operate with the OpenCL C API? ***

Yes. I have designed the C++ wrapper API to be as unobtrusive as
possible. All the functionality available in the OpenCL C API will
also be available via the Boost.Compute C++ API. In fact, the C++
wrapped classes all have conversion operators to their underlying
OpenCL types so that they can be passed directly to OpenCL functions:

// create context object
boost::compute::context ctx = boost::compute::default_context();

// query number of devices using the OpenCL C API
cl_uint num_devices;
clGetContextInfo(ctx, CL_CONTEXT_NUM_DEVICES, sizeof(cl_uint), &num_devices, 0);
std::cout << “num_devices: “ << num_devices << std::endl;

*** How is the performance? ***

As of now many of the Boost.Compute algorithms are not ready for
production code (at least performance-wise). I have focused the
majority my time on getting the API stable and functional as well as
implementing a comprehensive test-suite. In fact, a few of the
algorithms are still implemented serially. Over time these will be
improved and the library will become competitive with other GPGPU
libraries. On that note, if anyone has OpenCL/CUDA code that
implements any of the STL algorithms and can be released under the
Boost Software License I'd love to hear from you.

Thanks,
Kyle

Re: [boost] Interest in a GPU computing library

Kyle Lutz