[compute] Some questions

older
[compute] Review period extended...

Andrey Semashev

23 Dec 2014 23 Dec '14

9:20 a.m.

Hi, I have no experience with OpenCL or GPU computing in general, so bear with me if my questions sound silly. I have a few questions regarding Boost.Compute: 1. When you define a kernel (e.g. with the BOOST_COMPUTE_FUNCTION macro), is this kernel supposed to be in C? Can it reference global (namespace scope) objects and other functions? Other kernels? 2. When is the kernel compiled and uploaded to the device? Is it possible to cache and reuse the compiled kernel? 3. Why is the library not thread-safe by default? I'd say, we're long past single-threaded systems now, and having to always define the config macro is a nuisance. 4. Is it possible to upload the data to process to the device's local memroy from a user-provided buffer, without copying it to boost::compute::vector? Same for downloading. What I'd like to do is move some of data processing to the GPU while the rest is performed on the CPU (possibly with other libraries), and avoid excessive copying. 5. Is it possible to pass buffers in the device-local memory between different processes (on the CPU) without downloading/uploading data to/from the CPU memory? 6. Is it possible to discover device capabilities? E.g. the amount of local memory (total/used/free), execution units, vendor and device name? Thanks.

Show replies by date

Asbjørn

23 Dec 23 Dec

10:12 a.m.

On 23.12.2014 10:20, Andrey Semashev wrote:

...

6. Is it possible to discover device capabilities? E.g. the amount of local memory (total/used/free), execution units, vendor and device name?

Yes, see http://kylelutz.github.io/compute/boost/compute/device.html Iterate over boost::compute::system::devices(), which returns a vector of all the devices across all available OpenCL platforms, and examine the properties you want. Cheers - Asbjørn

Kyle Lutz

4:29 p.m.

On Tue, Dec 23, 2014 at 1:20 AM, Andrey Semashev <andrey.semashev@gmail.com> wrote:

...

Hi,

I have no experience with OpenCL or GPU computing in general, so bear with me if my questions sound silly. I have a few questions regarding Boost.Compute:

1. When you define a kernel (e.g. with the BOOST_COMPUTE_FUNCTION macro), is this kernel supposed to be in C? Can it reference global (namespace scope) objects and other functions? Other kernels?

Yes, the source code for OpenCL kernels and functions is specified in OpenCL C which is a dialect of C99 with extensions for vectorized operations. There are a few ways to specific kernel functions which reference global C++ values. One is the BOOST_COMPUTE_CLOSURE() macro [1] which works similarly to BOOST_COMPUTE_FUNCTION(), but also allows a lambda-like capture list of C++ values. Another option is to specify your function with extra arguments for the global objects and then bind them to the function with boost::compute::bind() [2].

...

2. When is the kernel compiled and uploaded to the device? Is it possible to cache and reuse the compiled kernel?

If writing a custom kernel, the kernel is built when the "program::build()" method is called. Internally, the higher-level algorithms compile programs when they're needed and store them in a global program cache. And yes, compiled program and kernel objects can be stored and re-used (this is strongly recommended). Boost.Compute provides the program_cache class [3] which is used stores frequently used programs as compiled objects.

...

3. Why is the library not thread-safe by default? I'd say, we're long past single-threaded systems now, and having to always define the config macro is a nuisance.

I would very much like to have it thread-safe by default. This is a problem however with keeping the library header-only and useable with C++03 compilers. The BOOST_COMPUTE_THREAD_SAFE macro basically just instructs Boost.Compute to use the C++11 "thread_local" specifier for global objects instead of "static". With C++03 compilers, this will use boost::thread_specific_ptr<> which then requires users to also link to Boost.Thread. That said, I still don't think it's ideal and I am very open to ideas/patches which improve this.

...

4. Is it possible to upload the data to process to the device's local memroy from a user-provided buffer, without copying it to boost::compute::vector? Same for downloading. What I'd like to do is move some of data processing to the GPU while the rest is performed on the CPU (possibly with other libraries), and avoid excessive copying.

Yes, that is what the mapped_view class [4] is for. It maps a region of host-memory to device-memory and provides a std::vector-like interface on top of it so it may be used with Boost.Compute algorithms or custom kernels.

...

5. Is it possible to pass buffers in the device-local memory between different processes (on the CPU) without downloading/uploading data to/from the CPU memory?

This is not supported by OpenCL (at least not in any standard or portable way). Memory buffers belong to OpenCL contexts, and contexts are created per-process without any mechanisms to share them with other processes. If anyone has any experience/ideas with sharing OpenCL contexts between processes I'd be very interested in trying to get this work.

...

6. Is it possible to discover device capabilities? E.g. the amount of local memory (total/used/free), execution units, vendor and device name?

Yes, the device class [5] provides a number of methods for returning information about the device including the generic get_info() function. Specifically for those cases you listed you could use: * Local memory: device.local_memory_size() * Execution units: device.compute_units() * Vendor name: device.vendor() * Device name: device.name() Thanks for the questions. Let me know if I can explain anything better. -kyle [1] http://kylelutz.github.io/compute/BOOST_COMPUTE_CLOSURE.html [2] http://kylelutz.github.io/compute/boost/compute/bind.html [3] http://kylelutz.github.io/compute/boost/compute/program_cache.html [4] http://kylelutz.github.io/compute/boost/compute/mapped_view.html [5] http://kylelutz.github.io/compute/boost/compute/device.html

Andrey Semashev

8:55 p.m.

On Tue, Dec 23, 2014 at 7:29 PM, Kyle Lutz <kyle.r.lutz@gmail.com> wrote:

...

On Tue, Dec 23, 2014 at 1:20 AM, Andrey Semashev <andrey.semashev@gmail.com> wrote:

...
1. When you define a kernel (e.g. with the BOOST_COMPUTE_FUNCTION macro), is this kernel supposed to be in C? Can it reference global (namespace scope) objects and other functions? Other kernels?

Yes, the source code for OpenCL kernels and functions is specified in OpenCL C which is a dialect of C99 with extensions for vectorized operations.

Does this mean that the compiler has to support OpenCL in order to be able to use Boost.Compute? Or its specific features? If yes, can this be mentioned in the docs (with the list of the affected features, if possible)? Also, I don't quite understand, how the kernel source code which I supply to BOOST_COMPUTE_FUNCTION is then compiled into kernel. Is this source code just stringized and not actually compiled when the application is built?

...

There are a few ways to specific kernel functions which reference global C++ values. One is the BOOST_COMPUTE_CLOSURE() macro [1] which works similarly to BOOST_COMPUTE_FUNCTION(), but also allows a lambda-like capture list of C++ values.

...
2. When is the kernel compiled and uploaded to the device? Is it possible to cache and reuse the compiled kernel?

If writing a custom kernel, the kernel is built when the "program::build()" method is called. Internally, the higher-level algorithms compile programs when they're needed and store them in a global program cache.

And yes, compiled program and kernel objects can be stored and re-used (this is strongly recommended). Boost.Compute provides the program_cache class [3] which is used stores frequently used programs as compiled objects.

So, e.g. a kernel defined with BOOST_COMPUTE_FUNCTION will be compiled when first used, and then saved in some global program_cache, is that correct? Also, captured arguments of BOOST_COMPUTE_CLOSURE will be evaluated only once, when the kernel is built?

...

...
3. Why is the library not thread-safe by default? I'd say, we're long past single-threaded systems now, and having to always define the config macro is a nuisance.

I would very much like to have it thread-safe by default. This is a problem however with keeping the library header-only and useable with C++03 compilers. The BOOST_COMPUTE_THREAD_SAFE macro basically just instructs Boost.Compute to use the C++11 "thread_local" specifier for global objects instead of "static". With C++03 compilers, this will use boost::thread_specific_ptr<> which then requires users to also link to Boost.Thread.

That said, I still don't think it's ideal and I am very open to ideas/patches which improve this.

Personally, I see no big problem with dependency on Boost.Thread in C++03. However, it is quite possible to use system API to implement TLS in header-only library. On POSIX systems it is quite trivial with pthread_once and pthread_key* API. On Windows you can use Interlocked* functions or Boost.Atomic to implement something similar to pthread_once and Tls* functions for the TLS itself. The tricky part is the TLS cleanup, which can be done with help of the Windows thread pool. You can use RegisterWaitForSingleObject to schedule a wait operation on the handle of the thread that sets the thread-local value. When the thread exits, the pool will invoke the callback you passed to RegisterWaitForSingleObject, where you can clean the TLS value. The important difference from thread_local and Boost.Thread is that the callback is called in a thread different from the one that initialized the TLS value, but for various cleanup routines this should not matter. You can see how it's done in Boost.Sync: https://github.com/boostorg/sync/blob/develop/include/boost/sync/detail/wait...

Kyle Lutz

9:21 p.m.

On Tue, Dec 23, 2014 at 12:55 PM, Andrey Semashev <andrey.semashev@gmail.com> wrote:

...

On Tue, Dec 23, 2014 at 7:29 PM, Kyle Lutz <kyle.r.lutz@gmail.com> wrote:

...
On Tue, Dec 23, 2014 at 1:20 AM, Andrey Semashev <andrey.semashev@gmail.com> wrote:

...
1. When you define a kernel (e.g. with the BOOST_COMPUTE_FUNCTION macro), is this kernel supposed to be in C? Can it reference global (namespace scope) objects and other functions? Other kernels?

Yes, the source code for OpenCL kernels and functions is specified in OpenCL C which is a dialect of C99 with extensions for vectorized operations.

Does this mean that the compiler has to support OpenCL in order to be able to use Boost.Compute? Or its specific features? If yes, can this be mentioned in the docs (with the list of the affected features, if possible)?

No, Boost.Compute does not require any special compiler or compiler extensions. It will work with all standards-conforming C++03 and later compilers.

...

Also, I don't quite understand, how the kernel source code which I supply to BOOST_COMPUTE_FUNCTION is then compiled into kernel. Is this source code just stringized and not actually compiled when the application is built?

Yes, the source argument for BOOST_COMPUTE_FUNCTION() is stringized and then inserted into a OpenCL program when "invoked" by an algorithm. And you're right, the function source is not compiled by the host-compiler, though the function signature itself is which gives us some degree of type-safety.

...

...
There are a few ways to specific kernel functions which reference global C++ values. One is the BOOST_COMPUTE_CLOSURE() macro [1] which works similarly to BOOST_COMPUTE_FUNCTION(), but also allows a lambda-like capture list of C++ values.

...
2. When is the kernel compiled and uploaded to the device? Is it possible to cache and reuse the compiled kernel?

If writing a custom kernel, the kernel is built when the "program::build()" method is called. Internally, the higher-level algorithms compile programs when they're needed and store them in a global program cache.

And yes, compiled program and kernel objects can be stored and re-used (this is strongly recommended). Boost.Compute provides the program_cache class [3] which is used stores frequently used programs as compiled objects.

So, e.g. a kernel defined with BOOST_COMPUTE_FUNCTION will be compiled when first used, and then saved in some global program_cache, is that correct? Also, captured arguments of BOOST_COMPUTE_CLOSURE will be evaluated only once, when the kernel is built?

Yeah, the algorithms in Boost.Compute will create a program with the function's source and then store it in the global program cache for later use. And captured values with BOOST_COMPUTE_CLOSURE() are stored by reference and are updated if the corresponding C++ values change. Currently changing captured values will cause a kernel re-compilation. I'm working on improving this to avoid the re-compilation and simply pass the new values to the kernel.

...

...
...
3. Why is the library not thread-safe by default? I'd say, we're long past single-threaded systems now, and having to always define the config macro is a nuisance.

I would very much like to have it thread-safe by default. This is a problem however with keeping the library header-only and useable with C++03 compilers. The BOOST_COMPUTE_THREAD_SAFE macro basically just instructs Boost.Compute to use the C++11 "thread_local" specifier for global objects instead of "static". With C++03 compilers, this will use boost::thread_specific_ptr<> which then requires users to also link to Boost.Thread.

That said, I still don't think it's ideal and I am very open to ideas/patches which improve this.

Personally, I see no big problem with dependency on Boost.Thread in C++03. However, it is quite possible to use system API to implement TLS in header-only library.

On POSIX systems it is quite trivial with pthread_once and pthread_key* API. On Windows you can use Interlocked* functions or Boost.Atomic to implement something similar to pthread_once and Tls* functions for the TLS itself. The tricky part is the TLS cleanup, which can be done with help of the Windows thread pool. You can use RegisterWaitForSingleObject to schedule a wait operation on the handle of the thread that sets the thread-local value. When the thread exits, the pool will invoke the callback you passed to RegisterWaitForSingleObject, where you can clean the TLS value. The important difference from thread_local and Boost.Thread is that the callback is called in a thread different from the one that initialized the TLS value, but for various cleanup routines this should not matter.

You can see how it's done in Boost.Sync:

https://github.com/boostorg/sync/blob/develop/include/boost/sync/detail/wait...

I personally don't see an issue with depending on Boost.Thread either but this does prevent the library from being header-only. I'll take a look at your example and see if that can be worked into Boost.Compute. Thanks! -kyle

3841

Age (days ago)

3841

Last active (days ago)

List overview

Download

4 comments

3 participants

participants (3)

Andrey Semashev
Asbjørn
Kyle Lutz