[compute] kernels as strings impairs readability and maintainability

Mathias Gaunard

23 Dec 2014 23 Dec '14

5:02 p.m.

Hi, While reading through the code of Boost.Compute to see what it does and how it does it, I often found that the approach used by the library of putting all OpenCL kernels inside of strings was an annoying limitation and made it quite difficult to reason with them, much less debug or maintain them. This has a negative effect on the effort needed to contribute to the library. Has separate compilation been considered? Put the OpenCL code into .cl files, and let the build system do whatever is needed to transform them into a form that can be executed. This would also make it easier to eventually provide a CUDA backend.

Show replies by date

Kyle Lutz

23 Dec 23 Dec

7:21 p.m.

On Tue, Dec 23, 2014 at 9:02 AM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

Hi,

While reading through the code of Boost.Compute to see what it does and how it does it, I often found that the approach used by the library of putting all OpenCL kernels inside of strings was an annoying limitation and made it quite difficult to reason with them, much less debug or maintain them. This has a negative effect on the effort needed to contribute to the library.

While yes, it does make developing Boost.Compute itself a bit more complex, it also gives us much greater flexibility. For instance, we can dynamically build programs at run-time by combining algorithmic skeletons (such as reduce or scan) with custom user-defined reduction functions and produce optimized kernels for the actual platform that executes the code (which in fact can be dramatically different hardware than where Boost.Compute itself was compiled). It also allows us to automatically tune algorithm parameters for the actual hardware present at run-time (and also allows us to execute currently algorithms as efficiently as possible on future hardware platforms by re-tuning and scaling up parameters, all without any recompilation). It also allows us to generate fully specialized kernels at run-time based on dynamic-input/user-configuration (imagine user-created filter pipelines in Photoshop or custom database queries in PGSQL). I think this added complexity is well worth the cost and this fits naturally with OpenCL's JIT-like programming model.

...

Has separate compilation been considered? Put the OpenCL code into .cl files, and let the build system do whatever is needed to transform them into a form that can be executed.

Compiling programs to binaries and then later loading them from disk is supported by Boost.Compute (and is in fact used to implement the offline kernel caching infrastructure). However, for the reasons I mentioned before, this mode is not used exclusively in Boost.Compute and the algorithms are mainly implemented in terms of the run-time program creation and compilation model. Another concern is that Boost.Compute is a header-only library and doesn't control the build system or how it the library will be loaded. This limits our ability to pre-compile certain programs and "install" them for later use by the library. That said, I am very interested in exploring methods for integrating OpenCL source files built by the build tool-chain and make loading and executing them seamless with the rest of Boost.Compute. One approach I have for this is an "extern_function<>" class which works like "boost::compute::function<>", but instead of being specified with a string at run-time, its object code is loaded from a pre-compiled OpenCL binary on disk. I've also been exploring a clang-plugin-based approach to simplify embedding OpenCL code in C++ and using it together with the Boost.Compute algorithms. There is certainly room for improvement, and I'd be very happy to collaborate with anyone interested in this sort of work. -kyle

Asbjørn

9:54 p.m.

On 23.12.2014 20:21, Kyle Lutz wrote:

...

Another concern is that Boost.Compute is a header-only library and doesn't control the build system or how it the library will be loaded. This limits our ability to pre-compile certain programs and "install" them for later use by the library.

Given that Boost.Compute relies on Boost.Chrono, and Boost.Chrono is not a head-only library, I don't see a strong reason why Boost.Compute should stay header-only if that blocks some interesting and/or important features. Just my 2 cents as a library user :) Cheers - Asbjørn

Kyle Lutz

10:26 p.m.

On Tue, Dec 23, 2014 at 1:54 PM, Asbjørn <lordcrc@gmail.com> wrote:

...

On 23.12.2014 20:21, Kyle Lutz wrote:

...
Another concern is that Boost.Compute is a header-only library and doesn't control the build system or how it the library will be loaded. This limits our ability to pre-compile certain programs and "install" them for later use by the library.

Given that Boost.Compute relies on Boost.Chrono, and Boost.Chrono is not a head-only library, I don't see a strong reason why Boost.Compute should stay header-only if that blocks some interesting and/or important features.

Just my 2 cents as a library user :)

Boost.Compute only relies on the header-only porition of Boost.Chrono (specifically <boost/chrono/duration.hpp>) so it remains header-only. And I agree that being non-header-only would allow some features to be developed more easily. I have considered having a "BOOST_COMPUTE_NO_HEADER_ONLY" flag which would enable a compiled version of Boost.Compute which could support more features like this and also potentially improve compile times (though I would still want to keep a large majority of Boost.Compute's functionality available in header-only mode). -kyle

Asbjørn

10:56 p.m.

On 23.12.2014 23:26, Kyle Lutz wrote:

...

Boost.Compute only relies on the header-only porition of Boost.Chrono (specifically <boost/chrono/duration.hpp>) so it remains header-only.

Hm then I must have done something wrong, it failed to link here complaining it couldnt find libboost-chrono-something-something.lib. Once I compiled chrono (only) it worked fine. I'm using a "private" copy of Boost 1.57 for the Boost dependencies. Using MSVC2013 if that makes a difference. - Asbjørn

Kyle Lutz

11 p.m.

On Tue, Dec 23, 2014 at 2:56 PM, Asbjørn <lordcrc@gmail.com> wrote:

...

On 23.12.2014 23:26, Kyle Lutz wrote:

...
Boost.Compute only relies on the header-only porition of Boost.Chrono (specifically <boost/chrono/duration.hpp>) so it remains header-only.

Hm then I must have done something wrong, it failed to link here complaining it couldnt find libboost-chrono-something-something.lib. Once I compiled chrono (only) it worked fine. I'm using a "private" copy of Boost 1.57 for the Boost dependencies.

Using MSVC2013 if that makes a difference.

Interesting, you may have to define "BOOST_CHRONO_HEADER_ONLY". -kyle

Asbjørn

11:49 p.m.

On 24.12.2014 00:00, Kyle Lutz wrote:

...

Interesting, you may have to define "BOOST_CHRONO_HEADER_ONLY".

DOH! Yes that did the trick. Sorry for the noise, been a while since I mucked around with building/configuring boost. I added BOOST_ALL_NO_LIB and deleted the stage/lib directory just to be sure, works purely header based now. That said, personally I don't have any issues with compiling boost libraries, it's one of the easiest dependencies I've had to work with. Cheers - Asbjørn

Niall Douglas

24 Dec 24 Dec

12:38 a.m.

On 23 Dec 2014 at 11:21, Kyle Lutz wrote:

...

For instance, we can dynamically build programs at run-time by combining algorithmic skeletons (such as reduce or scan) with custom user-defined reduction functions and produce optimized kernels for the actual platform that executes the code (which in fact can be dramatically different hardware than where Boost.Compute itself was compiled). It also allows us to automatically tune algorithm parameters for the actual hardware present at run-time (and also allows us to execute currently algorithms as efficiently as possible on future hardware platforms by re-tuning and scaling up parameters, all without any recompilation). It also allows us to generate fully specialized kernels at run-time based on dynamic-input/user-configuration (imagine user-created filter pipelines in Photoshop or custom database queries in PGSQL).

Back when I was planning something very like Compute some years ago, I was going to make a C++ metaprogramming based clang AST manipulator. The idea was that you'd use libclang to hold the OpenCL kernels as an in memory AST, and you'd write C++ which when executed transformed the ASTs rather like Boost.Python. clang, if I remember, has a full fat OpenCL to LLVM compiler, and better still that works as expected in gdb. I figured it should be possible to integrate a debugger frontend for that so you could breakpoint and debug your C++-as-OpenCL nicely. It's a much bigger project than yours of course. And one rendered a bit obsolete by the rise of C++ AMP. Still, food for thought. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Mathias Gaunard

1:46 a.m.

On 23/12/2014 20:21, Kyle Lutz wrote:

...

While yes, it does make developing Boost.Compute itself a bit more complex, it also gives us much greater flexibility.

For instance, we can dynamically build programs at run-time by combining algorithmic skeletons (such as reduce or scan) with custom user-defined reduction functions and produce optimized kernels for the actual platform that executes the code (which in fact can be dramatically different hardware than where Boost.Compute itself was compiled). It also allows us to automatically tune algorithm parameters for the actual hardware present at run-time (and also allows us to execute currently algorithms as efficiently as possible on future hardware platforms by re-tuning and scaling up parameters, all without any recompilation). It also allows us to generate fully specialized kernels at run-time based on dynamic-input/user-configuration (imagine user-created filter pipelines in Photoshop or custom database queries in PGSQL).

I think this added complexity is well worth the cost and this fits naturally with OpenCL's JIT-like programming model.

I could see that from the code, yes. But nothing should prevent doing that while still writing the original OpenCL source code (or skeletons/templates) in separate files rather than C strings.

...

...
Has separate compilation been considered? Put the OpenCL code into .cl files, and let the build system do whatever is needed to transform them into a form that can be executed.

Compiling programs to binaries and then later loading them from disk is supported by Boost.Compute (and is in fact used to implement the offline kernel caching infrastructure). However, for the reasons I mentioned before, this mode is not used exclusively in Boost.Compute and the algorithms are mainly implemented in terms of the run-time program creation and compilation model.

I didn't necessarily mean compiling OpenCL to SPIR (if that's indeed what you mean by binary). You could just make the build system automatically generate the C string from a .cl file, for example.

...

Another concern is that Boost.Compute is a header-only library and doesn't control the build system or how it the library will be loaded. This limits our ability to pre-compile certain programs and "install" them for later use by the library.

As it is, you're probably getting some bloat for the sole reason that you're getting a copy of all your strings in every TU, in particular the radix sort kernel. It makes more sense for it to be a library IMHO. There is a tendency for people to prefer header-only designs because it facilitates deployment due to not having to build a library with compatible settings separately, but I do not think someone should go for header-only just for that reason.

...

That said, I am very interested in exploring methods for integrating OpenCL source files built by the build tool-chain and make loading and executing them seamless with the rest of Boost.Compute. One approach I have for this is an "extern_function<>" class which works like "boost::compute::function<>", but instead of being specified with a string at run-time, its object code is loaded from a pre-compiled OpenCL binary on disk. I've also been exploring a clang-plugin-based approach to simplify embedding OpenCL code in C++ and using it together with the Boost.Compute algorithms.

I do not know what you have in mind with your clang development, but I assumed your library was sticking to oldish standard OpenCL for compatibility with a wide variety of devices and older toolchains. There are already some compiler projects that can generate hybrid CPU and GPU code from a single source, turning functions into GPU kernels as needed: C++AMP does it, CUDA does it too somewhat, and now there is SYCL, a recent addition to the OpenCL standards that was presented at SC14, which should become the best solution for this.

Kyle Lutz

2:57 a.m.

On Tue, Dec 23, 2014 at 5:46 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

On 23/12/2014 20:21, Kyle Lutz wrote:

...
While yes, it does make developing Boost.Compute itself a bit more complex, it also gives us much greater flexibility.

For instance, we can dynamically build programs at run-time by combining algorithmic skeletons (such as reduce or scan) with custom user-defined reduction functions and produce optimized kernels for the actual platform that executes the code (which in fact can be dramatically different hardware than where Boost.Compute itself was compiled). It also allows us to automatically tune algorithm parameters for the actual hardware present at run-time (and also allows us to execute currently algorithms as efficiently as possible on future hardware platforms by re-tuning and scaling up parameters, all without any recompilation). It also allows us to generate fully specialized kernels at run-time based on dynamic-input/user-configuration (imagine user-created filter pipelines in Photoshop or custom database queries in PGSQL).

I think this added complexity is well worth the cost and this fits naturally with OpenCL's JIT-like programming model.

I could see that from the code, yes. But nothing should prevent doing that while still writing the original OpenCL source code (or skeletons/templates) in separate files rather than C strings.

...
...
Has separate compilation been considered? Put the OpenCL code into .cl files, and let the build system do whatever is needed to transform them into a form that can be executed.

Compiling programs to binaries and then later loading them from disk is supported by Boost.Compute (and is in fact used to implement the offline kernel caching infrastructure). However, for the reasons I mentioned before, this mode is not used exclusively in Boost.Compute and the algorithms are mainly implemented in terms of the run-time program creation and compilation model.

I didn't necessarily mean compiling OpenCL to SPIR (if that's indeed what you mean by binary).

Well, to be clear, OpenCL provides two mechanisms for creating programs, one from source strings with clCreateProgramWithSource() and one from binary blobs with clCreateProgramWithBinary(). Binaries are either in a vendor-specific format, or in the SPIR form for platforms that support it (which essentially attempts to be a "vendor-neutral" binary representation).

...

You could just make the build system automatically generate the C string from a .cl file, for example.

Boost.Compute has no "build system", it is merely a set of header files. If, in the future, we move away from a header-only implementation, we could certainly do something like this.

...

...
Another concern is that Boost.Compute is a header-only library and doesn't control the build system or how it the library will be loaded. This limits our ability to pre-compile certain programs and "install" them for later use by the library.

As it is, you're probably getting some bloat for the sole reason that you're getting a copy of all your strings in every TU, in particular the radix sort kernel. It makes more sense for it to be a library IMHO.

There is a tendency for people to prefer header-only designs because it facilitates deployment due to not having to build a library with compatible settings separately, but I do not think someone should go for header-only just for that reason.

These are good points. In the future we may move away from a header-only implementation if it proves to be too big of a hinderance.

...

...
That said, I am very interested in exploring methods for integrating OpenCL source files built by the build tool-chain and make loading and executing them seamless with the rest of Boost.Compute. One approach I have for this is an "extern_function<>" class which works like "boost::compute::function<>", but instead of being specified with a string at run-time, its object code is loaded from a pre-compiled OpenCL binary on disk. I've also been exploring a clang-plugin-based approach to simplify embedding OpenCL code in C++ and using it together with the Boost.Compute algorithms.

I do not know what you have in mind with your clang development, but I assumed your library was sticking to oldish standard OpenCL for compatibility with a wide variety of devices and older toolchains.

There are already some compiler projects that can generate hybrid CPU and GPU code from a single source, turning functions into GPU kernels as needed: C++AMP does it, CUDA does it too somewhat, and now there is SYCL, a recent addition to the OpenCL standards that was presented at SC14, which should become the best solution for this.

Yes, I am well aware of these projects. However, one of my goals for Boost.Compute was to provide a GPGPU library which required no special compiler or compiler extensions (as CUDA, C++AMP, SYCL, OpenACC, etc... all do). My aim is to provide a portable parallel programming library in C++ which supports the widest range of available platforms and compilers and I feel OpenCL fills this role very well (also see the "Why OpenCL?" section [1] in the documentation for more on my rationale for this choice). -kyle [1] http://kylelutz.github.io/compute/boost_compute/design.html#boost_compute.de...

Pavan Yalamanchili

7:15 p.m.

I am a bit late to the party, but we faced the same problem with our library, ArrayFire. The solution we came up with is the following. - The kernels are written as .cl files and are part of the repository. - During the build process, the kernels in .cl files are converted to strings in *new* .hpp files. - The auto-generated kernel headers are the files that are included when trying to compile the said kernel. This allowed us to iterate quickly when writing OpenCL code. This could work with Boost.Compute and also keep it as a header only library. The only downside is that users will not be able to point to the source repo directly. They will have to do a "make install" which converts kernels in .cl to strings in .hpp. On Tue, Dec 23, 2014 at 9:57 PM, Kyle Lutz <kyle.r.lutz@gmail.com> wrote:

...

...
On 23/12/2014 20:21, Kyle Lutz wrote:

...
While yes, it does make developing Boost.Compute itself a bit more complex, it also gives us much greater flexibility.

For instance, we can dynamically build programs at run-time by combining algorithmic skeletons (such as reduce or scan) with custom user-defined reduction functions and produce optimized kernels for the actual platform that executes the code (which in fact can be dramatically different hardware than where Boost.Compute itself was compiled). It also allows us to automatically tune algorithm parameters for the actual hardware present at run-time (and also allows us to execute currently algorithms as efficiently as possible on future hardware platforms by re-tuning and scaling up parameters, all without any recompilation). It also allows us to generate fully specialized kernels at run-time based on dynamic-input/user-configuration (imagine user-created filter pipelines in Photoshop or custom database queries in PGSQL).

I think this added complexity is well worth the cost and this fits naturally with OpenCL's JIT-like programming model.

I could see that from the code, yes. But nothing should prevent doing that while still writing the original OpenCL source code (or skeletons/templates) in separate files rather

On Tue, Dec 23, 2014 at 5:46 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote: than C

...
strings.

...
...
Has separate compilation been considered? Put the OpenCL code into .cl files, and let the build system do whatever is needed to transform them into a form that can be executed.

Compiling programs to binaries and then later loading them from disk is supported by Boost.Compute (and is in fact used to implement the offline kernel caching infrastructure). However, for the reasons I mentioned before, this mode is not used exclusively in Boost.Compute and the algorithms are mainly implemented in terms of the run-time program creation and compilation model.

I didn't necessarily mean compiling OpenCL to SPIR (if that's indeed what you mean by binary).

Well, to be clear, OpenCL provides two mechanisms for creating programs, one from source strings with clCreateProgramWithSource() and one from binary blobs with clCreateProgramWithBinary(). Binaries are either in a vendor-specific format, or in the SPIR form for platforms that support it (which essentially attempts to be a "vendor-neutral" binary representation).

...
You could just make the build system automatically generate the C string from a .cl file, for example.

Boost.Compute has no "build system", it is merely a set of header files. If, in the future, we move away from a header-only implementation, we could certainly do something like this.

...
...
Another concern is that Boost.Compute is a header-only library and doesn't control the build system or how it the library will be loaded. This limits our ability to pre-compile certain programs and "install" them for later use by the library.

As it is, you're probably getting some bloat for the sole reason that you're getting a copy of all your strings in every TU, in particular the radix sort kernel. It makes more sense for it to be a library IMHO.

There is a tendency for people to prefer header-only designs because it facilitates deployment due to not having to build a library with compatible settings separately, but I do not think someone should go for header-only just for that reason.

These are good points. In the future we may move away from a header-only implementation if it proves to be too big of a hinderance.

...
...
That said, I am very interested in exploring methods for integrating OpenCL source files built by the build tool-chain and make loading and executing them seamless with the rest of Boost.Compute. One approach I have for this is an "extern_function<>" class which works like "boost::compute::function<>", but instead of being specified with a string at run-time, its object code is loaded from a pre-compiled OpenCL binary on disk. I've also been exploring a clang-plugin-based approach to simplify embedding OpenCL code in C++ and using it together with the Boost.Compute algorithms.

I do not know what you have in mind with your clang development, but I assumed your library was sticking to oldish standard OpenCL for compatibility with a wide variety of devices and older toolchains.

There are already some compiler projects that can generate hybrid CPU and GPU code from a single source, turning functions into GPU kernels as needed: C++AMP does it, CUDA does it too somewhat, and now there is SYCL, a recent addition to the OpenCL standards that was presented at SC14, which should become the best solution for this.

Yes, I am well aware of these projects. However, one of my goals for Boost.Compute was to provide a GPGPU library which required no special compiler or compiler extensions (as CUDA, C++AMP, SYCL, OpenACC, etc... all do). My aim is to provide a portable parallel programming library in C++ which supports the widest range of available platforms and compilers and I feel OpenCL fills this role very well (also see the "Why OpenCL?" section [1] in the documentation for more on my rationale for this choice).

-kyle

[1] http://kylelutz.github.io/compute/boost_compute/design.html#boost_compute.de...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Antony Polukhin

25 Dec 25 Dec

10:58 a.m.

2014-12-24 23:15 GMT+04:00 Pavan Yalamanchili <pavan@arrayfire.com>:

...

I am a bit late to the party, but we faced the same problem with our library, ArrayFire.

The solution we came up with is the following.

- The kernels are written as .cl files and are part of the repository. - During the build process, the kernels in .cl files are converted to strings in *new* .hpp files. - The auto-generated kernel headers are the files that are included when trying to compile the said kernel.

There is a possible hack for that case! You'll need two helper header files `import.pp` and `end_import.pp`. Something like the following could work (not tested). /// import.pp #define TO_STRING(...) \ #__VA_ARGS__ /// end_import.pp #undef TO_STRING #undef IMPORT_AS Now you'll need to write kernels like this: /// kernel.cl #ifdef IMPORT_AS char IMPORT_AS[] = TO_STRING( #endif // Code goes here #ifdef IMPORT_AS ); // not sure that this will work #endif That's it. Now if you need that kernel as a string, you just write the following: #define IMPORT_AS variable_name #include "import.pp" #include "kernel.cl" #include "end_import.pp" -- Best regards, Antony Polukhin

Andrey Semashev

11:19 a.m.

On Thu, Dec 25, 2014 at 1:58 PM, Antony Polukhin <antoshkka@gmail.com> wrote:

...

There is a possible hack for that case! You'll need two helper header files `import.pp` and `end_import.pp`. Something like the following could work (not tested).

[snip]

...

That's it. Now if you need that kernel as a string, you just write the following:

#define IMPORT_AS variable_name #include "import.pp" #include "kernel.cl" #include "end_import.pp"

I think raw string literals could make it cleaner: #define BEGIN_CL "opencl( #define END_CL )opencl" const char cl_program[] = #include "kernel.cl" // In kernel.cl: BEGIN_CL <code here> END_CL I wonder how debugging goes in OpenCL though. If the kernel does not compile or is not working, you probably get pointers to the kernel string, not the C++ source. Is there a convenient way to translate between the two?

3879

Age (days ago)

3881

Last active (days ago)

List overview

Download

12 comments

7 participants

participants (7)

Andrey Semashev
Antony Polukhin
Asbjørn
Kyle Lutz
Mathias Gaunard
Niall Douglas
Pavan Yalamanchili