
On 09/18/2012 06:28 PM, Kyle Lutz wrote:
*** Why not target CUDA and/or support multiple back-ends? ***
CUDA and OpenCL are two very different technologies. OpenCL works by compiling C99 code at run-time to generate kernel objects which can then be executed on the GPU. CUDA, on the other hand, works by compiling its kernels using a special compiler (nvcc) which then produces binaries which can executed on the GPU.
The company I work at has technology to generate both CUDA (at compile-time) and OpenCL (at runtime) kernels from expression templates. At the moment we have support for element-wise, global and partial reduction across all dimensions as well as partial scanning across all dimensions. Element-wise function combinations can be merged into a single reduction and scanning kernel. Everything is automatically streamed and retrieved as needed and data is cached on the device when possible, with a runtime deciding the right amount of memory and computing resources to allocate for each computation depending on the device capabilities. Therefore, I do not think both CUDA and OpenCL is an impossible problem. People want CUDA for a simple reason: CUDA is still faster than equivalent OpenCL on NVIDIA hardware. I think however that automatic kernel generation is a whole problem of its own, and should be clearly separated from the distribution and memory handling logic.
This is the reason why I wrote the Boost.Compute lambda library. Basically it takes C++ lambda expressions (e.g. _1 * sqrt(_1) + 4) and transforms them into C99 source code fragments (e.g. “input[i] * sqrt(input[i]) + 4)”) which are then passed to the Boost.Compute STL-style algorithms for execution. While not perfect, it allows the user to write code closer to C++ that still can be executed through OpenCL.
From your description, it looks like you've reinvented the wheel there, causing needless limitations and interoperability problems for users. It could have just been done by serializing arbitrary Proto transforms to C99, with extension points for custom tags. With CUDA, you'd actually have hit the problem that the Proto functions are not marked __device__, but with OpenCL it doesn't matter.