On 20/12/2014 09:32, Kyle Lutz wrote:
Can you give examples of where the OpenCL wrapper types in Boost.Compute are lacking?
I missed any interface to cl::... (Khronos C++ wrapping API)
One particular issue that makes me hesitant is the lack of OpenCL 2.0 support in the "official" C++ bindings. The OpenCL 2.0 specification was released over a year ago (November 2013). The first public OpenCL 2.0 implementation was released by Intel three months ago (September 2014, followed closely by an AMD implementation).
It is correct that Khronos presently only offers a C++ API for OpenCL 1.2, but in practice the ways more serious, limiting constraint is that Nvidia only offers a 1.1 _implementation_ anyway. Any portable OpenCL code must restrict itself to 1.1. We could discuss issues of the Khronos C++ bindings and your wrapper to no end, but my main point remains: The Khronos version is official and established; an alternative version intended for wider use must be clearly superior and give people strong reasons to migrate (I am not only referring to programming itself, but also people writing books on OpenCL, blogs etc.). In my eyes yours (that is core + utilities parts of your library) surely fails to meet that.
Could you let me know what parts of the core wrapper API fail to meet your bar for quality?
Just as one analysis example: For the plain C API clEnqueueReadBuffer (surely a commonly used API function) has this signature (we use it as reference): cl_int clEnqueueReadBuffer(cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t size, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event); The Khronos C++ bindings version is: cl_int cl::CommandQueue::enqueueReadBuffer(const Buffer& buffer, cl_bool blocking_read, ::size_t offset, ::size_t size, const void * ptr, const VECTOR_CLASS<Event> * events = NULL, Event * event = NULL); And your "main" version is: void enqueue_read_buffer(const buffer & buffer, size_t offset, size_t size, void * host_ptr, const wait_list & events = wait_list()); The Khronos C++ version matches in arguments logic and order fully the C version, and also returns a cl_int as error code - they correspond to each other, easy to use one or the other. Your version a) misses the cl_bool for blocking/non-blocking, b) misses the last event argument, and c) throws (I suppose) instead of returning an error code. Let's go through these: a) blocking: If I remember my code inspection correctly your version is automatically always blocking, and a non-blocking version is (I guess?) given by enqueue_read_buffer_async, which has a different signature (return value) besides it's different name. So instead of being able to set a single cl_bool and pass it as standard argument I need to take care which function (name) to call, what it returns etc. Studying your library docs I find very little information, what makes them different etc.; specifically nowhere does it say that enqueue_read_buffer _is_ a blocking operation, it only says that it _enqueues_ a read command. Both functions then simply refer to clEnqueueReadBuffer() which does not help matters at all given the different signature. Now take a look at the Khronos C++ documentation and one encounters a ways, ways more detailed description. It's signature is already quite clear, and with the elaborate description it's really clear. b) why your version lacks the last event argument entirely, I have no idea. Anything that can be done with it in the C / C++ API (e.g. OpenCL runtime-level profiling; controlling an out-of-order command-queue / another commmand-queue) seems to be undoable. c) error handling: I'd much prefer some policy setting which specifies if an exception is thrown on error (the usual custom in C++) or an error code is returned by the function (the usual OpenCL behaviour). reviewer note: My original review said that the documentation is well done, but this did not refer to the parts core + utilities of which I had always been critical. Another, huge issue in practice, is reliability + stability. Not only does the OpenCL execution model by itself make committing errors relatively easy, but I guess many of us can tell stories of bugged OpenCL implementations having hit us deeply. When something goes wrong figuring out where/why exactly can be a really slow, frustrating experience. So with respect to this the bar for any additional layer, developed outside Khronos (who has the OpenCL implementers on board !) must also be set very very high. Let me reiterate that my main concern is not fixing the example above; my main point is that any alternative version must be rock-solid from ground up on the one hand, plus offer considerable benefits to warrant migration considerations. Incidentally, if there is no Khronos OpenCL 2.0 C++ wrapping out there yet, have you ever given it a thought of getting in touch with the Khronos group if they are interested in doing work together, merging efforts? Note for clarity: I am personally neither involved with the Khronos Groups, nor an OpenCL implementer, nor otherwise with distributing OpenCL [no book author etc.]; just a programmer using OpenCL.
On the other hand to those rather new to OpenCL a simplified, less error-prone design would be beneficial ...
I have put much consideration into this and ultimately I don't feel it is right for the data-copying to be made implicit and hidden away from the user. ...> In any case, there is a simplified, high-level API available in Boost.Compute which allows kernels to directly operate on host memory. See the mapped_view class [2].
mapped_view uses a direct host-memory mapped buffer scheme, through the CL_MEM_USE_HOST_PTR flag (again I had to go into the implementation code to assert that initial suspicion - the docs don't say it, and you know other implementation schemes would have also been possible, e.g. it could have used a CL_MEM_ALLOC_HOST_PTR). CL_MEM_USE_HOST_PTR is only very exceptionally useful in practice. However I fully agree with you that users should not be forced to use exclusively the one or the other. My proposal aimed at offering a variety of high-level abstractions from which users can choose: -) if an input or output to a Boost.compute algorithm call is a plain host side object (e.g. a std:: container), automatically copy the data to the OpenCL runtime (if input), run the kernel, and automatically copy the data from the OpenCL runtime to the host (if output). -) if an input or output to an algorithm call is a plain OpenCL object (e.g. cl_mem or some C++ class wrapping it) the user will need to take care of any copying to host memory. -) if an input or output to an algorithm call is something which links a plain host object (e.g. std:: container) to an OpenCL object (e.g. cl_mem) - I have sketched such a class in my original post - ensure that upon any access to either the data become automatically synchronized. One can imagine additional / mixed schemes as well, temporarily disable default behaviour etc.
These are definitely ideas I've thought about and these kinds of tools could all be built upon the current API...
...
See my response above, I'm not a huge fan of these automatic data transfer abstractions. However, I could be in favor of offering this functionality in a higher-level API.
My recommendation and reviewer comment is: Your library shall offer high(er)-level abstractions of things NOT already offered by Khronos, by building on top of the Khronos APIs (incl. their C++ wrapper). Your STL-ish algorithms and their supporting functionality are one aspect here; the stuff just discussed another example: those who want low-level control already have it and can use it (through the Khronos APIs), while those who desire an "auto-synchronization" between host and OpenCL runtime presently do NOT have anything. Here the real benefit would come in. 2 other examples that come to mind (real-world applications from my work): -) smoothly support utilizing multiple devices (e.g. auto-splitting kernel workload across devices and data synchronization among devices; details will depend quite whether devices do or do not belong to the same context). -) specifying "operations-sequences", like: "Copy input data A to OpenCL, copy input data B, run kernel 1, 2, an intermediate result decides if next run kernel 3 or 4, and then copy data C to host". Such pre-specified sequences implicitly also help to improve performance (e.g. only the last command must be non-blocking).
3) for float/double input containers compute::accumulate falls back to a plain serial reduction.
Because floating-point addition is not associative. Doing this would lead to accumulate() producing different results on the device versus the host.
This argument is not clear to me. Floating-point results must, generally speaking, always be expected to deviate slightly between a plain C++ and OpenCL execution (subject to the specific device hardware + driver version + compilation settings used; for pretty much any kernel). I'd say that when someone defers any floating-point calculation to OpenCL then this is implicitly acknowledged, at the benefit of a parallelized execution.
True, the run-time compilation model provided by OpenCL does have some associated overhead. There a few techniques in Boost.Compute which help mitigate this ... The other is support for offline-caching of program binaries. This is enabled by defining "BOOST_COMPUTE_USE_OFFLINE_CACHE" and causes Boost.Compute to cache binaries for programs so that they are only compiled once the very first time they are run on the system.
Has this been well-tested, including scenarios such as OpenCL driver version changes between runs and how it interacts with Nvidia's own auto-caching mechanism (which can drive one nuts when it fails to detect changes and happily uses an outdated binary).
I would find it very useful if smart algorithms dispatch the algorithm to a plain C++ algorithm if it's really predictable that a GPU execution will just waste time.
I disagree, I think the call on whether to execute the algorithm on the host or on an accelerator (with its associated overhead) should be left up to the user (or potentially a higher-level API built on top of Boost.Compute)...While I agree that this would a useful feature, I just don't think Boost.Compute is the right place for that logic to live.
It is meant as another example of a higher-level abstraction that can be done; in my opinion Boost.Compute would be the perfect place to put it in. There is no need to implement a number of the suggested higher-level abstractions right away (or for some: ever), it shall help finding the right niche for your library (now and in the long run). As such the recommendations / critics raised here are by no means intended to be destructive, I just bring in my view how I see your library best serving the community. cheers, Thomas