Re: [boost] [compute] Review period starts today December 15, 2014, ends on December 24, 2014

21 Dec 2014

      On 20/12/2014 09:32, Kyle Lutz wrote:
...
Can you give examples of where the OpenCL wrapper types in
Boost.Compute are lacking?
I missed any interface to cl::... (Khronos C++ wrapping API)
...
One particular issue that makes me hesitant is the lack of OpenCL 2.0
support in the "official" C++ bindings. The OpenCL 2.0 specification
was released over a year ago (November 2013). The first public OpenCL
2.0 implementation was released by Intel three months ago (September
2014, followed closely by an AMD implementation).
It is correct that Khronos presently only offers a C++ API for OpenCL 
1.2, but in practice the ways more serious, limiting constraint is that 
Nvidia only offers a 1.1 _implementation_ anyway. Any portable OpenCL 
code must restrict itself to 1.1.

We could discuss issues of the Khronos C++ bindings and your wrapper to 
no end, but my main point remains: The Khronos version is official and 
established; an alternative version intended for wider use must be 
clearly superior and give people strong reasons to migrate (I am not 
only referring to programming itself, but also people writing books on 
OpenCL, blogs etc.). In my eyes yours (that is core + utilities parts of 
your library) surely fails to meet that.
...
Could you let me know what parts of the core wrapper API fail to meet
your bar for quality?
Just as one analysis example:

For the plain C API clEnqueueReadBuffer (surely a commonly used API 
function) has this signature (we use it as reference):

cl_int clEnqueueReadBuffer(cl_command_queue command_queue,
                            cl_mem buffer,
                            cl_bool blocking_read,
                            size_t offset,
                            size_t size,
                            void *ptr,
                            cl_uint num_events_in_wait_list,
                            const cl_event *event_wait_list,
                            cl_event *event);

The Khronos C++ bindings version is:

cl_int cl::CommandQueue::enqueueReadBuffer(const Buffer& buffer,
                                            cl_bool blocking_read,
                                            ::size_t offset,
                                            ::size_t size,
                                            const void * ptr,
                               const VECTOR_CLASS<Event> * events = NULL,
                               Event * event = NULL);

And your "main" version is:

void enqueue_read_buffer(const buffer & buffer,
                          size_t offset,
                          size_t size,
                          void * host_ptr,
                          const wait_list & events = wait_list());

The Khronos C++ version matches in arguments logic and order fully the C 
version, and also returns a cl_int as error code - they correspond to 
each other, easy to use one or the other. Your version a) misses the 
cl_bool for blocking/non-blocking, b) misses the last event argument, 
and c) throws (I suppose) instead of returning an error code. Let's go 
through these:
a) blocking: If I remember my code inspection correctly your version is 
automatically always blocking, and a non-blocking version is (I guess?) 
given by enqueue_read_buffer_async, which has a different signature 
(return value) besides it's different name. So instead of being able to 
set a single cl_bool and pass it as standard argument I need to take 
care which function (name) to call, what it returns etc. Studying your 
library docs I find very little information, what makes them different 
etc.; specifically nowhere does it say that enqueue_read_buffer _is_ a 
blocking operation, it only says that it _enqueues_ a read command. Both 
functions then simply refer to clEnqueueReadBuffer() which does not help 
matters at all given the different signature.
Now take a look at the Khronos C++ documentation and one encounters a 
ways, ways more detailed description. It's signature is already quite 
clear, and with the elaborate description it's really clear.
b) why your version lacks the last event argument entirely, I have no 
idea. Anything that can be done with it in the C / C++ API (e.g. OpenCL 
runtime-level profiling; controlling an out-of-order command-queue / 
another commmand-queue) seems to be undoable.
c) error handling: I'd much prefer some policy setting which specifies 
if an exception is thrown on error (the usual custom in C++) or an error 
code is returned by the function (the usual OpenCL behaviour).

reviewer note: My original review said that the documentation is well 
done, but this did not refer to the parts core + utilities of which I 
had always been critical.

Another, huge issue in practice, is reliability + stability. Not only 
does the OpenCL execution model by itself make committing errors 
relatively easy, but I guess many of us can tell stories of bugged 
OpenCL implementations having hit us deeply. When something goes wrong 
figuring out where/why exactly can be a really slow, frustrating 
experience. So with respect to this the bar for any additional layer, 
developed outside Khronos (who has the OpenCL implementers on board !) 
must also be set very very high.

Let me reiterate that my main concern is not fixing the example above; 
my main point is that any alternative version must be rock-solid from 
ground up on the one hand, plus offer considerable benefits to warrant 
migration considerations.
Incidentally, if there is no Khronos OpenCL 2.0 C++ wrapping out there 
yet, have you ever given it a thought of getting in touch with the 
Khronos group if they are interested in doing work together, merging 
efforts?

Note for clarity: I am personally neither involved with the Khronos 
Groups, nor an OpenCL implementer, nor otherwise with distributing 
OpenCL [no book author etc.]; just a programmer using OpenCL.
...
...
On the other hand to those rather new to OpenCL a simplified, less
error-prone design would be beneficial ...
I have put much consideration into this and ultimately I don't feel it
is right for the data-copying to be made implicit and hidden away from
the user. ...>
In any case, there is a simplified, high-level API available in
Boost.Compute which allows kernels to directly operate on host memory.
See the mapped_view class [2].
mapped_view uses a direct host-memory mapped buffer scheme, through the 
CL_MEM_USE_HOST_PTR flag (again I had to go into the implementation code 
to assert that initial suspicion - the docs don't say it, and you know 
other implementation schemes would have also been possible, e.g. it 
could have used a CL_MEM_ALLOC_HOST_PTR). CL_MEM_USE_HOST_PTR is only 
very exceptionally useful in practice.

However I fully agree with you that users should not be forced to use 
exclusively the one or the other. My proposal aimed at offering a 
variety of high-level abstractions from which users can choose:

-) if an input or output to a Boost.compute algorithm call is a plain 
host side object (e.g. a std:: container), automatically copy the data 
to the OpenCL runtime (if input), run the kernel, and automatically copy 
the data from the OpenCL runtime to the host (if output).

-) if an input or output to an algorithm call is a plain OpenCL object 
(e.g. cl_mem or some C++ class wrapping it) the user will need to take 
care of any copying to host memory.

-) if an input or output to an algorithm call is something which links a 
plain host object (e.g. std:: container) to an OpenCL object (e.g. 
cl_mem) - I have sketched such a class in my original post - ensure that 
upon any access to either the data become automatically synchronized.

One can imagine additional / mixed schemes as well, temporarily disable 
default behaviour etc.
...
These are definitely ideas I've thought about and these kinds of tools
could all be built upon the current API...
...
...
See my response above, I'm not a huge fan of these automatic data
transfer abstractions. However, I could be in favor of offering this
functionality in a higher-level API.
My recommendation and reviewer comment is: Your library shall offer 
high(er)-level abstractions of things NOT already offered by Khronos, by 
building on top of the Khronos APIs (incl. their C++ wrapper). Your 
STL-ish algorithms and their supporting functionality are one aspect 
here; the stuff just discussed another example: those who want low-level 
control already have it and can use it (through the Khronos APIs), while 
those who desire an "auto-synchronization" between host and OpenCL 
runtime presently do NOT have anything. Here the real benefit would come in.

2 other examples that come to mind (real-world applications from my work):

-) smoothly support utilizing multiple devices (e.g. auto-splitting 
kernel workload across devices and data synchronization among devices; 
details will depend quite whether devices do or do not belong to the 
same context).

-) specifying "operations-sequences", like: "Copy input data A to 
OpenCL, copy input data B, run kernel 1, 2, an intermediate result 
decides if next run kernel 3 or 4, and then copy data C to host". Such 
pre-specified sequences implicitly also help to improve performance 
(e.g. only the last command must be non-blocking).
...
...
3) for float/double input containers compute::accumulate falls back to a
plain serial reduction.
Because floating-point addition is not associative. Doing this would
lead to accumulate() producing different results on the device versus
the host.
This argument is not clear to me. Floating-point results must, generally 
speaking, always be expected to deviate slightly between a plain C++ and 
OpenCL execution (subject to the specific device hardware + driver 
version + compilation settings used; for pretty much any kernel). I'd 
say that when someone defers any floating-point calculation to OpenCL 
then this is implicitly acknowledged, at the benefit of a parallelized 
execution.
...
True, the run-time compilation model provided by OpenCL does have some
associated overhead. There a few techniques in Boost.Compute which
help mitigate this ... The other is support for offline-caching of program
binaries. This is enabled by defining
"BOOST_COMPUTE_USE_OFFLINE_CACHE" and causes Boost.Compute to cache
binaries for programs so that they are only compiled once the very
first time they are run on the system.
Has this been well-tested, including scenarios such as OpenCL driver 
version changes between runs and how it interacts with Nvidia's own 
auto-caching mechanism (which can drive one nuts when it fails to detect 
changes and happily uses an outdated binary).
...
...
I would find it very useful if smart algorithms dispatch the algorithm to a
plain C++ algorithm if it's really predictable that a GPU execution will
just waste time.
I disagree, I think the call on whether to execute the algorithm on
the host or on an accelerator (with its associated overhead) should be
left up to the user (or potentially a higher-level API built on top of
Boost.Compute)...While I agree that this would a useful feature, I just
don't think Boost.Compute is the right place for that logic to live.
It is meant as another example of a higher-level abstraction that can be 
done; in my opinion Boost.Compute would be the perfect place to put it in.

There is no need to implement a number of the suggested higher-level 
abstractions right away (or for some: ever), it shall help finding the 
right niche for your library (now and in the long run). As such the 
recommendations / critics raised here are by no means intended to be 
destructive, I just bring in my view how I see your library best serving 
the community.

cheers,
Thomas

Re: [boost] [compute] Review period starts today December 15, 2014, ends on December 24, 2014

Thomas M