Re: [boost] [compute] Review period starts today December 15, 2014, ends on December 24, 2014

22 Dec 2014

      On Sun, Dec 21, 2014 at 2:24 PM, Asbjørn <lordcrc@gmail.com> wrote:
...
On 21.12.2014 20:39, Kyle Lutz wrote:
...
On Sun, Dec 21, 2014 at 3:44 AM, Thomas M <firespot71@gmail.com> wrote:
...
Studying your library docs I find very
little information, what makes them different etc.; specifically nowhere
does it say that enqueue_read_buffer _is_ a blocking operation, it only
says
that it _enqueues_ a read command. Both functions then simply refer to
clEnqueueReadBuffer() which does not help matters at all given the
different
signature.
Yes, I've split the blocking and non-blocking memory copy operations
into separate functions. Personally, I've never been fond of APIs
which drastically change behavior based on a single boolean flag.
Also, this is more in line with the API provided by other libraries
like Boost.ASIO (e.g. boost::asio::read() vs.
boost::asio::async_read()).
As a library user I agree with the more explicit split between sync and
async routines, ala ASIO, and I think Boost.Compute should follow this
convention. However, I think this case you should deviate from the OpenCL
API names to make it more clear that things are different. Specifically,
drop the "enqueue" word. Simply have "read_buffer" and "read_buffer_async".
For me the "enqueue" word just makes things more confusing.
Well one concern with removing the word "enqueue" from the name is
that the function actually *does* enqueue a command in the queue. For
instance, if you queue up a couple kernel launches (which are
asynchronous) followed by a synchronous read (with
enqueue_read_buffer()), the asynchronous operations in the queue will
still be executed before the read operation (i.e. the normal FIFO
queue behavior).
...
...
...
c) error handling: I'd much prefer some policy setting which specifies if
an
exception is thrown on error (the usual custom in C++) or an error code
is
returned by the function (the usual OpenCL behaviour).
FWIW, again as a library user, I quite like ASIO's approach where each
operation is overloaded to either fill an error_code or throw.
Having two separate APIs with different error handling semantics is
definitely possible. While I'm fairly happy with the current
exception-based error handling, implementing an approach like ASIO's
wouldn't be that difficult.
...
...
This is also something I have played around with. Basically I'd like
to have any API which allows users to define "pipelines" or
"task-graphs" which hook up several different
kernels/algorithms/memory-copies and produce an efficient set of
operations to stream data through and extract the results.
Any ideas you have on a potential API you'd like to see for this would
be great. There is potentially some prior art in the C++ pipelines
proposal [3] which may be interesting.
In my "just for fun" Delphi.Compute library (written in Delphi, inspired by
Boost.Compute) I made Copy() and Transform() return futures of the output
buffers, as well as accept futures as parameters. Note, Delphi doesn't have
iterators like C++ so my routines operate directly on buffers.
So when Transform() say got a Future<Buffer> instead of a Buffer for a
parameter, it would add the future's associated event to the wait list
passed to clEnqueueNDRangeKernel (technically the Buffer type has an
implicit conversion operator to an "immediate" Future<Buffer>).
This made it pretty seamless to queue up everything and then just wait for
the final read (the default "copy device buffer to host array and return it"
call is blocking). The code looks sequential but would only block on that
last read.
I'm sure there are better ways, just thought I'd share.
That sounds very cool. I'd definitely be interested in exploring an
API like that. I've also been keeping my eye on the papers from the
C++ concurrency working group (for instance, N3857). It would be good
to align work in this direction for Boost.Compute with the proposed
standards (where feasible).
...
...
Strongly disagree, the floating-point operations on the device are
well defined and their output should be identical to the host results
(barring optimizations like "-cl-fast-relaxed-math").
While I agree, I've found Intel's OpenCL CPU device to return results which
make me think it uses some relaxed math regardless. With NVIDIA and AMD I
can get (essentially) the same results as reference CPU calculations, but
with Intel I sometimes get quite large discrepancies. Of course, it's
possible I'm just doing it wrong...
Interesting, could you provide a test-case to reproduce this? In my
testing (on Intel and others) I haven't found any problems with the
code I supplied (though I've heard there may be issues with precision
complex operations involving transcendental functions/sqrt()/etc. on
some implementations).

Thanks for your feedback!

-kyle