Le 30/12/14 20:08, Kyle Lutz a écrit :
On Tue, Dec 30, 2014 at 7:15 AM, Vicente J. Botet Escriba
wrote: Hi,
I have no time to do a complete review. Anyway here it is my minimal review
*1. What is your evaluation of the design?** *** I would prefer that the algorithms take as first parameter the queue (which could be abstracted to an asynchronous executor). Any particular reason to prefer the first argument rather than the last? Currently, by using the last argument to specify the command queue, we can use a default argument of "system::default_queue()". For the common case of only wanting to use the default compute device on the system, all algorithms will work correctly and the user never has to bother with finding a device, setting up a context, creating a command queue and passing it to the algorithm function. One of the reasons is to allow for algorithms that take a variadic number of arguments. I don't know if you have one on your library, but I find this a uniform pattern. Second, I consider that the user must be explicit about which queue (devices) is used as AFAIK this is an optimization library.
The performances let think that there is no interest in using the library for containers having less than 10^4/10^5 elements. In general yes, GPUs/Accelerators are ill-suited for small problem sizes (unless the computation performed on each element is quite expensive). Sorting 1000 integers will nearly always be faster on the host rather than the device.
Even if this is considered a low level library, the documentation should show how a library writer could make use of the provided interface to provide a higher one. I agree, there should be more documentation for this. Could you let me know what sort of higher-level interface you'd like to see demonstrated? The one of the standard Parallel TS :), or a pipeline framework.
I would prefer that the library make use of std::future or boost::future class instead of adding an additional one. Do the author tried to make use of the existing futures? if yes, which problems were found? The implementation of the boost::compute::future type is fairly different than that of the std::/boost::future types. Essentially, there is no operation akin to "promise.set_value()" for boost::compute::future. Instead, it is a simple wrapper around an OpenCL event object which can be used to wait for an operation on the device to complete (via calling the blocking clWaitForEvents() function from the OpenCL API). Please let me know if there are ways to better integrate with boost::future.
I understand this seems to don't match with the std/future/promise design. Note that the futures returned by std::async, std::make_ready_future, future<>::then, when_all, when_any, don't show neither an associated promise (as it is hidden). I don't know yet how this could be done, but I find it an important use case to be solved. I'm not (by principle) against a multiplication of future types, but this means that we would need to duplicate the whole future interface (then continuations, when_all, when _any, ...).
I'm missing a range interface to most of the algorithms. The algorithms API is meant to mimic the STL algorithms API. Adding a range-based interface is definitely possible, but is a non-trivial amount of work. Would you be interested in contributing this?
Agreed this is a lot of work. No I have not time to contribute to this. I was just wondering if an range interface, couldn't make it possible to compose the programs executed by the whole pipeline and make the cost of copies less significant.
I suspect that as this is a C++03 library it doesn't take advantages of some interesting C++11+/C++14 features. It does take advantage of some C++11 features (e.g. variadic functions) where possible. I would be very open to integrating more C++11/14 features where appropriate.
It is not clear which operations are synchronous, which ones could block, ... Could you be more specific about what operations you are referring to?
All the functions of the core classes must be explicit about thread-safety (synchronization). I suspect that almost all the algorithm block, and no parallelism (pipeline) is possible. Could you confirm?
I know there is already some documentation regarding this for the command_queue class as well as for the copy()/copy_async() functions. I'll work on adding more. Great.
In order to get the native OpenCl handle I suggest to name the function get_native_handle instead of get.
*2. What is your evaluation of the implementation?* I've not see at the implementation.
*3. What is your evaluation of the documentation?*
The documentation lacks of some core tutorials. I was unable to have an idea of what a context is, what can be done with a command_queue, what is a Kernel, a pipe, ...? Does it mean that the user must know OpenCL? The low-level API is a wrapper around OpenCL and so having some background knowledge there helps. I'll work on adding more documentation for these classes and how they fit together. Thanks.
The reference documentation should follow he C++ standard way, including pre-conditions, effects, synchronization, throws clauses. Fully agree, I'll continue to work on this. Great.
I would appreciate a description of the algorithms used, how the parallelism is used? is this completely managed by OpenCL or do the library something else? I'm not sure if I fully understand what you're asking here but I'll try to answer it.
OpenCL provides an API for executing code on compute devices (e.g. GPUs), Boost.Compute provides algorithms which, when called, use the OpenCL API to create programs, compile them for the device, and execute them.
Let me know if I can better explain this. I have take a look at the OpenCL documentation and I understand it better. Could it be possible to have a roughly pseudo code of the program that each one of the algorithms executes on the device?
*4. What is your evaluation of the potential usefulness of the library?** *** This is essentially an optimization library. It should be quite useful, but the performances results don't probe it. Could you clarify what you mean by "the performances results don't probe it"? Boost.Compute provides a number of benchmark programs which can be used to measure and evaluate the performance of the library on a particular system. Results from this are also published on the "Performance" page [1] in the documentation.
What a meant is the performances are often worst except when the size of the container is high enough. The scope of the library could be to be better only for these big containers, but I'm missing an interface that makes a choice between the Compute or the STL one.
*8. Do you think the library should be accepted as a Boost library? ** * As I have not taken too much time I'm unable to say that this library should not be included in Boost. So I vote to include it conditionally, subject to: * I would need some performances that probe the library is better than using standard STL containers in most of the cases.
Boost.Compute will never be universally better than the STL, and it was never intended to be. CPUs and GPUs make fundamentally different trade-offs which lead to very different performance characteristics. GPUs support performing computation on large amounts of data with very high throughput (in some cases 10x-100x faster than a CPU). The down-side to this is a larger overhead to get the GPU running as well as the cost of transferring data to device memory. Boost.Compute and the STL are different tools which each suit their own different use-cases. I don't think Boost.Compute should be required to supplant the STL in order to be an acceptable library.
I suggest you to make more evident the goal and scope of the library in the introduction of the library. Please, replace my requirement by this one.
* The documentation should be completed with more tutorials and a better reference documentation.
Fully agree, I'll definitely continue to work on improving the documentation.
Thanks for the review. Let me know if I can explain anything more clearly.
I will comeback if I find some time to read the documentation carefully. Good luck, Vicente