On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt
1. What is your evaluation of the design?
I agree with other comments made about synchronization. The design should be more explicit about what's asynchronous, and ideally utilize only one mechanism for synchronization. One approach that would have made this clearer is if high-level operations were actually objects that got posted to the command_queue.
Like I mentioned before, there is only one method for asynchrony in Boost.Compute, the command queue abstraction provided by OpenCL. Operations are enqueued to be executed on the compute device and this occurs asynchronously with respect to code executing on the host. The exception to this rule are functions which interact directly with host-memory which by default are blocking and offer explicit "_async()" versions (in order to prevent potential race-conditions on host-memory). And I am hesitant to make the command_queue interface more high-level. But this sort of functionality could certainly be built on top of current interfaces in Boost.Compute. See some of the previous reviews for discussion about this.
Regarding synchronization, I'm a also bit concerned about the performance impact of synchronizing on all copies to host memory. Overuse of synchronization can easily result in performance deterioration. On this point, I think it might be worth limiting host memory usable with algorithms, to containers that perform implicit synchronization to actual use (or destruction) of results. Give users the choice between that or performing explicit copies to raw types.
To be clear, all copies are not synchronized with host memory. Boost.Compute allows both synchronous and asynchronous memory transfers between the host and device. And having an implicitly synchronized container class could be built on top of the Boost.Compute APIs. I haven't implemented it yet as I personally feel that blocking operations should be done explicitly by the user and not hidden away in an implicit API. However, I'd be happy to review and integrate functionality like this into Boost.Compute if someone was interested in developing it.
Also, I agree with Thomas M that it'd be useful for operations to return events. That said, if you told me it doubled the overhead involved in dispatching operations, I'd suggest to make it optional through overloads or via a mechanism in the command_queue itself.
All asynchronous operations in the command queue class do return events. One of his comments was to also return events from the synchronous methods for consistency and I am working on adding this.
2. What is your evaluation of the implementation?
It seems a bit heavy to be header-only.
I'd agree with Andrey Semashev, regarding thread safety, except I think it shouldn't be an option at all. That said, not all operations on all objects need be thread safe. I think the Boost.ASIO documentation provides a good example of how to express different thread safety limitations. Now, I haven't examined the use of thread-local storage, but do you feel there's a strong case to be made for the thread safety it's providing (especially, given that you seem to feel it's non-essential)?
I absolutely do not feel it is non-essential. The reason for it currently being a compile-time configurable option is that enabling it on non-C++11 compilers would make the library non-header-only (it would require the users to link to Boost.Thread). This has come up a few times in the review and I have been considering making the library non-header-only (or at least optionally non-header-only) which would obviate some of these issues. However, this is a much larger and potentially non-backwards-compatible change and I'd need to run it by the community and provide a migration plan for current users.
I'm a bit concerned about the use of type names ending in '_', such as float4_. Is this consistent with Boost conventions? I've only seen that used in other Boost libraries to denote class member variables.
I'm not sure if there are any Boost conventions for/against this (someone please speak up if there are). I chose the trailing underscore for these types as I needed a consistent spelling for representing the fundamental OpenCL types (e.g. "float" or "uint" or "int4") and I couldn't just use the names without the trailing underscore as they'd conflict with the C++ reserved keywords (e.g. "float", "int"). Using a leading underscore (e.g. "_float4") looked worse to me, so I used a trailing underscore. But I'd definitely be open to hearing other ideas.
I didn't notice any unit tests specifically intended to verify exception-safety. I'd want to know that had been thoroughly vetted, before I'd consider using Boost.Compute in production code.
There are already some tests which for the error/exception paths in Boost.Compute (testing building invalid programs in "test_program.cpp" is the first that comes to mind). I'll work on adding more. Please let me know if there are any specific areas you'd like to see more heavily tested.
Finally, based on the docs, it seems there's a bug in boost::compute::command_queue::enqueue_copy_image_to_buffer() and boost::compute::command_queue::enqueue_copy_buffer_to_image(). The region and origin parameters should be arrays, in the 2D and 3D cases. If this discrepancy is intentional, it should be noted. I haven't exhaustively reviewed the reference docs, so there might be other issues of this sort.
Can you be more specific as to what you consider to be a "bug" here? These APIs in the command_queue class are very "thin" wrappers on top of the clEnqueue* functions from the OpenCL C API. If you could provide a small test-case which demonstrates the issue and submit it to the bug-tracker [1] that would be great.
3. What is your evaluation of the documentation?
It needs more and better-documented tutorials (and I'm including the "Advanced Topics", here).
Fully agree, I'll work on this.
Some of the reference pages could use more details, as well, especially concerning synchronization and exception usage.
Fully agree. I've been working on improving this. And please let me know if there are any specific classes/functions which you'd like to see more detailed documentation.
Regarding exceptions, there should be discussion of specifically what is invalidated by which kinds of errors. If any errors are non-recoverable, this should also be called out and perhaps derived from a different or additional baseclass (not a bad use case for multiple inheritance, actually).
As far as I know, there are no non-recoverable errors or errors which invalidate host objects thrown by Boost.Compute. This is due to the fact that Boost.Compute mainly deals with executing operations on a separate device (e.g. GPU) and exceptions there don't affect the host's ability to continue execution.
I agree with Mathias Gaunard that it would be helpful to discuss computational complexity, device memory requirements, and the various methods used to implement some of the algorithms. You could continue to omit these details on the simple algorithms, though.
I will definitely continue to improve the documentation and specifically add more of these sorts of details for the algorithms. Please let me know if there are additional areas where you would also like to see more specific details.
4. What is your evaluation of the potential usefulness of the library?
This is my biggest misgiving, by far. In the very near future, I expect developers will opt for either SYCL (https://www.khronos.org/opencl/sycl) or Bolt (https://github.com/HSA-Libraries/Bolt). SYCL provides a modern, standard C++11 wrapper around OpenCL, with better concurrency control and support for integrating kernels inline. Bolt provides many of the same higher-level abstractions found in Boost.Compute, but with forthcoming support for HSA.
To have the kind of lasting relevance and broad applicability to which all Boost libraries should aspire, I think Boost.Compute should be architected to support multiple backends. Though OpenCL support is currently ascendant, it's far from universal and is already flagging on some platforms (Nvidia, not the least). And HSA provides a foundation on which alternatives are actively being built. Most importantly, there exist multitudes of multi-core and multiprocessor systems which lack OpenCL support. It would be eminently useful to support these with such backends as thread pool, OpenMP, etc. And backends could be added to support new technologies, as they mature.
5. Did you try to use the library? With what compiler? Did you have any problems?
I downloaded and inspected the sources and some examples. I didn't have any questions or concerns that would be addressed by experimentation.
6. How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
Review of the documentation, some sources, some examples, and a refresher on SYCL & Bolt. I also read all of the reviews and replies sent thus far.
7. Are you knowledgeable about the problem domain?
I've been developing production code for parallel and heterogeneous computing platforms for the majority of the last 16 years. I've followed OpenCL since version 1.0 was finalized. I've also submitted bug reports and minor patches for Khronos' official C++ OpenCL interface.
8. Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion.
Not in its current state. The primary issue is that it's too OpenCL-centric. It should support more backends, even if these are just compile-time options. This is crucial for both current and lasting relevance.
Boost.Compute is explicitly *not* designed to be an abstraction layer over every parallel-programming interface currently available. It is designed to target modern programmable GPUs, multi-core CPUs and some of the more exotic accelerators (Xeon Phi, Parallella Epiphany, FPGAs, etc.). In my opinion, OpenCL is the current best and most widely-supported interface for this purpose and fills this role well. Others libraries/frameworks (such as SYCL, Bolt, C++AMP, OpenACC, etc.) are all dependent on either a special compiler or special compiler extensions. On the other hand, OpenCL is a library-level solution which allows Boost.Compute to be portable to any platform with a standard C++ compiler. And as Boost.Compute and SYCL are both based on OpenCL, interoperating between them should be fairly trivial. If a non-commercial SYCL implementation is ever released, I will work on adding supporting for its single-source compilation model to Boost.Compute. I feel developing an all-encompassing parallel programming library is outside the scope of Boost.Compute (and would require considerably more resources to design, implement, and maintain). That said, I feel that Boost.Compute is well-suited to provide a back-end to a more generic/high-level parallel programming library (either something like a potential Boost.Parallel library or the upcoming standard C++ Parallelism TS). More concretely, and as Sebastian mentioned, OpenCL itself already provides a parallel-programming abstraction with many different back-ends (both vendor-supplied implementations and open-source implementations). I personally don't see the benefit to adding yet another abstraction layer between the user and the compute device. I feel it would greatly increase complexity of the library and prevent us from spending more time on improving performance.
Beyond that, better concurrency control is crucial for optimum performance. If done well, it could even help further reduce usage errors.
Boost.Compute provides low-level abstractions for concurrency between the host and device. More high-level APIs and frameworks could definitely be built on top of them. However, designing an optimal, high-level API for concurrency is notoriously difficult and, in my opinion, still an open-problem. Thanks for the review. Please let me know if I can explain anything further. -kyle [1] https://github.com/kylelutz/compute/issues