[compute] Review period starts today December 15, 2014, ends on December 24, 2014

Antony Polukhin

15 Dec 2014 15 Dec '14

6:57 a.m.

Dear All, Review of the Compute library starts today on Mon 25st of December 2014 and will last for ten days. The Compute library provides a C++ interface to multi-core GPGPU and CPU computing platforms based on OpenCL. The project is hosted on GitHub at https://github.com/kylelutz/compute. Docs are available at http://kylelutz.github.io/compute/ Sources could be downloaded as a ZIP archive via https://github.com/kylelutz/compute/archive/master.zip Some pre-official reviews could be found in Boost Library Incubator: http://rrsd.com/blincubator.com/reviews/?library_id=669 Please, answer the following questions in your review: 1. What is your evaluation of the design? 2. What is your evaluation of the implementation? 3. What is your evaluation of the documentation? 4. What is your evaluation of the potential usefulness of the library? 5. Did you try to use the library? With what compiler? Did you have any problems? 6. How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? 7. Are you knowledgeable about the problem domain? And finally, every review should answer this question: 8. Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion. -- Best regards, Antony Polukhin

Show replies by date

Denis Demidov

16 Dec 16 Dec

7:25 a.m.

New subject: [Boost-users] [compute] Review period starts today December 15, 2014, ends on December 24, 2014

Hi, I am the one who submitted the review to the boost incubator, but I think it should be updated. Also I think its easier to read the review as the whole piece here than on the incubator site. ### Design ### Boost.Compute provides a thin C++ wrapper around OpenCL host API at its core and builds a set of STL-like algorithms on top of that core. The user interface strongly resembles the STL and hence should be familiar to any C++ programmer. The only minor problem I have with the design is the decision to provide another C++ wrapper for OpenCL host API instead of using the standard C++ bindings header [1] provided by the Khronos group (the body behind the OpenCL standard). This decision makes interaction with the existing OpenCL libraries (that use Khronos C++ bindings) somewhat complicated at times. I don't believe its possible to change the design at this point though, so I am prepared to live with it. ### Implementation ### Having proposed several patches to the library, I can say that I am familiar with its implementation. The library is well structured, the code is well designed, well formatted and is easy to read and understand. The library provides a large number of examples and an extensive set of unit tests. When the first public announcement of the library was made here on Boost mailing list, there were several performance problems. In particular the compute kernels were not cached at the first invocation, and some algorithms only provided serial implementation (some still do [2]). I know that a lot of effort has been put into the implementation since then, and the situation has much improved. The performance page [3] of the documentation shows that Boost.compute is able to outperform Nvidia's Thrust for some algorithms, but there is still some work to be done. ### Documentation ### The documentation does a good job at providing both an overview of the library and an extensive API reference. Boost.compute uses Boostbook as the documentation generator, and has a look and feel compatible with the majority of Boost libraries. ### Potential usefulness of the library ### I'd say that a library that allows to easily harvest the performance provided by the modern graphic processors and accelerators is extremely useful. For me, as an end user, the convenience could even outweigh a loss of a fraction of performance. Since the library interface is so close to STL, it is extremely easy to try and use. I have successfully compiled the library with the recent versions of GCC and Clang. The unit tests run fine on NVIDIA, AMD, and Intel OpenCL platforms. Among the current alternatives to the library that provide an STL-like set of containers and algorithms the Boost.Compute is the most portable, being built on top of standard OpenCL (see [4] for my take on differences between Boost.Compute and alternatives at stackoverflow.com). One thing that could potentially make Boost.compute obsolete is the inclusion of n3960 [5] into standard. Kyle, what do you think about this? ### Familiarity with problem domain ### I am the author of VexCL [6] library that has similar functionality to Boost.Compute, but provides higher level interface. I would say that I am well familiar with GPGPU programming, both CUDA and OpenCL. I have provided an implementation of Boost.Compute backend (algebra and operations) for Boost.Odeint library [7], and made a couple of Boost.Compute algorithms available through VexCL interface. ### Conclusion ### I think an inclusion of a GPGPU library into Boost is long overdue. In my opinion, Boost.Compute deserves to be accepted. The interface of the library is well designed, but it needs some work on the performance of the provided algorithms. Due to high reputation of Boost collection of libraries, a newly included library almost automatically becomes a de-facto standard in its field. This is why I want a GPGPU library accepted into Boost to show a state of the art performance. I still believe the work on performance may be continued _after_ the library is accepted into Boost. ### References ### 1. [cl.hpp](http://www.khronos.org/registry/cl/api/1.2/cl.hpp) -- OpenCL 1.2 C++ Bindings Header File, implementing the [C++ Bindings Specification]( http://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.2.pdf). 2. [boost/compute/algorithm/sort_by_key.hpp](http://goo.gl/iHfMdN) 3. Boost.compute [performance]( http://kylelutz.github.io/compute/boost_compute/performance.html). 4. Answer to [Differences between VexCL, Thrust, and Boost.Compute](http://goo.gl/MRT12G). 5. [TS for C++ Extensions for Parallelism]( http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3960.pdf). 6. [VexCL](https://github.com/ddemidov/vexcl) -- a C++ vector expression template library for OpenCL/CUDA. 7. [Boost.compute backend for Boost.odeint](http://goo.gl/xZSd10). Best regards, Denis Demidov, Senior researcher at Supercomputer center of the Russian Academy of Sciences.

Kyle Lutz

17 Dec 17 Dec

5:11 a.m.

New subject: [Boost-users] [compute] Review period starts today December 15, 2014, ends on December 24, 2014

On Mon, Dec 15, 2014 at 11:25 PM, Denis Demidov <dennis.demidov@gmail.com> wrote:

...

Hi,

I am the one who submitted the review to the boost incubator, but I think it should be updated. Also I think its easier to read the review as the whole piece here than on the incubator site.

Thanks for the review (and for being the first to review on the Boost library incubator)! I've addressed your comments in-line below. Let me know if I missed anything or can explain anything better.

...

### Design ###

Boost.Compute provides a thin C++ wrapper around OpenCL host API at its core and builds a set of STL-like algorithms on top of that core. The user interface strongly resembles the STL and hence should be familiar to any C++ programmer.

The only minor problem I have with the design is the decision to provide another C++ wrapper for OpenCL host API instead of using the standard C++ bindings header [1] provided by the Khronos group (the body behind the OpenCL standard). This decision makes interaction with the existing OpenCL libraries (that use Khronos C++ bindings) somewhat complicated at times. I don't believe its possible to change the design at this point though, so I am prepared to live with it.

Yeah, way back when I began work on Boost.Compute (years ago), I encountered some issues with the C++ OpenCL API which lead me to use the C API directly. Also, implementing them in Boost.Compute gave me a bit more control and allowed for usage of Boost-specific tools in the implementation (e.g. using "BOOST_THROW_EXCEPTION()" for errors rather than plain "throw"). Furthermore, and more stylistically, I've implemented the Boost.Compute OpenCL wrapper types with a more STL/Boost-like API (e.g. "command_queue::enqueue_copy_buffer()" instead of "CommandQueue::enqueueCopyBuffer()"). I think this gives the library a more consistent look and feel (as it is heavily inspired by the STL API). And, I've also been working on adding first-class support for using types from the Khronos C++ wrapper library (like cl::Buffer) directly with Boost.Compute types (like boost::compute::buffer). I think this should ease any pain when working with both libraries. Hopefully I'll get this finished soon.

...

### Implementation ###

Having proposed several patches to the library, I can say that I am familiar with its implementation. The library is well structured, the code is well designed, well formatted and is easy to read and understand. The library provides a large number of examples and an extensive set of unit tests.

When the first public announcement of the library was made here on Boost mailing list, there were several performance problems. In particular the compute kernels were not cached at the first invocation, and some algorithms only provided serial implementation (some still do [2]). I know that a lot of effort has been put into the implementation since then, and the situation has much improved. The performance page [3] of the documentation shows that Boost.compute is able to outperform Nvidia's Thrust for some algorithms, but there is still some work to be done.

Very true, there is still work to be done on this front. Now that the API is mostly settled, most of the development will be towards performance.

...

### Documentation ###

The documentation does a good job at providing both an overview of the library and an extensive API reference. Boost.compute uses Boostbook as the documentation generator, and has a look and feel compatible with the majority of Boost libraries.

Thanks!

...

### Potential usefulness of the library ###

I'd say that a library that allows to easily harvest the performance provided by the modern graphic processors and accelerators is extremely useful. For me, as an end user, the convenience could even outweigh a loss of a fraction of performance.

Since the library interface is so close to STL, it is extremely easy to try and use. I have successfully compiled the library with the recent versions of GCC and Clang. The unit tests run fine on NVIDIA, AMD, and Intel OpenCL platforms.

Among the current alternatives to the library that provide an STL-like set of containers and algorithms the Boost.Compute is the most portable, being built on top of standard OpenCL (see [4] for my take on differences between Boost.Compute and alternatives at stackoverflow.com).

One thing that could potentially make Boost.compute obsolete is the inclusion of n3960 [5] into standard. Kyle, what do you think about this?

I've been following the Parallelism TS closely and I'm very happy to see this being worked on in the standard. But I don't think this API in the standard library would make Boost.Compute obsolete. In fact, I can see Boost.Compute being one possible back-end (or "Executor") for the parallel algorithm API. I also think Boost.Compute is a little more flexible when it comes to programming accelerators. For instance, it allows users to directly execute custom kernels/functions rather than being restricted to just the algorithms provided by the parallel API. Furthermore, Boost.Compute allows access to other GPU-specific resources such as the image/texture-caches and support for direct OpenGL/D3D interoperation. Being a separate, non-standardized library also allows it to both evolve more rapidly and support features not currently made available in the standard.

...

### Familiarity with problem domain ###

I am the author of VexCL [6] library that has similar functionality to Boost.Compute, but provides higher level interface. I would say that I am well familiar with GPGPU programming, both CUDA and OpenCL. I have provided an implementation of Boost.Compute backend (algebra and operations) for Boost.Odeint library [7], and made a couple of Boost.Compute algorithms available through VexCL interface.

### Conclusion ###

I think an inclusion of a GPGPU library into Boost is long overdue. In my opinion, Boost.Compute deserves to be accepted. The interface of the library is well designed, but it needs some work on the performance of the provided algorithms. Due to high reputation of Boost collection of libraries, a newly included library almost automatically becomes a de-facto standard in its field. This is why I want a GPGPU library accepted into Boost to show a state of the art performance. I still believe the work on performance may be continued _after_ the library is accepted into Boost.

Thanks! -kyle

Hartmut Kaiser

1:13 p.m.

New subject: [Boost-users] [compute] Review period starts today December 15, 2014, ends on December 24, 2014

All,

...

Review of the Compute library starts today on Mon 25st of December 2014 and will last for ten days.

The Compute library provides a C++ interface to multi-core GPGPU and CPU computing platforms based on OpenCL.

Caveat: I have spent only very little time to look into this library, so I might be off by a large margin. Mainly, I have two comments wrt the API of this library: a) I would have expected for the STL-like algorithms to be 100% aligned with N4105 (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4105.pdf). I strongly believe we will see an Boost implementation for N4105, so Boost.Compute could nicely integrate (or even lay the foundation) by defining its own execution policies. b) As already mentioned elsewhere, I find it to be confusing for the library to expose two different mechanisms for dealing with the asynchrony. There is the queue type representing OCL events, and the partially used future<> return type for the user to synchronize. I believe this can be unified For a Boost library finding the proper API is crucial. Getting this straight before acceptance is important. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Kyle Lutz

18 Dec 18 Dec

4:38 a.m.

New subject: [Boost-users] [compute] Review period starts today December 15, 2014, ends on December 24, 2014

On Wed, Dec 17, 2014 at 5:13 AM, Hartmut Kaiser <hartmut.kaiser@gmail.com> wrote:

...

All,

...
Review of the Compute library starts today on Mon 25st of December 2014 and will last for ten days.

The Compute library provides a C++ interface to multi-core GPGPU and CPU computing platforms based on OpenCL.

Caveat: I have spent only very little time to look into this library, so I might be off by a large margin.

Thanks for taking a look! I've addressed your comments in-line below.

...

Mainly, I have two comments wrt the API of this library:

a) I would have expected for the STL-like algorithms to be 100% aligned with N4105 (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4105.pdf). I strongly believe we will see an Boost implementation for N4105, so Boost.Compute could nicely integrate (or even lay the foundation) by defining its own execution policies.

Well the API can't be 100% aligned with the proposal as Boost.Compute supports C++03 compilers. Places where the APIs differ (e.g. slightly different signatures/semantics for some of the new algorithms) are due to what I consider some short-comings in the proposal (at least as far as it applies to GPU/accelerator programming). But anyway, I think Boost.Compute is perfectly suited to provide an N4105-style ExcecutionPolicy which could be used to execute algorithms on the GPU using the proposed standard API. Also, I would be very much in support of an implementation of the proposal for Boost and would be interested in collaborating with anyone working on that.

...

b) As already mentioned elsewhere, I find it to be confusing for the library to expose two different mechanisms for dealing with the asynchrony. There is the queue type representing OCL events, and the partially used future<> return type for the user to synchronize. I believe this can be unified

To be clear, there is only one mechanism for asynchrony, the command queue abstraction provided by OpenCL. Calls that enqueue operations on the command queue (e.g. copying between memory buffers or launching a kernel) are non-blocking and return an event object which can be used to track the progress of the operation or wait for its completion. The future<> class in Boost.Compute simply wraps the returned OpenCL event object and provides a standard C++ future API for it. This is merely provided as a convenience. Hope this makes things more clear. -kyle

Thomas M

20 Dec 20 Dec

7:09 a.m.

Dear all, [2nd posting trial]

...

Review of the Compute library starts today

Having spent the last 3 years on a larger-scale C++ project utilizing OpenCL as main computing engine, without doubt a C++ GPGPU library is worthwhile. My evaluation was based on studying the full docs, tutorial examples, creating own (simple) applications, and inspecting selected implementation details. I liked most the portable STL-like algorithms; combined with fairly straightforward on-the-fly specifications of kernel functions (including lambda expression support) GPGPU utilization becomes much more accessible for every-day C++ programs. The design of this library part is rather clean and aligns well with the C++ standard customs. I have checked a handful of function interfaces and they correspond to the C++ STL variants. However I have also encountered a number of issues, of which I consider most severe the overall library's design/aim: Khronos Group already provides (since years) a C++ bindings API itself (https://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.2.pdf). Frankly the Khronos API is not an example of a clean, modern C++, but it provides very fine-grained operations control (which can be crucial for effective GPGPU performance), is developed (althouth a bit lagged) with OpenCL itself, and covered in every elaborate OpenCL textbook. It is thus IMHO the current de-facto C++ OpenCL wrapper standard. The proposed boost library's parts core & utility (heavy in total!) seem to do just the same interface-wise, yet lack some important features e.g. detailed control flow (blocking / event setting), image classes, or deviate in subtle signature details making it really difficult to grasp which does exactly what / behaves differently. A boost library should not start from scratch but integrate with and extend the Khronos API, both at C and C++ bindings level (e.g. providing the STL-like algorithms to them in straightforward to use manners). Programmers can thus rely on established practices (personally I wouldn't switch away from the Khronos C++ API as main underlying workhorse!) yet benefit from the extended functionality provided by the boost library. On the other hand to those rather new to OpenCL a simplified, less error-prone design would be beneficial; equally this can raise the productivity of everyone. The current design/implementation follows the typical OpenCL execution model incl. some of its caveats (see tutorial "Transforming Data"): explicitly copying input data to the device, executing kernel(s), and then copying the output back to the host. Frequently however the whole emphasis is on the kernel invocation (run an algorithm on the data), rendering the copying an implementation detail that's "done because it must be done" but otherwise makes code longer and comprises an error source if forgotten. I hence wonder if the following overall design would be more appropriate (this is by no means a request to doing it this way, I just try to bring in alternative perspectives): 1) build on top the Khronos C / C++ bindings API, i.e. use that as base instead of the own core + utilities parts 2) offering a high-level interface for algorithm execution that exposes users to as little OpenCL internals as possible while giving the algorithms lots of flexibility 3) offering a high-level interface that auto-connects STL-containers with OpenCL memory objects, implemented based on standard C++ / Khronos API classes. 4) offering a high-level interface that applies the algorithms directly to objects of the Khronos C / C++ API. The first point is rather obvious. To me the proposed library parts core + utility appear as just another C++ wrapper, and unless this is done in extremely (!!!) well manner (i.e. offering every functionality the Khronos API does, yet making it clean from scratch, aligning it with Standard C++, extending it by essential features etc., ensuring a rock-solid quality control for reliability etc.; I'd set the bar really high here) I see no reason to do it. If people are forced to use the proposed library core wrapper just to gain access to the other functionality (and there is good other functionality in there !) then I think there is a serious risk that a considerable number of people simply turn away altogether. With respect to the second point I suppose something like that is doable: // BEGIN compute::gpgpu_engine gpgpuEngine; // default initialization links it to a default device etc. // create vector, fill with data - ordinary C++ std::vector<float> vec(10000); std::generate(vec.begin(), vec.end(), rand); compute::transform(vec.begin(), vec.end(), vec.begin(), compute::sqrt<float>(), gpgpuEngine); std::cout << vec[0]; // results are already on the host // END So an instance of gpgpu_engine (or whatever name) gets setup once and can, if needed, become customized for its behaviour (devices used, execution policies etc.). This engine can then internally (hidden to the user, like a std::vector manages it's memory) manage buffers, and when transform now gets invoked it: -) copies the data to one of its buffers (create one if none available) -) run the kernel -) copies the data to the host side container (keep buffer for reuse later) This would bring several advantages: -) the code becomes very similar to ordinary C++ code -> short-handed, less error-prone -) buffers can be recycled among multiple algorithm calls (e.g. the engine can cache a small number of buffers immediately available for future calls) -) more efficient OpenCL runtime utilization (performance) because the whole operation sequence has been abstracted: e.g. the input copy operation can be enqueued in a non-blocking fashion, so the data transfer to the device and the boost.compute kernel preparation can occur concurrently; equally while the kernel runs the copy-back-to-host command can already become enqueued. -) gpgpu_engine can encapsulate a number of policies that control its behaviour (providing sensible default-configurations but allowing fine-grained control if desired), e.g.: error handling (e.g. throwing exception vs. setting some plain-odd OpenCL error codes); device to execute on (if multiple available); copying back to the host in non-blocking manner (like copy_async); to allow the selection of a 'smart' execution path (for example if the input data do not warrant the overhead of a GPU call [e.g. if they are too small or do not well fit GPU computation problems] defer the call a plain STL-algorithm call or use OpenCL's built-in native C++ calling threading functionality); etc. It would be beneficial if those options can become temporarily overwritten (something like the boost::ios_..._saver classes come to mind). With respect to the third point I am thinking of something along the lines of: template <class T, ... policies etc for both std::vector and cl::Buffer> class vector_buffer { private: std::vector<T,...> vec; cl::Buffer buf; }; the class ensures that the std::vector and the buffer are automatically synchronized whenever changes must become transparent (i.e. access). Obviously this requires some thought if get functions grant access to the plain std::vector / cl_mem/cl::Buffer; however for the present implementation I also don't see what would stop me from hijacking a cl_mem from a compute::vector and modify the buffer arbitrarily outside the compute::vector class. For 4) I am thinking of overloads for the algorithm to +- directly accept Khronos C / C++ objects. Some really light-weight adapters adding e.g. required type + size data could do the trick. In general I'd find it useful that all of a host object, a device object and something linking a host with a device object can be an input / output of an algorithm, and the implementation takes care of automatic data transfer. So if the input refers to a host object the implementation automatically copies the data to the device, if the output is a host object it also automatically copies the result to it etc. A final but probably very important design consideration: I wonder if boost needs a OpenCL-computing library, or a general parallelization library. Presently the GPGPU world is already split too much between CUDA and OpenCL as main players (hardware vendors doing their parts ...), and technology is really rapidly moving (APUs etc.). As Hartmut has already pointed out one approach could be to use the current proposal as foundation for a parallelization implementation: cut it down to the essentials of that and hide as much OpenCL implementation details as possible. A completely different approach could be to try coming up with a unifying parallelization framework that supports multiple backends (OpenCL, CUDA, others). Obviously this would be a tremendous amount of work (and getting that API right is probably extremely difficult - the last thing we'd need is just another restricted API causing more splitting) but in the long run could be the more rewarding solution. Implementation details: I have checked the implementation only briefly, mostly only when questions arose for a few functions. Overall it looks ok and organized, yet I have encountered some issues. 1) type safety [major issue - must be fixed before acceptance]: Type safety is not a strength of OpenCL, and this is reflected at parts in the implementation when it fails to add a proper conversion/protection layer. Using compute::reduce it was embarrassingly easy to produce rubbish results through the following code (modifying the provided tutorial code) : // BEGIN compute::device device = compute::system::default_device(); compute::context context(device); compute::command_queue queue(context, device); // generate random data on the host - type is float std::vector<float> host_vector(10000); std::generate(host_vector.begin(), host_vector.end(), rand); // create a vector on the device compute::vector<float> device_vector(host_vector.size(), context); // transfer data from the host to the device compute::copy(host_vector.begin(), host_vector.end(), device_vector.begin(), queue); double reduction_result = 0.0; // result is of type double compute::reduce(device_vector.begin(), device_vector.end(), &reduction_result, queue); std::cout << "result: " << reduction_result<< std::endl; // END The input data is of type float while the result shall be stored in a double. This fails miserably under the current implementation because after the reduction has completed the final value stored in device memory gets copied merely byte-wise to the target variable (using a plain type-ignorant clEnqueueReadBuffer), reading simply the 4 bytes from a float into an 8 byte double (4/8 on my PC machine). I suppose reversing types (double as input, float as output) will be even more spectacular because 4 superfluous bytes simply overwrite the stack. The same affects a plain compute::copy, for example if above the device_vector is of type double: // BEGIN // generate random data on the host - type is float std::vector<float> host_vector(10000); std::generate(host_vector.begin(), host_vector.end(), rand); // create a vector on the device - type is double compute::vector<double> device_vector(host_vector.size(), context); // transfer data from the host to the device compute::copy(host_vector.begin(), host_vector.end(), device_vector.begin(), queue); // END it equally makes pang because the data are just copied byte-wise. The library must provide a strict type-safety for all algorithms / data structures, where (in order of preference that comes to mind): a) convert properly to target type if possible (above surely applicable) b) issue compile-time error if conversions not possible c) last fallback: throw a proper exception at runtime 2) when inspecting the code flow in above copy operation I missed a debug mode check that for a copy operation the output range can hold that many elements; something like a safe iterator returned by device_vector.begin() -> a good implementation should throw an organized error instead of just overwriting memory. 3) for float/double input containers compute::accumulate falls back to a plain serial reduction, making element-wise additions (which is really slow on a GPU). This is because can_accumulate_with_reduce returns false as it is not defined for integral types. Is there a technical reason why it cannot work for floating types? How many algorithms are affected by a possible fallback to a plain serial execution order? 4) for types not supported under both OpenCL and C++ (e.g. long double, bool, half) more specific error messages would be useful. Note: above are listed only issues which I have encountered during my few trials; there's no claim whatsoever for complete coverage Performance: I have not really tested performance so I cannot say much on it. At times I spotted what appears as unnecessary OpenCL runtime overhead (e.g. blocking commands, resetting kernel arguments upon each invocation) but I am not familiar enough with the implementation to judge if this really just redundant. Invoking the OpenCL compiler always takes considerable time for any OpenCL program. The library compiles kernels on demand when encountered to execute; while this is technically reasonable I am not sure in how far it is clear to everyone (foremost end-users of programs) that e.g. a simply accumulate of 100 ints may take several seconds to execute in total on first invocation simply because the kernel compilation takes that time. I guess this considerable penalty also somewhat discourages from using the library to create a number of kernels on the fly. I would find it very useful if smart algorithms dispatch the algorithm to a plain C++ algorithm if it's really predictable that a GPU execution will just waste time (I have elaborated on this above). It's fairly trivial to have data/algorithm combinations that are better not executed on the GPU, being able to rely on some auto-mechanism would relief programmers. Answers to reviewer questions: 1. What is your evaluation of the design? See comments above 2. What is your evaluation of the implementation? See comments above 3. Documentation: Overall I find the documentation well written and structured. As minor issues at times it could be more explicit / elaborate (e.g. for compute::vector a short description is provided, but for several other containers there is none; what are the accepted types for predicates in the algorithms etc.). The installation page misses that (on Windows) the OpenCL headers must be explicitly included (it leaves the impression that the library would do it on it's own). The performance page should include more details with respect to overhead and specifically "unintuitive" ones such as kernel compilation time. Recommendations can be given when a problem is / is not suitable for GPU execution. It is unclear if the provided measurements refer to float or double; measurements for both should be provided. 4. Overall usefulness of the library I find the portable STL-ish algorithms (+ their supplements) useful. With respect to the core + utilities parts I think unnecessary competition with existent wrappers is introduced. 5. Did you try to use the library? I used MSVC 12 and had no problems installing it and running a few little programs. 6. How much effort did you put into your evaluation? I read the documentation, ran tutorial code and created a few example programs on my own (I did not run any of the pre-packaged examples). When questions arose I took close looks at the implementation (thus I focused more on in-depth analyses of selected components instead of testing overall into breadth). 7. Are you knowledgeable about the problem domain? I consider myself knowledgeable of the problem domain. 8. The core question: Do you think the library should be accepted as a Boost library? Generally speaking yes but I recommend a major revision before acceptance is reconsidered. My greatest concern revolves around the overall design aim. I don't like the idea of competing with the Khronos API for general wrapping; I'd prefer a more light-weight library with the STL-ish algorithms (or other things) at the core, adding to what is already out there. The library needs to find its own niche, minimizing overlap and elaborating on it's novelty/strengths. The implementation must become more robust, presently it is ways too trivial to break things. best, Thomas

Paul A. Bristow

28 Dec 28 Dec

12:48 p.m.

New subject: [compute] Review period starts today December 15, 2014, ends on December 24, 2014

...

-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Antony Polukhin Sent: 15 December 2014 06:58 To: boost@lists.boost.org List; boost-announce@lists.boost.org; boost- users@lists.boost.org; Kyle Lutz Subject: [boost] [compute] Review period starts today December 15, 2014, ends on December 24, 2014

Dear All,

Review of the Compute library starts today on Mon 25st of December 2014 and will last for ten days.

The Compute library provides a C++ interface to multi-core GPGPU and CPU computing platforms based on OpenCL.

...

Please, answer the following questions in your review:

1. What is your evaluation of the design? Un-qualified to judge.

2. What is your evaluation of the implementation? Un-qualified to judge.

(I note some worrying about floating-point - is this not over-expectation ? "Die ganzen Zahlen hat der liebe Gott gemacht, alles andere ist Menschenwerk." Integers are made by the Good Lord, all others are man's work. -- Leopold Kronecker We should never expect to get exactly the same results from hardware C++ as software GPU FP?

...

3. What is your evaluation of the documentation? Very good.

4. What is your evaluation of the potential usefulness of the library? Invaluable.

5. Did you try to use the library? No.

6. How much effort did you put into your evaluation? A glance?

...

7. Are you knowledgeable about the problem domain? Barely.

And finally, every review should answer this question:

8. Do you think the library should be accepted as a Boost library?

Yes because I think Kyle has shown the feasibility and utility of this library and has some real-life users. I trust him to maintain and develop the library in future. The C++ and GPU is fast moving, so I don't think we should be too fixated on finding the ideal solution. There will be more attempts to add GPU support to Standard C++ itself. Obviously, the more we can allow other platforms than OpenCL to be added later the better. Like Boost.Sort, I think we should use the library as a container for various tools to use GPUs. I get the impression that the existing setup does not close too many doors too firmly. So I feel it is time to let it hit the streets. Paul --- Paul A. Bristow Prizet Farmhouse Kendal UK LA8 8AB +44 (0) 1539 561830

Belcourt, Kenneth

31 Dec 31 Dec

12:15 a.m.

New subject: [EXTERNAL] [compute] Review period starts today December 15, 2014, ends on December 24, 2014

On Dec 14, 2014, at 11:57 PM, Antony Polukhin <antoshkka@gmail.com> wrote:

...

1. What is your evaluation of the design?

Very nice, love the STL algorithm support, should make it fairly easy for our codes to migrate to Compute given our existing heavy reliance on STL.

...

2. What is your evaluation of the implementation?

Didn’t really look at it.

...

3. What is your evaluation of the documentation?

A pleasure to peruse, nicely laid out and well-phrased.

...

4. What is your evaluation of the potential usefulness of the library?

Very high. Easy to use with fairly comprehensive STL support.

...

5. Did you try to use the library?

Yes, ran many Compute examples and test programs, they worked as expected with a Tesla K20X on Cray. Also added some Compute calls in one of our applications, they worked as expected.

...

With what compiler?

Cray compiler.

...

Did you have any problems?

I ran into a compilation problem on another Cray system: "boost/include/boost/compute/type_traits/type_name.hpp", line 79: error: class "boost::compute::detail::type_name_trait<boost::compute::char_>" has already been defined BOOST_COMPUTE_DEFINE_BUILTIN_TYPE_NAME_FUNCTION(char) ^ Commenting out line 79 got me past the error.

...

6. How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?

A few hours, mostly bulding and testing out the Compute tests and it’s interaction with our own codes. Studied STL support to assess usability, ease of finding what I need in the documentation. I didn’t closely examine the implementation details though I expect to as we begin to use it.

...

7. Are you knowledgeable about the problem domain?

Yes, I’m knowledgeable.

...

8. Do you think the library should be accepted as a Boost library?

Yes. It’s already highly usable in it’s current form and I’m confident it will evolve and be improved once accepted into Boost. Very nice work Kyle. — Noel Belcourt

Kyle Lutz

1:08 a.m.

New subject: [EXTERNAL] [compute] Review period starts today December 15, 2014, ends on December 24, 2014

On Tue, Dec 30, 2014 at 4:15 PM, Belcourt, Kenneth <kbelco@sandia.gov> wrote:

...

I ran into a compilation problem on another Cray system:

"boost/include/boost/compute/type_traits/type_name.hpp", line 79: error: class "boost::compute::detail::type_name_trait<boost::compute::char_>" has already been defined BOOST_COMPUTE_DEFINE_BUILTIN_TYPE_NAME_FUNCTION(char) ^

Commenting out line 79 got me past the error.

Interesting. My best guess is that this occurs because "cl_char" and "char" are actually synonymous on that platform which leads us to specialize the same template twice. Would you mind submitting a bug to the issue tracker [1] for this?

...

...
8. Do you think the library should be accepted as a Boost library?

Yes. It’s already highly usable in it’s current form and I’m confident it will evolve and be improved once accepted into Boost. Very nice work Kyle.

Thanks for the review! -kyle [1] https://github.com/kylelutz/compute/issues

3889

Age (days ago)

3905

Last active (days ago)

List overview

Download

8 comments

7 participants

participants (7)

Antony Polukhin
Belcourt, Kenneth
Denis Demidov
Hartmut Kaiser
Kyle Lutz
Paul A. Bristow
Thomas M