Dear all, [2nd posting trial]
Review of the Compute library starts today
Having spent the last 3 years on a larger-scale C++ project utilizing
OpenCL as main computing engine, without doubt a C++ GPGPU library is
worthwhile. My evaluation was based on studying the full docs, tutorial
examples, creating own (simple) applications, and inspecting selected
implementation details.
I liked most the portable STL-like algorithms; combined with fairly
straightforward on-the-fly specifications of kernel functions (including
lambda expression support) GPGPU utilization becomes much more
accessible for every-day C++ programs. The design of this library part
is rather clean and aligns well with the C++ standard customs. I have
checked a handful of function interfaces and they correspond to the C++
STL variants.
However I have also encountered a number of issues, of which I consider
most severe the overall library's design/aim:
Khronos Group already provides (since years) a C++ bindings API itself
(https://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.2.pdf).
Frankly the Khronos API is not an example of a clean, modern C++, but it
provides very fine-grained operations control (which can be crucial for
effective GPGPU performance), is developed (althouth a bit lagged) with
OpenCL itself, and covered in every elaborate OpenCL textbook. It is
thus IMHO the current de-facto C++ OpenCL wrapper standard. The proposed
boost library's parts core & utility (heavy in total!) seem to do just
the same interface-wise, yet lack some important features e.g. detailed
control flow (blocking / event setting), image classes, or deviate in
subtle signature details making it really difficult to grasp which does
exactly what / behaves differently. A boost library should not start
from scratch but integrate with and extend the Khronos API, both at C
and C++ bindings level (e.g. providing the STL-like algorithms to them
in straightforward to use manners). Programmers can thus rely on
established practices (personally I wouldn't switch away from the
Khronos C++ API as main underlying workhorse!) yet benefit from the
extended functionality provided by the boost library.
On the other hand to those rather new to OpenCL a simplified, less
error-prone design would be beneficial; equally this can raise the
productivity of everyone. The current design/implementation follows the
typical OpenCL execution model incl. some of its caveats (see tutorial
"Transforming Data"): explicitly copying input data to the device,
executing kernel(s), and then copying the output back to the host.
Frequently however the whole emphasis is on the kernel invocation (run
an algorithm on the data), rendering the copying an implementation
detail that's "done because it must be done" but otherwise makes code
longer and comprises an error source if forgotten.
I hence wonder if the following overall design would be more appropriate
(this is by no means a request to doing it this way, I just try to bring
in alternative perspectives):
1) build on top the Khronos C / C++ bindings API, i.e. use that as base
instead of the own core + utilities parts
2) offering a high-level interface for algorithm execution that exposes
users to as little OpenCL internals as possible while giving the
algorithms lots of flexibility
3) offering a high-level interface that auto-connects STL-containers
with OpenCL memory objects, implemented based on standard C++ / Khronos
API classes.
4) offering a high-level interface that applies the algorithms directly
to objects of the Khronos C / C++ API.
The first point is rather obvious. To me the proposed library parts core
+ utility appear as just another C++ wrapper, and unless this is done in
extremely (!!!) well manner (i.e. offering every functionality the
Khronos API does, yet making it clean from scratch, aligning it with
Standard C++, extending it by essential features etc., ensuring a
rock-solid quality control for reliability etc.; I'd set the bar really
high here) I see no reason to do it. If people are forced to use the
proposed library core wrapper just to gain access to the other
functionality (and there is good other functionality in there !) then I
think there is a serious risk that a considerable number of people
simply turn away altogether.
With respect to the second point I suppose something like that is doable:
// BEGIN
compute::gpgpu_engine gpgpuEngine; // default initialization links it to
a default device etc.
// create vector, fill with data - ordinary C++
std::vector<float> vec(10000);
std::generate(vec.begin(), vec.end(), rand);
compute::transform(vec.begin(),
vec.end(),
vec.begin(),
compute::sqrt<float>(),
gpgpuEngine);
std::cout << vec[0]; // results are already on the host
// END
So an instance of gpgpu_engine (or whatever name) gets setup once and
can, if needed, become customized for its behaviour (devices used,
execution policies etc.). This engine can then internally (hidden to the
user, like a std::vector manages it's memory) manage buffers, and when
transform now gets invoked it:
-) copies the data to one of its buffers (create one if none available)
-) run the kernel
-) copies the data to the host side container (keep buffer for reuse later)
This would bring several advantages:
-) the code becomes very similar to ordinary C++ code -> short-handed,
less error-prone
-) buffers can be recycled among multiple algorithm calls (e.g. the
engine can cache a small number of buffers immediately available for
future calls)
-) more efficient OpenCL runtime utilization (performance) because the
whole operation sequence has been abstracted: e.g. the input copy
operation can be enqueued in a non-blocking fashion, so the data
transfer to the device and the boost.compute kernel preparation can
occur concurrently; equally while the kernel runs the copy-back-to-host
command can already become enqueued.
-) gpgpu_engine can encapsulate a number of policies that control its
behaviour (providing sensible default-configurations but allowing
fine-grained control if desired), e.g.: error handling (e.g. throwing
exception vs. setting some plain-odd OpenCL error codes); device to
execute on (if multiple available); copying back to the host in
non-blocking manner (like copy_async); to allow the selection of a
'smart' execution path (for example if the input data do not warrant the
overhead of a GPU call [e.g. if they are too small or do not well fit
GPU computation problems] defer the call a plain STL-algorithm call or
use OpenCL's built-in native C++ calling threading functionality); etc.
It would be beneficial if those options can become temporarily
overwritten (something like the boost::ios_..._saver classes come to mind).
With respect to the third point I am thinking of something along the
lines of:
template
class vector_buffer
{
private:
std::vector vec;
cl::Buffer buf;
};
the class ensures that the std::vector and the buffer are automatically
synchronized whenever changes must become transparent (i.e. access).
Obviously this requires some thought if get functions grant access to
the plain std::vector / cl_mem/cl::Buffer; however for the present
implementation I also don't see what would stop me from hijacking a
cl_mem from a compute::vector and modify the buffer arbitrarily outside
the compute::vector class.
For 4) I am thinking of overloads for the algorithm to +- directly
accept Khronos C / C++ objects. Some really light-weight adapters adding
e.g. required type + size data could do the trick.
In general I'd find it useful that all of a host object, a device object
and something linking a host with a device object can be an input /
output of an algorithm, and the implementation takes care of automatic
data transfer. So if the input refers to a host object the
implementation automatically copies the data to the device, if the
output is a host object it also automatically copies the result to it etc.
A final but probably very important design consideration:
I wonder if boost needs a OpenCL-computing library, or a general
parallelization library. Presently the GPGPU world is already split too
much between CUDA and OpenCL as main players (hardware vendors doing
their parts ...), and technology is really rapidly moving (APUs etc.).
As Hartmut has already pointed out one approach could be to use the
current proposal as foundation for a parallelization implementation: cut
it down to the essentials of that and hide as much OpenCL implementation
details as possible.
A completely different approach could be to try coming up with a
unifying parallelization framework that supports multiple backends
(OpenCL, CUDA, others). Obviously this would be a tremendous amount of
work (and getting that API right is probably extremely difficult - the
last thing we'd need is just another restricted API causing more
splitting) but in the long run could be the more rewarding solution.
Implementation details:
I have checked the implementation only briefly, mostly only when
questions arose for a few functions. Overall it looks ok and organized,
yet I have encountered some issues.
1) type safety [major issue - must be fixed before acceptance]:
Type safety is not a strength of OpenCL, and this is reflected at parts
in the implementation when it fails to add a proper
conversion/protection layer. Using compute::reduce it was embarrassingly
easy to produce rubbish results through the following code (modifying
the provided tutorial code) :
// BEGIN
compute::device device = compute::system::default_device();
compute::context context(device);
compute::command_queue queue(context, device);
// generate random data on the host - type is float
std::vector<float> host_vector(10000);
std::generate(host_vector.begin(), host_vector.end(), rand);
// create a vector on the device
compute::vector<float> device_vector(host_vector.size(), context);
// transfer data from the host to the device
compute::copy(host_vector.begin(), host_vector.end(),
device_vector.begin(), queue);
double reduction_result = 0.0; // result is of type double
compute::reduce(device_vector.begin(), device_vector.end(),
&reduction_result, queue);
std::cout << "result: " << reduction_result<< std::endl;
// END
The input data is of type float while the result shall be stored in a
double. This fails miserably under the current implementation because
after the reduction has completed the final value stored in device
memory gets copied merely byte-wise to the target variable (using a
plain type-ignorant clEnqueueReadBuffer), reading simply the 4 bytes
from a float into an 8 byte double (4/8 on my PC machine). I suppose
reversing types (double as input, float as output) will be even more
spectacular because 4 superfluous bytes simply overwrite the stack.
The same affects a plain compute::copy, for example if above the
device_vector is of type double:
// BEGIN
// generate random data on the host - type is float
std::vector<float> host_vector(10000);
std::generate(host_vector.begin(), host_vector.end(), rand);
// create a vector on the device - type is double
compute::vector<double> device_vector(host_vector.size(), context);
// transfer data from the host to the device
compute::copy(host_vector.begin(), host_vector.end(),
device_vector.begin(), queue);
// END
it equally makes pang because the data are just copied byte-wise.
The library must provide a strict type-safety for all algorithms / data
structures, where (in order of preference that comes to mind):
a) convert properly to target type if possible (above surely applicable)
b) issue compile-time error if conversions not possible
c) last fallback: throw a proper exception at runtime
2) when inspecting the code flow in above copy operation I missed a
debug mode check that for a copy operation the output range can hold
that many elements; something like a safe iterator returned by
device_vector.begin() -> a good implementation should throw an organized
error instead of just overwriting memory.
3) for float/double input containers compute::accumulate falls back to a
plain serial reduction, making element-wise additions (which is really
slow on a GPU). This is because can_accumulate_with_reduce returns false
as it is not defined for integral types. Is there a technical reason why
it cannot work for floating types? How many algorithms are affected by a
possible fallback to a plain serial execution order?
4) for types not supported under both OpenCL and C++ (e.g. long double,
bool, half) more specific error messages would be useful.
Note: above are listed only issues which I have encountered during my
few trials; there's no claim whatsoever for complete coverage
Performance:
I have not really tested performance so I cannot say much on it.
At times I spotted what appears as unnecessary OpenCL runtime overhead
(e.g. blocking commands, resetting kernel arguments upon each
invocation) but I am not familiar enough with the implementation to
judge if this really just redundant.
Invoking the OpenCL compiler always takes considerable time for any
OpenCL program. The library compiles kernels on demand when encountered
to execute; while this is technically reasonable I am not sure in how
far it is clear to everyone (foremost end-users of programs) that e.g. a
simply accumulate of 100 ints may take several seconds to execute in
total on first invocation simply because the kernel compilation takes
that time. I guess this considerable penalty also somewhat discourages
from using the library to create a number of kernels on the fly.
I would find it very useful if smart algorithms dispatch the algorithm
to a plain C++ algorithm if it's really predictable that a GPU execution
will just waste time (I have elaborated on this above). It's fairly
trivial to have data/algorithm combinations that are better not executed
on the GPU, being able to rely on some auto-mechanism would relief
programmers.
Answers to reviewer questions:
1. What is your evaluation of the design?
See comments above
2. What is your evaluation of the implementation?
See comments above
3. Documentation:
Overall I find the documentation well written and structured. As minor
issues at times it could be more explicit / elaborate (e.g. for
compute::vector a short description is provided, but for several other
containers there is none; what are the accepted types for predicates in
the algorithms etc.).
The installation page misses that (on Windows) the OpenCL headers must
be explicitly included (it leaves the impression that the library would
do it on it's own).
The performance page should include more details with respect to
overhead and specifically "unintuitive" ones such as kernel compilation
time. Recommendations can be given when a problem is / is not suitable
for GPU execution. It is unclear if the provided measurements refer to
float or double; measurements for both should be provided.
4. Overall usefulness of the library
I find the portable STL-ish algorithms (+ their supplements) useful.
With respect to the core + utilities parts I think unnecessary
competition with existent wrappers is introduced.
5. Did you try to use the library?
I used MSVC 12 and had no problems installing it and running a few
little programs.
6. How much effort did you put into your evaluation?
I read the documentation, ran tutorial code and created a few example
programs on my own (I did not run any of the pre-packaged examples).
When questions arose I took close looks at the implementation (thus I
focused more on in-depth analyses of selected components instead of
testing overall into breadth).
7. Are you knowledgeable about the problem domain?
I consider myself knowledgeable of the problem domain.
8. The core question: Do you think the library should be accepted as a
Boost library?
Generally speaking yes but I recommend a major revision before
acceptance is reconsidered. My greatest concern revolves around the
overall design aim. I don't like the idea of competing with the Khronos
API for general wrapping; I'd prefer a more light-weight library with
the STL-ish algorithms (or other things) at the core, adding to what is
already out there. The library needs to find its own niche, minimizing
overlap and elaborating on it's novelty/strengths. The implementation
must become more robust, presently it is ways too trivial to break things.
best, Thomas