Re: [boost] [compute] Review

1 Jan 2015

      As someone who's looked at both sources of Bolt and Boost.Compute - I
can attest Kyle has some well written code and one of the cleanest /
concise / most straightforward approaches to several common issues
that are generally much more complicated when trying to do C++ with
AOT/JIT metaprogramming OpenCL to deal with templates.

Re bolt, AMD does indeed use the static kernel extension in bolt and
they suffer alot of copypasta - from what I can tell it's fairly
unnecessary copypasta-ing and unnecessary usage of templates (don't
get me wrong, I like type traits and lots of templates for subroutines
and specializations!) but non-the-less that extension was used and
thus is not compatible with other implementations (nvidia, intel) and
so we have 2 different libraries that are effectively stl-like and
solve the same algorithms but only work on their respective platforms.
On the flip side Nvidia seems to not be interested in good OpenCL
performance (or a public 1.2+ version) for whatever reasons so a
library that runs well on any OpenCL compliant device would be good to
those needing a portable GPU backed stl-like set of algorithms but
that makes the libraries choice a bit at odds with performance since
most performance interested parties would probably use CUDA out of the
belief that it will perform significantly better vs NVidia's OpenCL
implementation or OpenCL in general - I'd be very interested in seeing
a valid and direct comparison of a few common algorithms implemented
in the same fashion in OpenCL and CUDA in addition library-to-library
performance charts [here:
https://kylelutz.github.io/compute/boost_compute/performance.html ]
Boost.Compute has already - this might take away some of the fear from
the research community of OpenCL vs CUDA esp given the strength of the
Boost name.

A few tidbits relating to the work other than that though:
-One of the things I don't like is the scan operation recursively
calls itself and reallocates memory a bunch of times - Bolt seems to
have taken a better approach here which reduces the global memory
usage too.  I don't like additional memory allocations or things that
could incur unnecessary/unexpected latencies if the problem changes
size.  I assume this may be in a few different spots too?
-If I was to use the library it probably would be for sorting - I
think you should take a peek at the techniques used in Bolt to make
the radix sort faster and scan faster.   Potentially if available and
requested prefer OpenCL 2.0's work group functions too.
-I personally haven't found a place to use libraries like
Thrust/Bolt/Boost.Compute - their whole idea is that you have a single
huge workload rather than a batch of small-medium workloads which I
find much more common in my own work - I have only rolled my own
specialized kernels so far - who is the typical userbase of these
libraries?  As such I do not forsee myself as a user at present.
-Would like some way to make the runtime dump the kernel's the library has made.
-Based on my current understanding of OpenCL on Altera - likely on
Xillinx as well, this library's AoT/JIT technique will not work on
FPGAs - for those you would need to take the kernel's used through a
suite of tools which can take hours to process and later you are able
to load a program and call the kernels - the chief problem is that
there is no actual JIT allowed in that domain.  I dont hold this
against Boost.Compute - of course the FPGA design flow is going to be
be different and no one has a library for them anyway.

To try and be formal I'll put in my own review if anyone's still reading:

1. What is your evaluation of the design?

High quality.

2. What is your evaluation of the implementation?

High quality.  Could use some refinement and maybe optimization in
some functions but these changes are easy to make and will not break
API -  the library has many functions with very good performance and
is better / more skillfully maintained than others I have have
compared it to.

3. What is your evaluation of the documentation?

Good enough.

4. What is your evaluation of the potential usefulness of the library?

Fills a portability gap in stl-like libraries such as thrust and bolt,
looks to have more STL algorithms implemented.  I don't know who uses
these libraries though but there are several alternatives if you give
up code portability which indicates userbases exist.

5.  Did you try to use the library? With what compiler? Did you have
any problems?

No

6. How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?

I was evaluating these libraries myself several times over the last
year on specific components, I had to get deep in the guts of those
parts.

7. Are you knowledgeable about the problem domain?

GPGPU yes - not in who uses these GPU backed stl-like libraries though.

On Wed, Dec 31, 2014 at 11:09 AM, Kyle Lutz <kyle.r.lutz@gmail.com> wrote:
...
On Wed, Dec 31, 2014 at 9:57 AM, Ioannis Papadopoulos
<ipapadop@cse.tamu.edu> wrote:
...
On 12/30/2014 11:57 PM, Kyle Lutz wrote:
...
On Tue, Dec 30, 2014 at 8:14 PM, Yiannis Papadopoulos
<ipapadop@cse.tamu.edu> wrote:
...
Hi,
This is my review of Boost.Compute:
2. What is your evaluation of the implementation?
There is some code duplication (e.g. type traits) and various other bits and
pieces that can be moved to existing Boost components. I think there should
be some effort spent towards that.
Could you let me know which type-traits you think are duplicated or
should be moved elsewhere?
For example, the is_fundamental<T> is already implemented in
Boost.TypeTraits. Or type_traits/type_name.hpp may be able to leverage
Boost.TypeIndex?
True, there is a boost::is_fundamental<T> (and a
std::is_fundamental<T> in C++11), but these have different semantics
than boost::compute::is_fundamental<T>. For Boost.Compute, the
is_fundamental<T> trait returns true if the type T is fundamental on
the device (i.e. a OpenCL built-in type). For example, the float4_
type is an aggregate type on the host (i.e. std:: is_fundamental
<float4_>::value == false) but is a built-in type in OpenCL (i.e.
boost::compute::is_fundamental<float4_>::value == true).
As for type_name<T>(), it returns a string with the OpenCL type name
for the C++ type and can actually be very different from the C++ type
name (e.g. type_name<Eigen::Vector2f>() == "float2").
...
...
...
8. Do you think the library should be accepted as a Boost library?
This will be a maybe. It is a well-written library with a few minor issues
that can be resolved.
However, why would someone use Boost.Compute against what is out there?
Average users can resort to Bolt or Thrust. Power users will probably always
try to hand-tune their OpenCL or CUDA algorithm. How can we test it and
prove its performance?
Yes, Thrust and Bolt are alternatives. The problem is that each is
incompatible with the other. Thrust works on NVIDIA GPUs while Bolt
only works on AMD GPUs. Choosing one will preclude your code from
working on devices from the other.
On the other hand, code written with Boost.Compute will work on any
device with an OpenCL implementation. This includes NVIDIA GPUs, AMD
GPUs/CPUs, Intel GPUs/CPUs as well as other more exotic architectures
(Xeon Phi, FPGAs, Parallella Epiphany, etc.). Furthermore, unlike
CUDA/Thrust, Boost.Compute requires no special complier or
compiler-extensions in order to execute code on GPUs, it is a pure
library-level solution which is compatible with any standard C++
compiler.
Also, Boost.Compute does allow for users to access the low-level APIs
and execute their own hand-rolled kernels (and even interleave their
custom operations with the high-level algorithms available in
Boost.Compute). I think using Boost.Compute in this way allows for
both rapid development and the ability to fully-optimize kernels for
specific operations where necessary.
Thanks for the review. Let me know if I can explain anything more clearly.
-kyle
[1] https://github.com/kylelutz/compute/tree/master/perf
I realize that, but the thing is that what is the advantage of
Boost.Compute vs doing something like:
template<class InputIterator , class EqualityComparable >
auto count(InputIterator first, InputIterator last, const
EqualityComparable& value)
{
#ifdef THRUST
  return thrust::count(first, last, value);
#elif BOLT
  return bolt::cl::count(first, last, value);
#elif STL
  return std::count(first, last, value);
#endif
}
where first and last are iterators on some vector<> that is ifdefed
similarly (or just use some template magic to invoke the right algorithm
based on the container type). I have this concern, and IMO users might
question themselves that while shopping for GPU libraries.
Well if we took this approach, the library would have to be compiled
separately for each different compute device rather than being
portable to any system with an OpenCL implementation. And while this
is a trivial example for count(), implementing this for more
complicated algorithms which take user-defined operators or work with
higher-level iterators (e.g. transform_iterator or zip_iterator) would
be much more difficult. I think an approach like this would ultimately
be more complex, harder to maintain, and less flexible in the
interfaces/functionality we could offer.
...
Just to be clear, I am not dissing your work: I really like it and your
positive attitude for addressing issues.
Not at all, I appreciate your feedback. Thanks!
-kyle
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Re: [boost] [compute] Review

Jason Newton