Here's my review. 1. What is your evaluation of the design? The design of building on OpenCL seems sound, for the reasons Kyle has explained. The fact that it has real users is very encouraging. I don't really like the name (I was unable to guess what the library was about from the name), but it's not worth changing at this point. I am worried about the compile-time configuration with macros. I would hope that BOOST_COMPUTE_HAVE_THREAD_LOCAL could be read from Boost.Config. For the two features which require linking against other Boost libraries, my instinct would be to have them on by default, rather than off by default (that should be easier if and when Boost.Compute is part of the Boost distribution). It would be very nice to have range versions of all the algorithms. I find myself using the boost::range algorithms much more than the std:: ones these days. Similarly, it would be good to provide the safer 4-iterator versions of equal, is_permutation, mismatch (and any I've forgotten) as were added to C++14. Some minor things: - Consider renaming boost::compute::lambda to be consistent with one of the other placeholder-containing namespaces from std or Boost. (You can keep the old name as an alias for backwards-compatibility). - Why the default of .25 as the bernoulli_distribution parameter? - Did you consider using Boost.Predef for your version macro? 2. What is your evaluation of the implementation? Only looked at a couple of things while trying to understand the docs. 3. What is your evaluation of the documentation? Overall, not too bad. More tutorial would be good. In particular, something using some of the standard algorithms. Some concerns: - All the header links off http://kylelutz.github.io/compute/boost_compute/reference.html are broken. - All or most of the things in "See also" sections look like they ought to be links but aren't. - For all your function parameter documentation, the list of parameters seems to be sorted alphabetically by parameter name rather than in the order they are passed to the function. This is really weird... - As others have mentioned, some kind of complexity estimate or promise for the algorithms is very important. In particular it would be good to understand how much parallelism they take advantage of and how they ought to scale with device compute capability. - The warning about accumulate being slow for floats should be much more prominent. - http://kylelutz.github.io/compute/boost/compute/copy.html mentions copying from e.g. a std::list to a compute::vector. It would be nice if it gave some kind of indication as to how efficient we can expect such an operation to be. In particular, I guess it ought to do some kind of buffering on the host? - The way I normally copy into vectors is call v.reserve(...) and then copy to std::back_inserter(v). I tried that with a compute::vector and unsurprisingly it was very inefficient. I see you already warned about this in the push_back docs, but perhaps it should also be highlighted in the documentation of compute::copy. - Is there no version of inclusive/exclusive_scan or iota using an arbitrary associative operator? - http://kylelutz.github.io/compute/boost/compute/includes.html it's unclear whether the subrange has to be contiguous. - min_element and max_element should cross-reference minmax_element. - What source of randomness does random_shuffle use? - Why does reduce return its result through an OutputIterator? - http://kylelutz.github.io/compute/boost/compute/remove.html Does this really remove things or use std::remove semantics? You should clarify and document the return value. (Same with remove_if, unique) - I can see how sort_by_key might be useful, and I'm slightly surprised that there's no stable_sort_by_key (which seems like it would be more useful than stable_sort). Any particular reason? - The boost/compute/types headers docs seem to be broken or missing. In particular I couldn't find reference documentation for float4_ et al. - I was interested to see allocators, but the docs seem uninformative: http://kylelutz.github.io/compute/boost/compute/pinned_allocator.html - Given the confusion with std::future, you might want to clarify whether compute::future's destructor blocks. - Also, http://kylelutz.github.io/compute/BOOST_COMPUTE_CLOSURE.html doesn't explain what format the lists (arguments and capture) should be in. Preprocessor functions often take lists in other formats, such as a Boost.PP Sequence http://www.boost.org/doc/libs/1_57_0/libs/preprocessor/doc/data/sequences.ht..., and your example only has one-element lists, so it's not enough to clarify this point. (To be clear: I'm not suggesting you should use Boost.PP sequences, merely that you should make it clear in the docs whether you are). - http://kylelutz.github.io/compute/BOOST_COMPUTE_CL_CALLBACK.html is empty, as is http://kylelutz.github.io/compute/BOOST_COMPUTE_MAX_ARITY.html - http://kylelutz.github.io/compute/boost/compute/valarray.html - presumably compute::valarray has overloaded arithmetic operators? Are they mentioned anywhere? - http://kylelutz.github.io/compute/boost/compute/vector.html the unusual behaviour of vector's size_t constructor (not initializing the elements) should be warned more prominently. - http://kylelutz.github.io/compute/boost/compute/device.html the documentation of the enum type in the synopsis seems to be messed up. Similarly memory_object. - http://kylelutz.github.io/compute/boost/compute/function.html I'm puzzled by this interface. What's the name for? Would one ever use this constructor rather than e.g. make_function_from_source? - http://kylelutz.github.io/compute/boost/compute/field.html My reading of this is that when you get the type wrong you are guaranteed some kind of meaningful error report, and not just undefined behaviour; is that right? Perhaps you should clarify. - I can't guess what atomic_min and atomic_max do. - The right arrow ('next') link from http://kylelutz.github.io/compute/boost/compute/atomic_xor.html is broken (looks like the docs for the placeholders don't work). - The iterators should state what iterator category they have. - I can't guess what svm_ptr is for. Later I discovered svm_alloc and svm_free. Is there an RAII way to manage these things? - What does default_device do if there are no devices? Do the various environment variables mentioned have any kind of precedence? (Probably linking to the corresponding OpenCL docs is sufficient for such questions). Do they have to be exact matches? Are they case-sensitive? - What does find_device do if the requested device does not exist? - Can you give examples for type_definition? My guess was that it did the thing which type_name later turned out to do... - BOOST_COMPUTE_ADAPT_STRUCT should state that it must be used at global scope (i.e. not within a function or other namespace). - Congratulations for getting compile-time detection of internal padding into BOOST_COMPUTE_ADAPT_STRUCT :). You might want to mention in the docs that this will be detected, so people don't worry too much. - It looks like the performance tests don't include the time for copying data to (and, where applicable, from) the device. You should mention this on the Performance page. - You should probably add C++AMP to your things compared with in the FAQ. 4. What is your evaluation of the potential usefulness of the library? High. I've wanted something like this in the past and not found this. 5. Did you try to use the library? With what compiler? Did you have any problems? I used it with g++ 4.8.2 in C++1y mode. Some quick tests seemed to work OK, and indeed I was able to sort some large vectors of scalars faster using boost::compute::sort than std::sort, which is nice. I was pleased to see that 64-bit integers were handled fine. 6. How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? Couple of hours reading and trying simple examples. 7. Are you knowledgeable about the problem domain? I have done a couple of small projects with CUDA, and am generally knowledgeable, but with no previous OpenCL experience. 8. Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion. Yes. OpenCL seems to have fairly broad acceptance, but is rather difficult to use safely and with C++-isms. This library fills a useful niche. In an ideal world it will eventually be rendered obsolete by things like parallelism entering the standard library, but I doubt that will happen soon if at all. This provides a good solution now, and we shouldn't dismiss it simply because something sparkly is approaching, which may or may not turn out to be awesome and widely adopted. This library has clearly been a lot of work. Congratulations on making it this far, I hope the review is not too stressful :). John Bytheway
On Sun, Dec 28, 2014 at 7:26 PM, John Bytheway
Here's my review.
1. What is your evaluation of the design?
The design of building on OpenCL seems sound, for the reasons Kyle has explained. The fact that it has real users is very encouraging.
I don't really like the name (I was unable to guess what the library was about from the name), but it's not worth changing at this point.
I am worried about the compile-time configuration with macros. I would hope that BOOST_COMPUTE_HAVE_THREAD_LOCAL could be read from Boost.Config. For the two features which require linking against other Boost libraries, my instinct would be to have them on by default, rather than off by default (that should be easier if and when Boost.Compute is part of the Boost distribution).
I agree and I would very much prefer to have these thread-safe features on by default and will work towards improving this. Also, I looked for a configuration macro to enable support for C++11 "thread_local" storage in Boost.Config, but could not find one. If/when one is added I'll update Boost.Compute to use it. And like you said, many of these changes become easier if Boost.Compute becomes a part of the standard Boost distribution.
It would be very nice to have range versions of all the algorithms. I find myself using the boost::range algorithms much more than the std:: ones these days.
I fully agree and also I'm very much looking forward to having range algorithms standardized in C++17.
Similarly, it would be good to provide the safer 4-iterator versions of equal, is_permutation, mismatch (and any I've forgotten) as were added to C++14.
Fully agree. This shouldn't be too much work.
Some minor things:
- Consider renaming boost::compute::lambda to be consistent with one of the other placeholder-containing namespaces from std or Boost. (You can keep the old name as an alias for backwards-compatibility).
Yeah, I'll work on unifying the placeholders.
- Why the default of .25 as the bernoulli_distribution parameter?
Looks like an oversight, I'll get it updated to use 0.5 just as std::bernoulli_distribution does.
- Did you consider using Boost.Predef for your version macro?
Currently Boost.Compute remains compatible with Boost version 1.48 which didn't include Boost.Predef (added in 1.55 I believe). If Boost.Compute is accepted into the standard Boost distribution, updating it to use this should be no problem. (as well as many other Boost features added since 1.48).
2. What is your evaluation of the implementation?
Only looked at a couple of things while trying to understand the docs.
3. What is your evaluation of the documentation?
Overall, not too bad.
More tutorial would be good. In particular, something using some of the standard algorithms.
I'll definitely work on this more. Any ideas of specific tutorials/examples you'd like to see would be very helpful.
Some concerns:
- All the header links off http://kylelutz.github.io/compute/boost_compute/reference.html are broken.
- All or most of the things in "See also" sections look like they ought to be links but aren't.
- For all your function parameter documentation, the list of parameters seems to be sorted alphabetically by parameter name rather than in the order they are passed to the function. This is really weird...
These are all, to the best of my knowledge, bugs in the Boost documentation toolchain. I reported these issues a while back on the boost-doc mailing list here [1] but never found a resolution. I would really like to get these to work, but am not knowledgable enough about the Boost documentation pipeline (specifically the doxygen->quickbook infrastructure) to fix it. If anyone has any ideas or could help with this I would be very appreciative.
- As others have mentioned, some kind of complexity estimate or promise for the algorithms is very important. In particular it would be good to understand how much parallelism they take advantage of and how they ought to scale with device compute capability.
Will do.
- The warning about accumulate being slow for floats should be much more prominent.
Will do.
- http://kylelutz.github.io/compute/boost/compute/copy.html mentions copying from e.g. a std::list to a compute::vector. It would be nice if it gave some kind of indication as to how efficient we can expect such an operation to be. In particular, I guess it ought to do some kind of buffering on the host?
Currently it will just buffer the entire range into a std::vector<T> and then copy that to the device. If this is a bottleneck it could also be improved to copy the data in smaller batches. Let me know if you encounter issues with this.
- The way I normally copy into vectors is call v.reserve(...) and then copy to std::back_inserter(v). I tried that with a compute::vector and unsurprisingly it was very inefficient. I see you already warned about this in the push_back docs, but perhaps it should also be highlighted in the documentation of compute::copy.
Yeah, this idiom is not well suited to copying into device memory. You should prefer to do "bulky" operations using a single call to boost::compute::copy() (which internally will call the OpenCL function for copying large chunks of memory).
- Is there no version of inclusive/exclusive_scan or iota using an arbitrary associative operator?
These are not currently supported. I will work on adding an implementation for these and documenting them.
- http://kylelutz.github.io/compute/boost/compute/includes.html it's unclear whether the subrange has to be contiguous.
It should have the same semantics of std::includes, which IIRC, does not require the subrange to be contiguous.
- min_element and max_element should cross-reference minmax_element.
Will do.
- What source of randomness does random_shuffle use?
Currently it simply calls std::random_shuffle to generate the shuffled index sequence then uses boost::compute::scatter() to actually re-arrange the values on the device. However I am planning on deprecating it and providing an implementation of the C++11 shuffle() algorithm instead which will use the random number generator provided by the user.
- Why does reduce return its result through an OutputIterator?
In order to allow the algorithm to store the result in device memory and avoid a host-device synchronization point (which would be necessary if it simply returned the reduced result).
- http://kylelutz.github.io/compute/boost/compute/remove.html Does this really remove things or use std::remove semantics? You should clarify and document the return value. (Same with remove_if, unique)
Works just the same as std::remove. I'll add more documentation for these.
- I can see how sort_by_key might be useful, and I'm slightly surprised that there's no stable_sort_by_key (which seems like it would be more useful than stable_sort). Any particular reason?
Just not enough time to implement it yet. There is a parallel merge-sort in the works which would be very suitable for implementing a stable_sort_by_key(). Hopefully this should be ready soon.
- The boost/compute/types headers docs seem to be broken or missing. In particular I couldn't find reference documentation for float4_ et al.
Yeah, I wasn't sure the best way to document all of these (IIRC, 10 different built in scalar types times four different vector version of each). I'll add some information with an overview of the provided fundamental types.
- I was interested to see allocators, but the docs seem uninformative: http://kylelutz.github.io/compute/boost/compute/pinned_allocator.html
I'll definitely update this.
- Given the confusion with std::future, you might want to clarify whether compute::future's destructor blocks.
It does not, I'll update the documentation for this.
- Also, http://kylelutz.github.io/compute/BOOST_COMPUTE_CLOSURE.html doesn't explain what format the lists (arguments and capture) should be in. Preprocessor functions often take lists in other formats, such as a Boost.PP Sequence http://www.boost.org/doc/libs/1_57_0/libs/preprocessor/doc/data/sequences.ht..., and your example only has one-element lists, so it's not enough to clarify this point. (To be clear: I'm not suggesting you should use Boost.PP sequences, merely that you should make it clear in the docs whether you are).
Will do.
- http://kylelutz.github.io/compute/BOOST_COMPUTE_CL_CALLBACK.html is empty, as is http://kylelutz.github.io/compute/BOOST_COMPUTE_MAX_ARITY.html
Will fix.
- http://kylelutz.github.io/compute/boost/compute/valarray.html - presumably compute::valarray has overloaded arithmetic operators? Are they mentioned anywhere?
Actually they aren't currently implemented. I'll open an issue for this and also document them once they are added.
- http://kylelutz.github.io/compute/boost/compute/vector.html the unusual behaviour of vector's size_t constructor (not initializing the elements) should be warned more prominently.
Will do.
- http://kylelutz.github.io/compute/boost/compute/device.html the documentation of the enum type in the synopsis seems to be messed up. Similarly memory_object.
Thanks, I'll look into this more.
- http://kylelutz.github.io/compute/boost/compute/function.html I'm puzzled by this interface. What's the name for? Would one ever use this constructor rather than e.g. make_function_from_source?
There are a few legitimate use-cases. One is for specifying the names
of pre-existing or built-in functions (e.g.
function
- http://kylelutz.github.io/compute/boost/compute/field.html My reading of this is that when you get the type wrong you are guaranteed some kind of meaningful error report, and not just undefined behaviour; is that right? Perhaps you should clarify.
- I can't guess what atomic_min and atomic_max do.
They just forward to the corresponding OpenCL functions [2]. I'll update the documentation to be more clear about these.
- The right arrow ('next') link from http://kylelutz.github.io/compute/boost/compute/atomic_xor.html is broken (looks like the docs for the placeholders don't work).
Hmm, I'll look into this more.
- The iterators should state what iterator category they have.
Currently Boost.Compute doesn't really have different iterator categories, all iterators are random-access. I'll add this information to the documentation.
- I can't guess what svm_ptr is for. Later I discovered svm_alloc and svm_free. Is there an RAII way to manage these things?
These are for managing shared virtual memory objects which were added in OpenCL 2.0. Currently there is no RAII wrappers around them and they behave like malloc()/free(). I have plans to add an svm_allocator which will provide an allocator for SVM objects which can be used with the Boost.Compute containers (and also possibly standard STL containers).
- What does default_device do if there are no devices? Do the various environment variables mentioned have any kind of precedence? (Probably linking to the corresponding OpenCL docs is sufficient for such questions). Do they have to be exact matches? Are they case-sensitive?
There is a precedence but it currently is not explained in the documentation. For most cases only one of the default device variables is meant to be used. They are currently will match substrings and are case-sensitive. I'll work on improving the documentation for these and give some example usages.
- What does find_device do if the requested device does not exist?
Currently it will return a null (default-constructed) device object. I have considered updating this (and the default_device() function) to throw an exception when no valid device is found so that returned device objects are always valid.
- Can you give examples for type_definition? My guess was that it did the thing which type_name later turned out to do...
Yeah, type_definition<T> will return a string with the full definition for the type while type_name<T> returns just the name. So for something like: struct Foo { int id; float mass; }; BOOST_COMPUTE_ADAPT_STRUCT(Foo, Foo, (id, mass)); The following strings will be returned: type_name<Foo>() == "Foo" type_definition<Foo>() == "typedef struct { int id; float mass; } Foo;" I'll update the documentation for type_definition<T> with a better example.
- BOOST_COMPUTE_ADAPT_STRUCT should state that it must be used at global scope (i.e. not within a function or other namespace).
Will do.
- Congratulations for getting compile-time detection of internal padding into BOOST_COMPUTE_ADAPT_STRUCT :). You might want to mention in the docs that this will be detected, so people don't worry too much.
Will do. I also hope in the future to transparently support padded structs as well and to be able to correctly copy them to/from the device without issue.
- It looks like the performance tests don't include the time for copying data to (and, where applicable, from) the device. You should mention this on the Performance page.
I'll add this, though it is very device/system specific. Also, there is a performance benchmark available in Boost.Compute ("perf_copy_to_device") which could be used to measure this on your own system.
- You should probably add C++AMP to your things compared with in the FAQ.
Will do.
4. What is your evaluation of the potential usefulness of the library?
High. I've wanted something like this in the past and not found this.
5. Did you try to use the library? With what compiler? Did you have any problems?
I used it with g++ 4.8.2 in C++1y mode. Some quick tests seemed to work OK, and indeed I was able to sort some large vectors of scalars faster using boost::compute::sort than std::sort, which is nice.
I was pleased to see that 64-bit integers were handled fine.
6. How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
Couple of hours reading and trying simple examples.
7. Are you knowledgeable about the problem domain?
I have done a couple of small projects with CUDA, and am generally knowledgeable, but with no previous OpenCL experience.
8. Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion.
Yes. OpenCL seems to have fairly broad acceptance, but is rather difficult to use safely and with C++-isms. This library fills a useful niche. In an ideal world it will eventually be rendered obsolete by things like parallelism entering the standard library, but I doubt that will happen soon if at all. This provides a good solution now, and we shouldn't dismiss it simply because something sparkly is approaching, which may or may not turn out to be awesome and widely adopted.
This library has clearly been a lot of work. Congratulations on making it this far, I hope the review is not too stressful :).
Thanks for the review! Your feedback has been very helpful! -kyle [1] http://lists.boost.org/boost-docs/2013/07/5077.php [2] https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/atomic_min.html
On 2014-12-29 02:01, Kyle Lutz wrote:
On Sun, Dec 28, 2014 at 7:26 PM, John Bytheway
wrote: It would be very nice to have range versions of all the algorithms. I find myself using the boost::range algorithms much more than the std:: ones these days.
I fully agree and also I'm very much looking forward to having range algorithms standardized in C++17.
Do you mean that you are planning to add them to Boost.Compute?
3. What is your evaluation of the documentation?
Overall, not too bad.
More tutorial would be good. In particular, something using some of the standard algorithms.
I'll definitely work on this more. Any ideas of specific tutorials/examples you'd like to see would be very helpful.
Something using an algorithm which moves elements around might be nice. Maybe sort or partition...
Some concerns:
- All the header links off http://kylelutz.github.io/compute/boost_compute/reference.html are broken.
- All or most of the things in "See also" sections look like they ought to be links but aren't.
- For all your function parameter documentation, the list of parameters seems to be sorted alphabetically by parameter name rather than in the order they are passed to the function. This is really weird...
These are all, to the best of my knowledge, bugs in the Boost documentation toolchain. I reported these issues a while back on the boost-doc mailing list here [1] but never found a resolution. I would really like to get these to work, but am not knowledgable enough about the Boost documentation pipeline (specifically the doxygen->quickbook infrastructure) to fix it. If anyone has any ideas or could help with this I would be very appreciative.
Shame :(. Hope you get some responses here.
- http://kylelutz.github.io/compute/boost/compute/copy.html mentions copying from e.g. a std::list to a compute::vector. It would be nice if it gave some kind of indication as to how efficient we can expect such an operation to be. In particular, I guess it ought to do some kind of buffering on the host?
Currently it will just buffer the entire range into a std::vector<T> and then copy that to the device. If this is a bottleneck it could also be improved to copy the data in smaller batches. Let me know if you encounter issues with this.
Does this also work with copy_async? I presume not, because there would be troubles with the lifetime of the temporary vector...
- The way I normally copy into vectors is call v.reserve(...) and then copy to std::back_inserter(v). I tried that with a compute::vector and unsurprisingly it was very inefficient. I see you already warned about this in the push_back docs, but perhaps it should also be highlighted in the documentation of compute::copy.
Yeah, this idiom is not well suited to copying into device memory. You should prefer to do "bulky" operations using a single call to boost::compute::copy() (which internally will call the OpenCL function for copying large chunks of memory).
- http://kylelutz.github.io/compute/boost/compute/includes.html it's unclear whether the subrange has to be contiguous.
It should have the same semantics of std::includes, which IIRC, does not require the subrange to be contiguous.
Then perhaps rephrasing would make it clearer. The wording on http://en.cppreference.com/w/cpp/algorithm/includes seems clear.
- What source of randomness does random_shuffle use?
Currently it simply calls std::random_shuffle to generate the shuffled index sequence then uses boost::compute::scatter() to actually re-arrange the values on the device. However I am planning on deprecating it and providing an implementation of the C++11 shuffle() algorithm instead which will use the random number generator provided by the user.
Sounds good.
- Why does reduce return its result through an OutputIterator?
In order to allow the algorithm to store the result in device memory and avoid a host-device synchronization point (which would be necessary if it simply returned the reduced result).
That makes sense, but seems inconsistent with accumulate, and to a lesser extent the algorithms that return bool (binary_search, equal, etc.) Are they all introducing a synchronization point? What about the ones returning iterators?
- The boost/compute/types headers docs seem to be broken or missing. In particular I couldn't find reference documentation for float4_ et al.
Yeah, I wasn't sure the best way to document all of these (IIRC, 10 different built in scalar types times four different vector version of each). I'll add some information with an overview of the provided fundamental types.
At least something documenting the supported means of initialization and element access would be good :).
- http://kylelutz.github.io/compute/boost/compute/function.html I'm puzzled by this interface. What's the name for? Would one ever use this constructor rather than e.g. make_function_from_source?
There are a few legitimate use-cases. One is for specifying the names of pre-existing or built-in functions (e.g. function
("sqrt") would call the built-in sqrt() function).
So constructing a function by name points it at some existing named function. That makes sense, but might be unexpected, so should probably be mentioned in the constructor documentation.
- http://kylelutz.github.io/compute/boost/compute/field.html My reading of this is that when you get the type wrong you are guaranteed some kind of meaningful error report, and not just undefined behaviour; is that right? Perhaps you should clarify.
Any comment here?
- The iterators should state what iterator category they have.
Currently Boost.Compute doesn't really have different iterator categories, all iterators are random-access. I'll add this information to the documentation.
Now that you mention this I realize I was confused by what function_input_iterator does. I had assumed it was an InputIterator which supported stateful functions, like http://www.boost.org/doc/libs/1_57_0/libs/iterator/doc/function_input_iterat.... But I guess in fact it has no guarantees about order of evaluation. You might want to clarify this for folks accustomed to Boost.Iterator.
- What does default_device do if there are no devices? Do the various environment variables mentioned have any kind of precedence? (Probably linking to the corresponding OpenCL docs is sufficient for such questions). Do they have to be exact matches? Are they case-sensitive?
There is a precedence but it currently is not explained in the documentation. For most cases only one of the default device variables is meant to be used. They are currently will match substrings and are case-sensitive. I'll work on improving the documentation for these and give some example usages.
- What does find_device do if the requested device does not exist?
Currently it will return a null (default-constructed) device object. I have considered updating this (and the default_device() function) to throw an exception when no valid device is found so that returned device objects are always valid.
I think that would be better. In particular, I experienced what happens if the default device does not initialize properly (in my case because of a missing kernel module) and that was an exception. Missing device should probably be handled similarly.
- Congratulations for getting compile-time detection of internal padding into BOOST_COMPUTE_ADAPT_STRUCT :). You might want to mention in the docs that this will be detected, so people don't worry too much.
Will do. I also hope in the future to transparently support padded structs as well and to be able to correctly copy them to/from the device without issue.
That would be nice.
- It looks like the performance tests don't include the time for copying data to (and, where applicable, from) the device. You should mention this on the Performance page.
I'll add this, though it is very device/system specific. Also, there is a performance benchmark available in Boost.Compute ("perf_copy_to_device") which could be used to measure this on your own system.
I think I was not clear. I mean that, for example, the sort benchmark doesn't include the time taken to copy data before and after the sort. The text on the performance page should explain this, so people can interpret the times properly. John
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of John Bytheway Sent: Tuesday, December 30, 2014 21:23 To: boost@lists.boost.org Subject: Re: [boost] [compute] Review
On 2014-12-29 02:01, Kyle Lutz wrote:
Currently it will just buffer the entire range into a std::vector<T> and then copy that to the device. If this is a bottleneck it could also be improved to copy the data in smaller batches. Let me know if you encounter issues with this.
Does this also work with copy_async? I presume not, because there would be troubles with the lifetime of the temporary vector...
Async host <-> device copies only exist at the low level (command_queue). They sidestep the issue by using raw pointers and size. For a summary of synchronization-related exception safety issues, see my last message on Synchronization. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
On 2014-12-30 21:46, Gruenke,Matt wrote:
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of John Bytheway Sent: Tuesday, December 30, 2014 21:23 To: boost@lists.boost.org Subject: Re: [boost] [compute] Review
On 2014-12-29 02:01, Kyle Lutz wrote:
Currently it will just buffer the entire range into a std::vector<T> and then copy that to the device. If this is a bottleneck it could also be improved to copy the data in smaller batches. Let me know if you encounter issues with this.
Does this also work with copy_async? I presume not, because there would be troubles with the lifetime of the temporary vector...
Async host <-> device copies only exist at the low level (command_queue). They sidestep the issue by using raw pointers and size.
I dispute that. There's a copy_async to which you need not supply a command_queue. It seems high-level to me. http://kylelutz.github.io/compute/boost/compute/copy_async.html Indeed, the following compiles: typedef uint64_t T; std::list<T> list = {0, 1, 2}; compute::vector<T> device_vector(list.size()); compute::copy_async(list.begin(), list.end(), device_vector.begin()); but then fails at runtime because copy_async doesn't support non-contiguous iterators. This is bad. The condition inside copy_async is compile-time, so this assert could be a static_assert. I made an issue. https://github.com/kylelutz/compute/issues/401 John
On Tue, Dec 30, 2014 at 8:13 PM, John Bytheway
On 2014-12-30 21:46, Gruenke,Matt wrote:
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of John Bytheway Sent: Tuesday, December 30, 2014 21:23 To: boost@lists.boost.org Subject: Re: [boost] [compute] Review
On 2014-12-29 02:01, Kyle Lutz wrote:
Currently it will just buffer the entire range into a std::vector<T> and then copy that to the device. If this is a bottleneck it could also be improved to copy the data in smaller batches. Let me know if you encounter issues with this.
Does this also work with copy_async? I presume not, because there would be troubles with the lifetime of the temporary vector...
Async host <-> device copies only exist at the low level (command_queue). They sidestep the issue by using raw pointers and size.
I dispute that. There's a copy_async to which you need not supply a command_queue. It seems high-level to me. http://kylelutz.github.io/compute/boost/compute/copy_async.html
Indeed, the following compiles:
typedef uint64_t T; std::list<T> list = {0, 1, 2}; compute::vector<T> device_vector(list.size()); compute::copy_async(list.begin(), list.end(), device_vector.begin());
but then fails at runtime because copy_async doesn't support non-contiguous iterators. This is bad. The condition inside copy_async is compile-time, so this assert could be a static_assert. I made an issue. https://github.com/kylelutz/compute/issues/401
Thanks for reporting this, I'll get it fixed. -kyle
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of John Bytheway Sent: Tuesday, December 30, 2014 23:14 To: boost@lists.boost.org Subject: Re: [boost] [compute] Review
On 2014-12-30 21:46, Gruenke,Matt wrote:
Async host <-> device copies only exist at the low level (command_queue). They sidestep the issue by using raw pointers and size.
I dispute that. There's a copy_async to which you need not supply a command_queue. It seems high-level to me.
Good catch. I looked for it, but somehow missed it. I fully agree that the constraint should be specified & enforced at compile-time. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
On Tue, Dec 30, 2014 at 9:35 PM, Gruenke,Matt
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of John Bytheway Sent: Tuesday, December 30, 2014 23:14 To: boost@lists.boost.org Subject: Re: [boost] [compute] Review
On 2014-12-30 21:46, Gruenke,Matt wrote:
Async host <-> device copies only exist at the low level (command_queue). They sidestep the issue by using raw pointers and size.
I dispute that. There's a copy_async to which you need not supply a command_queue. It seems high-level to me.
Good catch. I looked for it, but somehow missed it. I fully agree that the constraint should be specified & enforced at compile-time.
I've added a static_assert() for this, see pull-request #404 [1]. -kyle [1] https://github.com/kylelutz/compute/pull/404
On Tue, Dec 30, 2014 at 6:23 PM, John Bytheway
On 2014-12-29 02:01, Kyle Lutz wrote:
On Sun, Dec 28, 2014 at 7:26 PM, John Bytheway
wrote: It would be very nice to have range versions of all the algorithms. I find myself using the boost::range algorithms much more than the std:: ones these days.
I fully agree and also I'm very much looking forward to having range algorithms standardized in C++17.
Do you mean that you are planning to add them to Boost.Compute?
I would be interested in having range algorithms in Boost.Compute but this would be a large amount of work and is not currently very high in my list of priorities. If anyone is interested in working on these please let me know.
3. What is your evaluation of the documentation?
Overall, not too bad.
More tutorial would be good. In particular, something using some of the standard algorithms.
I'll definitely work on this more. Any ideas of specific tutorials/examples you'd like to see would be very helpful.
Something using an algorithm which moves elements around might be nice. Maybe sort or partition...
Will do. Also, if you're not aware, there are quite a few example applications in the "example" directory [1]. I'll also work on making these more prominent in the documentation.
Some concerns:
- All the header links off http://kylelutz.github.io/compute/boost_compute/reference.html are broken.
- All or most of the things in "See also" sections look like they ought to be links but aren't.
- For all your function parameter documentation, the list of parameters seems to be sorted alphabetically by parameter name rather than in the order they are passed to the function. This is really weird...
These are all, to the best of my knowledge, bugs in the Boost documentation toolchain. I reported these issues a while back on the boost-doc mailing list here [1] but never found a resolution. I would really like to get these to work, but am not knowledgable enough about the Boost documentation pipeline (specifically the doxygen->quickbook infrastructure) to fix it. If anyone has any ideas or could help with this I would be very appreciative.
Shame :(. Hope you get some responses here.
Yeah, me too.
- http://kylelutz.github.io/compute/boost/compute/copy.html mentions copying from e.g. a std::list to a compute::vector. It would be nice if it gave some kind of indication as to how efficient we can expect such an operation to be. In particular, I guess it ought to do some kind of buffering on the host?
Currently it will just buffer the entire range into a std::vector<T> and then copy that to the device. If this is a bottleneck it could also be improved to copy the data in smaller batches. Let me know if you encounter issues with this.
Does this also work with copy_async? I presume not, because there would be troubles with the lifetime of the temporary vector...
Yeah, as mentioned in one of the replies, copy_async() currently only works with contiguous ranges. I have ideas on how to make this work more generally but I haven't had time to implement it.
- The way I normally copy into vectors is call v.reserve(...) and then copy to std::back_inserter(v). I tried that with a compute::vector and unsurprisingly it was very inefficient. I see you already warned about this in the push_back docs, but perhaps it should also be highlighted in the documentation of compute::copy.
Yeah, this idiom is not well suited to copying into device memory. You should prefer to do "bulky" operations using a single call to boost::compute::copy() (which internally will call the OpenCL function for copying large chunks of memory).
- http://kylelutz.github.io/compute/boost/compute/includes.html it's unclear whether the subrange has to be contiguous.
It should have the same semantics of std::includes, which IIRC, does not require the subrange to be contiguous.
Then perhaps rephrasing would make it clearer. The wording on http://en.cppreference.com/w/cpp/algorithm/includes seems clear.
- What source of randomness does random_shuffle use?
Currently it simply calls std::random_shuffle to generate the shuffled index sequence then uses boost::compute::scatter() to actually re-arrange the values on the device. However I am planning on deprecating it and providing an implementation of the C++11 shuffle() algorithm instead which will use the random number generator provided by the user.
Sounds good.
- Why does reduce return its result through an OutputIterator?
In order to allow the algorithm to store the result in device memory and avoid a host-device synchronization point (which would be necessary if it simply returned the reduced result).
That makes sense, but seems inconsistent with accumulate, and to a lesser extent the algorithms that return bool (binary_search, equal, etc.) Are they all introducing a synchronization point? What about the ones returning iterators?
Yeah, algorithms like accumulate() which return the resulting value on the host are synchronization points. Others like transform() which return an iterator do not because the returned iterator is not actually the "result" of the operation and can be calculated and returned before the operation is actually complete.
- The boost/compute/types headers docs seem to be broken or missing. In particular I couldn't find reference documentation for float4_ et al.
Yeah, I wasn't sure the best way to document all of these (IIRC, 10 different built in scalar types times four different vector version of each). I'll add some information with an overview of the provided fundamental types.
At least something documenting the supported means of initialization and element access would be good :).
Will do.
- http://kylelutz.github.io/compute/boost/compute/function.html I'm puzzled by this interface. What's the name for? Would one ever use this constructor rather than e.g. make_function_from_source?
There are a few legitimate use-cases. One is for specifying the names of pre-existing or built-in functions (e.g. function
("sqrt") would call the built-in sqrt() function). So constructing a function by name points it at some existing named function. That makes sense, but might be unexpected, so should probably be mentioned in the constructor documentation.
Will do.
- http://kylelutz.github.io/compute/boost/compute/field.html My reading of this is that when you get the type wrong you are guaranteed some kind of meaningful error report, and not just undefined behaviour; is that right? Perhaps you should clarify.
Any comment here?
Yeah, I'll work on improving the documentation for this. I've added an issue for it (#384).
- The iterators should state what iterator category they have.
Currently Boost.Compute doesn't really have different iterator categories, all iterators are random-access. I'll add this information to the documentation.
Now that you mention this I realize I was confused by what function_input_iterator does. I had assumed it was an InputIterator which supported stateful functions, like http://www.boost.org/doc/libs/1_57_0/libs/iterator/doc/function_input_iterat.... But I guess in fact it has no guarantees about order of evaluation. You might want to clarify this for folks accustomed to Boost.Iterator.
Will do.
- What does default_device do if there are no devices? Do the various environment variables mentioned have any kind of precedence? (Probably linking to the corresponding OpenCL docs is sufficient for such questions). Do they have to be exact matches? Are they case-sensitive?
There is a precedence but it currently is not explained in the documentation. For most cases only one of the default device variables is meant to be used. They are currently will match substrings and are case-sensitive. I'll work on improving the documentation for these and give some example usages.
- What does find_device do if the requested device does not exist?
Currently it will return a null (default-constructed) device object. I have considered updating this (and the default_device() function) to throw an exception when no valid device is found so that returned device objects are always valid.
I think that would be better. In particular, I experienced what happens if the default device does not initialize properly (in my case because of a missing kernel module) and that was an exception. Missing device should probably be handled similarly.
Yeah, I'm becoming more convinced of this. I'll probably make that change soon. I've added an issue for this (#403).
- Congratulations for getting compile-time detection of internal padding into BOOST_COMPUTE_ADAPT_STRUCT :). You might want to mention in the docs that this will be detected, so people don't worry too much.
Will do. I also hope in the future to transparently support padded structs as well and to be able to correctly copy them to/from the device without issue.
That would be nice.
- It looks like the performance tests don't include the time for copying data to (and, where applicable, from) the device. You should mention this on the Performance page.
I'll add this, though it is very device/system specific. Also, there is a performance benchmark available in Boost.Compute ("perf_copy_to_device") which could be used to measure this on your own system.
I think I was not clear. I mean that, for example, the sort benchmark doesn't include the time taken to copy data before and after the sort. The text on the performance page should explain this, so people can interpret the times properly.
Ahh I see now. I'll add this information. -kyle [1] https://github.com/kylelutz/compute/tree/master/example
On 2014-12-30 23:41, Kyle Lutz wrote:
On Tue, Dec 30, 2014 at 6:23 PM, John Bytheway
wrote: On 2014-12-29 02:01, Kyle Lutz wrote:
On Sun, Dec 28, 2014 at 7:26 PM, John Bytheway
wrote: - Why does reduce return its result through an OutputIterator?
In order to allow the algorithm to store the result in device memory and avoid a host-device synchronization point (which would be necessary if it simply returned the reduced result).
That makes sense, but seems inconsistent with accumulate, and to a lesser extent the algorithms that return bool (binary_search, equal, etc.) Are they all introducing a synchronization point? What about the ones returning iterators?
Yeah, algorithms like accumulate() which return the resulting value on the host are synchronization points. Others like transform() which return an iterator do not because the returned iterator is not actually the "result" of the operation and can be calculated and returned before the operation is actually complete.
That's true of transform, but not for most algorithms returning an iterator, surely. How would you do this with find, or set_intersection? John
On 2014-12-29 02:01, Kyle Lutz wrote:
On Sun, Dec 28, 2014 at 7:26 PM, John Bytheway
wrote: - It looks like the performance tests don't include the time for copying data to (and, where applicable, from) the device. You should mention this on the Performance page.
I'll add this, though it is very device/system specific. Also, there is a performance benchmark available in Boost.Compute ("perf_copy_to_device") which could be used to measure this on your own system.
More on the subject of performance: I ran some of the performance tests on my machine, and I observed something surprising. set_intersection and set_difference are *unbelievably* fast (two orders of magnitude faster than the std:: versions), and yet set_union and set_symmetric_difference are quite slow (maybe twice as slow as std::). Looking at the code, the four algorithms are very similar. Have you any idea what causes this huge discrepancy in their speed? John
On Tue, Dec 30, 2014 at 8:33 PM, John Bytheway
On 2014-12-29 02:01, Kyle Lutz wrote:
On Sun, Dec 28, 2014 at 7:26 PM, John Bytheway
wrote: - It looks like the performance tests don't include the time for copying data to (and, where applicable, from) the device. You should mention this on the Performance page.
I'll add this, though it is very device/system specific. Also, there is a performance benchmark available in Boost.Compute ("perf_copy_to_device") which could be used to measure this on your own system.
More on the subject of performance:
I ran some of the performance tests on my machine, and I observed something surprising.
set_intersection and set_difference are *unbelievably* fast (two orders of magnitude faster than the std:: versions), and yet set_union and set_symmetric_difference are quite slow (maybe twice as slow as std::). Looking at the code, the four algorithms are very similar. Have you any idea what causes this huge discrepancy in their speed?
Very interesting, I'll look into this more. Also, I wasn't the author of those particular algorithms. Could you open an issue on the issue tracker [1] for this? Please also include the performance numbers you got and the device/platform you used. -kyle [1] https://github.com/kylelutz/compute/issues
On 2014-12-30 23:42, Kyle Lutz wrote:
On Tue, Dec 30, 2014 at 8:33 PM, John Bytheway
wrote: On 2014-12-29 02:01, Kyle Lutz wrote:
On Sun, Dec 28, 2014 at 7:26 PM, John Bytheway
wrote: - It looks like the performance tests don't include the time for copying data to (and, where applicable, from) the device. You should mention this on the Performance page.
I'll add this, though it is very device/system specific. Also, there is a performance benchmark available in Boost.Compute ("perf_copy_to_device") which could be used to measure this on your own system.
More on the subject of performance:
I ran some of the performance tests on my machine, and I observed something surprising.
set_intersection and set_difference are *unbelievably* fast (two orders of magnitude faster than the std:: versions), and yet set_union and set_symmetric_difference are quite slow (maybe twice as slow as std::). Looking at the code, the four algorithms are very similar. Have you any idea what causes this huge discrepancy in their speed?
Very interesting, I'll look into this more. Also, I wasn't the author of those particular algorithms. Could you open an issue on the issue tracker [1] for this? Please also include the performance numbers you got and the device/platform you used.
Done. John
participants (3)
-
Gruenke,Matt
-
John Bytheway
-
Kyle Lutz