1. What is your evaluation of the design?
I agree with other comments made about synchronization. The design should be more explicit about what's asynchronous, and ideally utilize only one mechanism for synchronization. One approach that would have made this clearer is if high-level operations were actually objects that got posted to the command_queue. Regarding synchronization, I'm a also bit concerned about the performance impact of synchronizing on all copies to host memory. Overuse of synchronization can easily result in performance deterioration. On this point, I think it might be worth limiting host memory usable with algorithms, to containers that perform implicit synchronization to actual use (or destruction) of results. Give users the choice between that or performing explicit copies to raw types. Also, I agree with Thomas M that it'd be useful for operations to return events. That said, if you told me it doubled the overhead involved in dispatching operations, I'd suggest to make it optional through overloads or via a mechanism in the command_queue itself.
2. What is your evaluation of the implementation?
It seems a bit heavy to be header-only. I'd agree with Andrey Semashev, regarding thread safety, except I think it shouldn't be an option at all. That said, not all operations on all objects need be thread safe. I think the Boost.ASIO documentation provides a good example of how to express different thread safety limitations. Now, I haven't examined the use of thread-local storage, but do you feel there's a strong case to be made for the thread safety it's providing (especially, given that you seem to feel it's non-essential)? I'm a bit concerned about the use of type names ending in '_', such as float4_. Is this consistent with Boost conventions? I've only seen that used in other Boost libraries to denote class member variables. I didn't notice any unit tests specifically intended to verify exception-safety. I'd want to know that had been thoroughly vetted, before I'd consider using Boost.Compute in production code. Finally, based on the docs, it seems there's a bug in boost::compute::command_queue::enqueue_copy_image_to_buffer() and boost::compute::command_queue::enqueue_copy_buffer_to_image(). The region and origin parameters should be arrays, in the 2D and 3D cases. If this discrepancy is intentional, it should be noted. I haven't exhaustively reviewed the reference docs, so there might be other issues of this sort.
3. What is your evaluation of the documentation?
It needs more and better-documented tutorials (and I'm including the "Advanced Topics", here). Some of the reference pages could use more details, as well, especially concerning synchronization and exception usage. Regarding exceptions, there should be discussion of specifically what is invalidated by which kinds of errors. If any errors are non-recoverable, this should also be called out and perhaps derived from a different or additional baseclass (not a bad use case for multiple inheritance, actually). I agree with Mathias Gaunard that it would be helpful to discuss computational complexity, device memory requirements, and the various methods used to implement some of the algorithms. You could continue to omit these details on the simple algorithms, though.
4. What is your evaluation of the potential usefulness of the library?
This is my biggest misgiving, by far. In the very near future, I expect developers will opt for either SYCL (https://www.khronos.org/opencl/sycl) or Bolt (https://github.com/HSA-Libraries/Bolt). SYCL provides a modern, standard C++11 wrapper around OpenCL, with better concurrency control and support for integrating kernels inline. Bolt provides many of the same higher-level abstractions found in Boost.Compute, but with forthcoming support for HSA. To have the kind of lasting relevance and broad applicability to which all Boost libraries should aspire, I think Boost.Compute should be architected to support multiple backends. Though OpenCL support is currently ascendant, it's far from universal and is already flagging on some platforms (Nvidia, not the least). And HSA provides a foundation on which alternatives are actively being built. Most importantly, there exist multitudes of multi-core and multiprocessor systems which lack OpenCL support. It would be eminently useful to support these with such backends as thread pool, OpenMP, etc. And backends could be added to support new technologies, as they mature.
5. Did you try to use the library? With what compiler? Did you have any problems?
I downloaded and inspected the sources and some examples. I didn't have any questions or concerns that would be addressed by experimentation.
6. How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
Review of the documentation, some sources, some examples, and a refresher on SYCL & Bolt. I also read all of the reviews and replies sent thus far.
7. Are you knowledgeable about the problem domain?
I've been developing production code for parallel and heterogeneous computing platforms for the majority of the last 16 years. I've followed OpenCL since version 1.0 was finalized. I've also submitted bug reports and minor patches for Khronos' official C++ OpenCL interface.
8. Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion.
Not in its current state. The primary issue is that it's too OpenCL-centric. It should support more backends, even if these are just compile-time options. This is crucial for both current and lasting relevance. Beyond that, better concurrency control is crucial for optimum performance. If done well, it could even help further reduce usage errors. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
Gruenke,Matt
This is my biggest misgiving, by far. In the very near future, I expect developers will opt for either SYCL (https://www.khronos.org/opencl/sycl) or Bolt (https://github.com/HSA-Libraries/Bolt). SYCL provides a modern, standard C++11 wrapper around OpenCL, with better concurrency control and support for integrating kernels inline. Bolt provides many of the same higher-level abstractions found in Boost.Compute, but with forthcoming support for HSA.
Bolt relies on an extension to OpenCL called "OpenCL Static C++ Kernel Language Extension". Only AMD bothered to implement it as to my knowledge. C++ AMP is in my opinion a more promising proposal compared to SYCL. Developers opt for C++ AMP today. But both SYCL and C++ AMP are higher level tools and have the disadvantages of any higher level library compared to a lower level library. In addition, they need a custom compiler or compiler extensions. This increases the fragmentation of the accelerator co-processor field further. I think Boost.Compute does the right thing here. Identify the lowest common denominator: OpenCL. Build a library on top of it that anyone can use on any platform, provided a standard C++ compiler is available and the OpenCL library is implemented. Build whatever fancy thing you want on top of that.
To have the kind of lasting relevance and broad applicability to which all Boost libraries should aspire, I think Boost.Compute should be architected to support multiple backends. Though OpenCL support is currently ascendant, it's far from universal and is already flagging on some platforms (Nvidia, not the least). And HSA provides a foundation on which alternatives are actively being built. Most importantly, there exist multitudes of multi-core and multiprocessor systems which lack OpenCL support. It would be eminently useful to support these with such backends as thread pool, OpenMP, etc. And backends could be added to support new technologies, as they mature.
OpenCL is supposed to be the abstraction layer that does all that, remember? That is, support multi-core, multi-processor and many-core vector co-processors. Asking Boost.Compute to support threading and OpenMP is asking it to do the job of OpenCL library implementers. To play the heretic for the sake of argument: why stop at single nodes then? Why not add, on top of the OpenMP/threading layer you ask Boost.Compute to support an MPI layer? I urge you to not open this can of worms.
On 28/12/2014 13:26, Sebastian Schaetz wrote:
C++ AMP is in my opinion a more promising proposal compared to SYCL. Developers opt for C++ AMP today. But both SYCL and C++ AMP are higher level tools and have the disadvantages of any higher level library compared to a lower level library. In addition, they need a custom compiler or compiler extensions. This increases the fragmentation of the accelerator co-processor field further.
C++AMP is Microsoft proprietary technology for Visual Studio, with all the associated problems one might expect. SYCL is an open standard designed to integrate with C++11 as well as possible, and has in implementation within the LLVM framework. It's pretty much DirectX vs OpenGL again.
Mathias Gaunard
C++AMP is Microsoft proprietary technology for Visual Studio, with all the associated problems one might expect.
SYCL is an open standard designed to integrate with C++11 as well as possible, and has in implementation within the LLVM framework.
It's pretty much DirectX vs OpenGL again.
It is not. C++ AMP is an open standard [0] implemented not only by Microsoft [1]. [0] http://blogs.msdn.com/b/nativeconcurrency/archive/2013/12/12/c-amp-open-spec... [1] https://bitbucket.org/multicoreware/cppamp-driver-ng/wiki/Home
On 28/12/2014 16:21, Sebastian Schaetz wrote:
Mathias Gaunard
writes: C++AMP is Microsoft proprietary technology for Visual Studio, with all the associated problems one might expect.
SYCL is an open standard designed to integrate with C++11 as well as possible, and has in implementation within the LLVM framework.
It's pretty much DirectX vs OpenGL again.
It is not. C++ AMP is an open standard [0]
It's not recognized as a standard by any recognized standards bodies though, and its writing didn't involve anyone but Microsoft. It's pretty much just Microsoft publishing their specifications (which their implementation probably doesn't even follow to the letter) so that others can try to make interoperable technology.
implemented not only by Microsoft [1].
This implementation is highly experimental for the moment, however.
2014-12-28 15:26 GMT+03:00 Sebastian Schaetz
Gruenke,Matt
writes: To have the kind of lasting relevance and broad applicability to which all Boost libraries should aspire, I think Boost.Compute should be architected to support multiple backends. Though OpenCL support is currently ascendant, it's far from universal and is already flagging on some platforms (Nvidia, not the least). And HSA provides a foundation on which alternatives are actively being built. Most importantly, there exist multitudes of multi-core and multiprocessor systems which lack OpenCL support. It would be eminently useful to support these with such backends as thread pool, OpenMP, etc. And backends could be added to support new technologies, as they mature.
OpenCL is supposed to be the abstraction layer that does all that, remember? That is, support multi-core, multi-processor and many-core vector co-processors. Asking Boost.Compute to support threading and OpenMP is asking it to do the job of OpenCL library implementers.
Totally agree. Here's my 0.05 rubles (it's much less than 5 cents): * OpenCL is the non proprietary open widely used technology that is meant to be the abstraction atop of any computing device. That's a 100% right choice. * CUDA is a proprietary technology nailed to the NVidia graphics. That's a very platform dependent solution for Boost. Some CUDA code could be possibly added to Compute only if it gives significant performance boost, but OpenCL implementation must be provided for not NVidia based platforms. * C++ AMP is not licensed with MIT, GPL, Boost or BSD. It's a "Microsoft Community Promise" which makes me nervous and it does not look like a perfect license for open standard http://en.wikipedia.org/wiki/Microsoft_Open_Specification_Promise#Scope_limi... . C++ AMP is not as popular as OpenCL or CUDA. The only one non Windows based implementation is the C++ AMP compiler that outputs to OpenCL. I see no reasons to use this technology. -- Best regards, Antony Polukhin
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Sebastian Schaetz Sent: Sunday, December 28, 2014 7:26 To: boost@lists.boost.org Subject: Re: [boost] [compute] review
Gruenke,Matt writes:
C++ AMP is in my opinion a more promising proposal compared to SYCL. Developers opt for C++ AMP today. But both SYCL and C++ AMP are higher level tools and have the disadvantages of any higher level library compared to a lower level library.
How is SYCL a higher-level tool? Have a look at the provisional spec: https://www.khronos.org/registry/sycl It has equivalents of everything you find in Boost.Compute, *except* for the higher-level functions. Moreover, they introduce the notion of command groups, and possibly other low level features.
In addition, they need a custom compiler or compiler extensions.
[Addressed in previous message. If you have any evidence to support this, please reply to that thread.]
OpenCL is supposed to be the abstraction layer that does all that, remember?
In fact, that has been the primary factor fueling my interest, over the years. But there are many systems that still don't support it. And it's only one solution to this problem. There will doubtlessly be others, possibly spurred on by the advent of HSA and other technologies yet to arrive on the scene.
I urge you to not open this can of worms.
I didn't mean to imply that it *needed* to have a backend for XYZ. I am merely *suggesting* backends such as a threadpool or possibly OpenMP. My point was about the design - that it should facilitate the addition of backends, in order to address both existing and future systems where OpenCL support is absent or inefficient. Again, the key point is that the design should accommodate different backends. Whether a given backend is developed depends on whether there's enough interest for someone to write and maintain it. And perhaps some backends will exist only as proprietary patches maintained in private repositories of users. The main contribution of Boost.Compute would then be the framework, interface, and high-level algorithms. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
C++ AMP is in my opinion a more promising proposal compared to SYCL. Developers opt for C++ AMP today. But both SYCL and C++ AMP are higher level tools and have the disadvantages of any higher level library compared to a lower level library.
How is SYCL a higher-level tool? Have a look at the provisional spec:
https://www.khronos.org/registry/sycl
It has equivalents of everything you find in Boost.Compute, *except* for
Gruenke,Matt
Moreover, they introduce the notion of command groups, and possibly other low level features.
I have and I judge it to be a higher level compared to OpenCL (or Boost.Compute). SYCL abstracts memory access through 'accessors' and I'm not sure that you can issue explicit memory copies in SYCL. Both OpenCl and Boost.Compute have explicit copy functions with synchronous or asynchronous semantics. This is also one of my major points of critique of C++ AMP - it is unclear when data is transferred between host and device.
In fact, that has been the primary factor fueling my interest, over the years. But there are many systems that still don't support it. And it's only one solution to this problem.
Agreed.
I urge you to not open this can of worms.
I didn't mean to imply that it *needed* to have a backend for XYZ. I am merely *suggesting* backends such as a threadpool or possibly OpenMP. My point was about the design - that it should facilitate the addition of backends, in order to address both existing and future systems where OpenCL support is absent or inefficient.
Again, the key point is that the design should accommodate different backends. Whether a given backend is developed depends on whether there's enough interest for someone to write and maintain it. And perhaps some backends will exist only as proprietary patches maintained in private repositories of users. The main contribution of Boost.Compute would then be the framework, interface, and high-level algorithms.
Such a multi-backend support would be at the STL level and could not go lower than that. Devices, command queues and contexts don't make any sense when thinking of a OpenMP backend. Let alone a compute::vector with associated compute::copy. The library you're asking for is very different from library proposed here.
On Sun, Dec 28, 2014 at 3:55 PM, Gruenke,Matt
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Sebastian Schaetz Sent: Sunday, December 28, 2014 7:26 To: boost@lists.boost.org Subject: Re: [boost] [compute] review
I urge you to not open this can of worms.
I didn't mean to imply that it *needed* to have a backend for XYZ. I am merely *suggesting* backends such as a threadpool or possibly OpenMP. My point was about the design - that it should facilitate the addition of backends, in order to address both existing and future systems where OpenCL support is absent or inefficient.
Again, the key point is that the design should accommodate different backends. Whether a given backend is developed depends on whether there's enough interest for someone to write and maintain it. And perhaps some backends will exist only as proprietary patches maintained in private repositories of users. The main contribution of Boost.Compute would then be the framework, interface, and high-level algorithms.
While I agree that this would be useful, and API like this isn't the goal of Boost.Compute. I think there is room for a higher-level library to provide a more abstract parallel interface which could potentially use Boost.Compute in addition to other available parallel APIs (e.g. OpenMP, TBB, CUDA, MPI, etc.). In fact this is very much what the C++ parallelism TS is attempting to provide. -kyle
On Sun, Dec 28, 2014 at 8:58 PM, Kyle Lutz
On Sun, Dec 28, 2014 at 3:55 PM, Gruenke,Matt
wrote: I didn't mean to imply that it *needed* to have a backend for XYZ. I am merely *suggesting* backends such as a threadpool or possibly OpenMP. My point was about the design - that it should facilitate the addition of backends, in order to address both existing and future systems where OpenCL support is absent or inefficient.
Again, the key point is that the design should accommodate different backends. Whether a given backend is developed depends on whether there's enough interest for someone to write and maintain it. And perhaps some backends will exist only as proprietary patches maintained in private repositories of users. The main contribution of Boost.Compute would then be the framework, interface, and high-level algorithms.
While I agree that this would be useful, and API like this isn't the goal of Boost.Compute. I think there is room for a higher-level library to provide a more abstract parallel interface which could potentially use Boost.Compute in addition to other available parallel APIs (e.g. OpenMP, TBB, CUDA, MPI, etc.). In fact this is very much what the C++ parallelism TS is attempting to provide.
I generally agree with Matt. If we intend Boost.Compute to be widely available and useful on many platforms (Windows, Mac, and Linux is far from being the entire world today; just think of iOS, where there is no OpenCL), absolutely sticking to OpenCL seems to me as not being a good solution. Thus, if adding the possibility of other backends is not part of Boost.Compute's goal, then why not name it Boost.OpenCL? -- François Duranleau
On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt
1. What is your evaluation of the design?
I agree with other comments made about synchronization. The design should be more explicit about what's asynchronous, and ideally utilize only one mechanism for synchronization. One approach that would have made this clearer is if high-level operations were actually objects that got posted to the command_queue.
Like I mentioned before, there is only one method for asynchrony in Boost.Compute, the command queue abstraction provided by OpenCL. Operations are enqueued to be executed on the compute device and this occurs asynchronously with respect to code executing on the host. The exception to this rule are functions which interact directly with host-memory which by default are blocking and offer explicit "_async()" versions (in order to prevent potential race-conditions on host-memory). And I am hesitant to make the command_queue interface more high-level. But this sort of functionality could certainly be built on top of current interfaces in Boost.Compute. See some of the previous reviews for discussion about this.
Regarding synchronization, I'm a also bit concerned about the performance impact of synchronizing on all copies to host memory. Overuse of synchronization can easily result in performance deterioration. On this point, I think it might be worth limiting host memory usable with algorithms, to containers that perform implicit synchronization to actual use (or destruction) of results. Give users the choice between that or performing explicit copies to raw types.
To be clear, all copies are not synchronized with host memory. Boost.Compute allows both synchronous and asynchronous memory transfers between the host and device. And having an implicitly synchronized container class could be built on top of the Boost.Compute APIs. I haven't implemented it yet as I personally feel that blocking operations should be done explicitly by the user and not hidden away in an implicit API. However, I'd be happy to review and integrate functionality like this into Boost.Compute if someone was interested in developing it.
Also, I agree with Thomas M that it'd be useful for operations to return events. That said, if you told me it doubled the overhead involved in dispatching operations, I'd suggest to make it optional through overloads or via a mechanism in the command_queue itself.
All asynchronous operations in the command queue class do return events. One of his comments was to also return events from the synchronous methods for consistency and I am working on adding this.
2. What is your evaluation of the implementation?
It seems a bit heavy to be header-only.
I'd agree with Andrey Semashev, regarding thread safety, except I think it shouldn't be an option at all. That said, not all operations on all objects need be thread safe. I think the Boost.ASIO documentation provides a good example of how to express different thread safety limitations. Now, I haven't examined the use of thread-local storage, but do you feel there's a strong case to be made for the thread safety it's providing (especially, given that you seem to feel it's non-essential)?
I absolutely do not feel it is non-essential. The reason for it currently being a compile-time configurable option is that enabling it on non-C++11 compilers would make the library non-header-only (it would require the users to link to Boost.Thread). This has come up a few times in the review and I have been considering making the library non-header-only (or at least optionally non-header-only) which would obviate some of these issues. However, this is a much larger and potentially non-backwards-compatible change and I'd need to run it by the community and provide a migration plan for current users.
I'm a bit concerned about the use of type names ending in '_', such as float4_. Is this consistent with Boost conventions? I've only seen that used in other Boost libraries to denote class member variables.
I'm not sure if there are any Boost conventions for/against this (someone please speak up if there are). I chose the trailing underscore for these types as I needed a consistent spelling for representing the fundamental OpenCL types (e.g. "float" or "uint" or "int4") and I couldn't just use the names without the trailing underscore as they'd conflict with the C++ reserved keywords (e.g. "float", "int"). Using a leading underscore (e.g. "_float4") looked worse to me, so I used a trailing underscore. But I'd definitely be open to hearing other ideas.
I didn't notice any unit tests specifically intended to verify exception-safety. I'd want to know that had been thoroughly vetted, before I'd consider using Boost.Compute in production code.
There are already some tests which for the error/exception paths in Boost.Compute (testing building invalid programs in "test_program.cpp" is the first that comes to mind). I'll work on adding more. Please let me know if there are any specific areas you'd like to see more heavily tested.
Finally, based on the docs, it seems there's a bug in boost::compute::command_queue::enqueue_copy_image_to_buffer() and boost::compute::command_queue::enqueue_copy_buffer_to_image(). The region and origin parameters should be arrays, in the 2D and 3D cases. If this discrepancy is intentional, it should be noted. I haven't exhaustively reviewed the reference docs, so there might be other issues of this sort.
Can you be more specific as to what you consider to be a "bug" here? These APIs in the command_queue class are very "thin" wrappers on top of the clEnqueue* functions from the OpenCL C API. If you could provide a small test-case which demonstrates the issue and submit it to the bug-tracker [1] that would be great.
3. What is your evaluation of the documentation?
It needs more and better-documented tutorials (and I'm including the "Advanced Topics", here).
Fully agree, I'll work on this.
Some of the reference pages could use more details, as well, especially concerning synchronization and exception usage.
Fully agree. I've been working on improving this. And please let me know if there are any specific classes/functions which you'd like to see more detailed documentation.
Regarding exceptions, there should be discussion of specifically what is invalidated by which kinds of errors. If any errors are non-recoverable, this should also be called out and perhaps derived from a different or additional baseclass (not a bad use case for multiple inheritance, actually).
As far as I know, there are no non-recoverable errors or errors which invalidate host objects thrown by Boost.Compute. This is due to the fact that Boost.Compute mainly deals with executing operations on a separate device (e.g. GPU) and exceptions there don't affect the host's ability to continue execution.
I agree with Mathias Gaunard that it would be helpful to discuss computational complexity, device memory requirements, and the various methods used to implement some of the algorithms. You could continue to omit these details on the simple algorithms, though.
I will definitely continue to improve the documentation and specifically add more of these sorts of details for the algorithms. Please let me know if there are additional areas where you would also like to see more specific details.
4. What is your evaluation of the potential usefulness of the library?
This is my biggest misgiving, by far. In the very near future, I expect developers will opt for either SYCL (https://www.khronos.org/opencl/sycl) or Bolt (https://github.com/HSA-Libraries/Bolt). SYCL provides a modern, standard C++11 wrapper around OpenCL, with better concurrency control and support for integrating kernels inline. Bolt provides many of the same higher-level abstractions found in Boost.Compute, but with forthcoming support for HSA.
To have the kind of lasting relevance and broad applicability to which all Boost libraries should aspire, I think Boost.Compute should be architected to support multiple backends. Though OpenCL support is currently ascendant, it's far from universal and is already flagging on some platforms (Nvidia, not the least). And HSA provides a foundation on which alternatives are actively being built. Most importantly, there exist multitudes of multi-core and multiprocessor systems which lack OpenCL support. It would be eminently useful to support these with such backends as thread pool, OpenMP, etc. And backends could be added to support new technologies, as they mature.
5. Did you try to use the library? With what compiler? Did you have any problems?
I downloaded and inspected the sources and some examples. I didn't have any questions or concerns that would be addressed by experimentation.
6. How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
Review of the documentation, some sources, some examples, and a refresher on SYCL & Bolt. I also read all of the reviews and replies sent thus far.
7. Are you knowledgeable about the problem domain?
I've been developing production code for parallel and heterogeneous computing platforms for the majority of the last 16 years. I've followed OpenCL since version 1.0 was finalized. I've also submitted bug reports and minor patches for Khronos' official C++ OpenCL interface.
8. Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion.
Not in its current state. The primary issue is that it's too OpenCL-centric. It should support more backends, even if these are just compile-time options. This is crucial for both current and lasting relevance.
Boost.Compute is explicitly *not* designed to be an abstraction layer over every parallel-programming interface currently available. It is designed to target modern programmable GPUs, multi-core CPUs and some of the more exotic accelerators (Xeon Phi, Parallella Epiphany, FPGAs, etc.). In my opinion, OpenCL is the current best and most widely-supported interface for this purpose and fills this role well. Others libraries/frameworks (such as SYCL, Bolt, C++AMP, OpenACC, etc.) are all dependent on either a special compiler or special compiler extensions. On the other hand, OpenCL is a library-level solution which allows Boost.Compute to be portable to any platform with a standard C++ compiler. And as Boost.Compute and SYCL are both based on OpenCL, interoperating between them should be fairly trivial. If a non-commercial SYCL implementation is ever released, I will work on adding supporting for its single-source compilation model to Boost.Compute. I feel developing an all-encompassing parallel programming library is outside the scope of Boost.Compute (and would require considerably more resources to design, implement, and maintain). That said, I feel that Boost.Compute is well-suited to provide a back-end to a more generic/high-level parallel programming library (either something like a potential Boost.Parallel library or the upcoming standard C++ Parallelism TS). More concretely, and as Sebastian mentioned, OpenCL itself already provides a parallel-programming abstraction with many different back-ends (both vendor-supplied implementations and open-source implementations). I personally don't see the benefit to adding yet another abstraction layer between the user and the compute device. I feel it would greatly increase complexity of the library and prevent us from spending more time on improving performance.
Beyond that, better concurrency control is crucial for optimum performance. If done well, it could even help further reduce usage errors.
Boost.Compute provides low-level abstractions for concurrency between the host and device. More high-level APIs and frameworks could definitely be built on top of them. However, designing an optimal, high-level API for concurrency is notoriously difficult and, in my opinion, still an open-problem. Thanks for the review. Please let me know if I can explain anything further. -kyle [1] https://github.com/kylelutz/compute/issues
On 2014-12-28 14:41, Kyle Lutz wrote:
On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt
wrote: I'm a bit concerned about the use of type names ending in '_', such as float4_. Is this consistent with Boost conventions? I've only seen that used in other Boost libraries to denote class member variables.
I'm not sure if there are any Boost conventions for/against this (someone please speak up if there are). I chose the trailing underscore for these types as I needed a consistent spelling for representing the fundamental OpenCL types (e.g. "float" or "uint" or "int4") and I couldn't just use the names without the trailing underscore as they'd conflict with the C++ reserved keywords (e.g. "float", "int"). Using a leading underscore (e.g. "_float4") looked worse to me, so I used a trailing underscore. But I'd definitely be open to hearing other ideas.
There is precedent for the trailing underscore convention. See for example http://www.boost.org/doc/libs/1_57_0/libs/mpl/doc/refmanual/numeric.html. John Bytheway
On Mon, Dec 29, 2014 at 1:31 AM, John Bytheway
On 2014-12-28 14:41, Kyle Lutz wrote:
On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt
wrote: I'm a bit concerned about the use of type names ending in '_', such as float4_. Is this consistent with Boost conventions? I've only seen that used in other Boost libraries to denote class member variables.
I'm not sure if there are any Boost conventions for/against this (someone please speak up if there are). I chose the trailing underscore for these types as I needed a consistent spelling for representing the fundamental OpenCL types (e.g. "float" or "uint" or "int4") and I couldn't just use the names without the trailing underscore as they'd conflict with the C++ reserved keywords (e.g. "float", "int"). Using a leading underscore (e.g. "_float4") looked worse to me, so I used a trailing underscore. But I'd definitely be open to hearing other ideas.
There is precedent for the trailing underscore convention. See for example http://www.boost.org/doc/libs/1_57_0/libs/mpl/doc/refmanual/numeric.html.
The trailing underscore syntax is typically used when a library construct has semantics very similar to a language component which uses a keyword. I.e. mpl::if_ is similar to the language if, and mpl::int_ to the language int, except that mpl tools operate at compile time. In this case, I think, the semantics is sufficiently different (vector types instead of scalar), which should warrant for a distinct and descriptive names.
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Andrey Semashev Sent: Sunday, December 28, 2014 18:44 To: boost@lists.boost.org Subject: Re: [boost] [compute] review
On Mon, Dec 29, 2014 at 1:31 AM, John Bytheway
wrote:
The trailing underscore syntax is typically used when a library construct has semantics very similar to a language component which uses a keyword. I.e. mpl::if_ is similar to the language if, and mpl::int_ to the language int, except that mpl tools operate at compile time. In this case, I think, the semantics is sufficiently different (vector types instead of scalar), which should warrant for a distinct and descriptive names.
I appreciate what I assume Kyle was trying to do. These correspond 1:1 to the types used in OpenCL kernels, so I think he wanted the Boost.Compute typedefs to be as similar as possible. I think this approach is justified, as their meaning will be obvious to anyone familiar with OpenCL. I only meant to raise the question of how they were mangled. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 14:42 To: boost@lists.boost.org List Subject: Re: [boost] [compute] review
On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:
Others libraries/frameworks (such as SYCL, Bolt, C++AMP, OpenACC, etc.) are all dependent on either a special compiler or special compiler extensions.
According to Khronos, that's incorrect. The provisional specification includes the following features: * Specifications for creating C++ template libraries and compilers using the C++11 standard * Easy to use, production grade API - built on-top of OpenCL and SPIR(tm) * Compatible with standard CPU C++ compilers across multiple platforms, ^^^^^^^^^^ ^^^^ ^^^^^^^^ ^^^ ^^^ ^^^^^^^^^ as well as enabling new SYCL device compilers to target OpenCL devices * Asynchronous, low-level access to OpenCL features for high performance and low-latency - while retaining ease of use Source: https://www.khronos.org/opencl/sycl If it didn't depend on C++11, or Boost would drop the C++03 compatibility requirement for new libraries, I'd even recommend using their C++ abstractions instead of Boost.Compute's custom core. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
Gruenke,Matt
-----Original Message----- From: Boost [mailto:boost-bounces <at> lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 14:42 To: boost <at> lists.boost.org List Subject: Re: [boost] [compute] review
On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:
Others libraries/frameworks (such as SYCL, Bolt, C++AMP, OpenACC, etc.) are all dependent on either a special compiler or special compiler
extensions.
According to Khronos, that's incorrect.
The provisional specification includes the following features:
* Specifications for creating C++ template libraries and compilers using the C++11 standard
* Easy to use, production grade API - built on-top of OpenCL and SPIR(tm)
* Compatible with standard CPU C++ compilers across multiple platforms, ^^^^^^^^^^ ^^^^ ^^^^^^^^ ^^^ ^^^ ^^^^^^^^^ as well as enabling new SYCL device compilers to target OpenCL devices
* Asynchronous, low-level access to OpenCL features for high performance and low-latency - while retaining ease of use
Slides 14 and 15 of this slidedeck [0] show SYCL device compilers. SYCL code compiles using a standard C++11 compiler. To compile SYCL code however you need "either a special compiler or special compiler extensions". Codeplay calls this "Shared Source Programming Model", you can learn more about it from slides 22 onwards in this slide deck [1]. It is the same thing CUDA is doing since its initial release. But that is not unexpected at all. I don't think there is a technique to generate kernel code from command_group(myQueue, [&]() { // Data accessors // [...] // Kernel parallel_for(count, kernel_functor([ = ](id<> item) { int i = item.get_global(0); r[i] = a[i] + b[i] + c[i]; })); }); within standard C++11. You need a second compiler or an extension. Boost.Compute can do most things SYCL can do - except you don't need anything but a regular C++ compiler and the OpenCL library. [0] http://tinyurl.com/q7f5nm4 [1] http://tinyurl.com/nffyyon
On Sun, Dec 28, 2014 at 3:33 PM, Gruenke,Matt
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 14:42 To: boost@lists.boost.org List Subject: Re: [boost] [compute] review
On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:
Others libraries/frameworks (such as SYCL, Bolt, C++AMP, OpenACC, etc.) are all dependent on either a special compiler or special compiler extensions.
According to Khronos, that's incorrect.
The provisional specification includes the following features:
* Specifications for creating C++ template libraries and compilers using the C++11 standard
* Easy to use, production grade API - built on-top of OpenCL and SPIR(tm)
* Compatible with standard CPU C++ compilers across multiple platforms, ^^^^^^^^^^ ^^^^ ^^^^^^^^ ^^^ ^^^ ^^^^^^^^^ as well as enabling new SYCL device compilers to target OpenCL devices
* Asynchronous, low-level access to OpenCL features for high performance and low-latency - while retaining ease of use
Well let me back up. Currently there is no publicly available implementation of SYCL which can offload computation to a GPU. Any possible implementation of SYCL with the ability to perform work on a GPU involving custom kernel functions specified in C++ would require non-standard (w.r.t the C++ standard) support from the compiler. In fact, I would be very happy to see any proof to the contrary as it would be very useful for developing Boost.Compute itself. So while you are correct that the Khronos specification doesn't explicitly require special compiler support, any practical implementation of SYCL would. But to more generally address these comments which have come up in review, the SYCL specification has yet to see a real-world implementation and, as far as Boost.Compute is concerned, SYCL is essentially vaporware. While I do see how the SYCL interface could be used together with Boost.Compute, I cannot currently use it as the basis for Boost.Compute as it does not exist in any practically useable form. In order for me to seriously consider using SYCL for Boost.Compute, I would like to see a publicly available implementation which supported NVIDIA GPUs, AMD CPUs/GPUs and Intel CPUs/GPUs. -kyle
On Sun, Dec 28, 2014 at 10:41 PM, Kyle Lutz
On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt
wrote: I'd agree with Andrey Semashev, regarding thread safety, except I think it shouldn't be an option at all. That said, not all operations on all objects need be thread safe. I think the Boost.ASIO documentation provides a good example of how to express different thread safety limitations. Now, I haven't examined the use of thread-local storage, but do you feel there's a strong case to be made for the thread safety it's providing (especially, given that you seem to feel it's non-essential)?
I absolutely do not feel it is non-essential. The reason for it currently being a compile-time configurable option is that enabling it on non-C++11 compilers would make the library non-header-only (it would require the users to link to Boost.Thread).
I'd say, in C++03 it is only nominally header only. I consider the multi-threaded mode as the default (i.e. it is what I want in 99% of use cases and I think that's common), and in this mode users will have to link with Boost.Thread anyway. If you truly want to pursue the header-only property, I think you have to make it header-only in all typical configs.
I'm a bit concerned about the use of type names ending in '_', such as float4_. Is this consistent with Boost conventions? I've only seen that used in other Boost libraries to denote class member variables.
I'm not sure if there are any Boost conventions for/against this (someone please speak up if there are). I chose the trailing underscore for these types as I needed a consistent spelling for representing the fundamental OpenCL types (e.g. "float" or "uint" or "int4") and I couldn't just use the names without the trailing underscore as they'd conflict with the C++ reserved keywords (e.g. "float", "int"). Using a leading underscore (e.g. "_float4") looked worse to me, so I used a trailing underscore. But I'd definitely be open to hearing other ideas.
There is precedent of the standard int types (e.g. uint32_t), but I assume in your case int4_ means a vector of 4 ints, right? I'd say, a better name would reflect the fact that this is a vector type and also specify the element type. I don't know if OpenCL defines sizes of its types - if it does, so should Boost.Compute. A few examples of what I have in mind: v4int_t, v4uint32_t, uint32x4_t, floatx4_t. You may also want to see how Boost.SIMD deals with vector types (packs, in its terminology).
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 14:42 To: boost@lists.boost.org List Subject: Re: [boost] [compute] review
On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:
I agree with other comments made about synchronization. The design should be more explicit about what's asynchronous,
Like I mentioned before, there is only one method for asynchrony in Boost.Compute, the command queue abstraction provided by OpenCL. Operations are enqueued to be executed on the compute device and this occurs asynchronously with respect to code executing on the host. The exception to this rule are functions which interact directly with host-memory which by default are blocking and offer explicit "_async()" versions (in order to prevent potential race-conditions on host-memory).
Regarding synchronization, I'm a also bit concerned about the performance impact of synchronizing on all copies to host memory. Overuse of synchronization can easily result in performance deterioration. On this point, I think it might be worth limiting host memory usable with algorithms, to containers that perform implicit synchronization to actual use (or destruction) of results. Give users the choice between that or performing explicit copies to raw types.
To be clear, all copies are not synchronized with host memory. Boost.Compute allows both synchronous and asynchronous memory transfers between the host and device.
My understanding, based on comments you've made to other reviewers, is that functions like boost::compute::transform() are asynchronous when the result is on the device, but block when the result is on the host. This is what I'm concerned about. Is it true?
Also, I agree with Thomas M that it'd be useful for operations to return events.
All asynchronous operations in the command queue class do return events. One of his comments was to also return events from the synchronous methods for consistency and I am working on adding this.
Well, what I had in mind was events for higher-order operations, like boost::compute::transform(). Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
On Sun, Dec 28, 2014 at 4:46 PM, Gruenke,Matt
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 14:42 To: boost@lists.boost.org List Subject: Re: [boost] [compute] review
On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:
I agree with other comments made about synchronization. The design should be more explicit about what's asynchronous,
Like I mentioned before, there is only one method for asynchrony in Boost.Compute, the command queue abstraction provided by OpenCL. Operations are enqueued to be executed on the compute device and this occurs asynchronously with respect to code executing on the host. The exception to this rule are functions which interact directly with host-memory which by default are blocking and offer explicit "_async()" versions (in order to prevent potential race-conditions on host-memory).
Regarding synchronization, I'm a also bit concerned about the performance impact of synchronizing on all copies to host memory. Overuse of synchronization can easily result in performance deterioration. On this point, I think it might be worth limiting host memory usable with algorithms, to containers that perform implicit synchronization to actual use (or destruction) of results. Give users the choice between that or performing explicit copies to raw types.
To be clear, all copies are not synchronized with host memory. Boost.Compute allows both synchronous and asynchronous memory transfers between the host and device.
My understanding, based on comments you've made to other reviewers, is that functions like boost::compute::transform() are asynchronous when the result is on the device, but block when the result is on the host. This is what I'm concerned about. Is it true?
Yes this is correct. In general, algorithms like transform() are asynchronous when the input/output ranges are both on the device and synchronous when one of the ranges is on the host. I'll work on better ways to allow asynchrony in the latter case. One of my current ideas is add asynchronous memory-mapping support to the mapped_view class [1] which can then be used with any of the algorithms in an asynchronous fashion.
Also, I agree with Thomas M that it'd be useful for operations to return events.
All asynchronous operations in the command queue class do return events. One of his comments was to also return events from the synchronous methods for consistency and I am working on adding this.
Well, what I had in mind was events for higher-order operations, like boost::compute::transform().
Yes, I would also like to have higher-level support for chaining together algorithms asynchronously. However, designing a generic and generally useful API for this is a complex task and may take some time and (I've shied away from just adding an extra "_async()" function for all of the algorithm APIs as I think it could be done better/more-extensibly). Any ideas/proposals for this would be great to hear. -kyle [1] http://kylelutz.github.io/compute/boost/compute/mapped_view.html
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 20:36 To: boost@lists.boost.org List Subject: Re: [boost] Synchronization (RE: [compute] review)
On Sun, Dec 28, 2014 at 4:46 PM, Gruenke,Matt wrote:
My understanding, based on comments you've made to other reviewers, is that functions like boost::compute::transform() are asynchronous when the result is on the device, but block when the result is on the host. This is what I'm concerned about. Is it true?
Yes this is correct. In general, algorithms like transform() are asynchronous when the input/output ranges are both on the device and synchronous when one of the ranges is on the host. I'll work on better ways to allow asynchrony in the latter case. One of my current ideas is add asynchronous memory-mapping support to the mapped_view class [1] which can then be used with any of the algorithms in an asynchronous fashion.
Why block when only the source is on the host? Are you worried it might go out of scope? If so, that's actually not a bad point. I was just pondering how to write exception-safe code using local host datastructures. I guess blocking on all operations involving them is a simple way to ensure nothing is read or written after it's out of scope. Not the only way that comes to mind (nor the most efficient), but it does the job. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 20:36 To: boost@lists.boost.org List Subject: Re: [boost] Synchronization (RE: [compute] review)
On Sun, Dec 28, 2014 at 4:46 PM, Gruenke,Matt wrote:
My understanding, based on comments you've made to other reviewers, is that functions like boost::compute::transform() are asynchronous when the result is on the device, but block when the result is on the host. This is what I'm concerned about. Is it true?
Yes this is correct. In general, algorithms like transform() are asynchronous when the input/output ranges are both on the device and synchronous when one of the ranges is on the host. I'll work on better ways to allow asynchrony in the latter case. One of my current ideas is add asynchronous memory-mapping support to the mapped_view class [1] which can then be used with any of the algorithms in an asynchronous fashion.
Why block when only the source is on the host? Are you worried it might go out of scope?
If so, that's actually not a bad point. I was just pondering how to write exception-safe code using local host datastructures. I guess blocking on all operations involving them is a simple way to ensure nothing is read or written after it's out of scope. Not the only way that comes to mind (nor the most efficient), but it does the job.
Yes, that is one of the major motivations. Avoiding potential race-conditions with host code accessing the memory at the same time is another. I'd be very open to other solutions. -kyle
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 21:24 To: boost@lists.boost.org List Subject: Re: [boost] Synchronization (RE: [compute] review)
On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
Why block when only the source is on the host? Are you worried it might go out of scope?
If so, that's actually not a bad point. I was just pondering how to write exception-safe code using local host datastructures. I guess blocking on all operations involving them is a simple way to ensure nothing is read or written after it's out of scope. Not the only way that comes to mind (nor the most efficient), but it does the job.
Yes, that is one of the major motivations. Avoiding potential race-conditions with host code accessing the memory at the same time is another. I'd be very open to other solutions.
I have some rough ideas, but they'd probably have a deeper impact on your API than you want, at this stage. Instead, I'm thinking mostly about how to make exception-safe use of the async copy commands to/from host memory. I think async copies will quickly gain popularity with advanced users, and will probably be one of the top optimization tips. I guess it'd be nice to have a scope guard that blocks on boost::compute::event. std::vector<int> hv(16000000); std::generate(hv.begin(), hv.end(), rand); boost::compute::vector<float> dv(host_vector.size(), context); boost::compute::guard wg(cq.enqueue_write_buffer_async(dv, 0, hv.size(), &hv.front()); // do some device -> device operations boost::compute::guard rg(cq.enqueue_read_buffer_async(dv, 0, hv.size(), &hv.front()); [Short names used for the sake of email.] boost::compute::guard would do nothing more than hold a reference to the event and call boost::compute::event::wait(), upon destruction (if it contains a valid event, at that point). So, if an exception is thrown after wg is constructed, then the stack unwind blocks on its completion (or failure). Similarly, if the exception is thrown after rg, then the unwind blocks on both wg and rg. Obviously, this example is somewhat artificial. If the read were really the last operation in scope, then you could just use a synchronous copy. But, I think one strong use case for async copies is to free up device memory for subsequent operations. So, I might transfer something off, even if I'm not going to do anything with it, at that point. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
On 29/12/2014 04:40, Gruenke,Matt wrote:
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 21:24 To: boost@lists.boost.org List Subject: Re: [boost] Synchronization (RE: [compute] review)
On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
Why block when only the source is on the host? Are you worried it might go out of scope?
If so, that's actually not a bad point. I was just pondering how to write exception-safe code using local host datastructures. I guess blocking on all operations involving them is a simple way to ensure nothing is read or written after it's out of scope. Not the only way that comes to mind (nor the most efficient), but it does the job.
Yes, that is one of the major motivations. Avoiding potential race-conditions with host code accessing the memory at the same time is another. I'd be very open to other solutions.
I find it truly confusing that an algorithm can run either synchronous or asynchronous, without its signature clearly and loudly indicating so. In template code (or in general) it can easily be +- unknown (or non-trivial to find out) if the input/output range refer to the host or the device, and thus if the algorithm will execute in synchronous or asynchronous mode -> and what that implies for the rest of the code around the algorithm. If I understood the behavior of transform correctly (and assuming that for device_vector the input/output ranges count as device side [?]), am I correct that the following can easily fail?: compute::command_queue queue1(context, device); compute::command_queue queue2(context, device); compute::vector<float> device_vector(n, context); // copy some data to device_vector // use queue1 boost::compute::transform(device_vector.begin(), device_vector.end(), device_vector.begin(), compute::sqrt<float>(), queue1); // use queue2 compute::copy(device_vector.begin(), device_vector.end(), some_host_vector.begin(), queue2); And currently the way to make this behave properly would be to force queue1 to wait for completion of any enqueued job (note: it may be an out-of-order queue!) after transform has been called? One way could be to make algorithms simply always treated as asynchronous at API level (even if internally they may run synchronous) and get always associated with an event. Another is providing a synchronous and asynchronous overload. I'd certainly prefer to know if it runs synchronous or asynchronous just by looking at the transform invocation itself. With respect to exception safety, is there any proper behavior defined by your library if transform has been enqueued to run in asynchronous mode, but before it has completed device_vector goes out of scope (e.g. due to an exception thrown in the host code following the transform)? Or is it the user's responsibility to ensure that, whatever happens, device_vector must live until the transform has completed?
I have some rough ideas, but they'd probably have a deeper impact on your API than you want, at this stage.
Instead, I'm thinking mostly about how to make exception-safe use of the async copy commands to/from host memory. I think async copies will quickly gain popularity with advanced users, and will probably be one of the top optimization tips. I guess it'd be nice to have a scope guard that blocks on boost::compute::event.
Here's another sketch, also considering the points above - though I obviously don't know if it's doable given the implementation + other design considerations I might miss, so apologize if it's non-sense. If input/output ranges generally refer to iterators from the boost::compute library, then: -) an iterator can store the container (or other data structure it belongs to, if any) -) an algorithm can, via the iterators, "communicate" with the container(s) For an input operation the data must be available throughout & in unaltered manner from the time of enqueuing the input operation until its completion. So when transform (as example) is launched it can inform the input data container that before any subsequent modification of it to occur (including destruction / setting new values through iterators) it must wait until that input operation has completed - i.e. the first modifying operation blocks until that has finished. Similarly for the output range, just that for that also any read operation must block until the output data from the transform has been written to it. So: -) no matter what causes the destruction of containers (e.g. regularly end-of-block reached, exception etc.) the lifetime of the container/iterators extends until the asynchronous operation on it has finished; thus exceptions thrown are implicitly handled. -) to the user the code appears as synchronous with respect to visible behavior, but can run as asynchronous in the background. Obviously a full-fledged version is neither trivial nor cheap with respect to performance (e.g. checking any reads/writes to containers if it must block), let alone threading aspects. But maybe just parts of it are useful, e.g. deferring container destruction until no OpenCL operation is enqueued to work on the container (-> handling exceptions). I think there's a wide range for balances between what the implementation does automagically and what constraints are placed on the user to not do "stupid" things. Thomas
On Mon, Dec 29, 2014 at 1:19 PM, Thomas M
On 29/12/2014 04:40, Gruenke,Matt wrote:
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 21:24 To: boost@lists.boost.org List Subject: Re: [boost] Synchronization (RE: [compute] review)
On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
Why block when only the source is on the host? Are you worried it might go out of scope?
If so, that's actually not a bad point. I was just pondering how to write exception-safe code using local host datastructures. I guess blocking on all operations involving them is a simple way to ensure nothing is read or written after it's out of scope. Not the only way that comes to mind (nor the most efficient), but it does the job.
Yes, that is one of the major motivations. Avoiding potential race-conditions with host code accessing the memory at the same time is another. I'd be very open to other solutions.
I find it truly confusing that an algorithm can run either synchronous or asynchronous, without its signature clearly and loudly indicating so. In template code (or in general) it can easily be +- unknown (or non-trivial to find out) if the input/output range refer to the host or the device, and thus if the algorithm will execute in synchronous or asynchronous mode -> and what that implies for the rest of the code around the algorithm.
If I understood the behavior of transform correctly (and assuming that for device_vector the input/output ranges count as device side [?]), am I correct that the following can easily fail?:
compute::command_queue queue1(context, device); compute::command_queue queue2(context, device);
compute::vector<float> device_vector(n, context); // copy some data to device_vector
// use queue1 boost::compute::transform(device_vector.begin(), device_vector.end(), device_vector.begin(), compute::sqrt<float>(), queue1);
// use queue2 compute::copy(device_vector.begin(), device_vector.end(), some_host_vector.begin(), queue2);
And currently the way to make this behave properly would be to force queue1 to wait for completion of any enqueued job (note: it may be an out-of-order queue!) after transform has been called?
Well this is essentially equivalent to having two separate host-threads both reading and writing from the same region of memory at the same time, of course you need to synchronize them. For this specific case you could just enqueue a barrier to ensure queue2 doesn't begin its operation before queue1 completes: // before calling copy() on queue2: queue2.enqueue_barrier(queue1.enqueue_marker());
One way could be to make algorithms simply always treated as asynchronous at API level (even if internally they may run synchronous) and get always associated with an event. Another is providing a synchronous and asynchronous overload. I'd certainly prefer to know if it runs synchronous or asynchronous just by looking at the transform invocation itself.
Well let me make this more clear: transform() always runs asynchronously. The only algorithm you really have to worry about is copy() as it is responsible for moving data between the host and device and will do this synchronously. If you want an asynchronous copy then use copy_async() which will return a future that can be used to wait for the copy operation to complete.
With respect to exception safety, is there any proper behavior defined by your library if transform has been enqueued to run in asynchronous mode, but before it has completed device_vector goes out of scope (e.g. due to an exception thrown in the host code following the transform)? Or is it the user's responsibility to ensure that, whatever happens, device_vector must live until the transform has completed?
The user must ensure that the memory being written to remains valid until the operation completes. Simply imagine you are calling std::transform() on a std::vector<> from a separate std::thread, you must wait for that thread to complete its work before destroying the memory it is writing to. Operations on the compute device can be reasoned about in a similar manner.
I have some rough ideas, but they'd probably have a deeper impact on your API than you want, at this stage.
Instead, I'm thinking mostly about how to make exception-safe use of the async copy commands to/from host memory. I think async copies will quickly gain popularity with advanced users, and will probably be one of the top optimization tips. I guess it'd be nice to have a scope guard that blocks on boost::compute::event.
Here's another sketch, also considering the points above - though I obviously don't know if it's doable given the implementation + other design considerations I might miss, so apologize if it's non-sense.
If input/output ranges generally refer to iterators from the boost::compute library, then: -) an iterator can store the container (or other data structure it belongs to, if any) -) an algorithm can, via the iterators, "communicate" with the container(s)
For an input operation the data must be available throughout & in unaltered manner from the time of enqueuing the input operation until its completion. So when transform (as example) is launched it can inform the input data container that before any subsequent modification of it to occur (including destruction / setting new values through iterators) it must wait until that input operation has completed - i.e. the first modifying operation blocks until that has finished. Similarly for the output range, just that for that also any read operation must block until the output data from the transform has been written to it. So: -) no matter what causes the destruction of containers (e.g. regularly end-of-block reached, exception etc.) the lifetime of the container/iterators extends until the asynchronous operation on it has finished; thus exceptions thrown are implicitly handled. -) to the user the code appears as synchronous with respect to visible behavior, but can run as asynchronous in the background.
Obviously a full-fledged version is neither trivial nor cheap with respect to performance (e.g. checking any reads/writes to containers if it must block), let alone threading aspects. But maybe just parts of it are useful, e.g. deferring container destruction until no OpenCL operation is enqueued to work on the container (-> handling exceptions). I think there's a wide range for balances between what the implementation does automagically and what constraints are placed on the user to not do "stupid" things.
While these are interesting ideas, I feel like this is sort of behavior is more high-level/advanced than what the Boost.Compute algorithms are meant to do. I have tried to mimic as close as possible the "iterators and algorithms" paradigm from the STL as I consider the design quite elegant. However, these sorts of techniques could definitely be implemented on top of Boost.Compute. I would be very interested to see a proof-of-concept demonstrating these ideas, would you be interested in working on this? -kyle
On 29/12/2014 22:51, Kyle Lutz wrote:
On Mon, Dec 29, 2014 at 1:19 PM, Thomas M
wrote: On 29/12/2014 04:40, Gruenke,Matt wrote:
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 21:24 To: boost@lists.boost.org List Subject: Re: [boost] Synchronization (RE: [compute] review)
On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
If I understood the behavior of transform correctly (and assuming that for device_vector the input/output ranges count as device side [?]), am I correct that the following can easily fail?:
compute::command_queue queue1(context, device); compute::command_queue queue2(context, device);
compute::vector<float> device_vector(n, context); // copy some data to device_vector
// use queue1 boost::compute::transform(device_vector.begin(), device_vector.end(), device_vector.begin(), compute::sqrt<float>(), queue1);
// use queue2 compute::copy(device_vector.begin(), device_vector.end(), some_host_vector.begin(), queue2);
And currently the way to make this behave properly would be to force queue1 to wait for completion of any enqueued job (note: it may be an out-of-order queue!) after transform has been called?
Well this is essentially equivalent to having two separate host-threads both reading and writing from the same region of memory at the same time, of course you need to synchronize them.
For this specific case you could just enqueue a barrier to ensure queue2 doesn't begin its operation before queue1 completes:
// before calling copy() on queue2: queue2.enqueue_barrier(queue1.enqueue_marker());
Sorry I haven't expressed myself well. Yes, surely I must synchronize it; just with the OpenCL API itself I can normally provide a (pointer to an) cl_event when calling objectclEnqueue... functions, which can subsequently be used quite flexibly to coordinate other operations (i.e. not going to the command queue level).
One way could be to make algorithms simply always treated as asynchronous at API level (even if internally they may run synchronous) and get always associated with an event. Another is providing a synchronous and asynchronous overload. I'd certainly prefer to know if it runs synchronous or asynchronous just by looking at the transform invocation itself.
Well let me make this more clear: transform() always runs asynchronously. The only algorithm you really have to worry about is copy() as it is responsible for moving data between the host and device and will do this synchronously. If you want an asynchronous copy then use copy_async() which will return a future that can be used to wait for the copy operation to complete.
Now I am really confused :) In this thread I have read, quoting: [Gruenke, Matt]: "My understanding, based on comments you've made to other reviewers, is that functions like boost::compute::transform() are asynchronous when the result is on the device, but block when the result is on the host. This is what I'm concerned about. Is it true?" [Kyle Lutz] "Yes this is correct. In general, algorithms like transform() are asynchronous when the input/output ranges are both on the device and synchronous when one of the ranges is on the host." This made me believe that some iterators in your library turn compute::boost::transform into a synchronous operation, and some into an asynchronous. So now I suppose that this does not seem to be the case ? In comparison to the OpenCL runtime execution model can I consider all your algorithms, except copy, basically acting like clEnqueueNDRangeKernel calls, that is always asynchronous?
Here's another sketch, also considering the points above.
While these are interesting ideas, I feel like this is sort of behavior is more high-level/advanced than what the Boost.Compute algorithms are meant to do. I have tried to mimic as close as possible the "iterators and algorithms" paradigm from the STL as I consider the design quite elegant.
However, these sorts of techniques could definitely be implemented on top of Boost.Compute. I would be very interested to see a proof-of-concept demonstrating these ideas, would you be interested in working on this?
Interested yes, time is currently a problem though; I'd need to familiarize myself much deeper with your implementation. At this stage my main concern is exception safety - how one could relief users in a simplistic manner from the need to manually taking care that objects do not get out-of-scope (due to an exception thrown) while an OpenCL asynchronous operation still needs them. Note that because your API can throw I consider exception effects to be of much greater concern than with the (implicitly non-throwing) Khronos API; and just enqueuing a couple of commands can make a proper, manual cleanup really easily non-trivial. cheers, Thomas
On Mon, Dec 29, 2014 at 2:58 PM, Thomas M
On 29/12/2014 22:51, Kyle Lutz wrote:
On Mon, Dec 29, 2014 at 1:19 PM, Thomas M
wrote: On 29/12/2014 04:40, Gruenke,Matt wrote:
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 21:24 To: boost@lists.boost.org List Subject: Re: [boost] Synchronization (RE: [compute] review)
On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
If I understood the behavior of transform correctly (and assuming that for device_vector the input/output ranges count as device side [?]), am I correct that the following can easily fail?:
compute::command_queue queue1(context, device); compute::command_queue queue2(context, device);
compute::vector<float> device_vector(n, context); // copy some data to device_vector
// use queue1 boost::compute::transform(device_vector.begin(), device_vector.end(), device_vector.begin(), compute::sqrt<float>(), queue1);
// use queue2 compute::copy(device_vector.begin(), device_vector.end(), some_host_vector.begin(), queue2);
And currently the way to make this behave properly would be to force queue1 to wait for completion of any enqueued job (note: it may be an out-of-order queue!) after transform has been called?
Well this is essentially equivalent to having two separate host-threads both reading and writing from the same region of memory at the same time, of course you need to synchronize them.
For this specific case you could just enqueue a barrier to ensure queue2 doesn't begin its operation before queue1 completes:
// before calling copy() on queue2: queue2.enqueue_barrier(queue1.enqueue_marker());
Sorry I haven't expressed myself well. Yes, surely I must synchronize it; just with the OpenCL API itself I can normally provide a (pointer to an) cl_event when calling objectclEnqueue... functions, which can subsequently be used quite flexibly to coordinate other operations (i.e. not going to the command queue level).
Yes, currently this functionality is not built-in to the algorithms API. If you want to synchronize between multiple command queues you must currently do so explicitly. With the enqueue_marker() method it is quite easy to get an event object at any point in time which can be used to synchronize with the host or with other command queues.
One way could be to make algorithms simply always treated as asynchronous at API level (even if internally they may run synchronous) and get always associated with an event. Another is providing a synchronous and asynchronous overload. I'd certainly prefer to know if it runs synchronous or asynchronous just by looking at the transform invocation itself.
Well let me make this more clear: transform() always runs asynchronously. The only algorithm you really have to worry about is copy() as it is responsible for moving data between the host and device and will do this synchronously. If you want an asynchronous copy then use copy_async() which will return a future that can be used to wait for the copy operation to complete.
Now I am really confused :) In this thread I have read, quoting:
[Gruenke, Matt]: "My understanding, based on comments you've made to other reviewers, is that functions like boost::compute::transform() are asynchronous when the result is on the device, but block when the result is on the host. This is what I'm concerned about. Is it true?"
[Kyle Lutz] "Yes this is correct. In general, algorithms like transform() are asynchronous when the input/output ranges are both on the device and synchronous when one of the ranges is on the host."
This made me believe that some iterators in your library turn compute::boost::transform into a synchronous operation, and some into an asynchronous. So now I suppose that this does not seem to be the case ? In comparison to the OpenCL runtime execution model can I consider all your algorithms, except copy, basically acting like clEnqueueNDRangeKernel calls, that is always asynchronous?
Sorry, I should of prefaced that statement with "For algorithms which accept both host and device iterators, ...". Currently there are very few of those (copy(), sort(), and reduce() for its output can use host iterators). In general, to maximize efficiency, the user should deal almost entirely with device iterators and only synchronize with the host at the beginning to transfer input and at the end to transfer back the output. And you're correct, nearly all calls to Boost.Compute algorithms result in creating a kernel, setting its arguments, and then calling clEnqueueNDRangeKernel() to execute it asynchronously.
Here's another sketch, also considering the points above.
While these are interesting ideas, I feel like this is sort of behavior is more high-level/advanced than what the Boost.Compute algorithms are meant to do. I have tried to mimic as close as possible the "iterators and algorithms" paradigm from the STL as I consider the design quite elegant.
However, these sorts of techniques could definitely be implemented on top of Boost.Compute. I would be very interested to see a proof-of-concept demonstrating these ideas, would you be interested in working on this?
Interested yes, time is currently a problem though; I'd need to familiarize myself much deeper with your implementation. At this stage my main concern is exception safety - how one could relief users in a simplistic manner from the need to manually taking care that objects do not get out-of-scope (due to an exception thrown) while an OpenCL asynchronous operation still needs them. Note that because your API can throw I consider exception effects to be of much greater concern than with the (implicitly non-throwing) Khronos API; and just enqueuing a couple of commands can make a proper, manual cleanup really easily non-trivial.
I have tried as much as possible to ensure that the library exception-safe (I surely hope you aren't assuming that I intentionally made the library non-exception-safe). If you do find any bugs related to exception handling please do submit them with a reproducible test-case to the issue tracker [1] and I will get them fixed as soon as possible. Also, in case you were unaware, you can always disable all exceptions in Boost.Compute by defining BOOST_NO_EXCEPTIONS. -kyle [1] https://github.com/kylelutz/compute/issues
On 30/12/2014 00:49, Kyle Lutz wrote:
On Mon, Dec 29, 2014 at 2:58 PM, Thomas M
wrote: On 29/12/2014 22:51, Kyle Lutz wrote:
On Mon, Dec 29, 2014 at 1:19 PM, Thomas M
wrote:
Interested yes, time is currently a problem though; I'd need to familiarize myself much deeper with your implementation. At this stage my main concern is exception safety - how one could relief users in a simplistic manner from the need to manually taking care that objects do not get out-of-scope (due to an exception thrown) while an OpenCL asynchronous operation still needs them. Note that because your API can throw I consider exception effects to be of much greater concern than with the (implicitly non-throwing) Khronos API; and just enqueuing a couple of commands can make a proper, manual cleanup really easily non-trivial.
I have tried as much as possible to ensure that the library exception-safe (I surely hope you aren't assuming that I intentionally made the library non-exception-safe). If you do find any bugs related to exception handling please do submit them with a reproducible test-case to the issue tracker [1] and I will get them fixed as soon as possible.
No, I never assumed you make it intentionally unsafe. However you, as implementer, probably know ways more about when something may throw (and how things then behave) then users, and thus which exception guarantees your library implicitly provides. Is there a docs section which explicitly says under which circumstances something may throw and how subsequently behavior, e.g. of asynchronous operations already launched, is defined? I take your transform tutorial code as example: // create a vector on the device compute::vector<float> device_vector(host_vector.size(), context); // transfer data from the host to the device compute::copy(host_vector.begin(), host_vector.end(), device_vector.begin(), queue); // calculate the square-root of each element in-place compute::transform(device_vector.begin(), device_vector.end(), device_vector.begin(), compute::sqrt<float>(), queue); // copy values back to the host compute::copy(device_vector.begin(), device_vector.end(), host_vector.begin(), queue); For neither the copy nor transform algorithm does the doc explicitly say that it isn't throwing anything (apologize if it says so in another doc section), so for exception-safe code as user I must assume the worst, i.e. any of them can throw (for the moment assume that I don't choose to disable throwing by #define). So I must ensure that device_vector will be kept alive until all commands which were successfully launched could finish; that appears to boiling down to something like a try around the copy-transform-copy, and a queue.finish() (or similar) in catch. If there is more than one container / command-queue etc. involved the complexity of sorting things out for proper cleanup can raise considerably; so does it when within the program hierarchy the creation/lifetime management of a container on the one hand and the command queue ownership / invocation of operations on the container on the other hand are separated (there may simply be naturally no central place which can control both the container lifetime and also have access to the command queue(s)). Note the aim is not desperately trying to get the results of the transform somehow, just to ensure that the OpenCL runtime is not doing anything disastrous in the asynchronous background, i.e. here accessing memory that has already gone out of scope due to stack unwinding. If your implementation already provides some exception / safety guarantees in the example then these must become easily visible to users. Thomas
On December 29, 2014 6:49:35 PM EST, Kyle Lutz
On Mon, Dec 29, 2014 at 2:58 PM, Thomas M
wrote: Note that because your API can throw I consider exception effects to be of much greater concern than with the (implicitly non-throwing) Khronos API; and just enqueuing a couple of commands can make a proper, manual cleanup really easily non-trivial.
I have tried as much as possible to ensure that the library exception-safe (I surely hope you aren't assuming that I intentionally made the library non-exception-safe). If you do find any bugs related to exception handling please do submit them with a reproducible test-case to the issue tracker [1] and I will get them fixed as soon as possible. Also, in case you were unaware, you can always disable all exceptions in Boost.Compute by defining BOOST_NO_EXCEPTIONS.
BOOST_NO_EXCEPTIONS invokes a global handler to note a failure. That means after every library call, the user must check a global value to determine if there was an error. Overloading with a Boost.System error_code localizes error checking. (Asio does this, for example.) I consider that a superior way to avoid exception overhead. The question is whether exception overhead is a bottleneck in coding with this library's API. It is necessary to weigh the likelihood of errors against the costs of reporting and detecting them. Conditional logic to check for errors is not pipelining friendly, but throwing exceptions is terribly expensive. -- Rob (Sent from my portable computation engine)
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Rob Stewart Sent: Tuesday, December 30, 2014 5:01 To: boost@lists.boost.org Subject: Re: [boost] Synchronization (RE: [compute] review)
On December 29, 2014 6:49:35 PM EST, Kyle Lutz wrote:
Also, in case you were unaware, you can always disable all exceptions in Boost.Compute by defining BOOST_NO_EXCEPTIONS.
BOOST_NO_EXCEPTIONS invokes a global handler to note a failure. That means after every library call, the user must check a global value to determine if there was an error. Overloading with a Boost.System error_code localizes error checking. (Asio does this, for example.) I consider that a superior way to avoid exception overhead.
Either way, I don't see how you avoid checking whether an error occurred after each operation. And when you do detect an error, then you can certainly skip some work, but you're still stuck waiting for the non-failed operations to complete before you can destroy any of the datastructures they're using. And that would be much simpler with a good contingent of RAII-friendly constructs, like those we're discussing. Therefore, I don't see the issue of exceptions vs. error codes (nor the mechanism of error code access) being relevant to the discussion of synchronization. The synchronization support needs to be there, no matter what. And it should be relatively unobtrusive and easy to use.
The question is whether exception overhead is a bottleneck in coding with this library's API. It is necessary to weigh the likelihood of errors against the costs of reporting and detecting them. Conditional logic to check for errors is not pipelining friendly, but throwing exceptions is terribly expensive.
By comparison with network programming (since you cited Boost.ASIO), GPU/OpenCL errors should be far less common. Therefore, error handling needn't be optimized for performance. I think our only concern is successful recovery. Indeed, as OpenCL provides no cancelation mechanism, you cannot avoid waiting around for any partial results to be computed that haven't failed. BTW, the types of operations generating these errors typically involve communication with a device or at least a couple syscalls. So, the kind of ALU pipeline effects of conditional statements that you mention aren't relevant, here. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
On December 30, 2014 5:58:10 AM EST, "Gruenke,Matt"
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Rob Stewart
BOOST_NO_EXCEPTIONS invokes a global handler to note a failure. That means after every library call, the user must check a global value to determine if there was an error. Overloading with a Boost.System error_code localizes error checking. (Asio does this, for example.) I consider that a superior way to avoid exception overhead.
Either way, I don't see how you avoid checking whether an error occurred after each operation.
You don't, but one is local to the last function call and the other is a global value.
And when you do detect an error, then you can certainly skip some work, but you're still stuck waiting for the non-failed operations to complete before you can destroy any of the datastructures they're using. And that would be much simpler with a good contingent of RAII-friendly constructs, like those we're discussing.
Okay
Therefore, I don't see the issue of exceptions vs. error codes (nor the mechanism of error code access) being relevant to the discussion of synchronization.
People were looking for ways to avoid exceptions. Whether that was to support embedded environments without exception support or to avoid exception overhead, I don't know.
The synchronization support needs to be there, no matter what. And it should be relatively unobtrusive and easy to use.
Sure
The question is whether exception overhead is a bottleneck in coding with this library's API. It is necessary to weigh the likelihood of errors against the costs of reporting and detecting them. Conditional logic to check for errors is not pipelining friendly, but throwing exceptions is terribly expensive.
By comparison with network programming (since you cited Boost.ASIO), GPU/OpenCL errors should be far less common. Therefore, error handling needn't be optimized for performance.
What about embedded programming environments without exception support?
BTW, the types of operations generating these errors typically involve communication with a device or at least a couple syscalls. So, the kind of ALU pipeline effects of conditional statements that you mention aren't relevant, here.
They still apply if the error checking is done on the host side. ___ Rob (Sent from my portable computation engine)
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Thomas M Sent: December 29, 2014 16:20 To: boost@lists.boost.org Subject: Re: [boost] Synchronization (RE: [compute] review)
If input/output ranges generally refer to iterators from the boost::compute library, then: -) an iterator can store the container (or other data structure it belongs to, if any) -) an algorithm can, via the iterators, "communicate" with the container(s)
For an input operation the data must be available throughout & in unaltered manner from the time of enqueuing the input operation until its >completion.
Essentially, what I think you want is to embed the equivalent of a shared_ptr to the underlying storage. Event callbacks could be used to decrement it, when the operation has completed. That way, even if the container object is deleted, the device memory sticks around so long as any pending operations reference it. For host data structures, it'd be nice if they could be restricted to iterators of a special container - let's call it host_mem<>, for the sake of discussion. Unlike mapped_view<>, you could only copy into it, so that it owns the memory. Reads & destruction could block on any pending operations that have been enqueued with it as a destination. If you don't want to pay the penalty of copying into a host_mem<> object, then just do an explicit command_queue::enqueue_write_buffer() or command_queue::enqueue_write_buffer_async(). Another benefit of requiring host_mem<> is that the host -> host copy is explicit. BTW, for command_queue-level operations, Kyle seems to agree with my proposal to have a sort of guard object that calls event::wait(), upon destruction. I actually wonder if it makes sense to call this a 'guarantee', since it doesn't quite behave like traditional guard objects. He's even accepted my offer to code this up. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
On 30/12/2014 00:56, Gruenke,Matt wrote:
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Thomas M Sent: December 29, 2014 16:20 To: boost@lists.boost.org Subject: Re: [boost] Synchronization (RE: [compute] review)
If input/output ranges generally refer to iterators from the boost::compute library, then: -) an iterator can store the container (or other data structure it belongs to, if any) -) an algorithm can, via the iterators, "communicate" with the container(s)
For an input operation the data must be available throughout & in unaltered manner from the time of enqueuing the input operation until its >completion.
Essentially, what I think you want is to embed the equivalent of a shared_ptr to the underlying storage. Event callbacks could be used to decrement it, when the operation has completed. That way, even if the container object is deleted, the device memory sticks around so long as any pending operations reference it.
yes, I was thinking of a shared_ptr for the implementation part.
BTW, for command_queue-level operations, Kyle seems to agree with my proposal to have a sort of guard object that calls event::wait(), upon destruction. I actually wonder if it makes sense to call this a 'guarantee', since it doesn't quite behave like traditional guard objects. He's even accepted my offer to code this up.
If you want to code that functionality up that's more than fine with me. I am not happy though with the need to explicitly, i.e. manually, associate guards with command-queue operations just for the sake of exception-safety (I am referring only to that here). And out of the box I cannot come up with a single example when a memory container is fine to die while OpenCL operations are still running on it or enqueued to run on it, so that's why I prefer a mechanisms which does it always, implicitly (unless maybe the user wants to take over it explicitly). Your proposal looks like: boost::compute::guard wg(cq.enqueue_write_buffer_async(dv, 0, hv.size(), &hv.front())); which appears to me quite lengthy and more error-prone (forgetting the guard) to get exception safety. Now assume that an iterator can store its underlying container, isn't it possible that the enqueue_write_buffer_async itself places some guard-like structure into the container, and when the container's destructor is invoked it calls wait() on all of them? Thomas
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Thomas M Sent: Tuesday, December 30, 2014 2:49 To: boost@lists.boost.org Subject: Re: [boost] Synchronization (RE: [compute] review)
BTW, for command_queue-level operations, Kyle seems to agree with my proposal to have a sort of guard object that calls event::wait(), upon destruction. I actually wonder if it makes sense to call this a 'guarantee', since it doesn't quite behave like traditional guard objects. He's even accepted my offer to code this up.
If you want to code that functionality up that's more than fine with me. I am not happy though with the need to explicitly, i.e. manually, associate guards with command-queue operations just for the sake of exception-safety (I am referring only to that here). And out of the box I cannot come up with a single example when a memory container is fine to die while OpenCL operations are still running on it or enqueued to run on it, so that's why I prefer a mechanisms which does it always, implicitly (unless maybe the user wants to take over it explicitly).
Your proposal looks like:
boost::compute::guard wg(cq.enqueue_write_buffer_async(dv, 0, hv.size(), &hv.front()));
which appears to me quite lengthy and more error-prone (forgetting the guard) to get exception safety.
Yes, except I'm now proposing to call it a 'guarantee', since it guarantees that the event will have completed within its lifetime.
Now assume that an iterator can store its underlying container, isn't it possible that the enqueue_write_buffer_async itself places some guard-like structure into the container, and when the container's destructor is invoked it calls wait() on all of them?
Keep in mind that the guarantee is proposed to be used with command_queue operations, which accept only pointers and buffer objects - not iterators. I think it's not unreasonable if the low level interface requires a bit more care to use, if that makes it more flexible and efficient. For instance, what if the datastructures on which you're operating are protected from exceptions by some other means (i.e. tied to a scope outside the try block)? Are you still going to force the caller to pay the price of copying the data into a special container? To avoid such overheads, do they then have to drop down to the C interface, and forego all of the other benefits of Boost.Compute? I agree with Kyle's approach of having a fairly direct and simple interface, at the low level, while trying to make the high-level interface easy & safe. I also agree with you that the high-level interface could be made both safer *and* more efficient, by restricting it to operate only on Boost.Compute containers with some additional facilities (i.e. refcounting in device containers and a guarantee in the host wrapper). If he'd just buy into that, I think it'd give Boost.Compute all the safety, efficiency, flexibility, and ease of use that anyone needs. There are always going to be tradeoffs. I'd prefer that, rather than the library deciding how to strike the best balance, it give the users a set of good options and let them decide which combination best fits the bill. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
On 30/12/2014 09:55, Gruenke,Matt wrote:
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Thomas M Sent: Tuesday, December 30, 2014 2:49 To: boost@lists.boost.org Subject: Re: [boost] Synchronization (RE: [compute] review)
BTW, for command_queue-level operations, Kyle seems to agree with my proposal to have a sort of guard object that calls event::wait(), upon destruction. I actually wonder if it makes sense to call this a 'guarantee', since it doesn't quite behave like traditional guard objects. He's even accepted my offer to code this up.
If you want to code that functionality up that's more than fine with me. I am not happy though with the need to explicitly, i.e. manually, associate guards with command-queue operations just for the sake of exception-safety (I am referring only to that here). And out of the box I cannot come up with a single example when a memory container is fine to die while OpenCL operations are still running on it or enqueued to run on it, so that's why I prefer a mechanisms which does it always, implicitly (unless maybe the user wants to take over it explicitly).
Your proposal looks like:
boost::compute::guard wg(cq.enqueue_write_buffer_async(dv, 0, hv.size(), &hv.front()));
which appears to me quite lengthy and more error-prone (forgetting the guard) to get exception safety.
Yes, except I'm now proposing to call it a 'guarantee', since it guarantees that the event will have completed within its lifetime.
Now assume that an iterator can store its underlying container, isn't it possible that the enqueue_write_buffer_async itself places some guard-like structure into the container, and when the container's destructor is invoked it calls wait() on all of them?
Keep in mind that the guarantee is proposed to be used with command_queue operations, which accept only pointers and buffer objects - not iterators. I think it's not unreasonable if the low level interface requires a bit more care to use, if that makes it more flexible and efficient.
Ok I missed the non-iterator cases and see your point. If you are going to implement such RAII guards here's a short wish-list of features / guard classes: a) make guards "transferable" across functions b) a container of guards and/or a guard for a wait_list as whole c) a guard for a command-queue as whole [possibly guards for other classes as well] a) + b) because something like this is really useful: void foo() { // setup all memory objects etc. some_guard guard; make_async_copy(..., guard); some_guard_container cont; // or cont is a wait_list on which a guard can be placed more_async_copies(..., cont); yet_more_async_copies(..., cont); even_more_async_copies(..., cont); cont.wait(); some_guard guard2; make_async_copy(..., guard2); } The functions ...async_copy thus can launch asynchronous operations but themselves do not wait on return; ownership of initially obtained guards gets passed on to foo. And a single guard can refer to multiple events/asynchronous operations as a whole group. With c) I have something like this in mind: { some_cmd_queue_guard qu_guard(queue); cq.enqueue_write_buffer_async(...); transform(..., queue); // implicitly async cq.enqueue_read_buffer_async(...); // here automatic synchronization occurs } where the destructor of some_cmd_queue_guard calls queue.finish(), or similar. So there is no need to place a separate guard on every single operation, for regularly executing code at the end of the block automatic synchronization with the host occurs, and in the case of an exception automatic protection is given. Thomas
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Thomas M Sent: Tuesday, December 30, 2014 7:37 To: boost@lists.boost.org Subject: Re: [boost] Synchronization (RE: [compute] review)
If you are going to implement such RAII guards here's a short wish-list of features / guard classes:
a) make guards "transferable" across functions
I agree they should be movable, but it makes no sense for them to be copyable.
b) a container of guards and/or a guard for a wait_list as whole
Hmmm... I can see the benefits (convenience). I'd make it a different type, though. I assume it should hold a reference to the list? Since the guarantee is designed to block when the wait_list goes out of scope, I think it's reasonable to assume its scope is a superset of the guarantee's.
c) a guard for a command-queue as whole [possibly guards for other classes as well]
Why? Convenience? Unless you're using it as a shorthand for waiting on individual events or wait_lists, there's no need. The event_queue is internally refcounted. When the refcount goes to zero, the destructor will block on all outstanding commands.
a) + b) because something like this is really useful:
Um... how about this: void foo() { // setup all memory objects etc. wait_list wl; wait_list::guarantee wlg(wl); // send data to device wl.insert(cq.enqueue_write_buffer_async(devmem, 0, size, host_ptr)); wl.insert(cq.enqueue_write_buffer_async(devmem2, 0, size, host_ptr2)); // a kernel that reads devmem and devmem2 and writes to devmem wl.insert(cq.enqueue_task(kern, wl)); // Note: wl is copied by enqueue funcs // copy result back to host wl.insert(cq.enqueue_read_buffer_async(devmem, 0, size, host_ptr, wl)); // wl.wait() would only be necessary if you wanted to access the results, here. // Enqueue an independent set of operations with another wait_list wait_list wl_b; wait_list::guarantee wlg_b(wl); // send data to device wl_b.insert(cq.enqueue_write_buffer_async(devmem_b, 0, size_b, host_ptr_b)); // ... }
With c) I have something like this in mind:
What about this? { command_queue cq(cntx, dev); command_queue::guarantee cqg(cq); cq.enqueue_write_buffer_async(devmem, 0, size, host_ptr) transform(..., cq); // implicitly async cq.enqueue_read_buffer_async(...); // here automatic synchronization occurs } It does presume that command_queues are local and tied to related batches of computations. Those assumptions won't always hold. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
Thomas M wrote:
If you are going to implement such RAII guards here's a short wish-list of features / guard classes:
I don't know much about the subject, but I did look (quickly) at SYCL, and it appears to have taken lifetime issues into account, so maybe it would be worth it to see how it solves the problem.
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Peter Dimov Sent: Tuesday, December 30, 2014 8:57 To: boost@lists.boost.org Subject: Re: [boost] Synchronization (RE: [compute] review)
Thomas M wrote:
If you are going to implement such RAII guards here's a short wish-list of features / guard classes:
I don't know much about the subject, but I did look (quickly) at SYCL, and it appears to have taken lifetime issues into account, so maybe it would be worth it to see how it solves the problem.
They seem to use an approach similar to the host_mem<> container I suggested (sans copying): Memory objects can be constructed with or without attached host memory. If no host memory is attached at the point of construction, then destruction of that memory object is non-blocking. The user can optionally provide a storage object which specifies the behaviour on construction and destruction (see Storage section 2.7.1 below). If host memory is attached and the user does not supply a storage object, then the default behaviour is followed, which is that the destructor will block until any command groups operating on the memory object have completed, then, if the contents of the memory object is modified on a device those contents are copied back to host and only then does the destructor return. The reason I like copying is that, if the memory is owned by the container, you can implement future<>-like blocking when there are pending writes to it. Also, when copying into it, the data can be made contiguous. But I suppose copying might be a high price to pay, for those benefits. BTW, I only intended guarantees as a low-level mechanism. It was always my belief that the high-level interface would best be served by a host_mem<> container and some form of protection (either blocking or refcounting) embedded in the device containers. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
Le 30/12/14 14:48, Gruenke,Matt a écrit :
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Thomas M Sent: Tuesday, December 30, 2014 7:37 To: boost@lists.boost.org Subject: Re: [boost] Synchronization (RE: [compute] review)
If you are going to implement such RAII guards here's a short wish-list of features / guard classes:
a) make guards "transferable" across functions I agree they should be movable, but it makes no sense for them to be copyable.
b) a container of guards and/or a guard for a wait_list as whole Hmmm... I can see the benefits (convenience). I'd make it a different type, though.
I assume it should hold a reference to the list? Since the guarantee is designed to block when the wait_list goes out of scope, I think it's reasonable to assume its scope is a superset of the guarantee's.
c) a guard for a command-queue as whole [possibly guards for other classes as well] Why? Convenience?
Unless you're using it as a shorthand for waiting on individual events or wait_lists, there's no need. The event_queue is internally refcounted. When the refcount goes to zero, the destructor will block on all outstanding commands.
a) + b) because something like this is really useful: Um... how about this:
void foo() { // setup all memory objects etc.
wait_list wl; wait_list::guarantee wlg(wl);
// send data to device wl.insert(cq.enqueue_write_buffer_async(devmem, 0, size, host_ptr)); wl.insert(cq.enqueue_write_buffer_async(devmem2, 0, size, host_ptr2));
// a kernel that reads devmem and devmem2 and writes to devmem wl.insert(cq.enqueue_task(kern, wl)); // Note: wl is copied by enqueue funcs
// copy result back to host wl.insert(cq.enqueue_read_buffer_async(devmem, 0, size, host_ptr, wl));
// wl.wait() would only be necessary if you wanted to access the results, here.
// Enqueue an independent set of operations with another wait_list wait_list wl_b; wait_list::guarantee wlg_b(wl);
// send data to device wl_b.insert(cq.enqueue_write_buffer_async(devmem_b, 0, size_b, host_ptr_b));
// ... }
Maybe you can follow the task_region design (See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4088.pdf).
With c) I have something like this in mind: What about this?
{ command_queue cq(cntx, dev); command_queue::guarantee cqg(cq); cq.enqueue_write_buffer_async(devmem, 0, size, host_ptr) transform(..., cq); // implicitly async cq.enqueue_read_buffer_async(...);
// here automatic synchronization occurs }
It does presume that command_queues are local and tied to related batches of computations. Those assumptions won't always hold. The same here.
Best, Vicente
Upon further consideration... -----Original Message----- From: Gruenke,Matt Sent: Tuesday, December 30, 2014 8:49 To: boost@lists.boost.org Subject: RE: [boost] Synchronization (RE: [compute] review)
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Thomas M Sent: Tuesday, December 30, 2014 7:37 To: boost@lists.boost.org Subject: Re: [boost] Synchronization (RE: [compute] review)
c) a guard for a command-queue as whole [possibly guards for other classes as well]
Why? Convenience?
Unless you're using it as a shorthand for waiting on individual events or wait_lists, there's no need. The event_queue is internally refcounted. When the refcount goes to zero, the destructor will block on all outstanding commands.
I should've listed to myself more carefully. I think this means command_queue-level guarantees aren't necessary or useful, because you've either got: 1) A local command_queue, with a refcount of 1, in which case it will block upon destruction. 2) A copy of a nonlocal command_queue, in which case there may be unrelated commands in the queue. It's up to Kyle if he wants it, but I'd skip the command_queue::guarantee, as it would likely lead to confusion, misuse, and generally sloppy code. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
On 30/12/2014 15:01, Gruenke,Matt wrote:
Upon further consideration...
-----Original Message----- From: Gruenke,Matt Sent: Tuesday, December 30, 2014 8:49 To: boost@lists.boost.org Subject: RE: [boost] Synchronization (RE: [compute] review)
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Thomas M Sent: Tuesday, December 30, 2014 7:37 To: boost@lists.boost.org Subject: Re: [boost] Synchronization (RE: [compute] review)
c) a guard for a command-queue as whole [possibly guards for other classes as well]
Why? Convenience?
Unless you're using it as a shorthand for waiting on individual events or wait_lists, there's no need. The event_queue is internally refcounted. When the refcount goes to zero, the destructor will block on all outstanding commands.
To some degree it's convenience, that's correct. I find this both more explicit and straightforward to code: command_queue::guarantee cqg(cq); cq.enqueue_write_buffer_async(...); transform(..., cqg); cq.enqueue_read_buffer_async(...); ... than wait_list wl; wait_list::guarantee wlg(wl); wl.insert(cq.enqueue_write_buffer_async(...)); wl.insert(cq.enqueue_task(kern, wl)); wl.insert(cq.enqueue_read_buffer_async(...)); I don't see anything wrong with a compact form which also emphasizes the grouped nature of all following commands if that's the situation; also I don't really know what enqueue_task shall exactly look like, but the STL-ish similarity is lost though.
I should've listed to myself more carefully. I think this means command_queue-level guarantees aren't necessary or useful, because you've either got:
1) A local command_queue, with a refcount of 1, in which case it will block upon destruction.
2) A copy of a nonlocal command_queue, in which case there may be unrelated commands in the queue.
I cannot fully follow this; let's presume the convenience aspect is appreciated, then one reason to offer these guards is to make sure that other objects, e.g. memory objects, still persist while an asynchronous operation might use them: std::vector<float> host_data(...); cq.enqueue_write_buffer_async(..., &host_data.front()); // here run some algorithm(s) cq.enqueue_read_buffer_async(..., &host_data.front()); If the command_queue is local then it must become constructed after host_data, as otherwise at the time of the queue's destruction, i.e. when it waits for the enqueued operations to finish, host_data is already dead. To me an explicit guard which I can place flexibly into any block/scope is ways more clearer (and safer) to handle than declaration ordering of "primary" objects. As for the second point I don't see how this alters the game. Every scope must ensure that its relevant operations have completed before resources required by the operations become released; if that requires that commands enqueued earlier (outside) must also complete so shall it be. I am not saying that such a command-queue guarantee is always the right thing. Thomas
Let's take another look at the big picture. In its current form, the high-level interface of Boost.Compute has safe (if overkill) synchronization with host datastructures. There's currently insufficient synchronization of device containers. Finally, the low level interface lacks the RAII objects needed to facilitate exception-safe code. In this thread, we've proposed: 1. Host memory containers, as a synchronization *optimization* of high-level operations involving host memory. This seems similar to the approach employed by SYCL (thanks to Peter Dimov). 2. You suggested device containers & their iterators could have the machinery to make them exception-safe. This could take the form of either refcounting, or synchronization. 3. RAII wrappers for events and wait_lists. Remember, if Kyle simply adopts #2, then regardless of the status of #1, I believe the high-level interface will be safe. So, there should be no need to use #3 with high-level operations. Considering the above, the existing synchronization behavior of command_queue, and the ordering dependency you mentioned (i.e. that all datastructures would need to be in scope before the guarantee on the command_queue), I conclude that it's not necessary and actually too error-prone to synchronize at the command_queue level. And regarding wait_lists, the reason I agreed with your suggestion to offer guarantees for them is that they're the primary mechanism for defining command execution order. Since I think most code using the low level interface will probably be using wait_lists anyhow, there's an undeniable convenience to offering guarantees at that level. As far as I'm aware, Kyle has expressed casual interest in supporting #1 (but not forcing it, presumably because this would break the nice iterator interface that many people like). I don't know where he stands on #2. And he's agreed to seriously consider #3 (in the form of a patch I've offered to submit). I think we should focus on #2. I'd really like to know his stance on it. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
On Tue, Dec 30, 2014 at 5:31 PM, Gruenke,Matt
Let's take another look at the big picture.
In its current form, the high-level interface of Boost.Compute has safe (if overkill) synchronization with host datastructures. There's currently insufficient synchronization of device containers. Finally, the low level interface lacks the RAII objects needed to facilitate exception-safe code.
In this thread, we've proposed:
1. Host memory containers, as a synchronization *optimization* of high-level operations involving host memory. This seems similar to the approach employed by SYCL (thanks to Peter Dimov).
2. You suggested device containers & their iterators could have the machinery to make them exception-safe. This could take the form of either refcounting, or synchronization.
3. RAII wrappers for events and wait_lists.
Remember, if Kyle simply adopts #2, then regardless of the status of #1, I believe the high-level interface will be safe. So, there should be no need to use #3 with high-level operations.
Considering the above, the existing synchronization behavior of command_queue, and the ordering dependency you mentioned (i.e. that all datastructures would need to be in scope before the guarantee on the command_queue), I conclude that it's not necessary and actually too error-prone to synchronize at the command_queue level.
And regarding wait_lists, the reason I agreed with your suggestion to offer guarantees for them is that they're the primary mechanism for defining command execution order. Since I think most code using the low level interface will probably be using wait_lists anyhow, there's an undeniable convenience to offering guarantees at that level.
As far as I'm aware, Kyle has expressed casual interest in supporting #1 (but not forcing it, presumably because this would break the nice iterator interface that many people like). I don't know where he stands on #2. And he's agreed to seriously consider #3 (in the form of a patch I've offered to submit).
I think we should focus on #2. I'd really like to know his stance on it.
First off, I'd just like to thank you guys for bringing this issue up and actually proposing solutions. Currently I'm leaning towards solution #2 as well but I'll need some more time to think all of this over. Also, solution #3 is the least-intrusive and could also suit other use-cases so I think it would be worthwhile to pursue. -kyle
On 31/12/2014 02:31, Gruenke,Matt wrote:
Let's take another look at the big picture.
In its current form, the high-level interface of Boost.Compute has safe (if overkill) synchronization with host datastructures. There's currently insufficient synchronization of device containers. Finally, the low level interface lacks the RAII objects needed to facilitate exception-safe code.
In this thread, we've proposed:
1. Host memory containers, as a synchronization *optimization* of high-level operations involving host memory. This seems similar to the approach employed by SYCL (thanks to Peter Dimov).
2. You suggested device containers & their iterators could have the machinery to make them exception-safe. This could take the form of either refcounting, or synchronization.
3. RAII wrappers for events and wait_lists.
Remember, if Kyle simply adopts #2, then regardless of the status of #1, I believe the high-level interface will be safe. So, there should be no need to use #3 with high-level operations.
Considering the above, the existing synchronization behavior of command_queue, and the ordering dependency you mentioned (i.e. that all datastructures would need to be in scope before the guarantee on the command_queue), I conclude that it's not necessary and actually too error-prone to synchronize at the command_queue level.
And regarding wait_lists, the reason I agreed with your suggestion to offer guarantees for them is that they're the primary mechanism for defining command execution order. Since I think most code using the low level interface will probably be using wait_lists anyhow, there's an undeniable convenience to offering guarantees at that level.
As far as I'm aware, Kyle has expressed casual interest in supporting #1 (but not forcing it, presumably because this would break the nice iterator interface that many people like). I don't know where he stands on #2. And he's agreed to seriously consider #3 (in the form of a patch I've offered to submit).
I think we should focus on #2. I'd really like to know his stance on it.
I don't consider the proposed ways to be mutually exclusive, or at least not as much as I interpret your post. To me it makes most sense offering users several options, at various abstraction levels, so the choice can be made flexibly given the problem at hand. With respect to the command_queue "guarantee" I keep my opinion that I consider it absolutely useful; I also don't see why it's more error-prone in general (see below). FWIW in my real-code it's a rather frequent case to execute a set of operations on an (in-order) command-queue as batch (copy input to device, run several kernels, copy output to host) and yes then I do make the synchronization at the command-queue level at the end. Why should I fiddle around with a number of individual events (paying unnecessary OpenCL overhead for setting them all up, plus littering my host code) when all that matters is the all-ops execution as set? Certainly not every problem is suited for doing so, but why making things unnecessarily complicated if applicable? With respect to error-proneness (e.g. regarding the lifetime of other objects) how does a wait_list differ? -> all that matters is the point at which the guarantee was created: wait_list wl; wait_list::guarantee wlg(wl); wl.insert(cq.enqueue_write_buffer_async(...)); wl.insert(cq.enqueue_...); if some object was created after wlg but used in a subsequent asynchronous operation then this blows up as much as here as it does for a command-queue guarantee; in either case those event(s) must be handled separately. With respect to performance there can be two-fold advantages of synchronizing at the command-queue level: firstly as already mentioned not having to pay the OpenCL overhead for creating and waiting on a number of individual events; and secondly for an out-of-order command queue the issuing of explicit waits for all events, which must occur in some order, may cause unnecessary rearrangement/ordering of the queue's execution order (AFAIK it's the implementers freedom if it gives preference to the just-issued wait or not). I think it has it's very legitimate reasons why the plain OpenCL API allows synchronization based on scalar events, arrays of events, or other abstraction levels such as command queues. Why should C++ wrapped guarantees refer to some of them but exclude the others? cheers, Thomas
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Thomas M Sent: Wednesday, December 31, 2014 2:14 To: boost@lists.boost.org Subject: Re: [boost] Synchronization (RE: [compute] review)
On 31/12/2014 02:31, Gruenke,Matt wrote:
I don't consider the proposed ways to be mutually exclusive, or at least not as much as I interpret your post.
They're not, nor did I mean to imply they were. I think they're very much complementary. Ideally, we'd get all three. I consider #2 the highest priority, because writing exception-safe code without it would be very difficult and cumbersome. #3 is similarly valuable, if you're directly using the command_queue. #1 is "only" an optimization - not a correctness/safety issue.
With respect to the command_queue "guarantee" I keep my opinion that I consider it absolutely useful; I also don't see why it's more error-prone in general (see below).
It's error-prone, because the guarantee must have a smaller scope than all of the containers it's intended to protect. You can't just instantiate new containers as you need them. That type of usage error isn't possible with event guarantees. You're correct to point out that this applies equally to wait_list guarantees. Hence, I am withdrawing my support for them.
With respect to performance there can be two-fold advantages of synchronizing at the command-queue level: firstly as already mentioned not having to pay the OpenCL overhead for creating and waiting on a number of individual events;
He's always creating the events. So, the only difference is whether you wait on them. Waiting should amount to a mutex-protected flag check (and a pthread_condition_wait(), if the event is incomplete). Locking and releasing mutexes is cheap, unless it actually blocks. In some tests I've run, years ago, it's on the order of a hundred cycles, on Linux, maybe less.
and secondly for an out-of-order command queue the issuing of explicit waits for all events, which must occur in some order, may cause unnecessary rearrangement/ordering of the queue's execution order (AFAIK it's the implementers freedom if it gives preference to the just-issued wait or not).
So, waiting on an event is going to change their scheduling? Call me a skeptic, but I doubt that. If it does happen, then a better approach would be to enqueue a marker and wait on that. Or just make an explicit call to command_queue::wait(), if you know there are no other commands in there. In any case, the issue is probably be moot. First, it seems like Boost.Compute won't be supporting out-of-order command queues. Second, unless you're using async copies and avoiding algorithms that implicitly synchronize for other reasons (i.e. to return results by value), then you'll block on the results before you ever get near the end of scope (in the non-error case, that is). I think our positions are fairly entrenched, at this point. But don't let me stop you from trying to convince Kyle. In the end, his is the only opinion that matters. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
On 31/12/2014 09:15, Gruenke,Matt wrote:
I think our positions are fairly entrenched, at this point. But don't let me stop you from trying to convince Kyle. In the end, his is the only opinion that matters.
Let me throw in a proposal to dissolve our entrenchment ;): boost::compute::...::guarantee is a true guarantee and refers only to events (or whatever can fulfill a true guarantee); boost::compute::...::wait (... being e.g. a [event,] wait_list, command_queue, all sorts of stuff) does basically the same, i.e. wait, but the name makes clear it isn't a full guarantee. Thus it provides convenience but doesn't prevent you from shooting yourself in the foot if you are careless - sounds like perfect C++ philosophy ;). I am very open to different names (wait, waiter, wait_lock, wait_on_me, eaahhhh ... anything short + concise much appreciated). best, Thomas
I just wanted to remind anyone following this that, if Kyle adopts #2 (embedding synchronization or refcounting in device memory containers), these waits/guards/guarantees/etc. wouldn't be of interest to the majority of users. So, I think it's a lot of debate over what's essentially a corner case. And event guarantees get the job done, while anything else is just easier to use (and misuse). I hope you don't feel that's unfair, Thomas. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
On 31/12/2014 11:00, Gruenke,Matt wrote:
I just wanted to remind anyone following this that, if Kyle adopts #2 (embedding synchronization or refcounting in device memory containers), these waits/guards/guarantees/etc. wouldn't be of interest to the majority of users. So, I think it's a lot of debate over what's essentially a corner case. And event guarantees get the job done, while anything else is just easier to use (and misuse).
Let me put it this way: The library should be by itself, that is implicit / hidden to the user, be as much exception safe as possible (I think we agree on that). Whatever gets us closest to that is perfectly fine with me. If I don't need to handle a single guarantee/wait etc. by myself, great. And as side effect that wipes out the probably biggest misuse source (= requiring the user to do some explicit in order to be protected). Currently I just cannot see, technically speaking, how this can be done throughout e.g. when the library interfaces with non-library objects (like plain void * host memory) or via the raw OpenCL objects; hence tools are useful which - now to be done explicitly by the user - help out here. Since there's a range of such interactions imaginable, with different preconditions, equally the tools offered shall provide some diversity. I am against forcing users to a single but clumpsy tool (an all-powered event-level guarantee) if the preconditions in a given application can be much more relaxed and something higher-leveled does the job as well. cheers, Thomas
On 31/12/2014 12:51, Thomas M wrote:
Since there's a range of such interactions imaginable, with different preconditions, equally the tools offered shall provide some diversity. I am against forcing users to a single but clumpsy tool (an all-powered event-level guarantee) if the preconditions in a given application can be much more relaxed and something higher-leveled does the job as well.
Matt, here's an example when IMHO a plain event-level guarantee as only tool won't do it: for (int i = 0; i < n; ++i) { cq.enqueue_write_buffer_async(devmem[i], 0, size[i], host_ptr[i]); } If there's a true guarantee around the cq.enqueue_write_buffer_async: for (int i = 0; i < n; ++i) { boost::compute::guarantee guar(cq.enqueue_write_buffer_async(devmem[i], 0, size[i], host_ptr[i])); } then the asynchronous nature is totally lost (it blocks on every iteration). If the guarantee can be "moved outside of the loop" to avoid the blocking, something like: boost::compute::guarantee_list guarList; for (int i = 0; i < n; ++i) { guarList.move_into(boost::compute::guarantee(cq.enqueue_write_buffer_async(devmem[i], 0, size[i], host_ptr[i])); } then it's not a guarantee in your sense any more, because I can obviously also move it out to any improper scope loosing the protection. The way to go for is a higher-level protection (e.g. wait on a wait_list / command_queue); thus the asynchronous copy nature is kept at the user's responsibility to place the protection at an "appropriate" scope. Thomas
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Thomas M Sent: December 31, 2014 12:11 To: boost@lists.boost.org Subject: Re: [boost] Synchronization (RE: [compute] review)
On 31/12/2014 12:51, Thomas M wrote:
If the guarantee can be "moved outside of the loop" to avoid the blocking, something like:
It was always my intention that they be movable, setable, and default-constructable. That should allow an array (or scoped_array<>, if the size isn't known until runtime) to be defined outside the loop and initialized inside.
then it's not a guarantee in your sense any more, because I can obviously also move it out to any improper scope loosing the protection.
It's still a guarantee - just not necessarily a guarantee of what you want. It guarantees that the event will complete within *its* lifetime. And yes, making it movable and setable opens the door to misuse, but the simplest and most natural way to use it is still the correct way. I feel like this issue has been receiving a disproportionate amount of focus. Most users of Boost.Compute will probably never use one of these constructs. Also, no one has yet closed the door to wait_list- or command_queue- level equivalents. I'm simply not going to write them. And even if they aren't eventually added to the library, you can still define them in your own code. Honestly, I'm much more concerned about #2. While we argue over levels of convenience, that looming issue of exception safety still casts a long, dark shadow over the high level interface of Boost.Compute and its users. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
On 31/12/2014 20:32, Gruenke,Matt wrote:
I feel like this issue has been receiving a disproportionate amount of focus. Most users of Boost.Compute will probably never use one of these constructs. Also, no one has yet closed the door to wait_list- or command_queue- level equivalents. I'm simply not going to write them. And even if they aren't eventually added to the library, you can still define them in your own code.
Honestly, I'm much more concerned about #2. While we argue over levels of convenience, that looming issue of exception safety still casts a long, dark shadow over the high level interface of Boost.Compute and its users.
Ok let's settle this issue. When you implement #2, and yes that's certainly a complex task, please compile a thorough list when the library can't ensure exception safety on it's on and share it here; we can then take a new look at the full-scale problem and think of an interface. Is that ok? Thomas
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Thomas M Sent: December 31, 2014 16:15 To: boost@lists.boost.org Subject: Re: [boost] Synchronization (RE: [compute] review)
When you implement #2,
I hadn't planned to do anything for #2. Kyle said he's thinking about it. Since you had some ideas, maybe you could sketch them up in an email them, or submit a patch. If you're afraid it might be a waste of time, email Kyle directly and ask if he's interested.
please compile a thorough list when the library can't ensure exception safety on it's on and share it here; we can then take a new look at the full-scale problem and think of an interface. Is that ok?
That sounds like a request for Kyle. You could always get the ball rolling, and maybe he'd help. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
On December 31, 2014 3:15:10 AM EST, "Gruenke,Matt"
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Thomas M
With respect to the command_queue "guarantee" I keep my opinion that I consider it absolutely useful; I also don't see why it's more error-prone in general (see below).
It's error-prone, because the guarantee must have a smaller scope than all of the containers it's intended to protect. You can't just instantiate new containers as you need them. That type of usage error isn't possible with event guarantees.
You're correct to point out that this applies equally to wait_list guarantees. Hence, I am withdrawing my support for them.
What if the guarantees were part of the construct in use and only registered or lifetime-managed objects can be used with it? By that I mean that of you want synchronization, then you create a synchronizing command queue our wait list. If you want to use host memory with it, the API requires a host memory object that was previously registered with the queue/wait list or a special host memory object with a ref count the queue/wait list can increment. The point is to have an interface that enforces correct lifetime management rather than relying on the user to remember to do certain things in a certain order. Another thought is to get the guarantee from the queue/wait list and then pass it to constructors and APIs that should be protected. That doesn't preclude using unprotected objects and calls unless the queue/wait list can verify that it's guarantee object was used to create the object handed to it or on the API call invoked on it whenever there is an extant guarantee object. ___ Rob (Sent from my portable computation engine)
On 29/12/2014 03:23, Kyle Lutz wrote:
On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt
wrote: -----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 20:36 To: boost@lists.boost.org List Subject: Re: [boost] Synchronization (RE: [compute] review)
On Sun, Dec 28, 2014 at 4:46 PM, Gruenke,Matt wrote:
My understanding, based on comments you've made to other reviewers, is that functions like boost::compute::transform() are asynchronous when the result is on the device, but block when the result is on the host. This is what I'm concerned about. Is it true?
Yes this is correct. In general, algorithms like transform() are asynchronous when the input/output ranges are both on the device and synchronous when one of the ranges is on the host. I'll work on better ways to allow asynchrony in the latter case. One of my current ideas is add asynchronous memory-mapping support to the mapped_view class [1] which can then be used with any of the algorithms in an asynchronous fashion.
When you speak of input/output ranges on the host, to what kinds of iterators do you refer to? Any input/output iterator kind (e.g. iterators from a std:: container -> just tried a std::vector on boost::compute::transform, didn't compile if provided as input range), or iterators that are part of your library? Thomas
On Mon, Dec 29, 2014 at 1:21 AM, Thomas M
On 29/12/2014 03:23, Kyle Lutz wrote:
On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt
wrote: -----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 20:36 To: boost@lists.boost.org List Subject: Re: [boost] Synchronization (RE: [compute] review)
On Sun, Dec 28, 2014 at 4:46 PM, Gruenke,Matt wrote:
My understanding, based on comments you've made to other reviewers, is that functions like boost::compute::transform() are asynchronous when the result is on the device, but block when the result is on the host. This is what I'm concerned about. Is it true?
Yes this is correct. In general, algorithms like transform() are asynchronous when the input/output ranges are both on the device and synchronous when one of the ranges is on the host. I'll work on better ways to allow asynchrony in the latter case. One of my current ideas is add asynchronous memory-mapping support to the mapped_view class [1] which can then be used with any of the algorithms in an asynchronous fashion.
When you speak of input/output ranges on the host, to what kinds of iterators do you refer to? Any input/output iterator kind (e.g. iterators from a std:: container -> just tried a std::vector on boost::compute::transform, didn't compile if provided as input range), or iterators that are part of your library?
In general, all of the algorithms operate on Boost.Compute iterators rather than STL iterators. The main exception to this rule is the boost::compute::copy() algorithm which copies between ranges of STL iterators on the host and Boost.Compute iterators on the device. Anther exception is the boost::compute::sort() algorithm which can directly sort ranges of random-access iterators on the host (as long as the data is contiguous, e.g. std::vector<T>). I am working to add more direct support for host ranges to the other algorithms. Currently the best way to use the Boost.Compute algorithms together with host memory is with the mapped_view [1] class. -kyle [1] http://kylelutz.github.io/compute/boost/compute/mapped_view.html
On 29/12/2014 17:56, Kyle Lutz wrote:
On Mon, Dec 29, 2014 at 1:21 AM, Thomas M
wrote: On 29/12/2014 03:23, Kyle Lutz wrote:
On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt
wrote: -----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 20:36 To: boost@lists.boost.org List Subject: Re: [boost] Synchronization (RE: [compute] review)
On Sun, Dec 28, 2014 at 4:46 PM, Gruenke,Matt wrote:
My understanding, based on comments you've made to other reviewers, is that functions like boost::compute::transform() are asynchronous when the result is on the device, but block when the result is on the host. This is what I'm concerned about. Is it true?
Yes this is correct. In general, algorithms like transform() are asynchronous when the input/output ranges are both on the device and synchronous when one of the ranges is on the host. I'll work on better ways to allow asynchrony in the latter case. One of my current ideas is add asynchronous memory-mapping support to the mapped_view class [1] which can then be used with any of the algorithms in an asynchronous fashion.
When you speak of input/output ranges on the host, to what kinds of iterators do you refer to? Any input/output iterator kind (e.g. iterators from a std:: container -> just tried a std::vector on boost::compute::transform, didn't compile if provided as input range), or iterators that are part of your library?
In general, all of the algorithms operate on Boost.Compute iterators rather than STL iterators. The main exception to this rule is the boost::compute::copy() algorithm which copies between ranges of STL iterators on the host and Boost.Compute iterators on the device. Anther exception is the boost::compute::sort() algorithm which can directly sort ranges of random-access iterators on the host (as long as the data is contiguous, e.g. std::vector<T>). I am working to add more direct support for host ranges to the other algorithms. Currently the best way to use the Boost.Compute algorithms together with host memory is with the mapped_view [1] class.
I would find it very helpful to forbid invalid iterators as arguments as much as possible at compile time. For example I can use a std::vector::iterator as output range argument: std::vector<int> stdvec(100, 0); boost::compute::transform(some_begin(), some_end(), stdvec.begin(), ...); I guess such a use would be invalid; if transform accepts only iterators from your library then I suppose such compile-time checks should be possible for both input and output (including targeted error message). Thomas
On Mon, Dec 29, 2014 at 11:53 AM, Thomas M
On 29/12/2014 17:56, Kyle Lutz wrote:
On Mon, Dec 29, 2014 at 1:21 AM, Thomas M
wrote: On 29/12/2014 03:23, Kyle Lutz wrote:
On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt
wrote: -----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 20:36 To: boost@lists.boost.org List Subject: Re: [boost] Synchronization (RE: [compute] review)
On Sun, Dec 28, 2014 at 4:46 PM, Gruenke,Matt wrote:
> My understanding, based on comments you've made to other reviewers, > is > that functions like boost::compute::transform() are asynchronous when > the result is on the device, but block when the result is on the > host. > This is what I'm concerned about. Is it true?
Yes this is correct. In general, algorithms like transform() are asynchronous when the input/output ranges are both on the device and synchronous when one of the ranges is on the host. I'll work on better ways to allow asynchrony in the latter case. One of my current ideas is add asynchronous memory-mapping support to the mapped_view class [1] which can then be used with any of the algorithms in an asynchronous fashion.
When you speak of input/output ranges on the host, to what kinds of iterators do you refer to? Any input/output iterator kind (e.g. iterators from a std:: container -> just tried a std::vector on boost::compute::transform, didn't compile if provided as input range), or iterators that are part of your library?
In general, all of the algorithms operate on Boost.Compute iterators rather than STL iterators. The main exception to this rule is the boost::compute::copy() algorithm which copies between ranges of STL iterators on the host and Boost.Compute iterators on the device. Anther exception is the boost::compute::sort() algorithm which can directly sort ranges of random-access iterators on the host (as long as the data is contiguous, e.g. std::vector<T>). I am working to add more direct support for host ranges to the other algorithms. Currently the best way to use the Boost.Compute algorithms together with host memory is with the mapped_view [1] class.
I would find it very helpful to forbid invalid iterators as arguments as much as possible at compile time. For example I can use a std::vector::iterator as output range argument:
std::vector<int> stdvec(100, 0); boost::compute::transform(some_begin(), some_end(), stdvec.begin(), ...);
I guess such a use would be invalid; if transform accepts only iterators from your library then I suppose such compile-time checks should be possible for both input and output (including targeted error message).
Fully agree. And Boost.Compute already has an internal `is_device_iterator` trait which we could use to check for proper device iterators. I've opened an issue for this [1]. -kyle [1] https://github.com/kylelutz/compute/issues/392
On 2014-12-28 20:35, Kyle Lutz wrote:
On Sun, Dec 28, 2014 at 4:46 PM, Gruenke,Matt
wrote: -----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 14:42 To: boost@lists.boost.org List Subject: Re: [boost] [compute] review
On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:
Also, I agree with Thomas M that it'd be useful for operations to return events.
All asynchronous operations in the command queue class do return events. One of his comments was to also return events from the synchronous methods for consistency and I am working on adding this.
Well, what I had in mind was events for higher-order operations, like boost::compute::transform().
Yes, I would also like to have higher-level support for chaining together algorithms asynchronously. However, designing a generic and generally useful API for this is a complex task and may take some time and (I've shied away from just adding an extra "_async()" function for all of the algorithm APIs as I think it could be done better/more-extensibly). Any ideas/proposals for this would be great to hear.
The approach that people seem to be converging on is using futures and continuations for this sort of thing. I think that works here. See for example HPX https://github.com/STEllAR-GROUP/hpx Given that you already have a wait_list object, I imagine that the required primitives are already there in OpenCL to chain operations in arbitrary graphs. John Bytheway
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 14:42 To: boost@lists.boost.org List Subject: Re: [boost] [compute] review
On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:
I didn't notice any unit tests specifically intended to verify exception-safety. I'd want to know that had been thoroughly vetted, before I'd consider using Boost.Compute in production code.
There are already some tests which for the error/exception paths in Boost.Compute (testing building invalid programs in "test_program.cpp" is the first that comes to mind). I'll work on adding more. Please let me know if there are any specific areas you'd like to see more heavily tested.
The only exception-related tests I saw seemed to be a small number that simply checked whether exceptions were being thrown when expected. What I'm specifically concerned about is whether your containers and other wrappers are exception-safe. I don't want memory or resource leaks (or worse!) in my programs, when I handle recoverable exceptions and continue executing. I assume the bulk of your classes are simply RAII wrappers around the OpenCL C types, so they're probably fairly safe. I guess I'm more concerned about containers and anything with nontrivial state. I don't know how much boost's unit test framework can help with this. Certainly, it can exercise a number of cases and report if there are any segfaults. But to check for resource leaks probably means also running them with a tool like valgrind, and examining the output. I'd enable file descriptor tracking, in hopes that it might detect some resource leaks down in vendor-specific code. Obviously, your library can only be as good as the underlying implementation, but the hope is that at least Boost.Compute doesn't add any leaks or errors of its own. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
On Sun, Dec 28, 2014 at 5:03 PM, Gruenke,Matt
-----Original Message----- I don't know how much boost's unit test framework can help with this. Certainly, it can exercise a number of cases and report if there are any segfaults. But to check for resource leaks probably means also running them with a tool like valgrind, and examining the output.
Obviously one could write tests to verify exception safety without using special tools. -- Emil Dotchevski Reverge Studios, Inc. http://www.revergestudios.com/reblog/index.php?n=ReCode
On Sun, Dec 28, 2014 at 5:03 PM, Gruenke,Matt
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 14:42 To: boost@lists.boost.org List Subject: Re: [boost] [compute] review
On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:
I didn't notice any unit tests specifically intended to verify exception-safety. I'd want to know that had been thoroughly vetted, before I'd consider using Boost.Compute in production code.
There are already some tests which for the error/exception paths in Boost.Compute (testing building invalid programs in "test_program.cpp" is the first that comes to mind). I'll work on adding more. Please let me know if there are any specific areas you'd like to see more heavily tested.
The only exception-related tests I saw seemed to be a small number that simply checked whether exceptions were being thrown when expected.
What I'm specifically concerned about is whether your containers and other wrappers are exception-safe. I don't want memory or resource leaks (or worse!) in my programs, when I handle recoverable exceptions and continue executing. I assume the bulk of your classes are simply RAII wrappers around the OpenCL C types, so they're probably fairly safe. I guess I'm more concerned about containers and anything with nontrivial state.
I don't know how much boost's unit test framework can help with this. Certainly, it can exercise a number of cases and report if there are any segfaults. But to check for resource leaks probably means also running them with a tool like valgrind, and examining the output. I'd enable file descriptor tracking, in hopes that it might detect some resource leaks down in vendor-specific code. Obviously, your library can only be as good as the underlying implementation, but the hope is that at least Boost.Compute doesn't add any leaks or errors of its own.
I have attempted as best as possible to make the library exception safe. I will work on improving the unit tests to explicitly check for exception safety and proper recovery. I'd welcome anyone to look over the code and, if possible, provide test-cases which show incorrect behavior. -kyle
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 14:42 To: boost@lists.boost.org List Subject: Re: [boost] [compute] review
On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:
Finally, based on the docs, it seems there's a bug in boost::compute::command_queue::enqueue_copy_image_to_buffer() and boost::compute::command_queue::enqueue_copy_buffer_to_image(). The region and origin parameters should be arrays, in the 2D and 3D cases. If this discrepancy is intentional, it should be noted. I haven't exhaustively reviewed the reference docs, so there might be other issues of this sort.
Can you be more specific as to what you consider to be a "bug" here? These APIs in the command_queue class are very "thin" wrappers on top of the clEnqueue* functions from the OpenCL C API. If you could provide a small test-case which demonstrates the issue and submit it to the bug-tracker [1] that would be great.
I can submit bugs, but this is really simple (and blatant). Now let's look at boost::compute::command_queue::enqueue_read_image(). It has the following overloads: void enqueue_read_image(const image2d & image, const size_t origin, const size_t region, size_t row_pitch, void * host_ptr, const wait_list & events = wait_list()); void enqueue_read_image(const image3d & image, const size_t origin, const size_t region, size_t row_pitch, size_t slice_pitch, void * host_ptr, const wait_list & events = wait_list()); See how origin and region are size_t? Well, here's the OpenCL function they claim to wrap: cl_int clEnqueueReadImage(cl_command_queue command_queue, cl_mem image, cl_bool blocking_read, const size_t *origin, const size_t *region, size_t row_pitch, size_t slice_pitch, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event ); See how origin and region are const size_t *? The reason is: origin - Defines the (x, y, z) offset in pixels in the 1D, 2D, or 3D image, the (x, y) offset and the image index in the image array or the (x) offset and the image index in the 1D image array. If image is a 2D image object, origin[2] must be 0. If image is a 1D image or 1D image buffer object, origin[1] and origin[2] must be 0. If image is a 1D image array object, origin[2] must be 0. If image is a 1D image array object, origin[1] describes the image index in the 1D image array. If image is a 2D image array object, origin[2] describes the image index in the 2D image array. region - Defines the (width, height, depth) in pixels of the 1D, 2D or 3D rectangle, the (width, height) in pixels of the 2D rectangle and the number of images of a 2D image array or the (width) in pixels of the 1D rectangle and the number of images of a 1D image array. If image is a 2D image object, region[2] must be 1. If image is a 1D image or 1D image buffer object, region[1] and region[2] must be 1. If image is a 1D image array object, region[2] must be 1. I have seen several such examples, in boost::compute::command_queue, and I didn't even look very hard. But it would be good to sort out all of these problems before Boost.Compute is released as part of boost, since fixing some may involve incompatible API changes. I guess you could add overloads, in these cases, but it'd be nice if Boost.Compute didn't already have cruft right at the start. Matt ________________________________ This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.
On Sun, Dec 28, 2014 at 5:36 PM, Gruenke,Matt
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Kyle Lutz Sent: Sunday, December 28, 2014 14:42 To: boost@lists.boost.org List Subject: Re: [boost] [compute] review
On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:
Finally, based on the docs, it seems there's a bug in boost::compute::command_queue::enqueue_copy_image_to_buffer() and boost::compute::command_queue::enqueue_copy_buffer_to_image(). The region and origin parameters should be arrays, in the 2D and 3D cases. If this discrepancy is intentional, it should be noted. I haven't exhaustively reviewed the reference docs, so there might be other issues of this sort.
Can you be more specific as to what you consider to be a "bug" here? These APIs in the command_queue class are very "thin" wrappers on top of the clEnqueue* functions from the OpenCL C API. If you could provide a small test-case which demonstrates the issue and submit it to the bug-tracker [1] that would be great.
I can submit bugs, but this is really simple (and blatant). Now let's look at boost::compute::command_queue::enqueue_read_image(). It has the following overloads:
void enqueue_read_image(const image2d & image, const size_t origin, const size_t region, size_t row_pitch, void * host_ptr, const wait_list & events = wait_list());
void enqueue_read_image(const image3d & image, const size_t origin, const size_t region, size_t row_pitch, size_t slice_pitch, void * host_ptr, const wait_list & events = wait_list());
See how origin and region are size_t? Well, here's the OpenCL function they claim to wrap:
cl_int clEnqueueReadImage(cl_command_queue command_queue, cl_mem image, cl_bool blocking_read, const size_t *origin, const size_t *region, size_t row_pitch, size_t slice_pitch, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event );
See how origin and region are const size_t *? The reason is:
origin - Defines the (x, y, z) offset in pixels in the 1D, 2D, or 3D image, the (x, y) offset and the image index in the image array or the (x) offset and the image index in the 1D image array. If image is a 2D image object, origin[2] must be 0. If image is a 1D image or 1D image buffer object, origin[1] and origin[2] must be 0. If image is a 1D image array object, origin[2] must be 0. If image is a 1D image array object, origin[1] describes the image index in the 1D image array. If image is a 2D image array object, origin[2] describes the image index in the 2D image array.
region - Defines the (width, height, depth) in pixels of the 1D, 2D or 3D rectangle, the (width, height) in pixels of the 2D rectangle and the number of images of a 2D image array or the (width) in pixels of the 1D rectangle and the number of images of a 1D image array. If image is a 2D image object, region[2] must be 1. If image is a 1D image or 1D image buffer object, region[1] and region[2] must be 1. If image is a 1D image array object, region[2] must be 1.
I have seen several such examples, in boost::compute::command_queue, and I didn't even look very hard. But it would be good to sort out all of these problems before Boost.Compute is released as part of boost, since fixing some may involve incompatible API changes. I guess you could add overloads, in these cases, but it'd be nice if Boost.Compute didn't already have cruft right at the start.
Thanks for bringing this to my attention. This looks to be a bug in generating the Boost documentation from the doxygen comments. For example, the enqueue_read_image() method takes the origin argument as "const size_t origin[2]" but the generated documentation lists this as merely "const size_t origin". I'd invite you, for now until the documentation bug is fixed, to look directly at the header file [1]. If anyone more knowledgable with the doxygen->quickbook toolchain could comment on this or knows of a work-around that would be a great help. -kyle [1] https://github.com/kylelutz/compute/blob/master/include/boost/compute/comman...
participants (13)
-
Andrey Semashev
-
Antony Polukhin
-
Emil Dotchevski
-
Francois Duranleau
-
Gruenke,Matt
-
John Bytheway
-
Kyle Lutz
-
Mathias Gaunard
-
Peter Dimov
-
Rob Stewart
-
Sebastian Schaetz
-
Thomas M
-
Vicente J. Botet Escriba