On 31/12/2014 02:31, Gruenke,Matt wrote:
Let's take another look at the big picture.
In its current form, the high-level interface of Boost.Compute has safe (if overkill) synchronization with host datastructures. There's currently insufficient synchronization of device containers. Finally, the low level interface lacks the RAII objects needed to facilitate exception-safe code.
In this thread, we've proposed:
1. Host memory containers, as a synchronization *optimization* of high-level operations involving host memory. This seems similar to the approach employed by SYCL (thanks to Peter Dimov).
2. You suggested device containers & their iterators could have the machinery to make them exception-safe. This could take the form of either refcounting, or synchronization.
3. RAII wrappers for events and wait_lists.
Remember, if Kyle simply adopts #2, then regardless of the status of #1, I believe the high-level interface will be safe. So, there should be no need to use #3 with high-level operations.
Considering the above, the existing synchronization behavior of command_queue, and the ordering dependency you mentioned (i.e. that all datastructures would need to be in scope before the guarantee on the command_queue), I conclude that it's not necessary and actually too error-prone to synchronize at the command_queue level.
And regarding wait_lists, the reason I agreed with your suggestion to offer guarantees for them is that they're the primary mechanism for defining command execution order. Since I think most code using the low level interface will probably be using wait_lists anyhow, there's an undeniable convenience to offering guarantees at that level.
As far as I'm aware, Kyle has expressed casual interest in supporting #1 (but not forcing it, presumably because this would break the nice iterator interface that many people like). I don't know where he stands on #2. And he's agreed to seriously consider #3 (in the form of a patch I've offered to submit).
I think we should focus on #2. I'd really like to know his stance on it.
I don't consider the proposed ways to be mutually exclusive, or at least not as much as I interpret your post. To me it makes most sense offering users several options, at various abstraction levels, so the choice can be made flexibly given the problem at hand. With respect to the command_queue "guarantee" I keep my opinion that I consider it absolutely useful; I also don't see why it's more error-prone in general (see below). FWIW in my real-code it's a rather frequent case to execute a set of operations on an (in-order) command-queue as batch (copy input to device, run several kernels, copy output to host) and yes then I do make the synchronization at the command-queue level at the end. Why should I fiddle around with a number of individual events (paying unnecessary OpenCL overhead for setting them all up, plus littering my host code) when all that matters is the all-ops execution as set? Certainly not every problem is suited for doing so, but why making things unnecessarily complicated if applicable? With respect to error-proneness (e.g. regarding the lifetime of other objects) how does a wait_list differ? -> all that matters is the point at which the guarantee was created: wait_list wl; wait_list::guarantee wlg(wl); wl.insert(cq.enqueue_write_buffer_async(...)); wl.insert(cq.enqueue_...); if some object was created after wlg but used in a subsequent asynchronous operation then this blows up as much as here as it does for a command-queue guarantee; in either case those event(s) must be handled separately. With respect to performance there can be two-fold advantages of synchronizing at the command-queue level: firstly as already mentioned not having to pay the OpenCL overhead for creating and waiting on a number of individual events; and secondly for an out-of-order command queue the issuing of explicit waits for all events, which must occur in some order, may cause unnecessary rearrangement/ordering of the queue's execution order (AFAIK it's the implementers freedom if it gives preference to the just-issued wait or not). I think it has it's very legitimate reasons why the plain OpenCL API allows synchronization based on scalar events, arrays of events, or other abstraction levels such as command queues. Why should C++ wrapped guarantees refer to some of them but exclude the others? cheers, Thomas