
on Sun Dec 16 2012, Julian Gonggrijp <j.gonggrijp-AT-gmail.com> wrote:
In the degenerate case where the dependency graph is strictly linear, concurrency might be completely wasted unless you're able to take advantage of pipelining.
Isn't execution of linearly dependent operations always a pipeline?
Yes, but pipelining doesn't help if the input period exceeds the pipeline latency.
Assuming that, isn't every vertex in a dependency graph a pipeline by itself?
I suppose it's a degenerate pipeline, but I don't know what you're getting at.
Assuming that, isn't pipelining the entire purpose of any form of computational composition?
IMO the purpose of computational composition is to manage complexity through re-use.
And there's still a question of how that concurrency is carved up. If you have a complicated dependency graph and you can run several of these flow graphs in parallel, but each in a single thread (as described below), that seems likely to induce less synchronization than if you try to run the nodes of a single flow graph in parallel. Of course whether that decomposition can work depends on your problem.
You have a point. But: suppose that you have two operations, A and B, and you want to output B(A(item)) for each item out of a collection of inputs. The alternatives that you're suggesting are
(1) have operation A executed by one thread and operation B by the other, then pipeline these threads; (2) have two (or more) identical threads, each taking an equal share of the inputs and executing B o A.
In scenario (2) the threads don't need to communicate, so you're right that it will involve less waiting if you generalize this to more complex computations, but now we're not taking into account that (1) is only a member out of a much larger class of solutions to the same problem. If we realistically assume that one operation takes longer than the other, let's say that B takes three times the amount of time of A, then the following network will prove superior to (1) and (2) on the condition that you have a CPU with at least four hardware threads:
(1b) We use four execution threads (or any multiple). The first takes the inputs, applies operation A on them and divides the intermediate results between the other three threads. The other three threads apply operation B, each to their own share of the intermediate results.
This also generalizes to more complex dependency graphs. In fact, the more operations are involved, the more likely you are to find significantly different execution times.
Yes. -- Dave Abrahams BoostPro Computing Software Development Training http://www.boostpro.com Clang/LLVM/EDG Compilers C++ Boost