
Thanks for clarifying, Dave. I'm glad to know that we're mostly on the same page. Dave Abrahams wrote:
on Sat Dec 15 2012, Julian Gonggrijp wrote:
[...] In part I'm puzzled by what seems to me to be your bolder claim: the possibility of statically wired asynchronous concurrent message passing.
I don't see why it shouldn't be possible.
In retrospect I do see why you didn't see it. Note the word "asynchronous". I'll get back to this point below.
[...] I believe we're not disagreeing at all. You seem to be implying that for very tightly interdependent computations static, single- threaded connections are most efficient, while for more coarse-grained computations concurrency is the best solution.
s/is/might be/
True.
In the degenerate case where the dependency graph is strictly linear, concurrency might be completely wasted unless you're able to take advantage of pipelining.
Isn't execution of linearly dependent operations always a pipeline? Assuming that, isn't every vertex in a dependency graph a pipeline by itself? Assuming that, isn't pipelining the entire purpose of any form of computational composition?
And there's still a question of how that concurrency is carved up. If you have a complicated dependency graph and you can run several of these flow graphs in parallel, but each in a single thread (as described below), that seems likely to induce less synchronization than if you try to run the nodes of a single flow graph in parallel. Of course whether that decomposition can work depends on your problem.
You have a point. But: suppose that you have two operations, A and B, and you want to output B(A(item)) for each item out of a collection of inputs. The alternatives that you're suggesting are (1) have operation A executed by one thread and operation B by the other, then pipeline these threads; (2) have two (or more) identical threads, each taking an equal share of the inputs and executing B o A. In scenario (2) the threads don't need to communicate, so you're right that it will involve less waiting if you generalize this to more complex computations, but now we're not taking into account that (1) is only a member out of a much larger class of solutions to the same problem. If we realistically assume that one operation takes longer than the other, let's say that B takes three times the amount of time of A, then the following network will prove superior to (1) and (2) on the condition that you have a CPU with at least four hardware threads: (1b) We use four execution threads (or any multiple). The first takes the inputs, applies operation A on them and divides the intermediate results between the other three threads. The other three threads apply operation B, each to their own share of the intermediate results. This also generalizes to more complex dependency graphs. In fact, the more operations are involved, the more likely you are to find significantly different execution times.
<schnipp>
:-)
[...] I find it very hard to imagine how this could be done, because at compile time it's impossible to predict how the instruction streams of both threads will be aligned during run-time. If something like that could actually be done, *please* enlighten me. :-)
Well, I have to admit, that sounds difficult. But you don't necessarily need more than a single-value buffer for each sub-computation, do you?
Indeed you don't, in fact I think you almost never do. Rather, the multiple-value buffer is mostly an optimization that enables asynchronous operation. The sender can push another value to the queue even if the receiver didn't pop the previous one yet, and vice versa. On top of that, every thread only needs to wait for its own preemptions, not for those of the other threads (unless preemption takes longer than filling/emptying the buffer). The problem with a single-value buffer, whether static or dynamic, is really that it can't be asynchronous; production and consumption of a message must always take turns. I believe this largely defeats the purpose of concurrency. Regardless, if you are to implement concurrent message passing with a static single-value buffer, you still have the problem that you'll somehow have to align the instruction streams of both threads. Now this seems less remote to me than the same thing with a multiple-value buffer, but I imagine that the act of aligning the instruction streams would add considerable overhead at run-time. After all, wouldn't that just be a barrier?
[...] You appear to believe that the same library should provide for both ways of composing computations, i.e. the static single-threaded one (like Accumulators) and the dynamic asynchronous one. I doubt that would be the right thing to do, because the modes of composition are not related in their underlying logic (contrary to the static and dynamic expressions in Xpressive) and because they pose very different requirements on the subcomputations that are to be composed.
You may be right.
[...] As the Accumulators mode of composition involves much tighter coupling between the operations, they need to have much more in common, e.g. features and dependencies in Accumulators.
Hmm, I don't think I buy that last part of the argument, but I also don't think we're very far apart now.
It might not be important enough to elaborate on. I agree that we're probably not far apart. -Julian