Re: [boost] Flow-based programming library for Boost?

16 Dec 2012

      Thanks for clarifying, Dave. I'm glad to know that we're mostly on the same page.

Dave Abrahams wrote:
...
on Sat Dec 15 2012, Julian Gonggrijp wrote:
...
[...] In part I'm 
puzzled by what seems to me to be your bolder claim: the possibility 
of statically wired asynchronous concurrent message passing.
I don't see why it shouldn't be possible.
In retrospect I do see why you didn't see it. Note the word 
"asynchronous". I'll get back to this point below.
...
...
[...] I believe we're not disagreeing at all. You seem to be implying 
that for very tightly interdependent computations static, single-
threaded connections are most efficient, while for more coarse-grained 
computations concurrency is the best solution.
s/is/might be/
True.
...
In the degenerate case where the dependency graph is strictly linear,
concurrency might be completely wasted unless you're able to take
advantage of pipelining.
Isn't execution of linearly dependent operations always a pipeline?
Assuming that, isn't every vertex in a dependency graph a pipeline by 
itself?
Assuming that, isn't pipelining the entire purpose of any form of 
computational composition?
...
And there's still a question of how that concurrency is carved up.  If
you have a complicated dependency graph and you can run several of these
flow graphs in parallel, but each in a single thread (as described
below), that seems likely to induce less synchronization than if you try
to run the nodes of a single flow graph in parallel.  Of course whether
that decomposition can work depends on your problem.
You have a point. But: suppose that you have two operations, A and B, 
and you want to output B(A(item)) for each item out of a collection of 
inputs. The alternatives that you're suggesting are

(1) have operation A executed by one thread and operation B by the 
other, then pipeline these threads;
(2) have two (or more) identical threads, each taking an equal share 
of the inputs and executing B o A.

In scenario (2) the threads don't need to communicate, so you're right 
that it will involve less waiting if you generalize this to more 
complex computations, but now we're not taking into account that (1) 
is only a member out of a much larger class of solutions to the same 
problem. If we realistically assume that one operation takes longer 
than the other, let's say that B takes three times the amount of time 
of A, then the following network will prove superior to (1) and (2) on 
the condition that you have a CPU with at least four hardware threads:

(1b) We use four execution threads (or any multiple). The first takes 
the inputs, applies operation A on them and divides the intermediate 
results between the other three threads. The other three threads apply 
operation B, each to their own share of the intermediate results.

This also generalizes to more complex dependency graphs. In fact, the 
more operations are involved, the more likely you are to find 
significantly different execution times.
...
<schnipp>
:-)
...
...
[...]  I find it very hard to imagine how this could be
done, because at compile time it's impossible to predict how the
instruction streams of both threads will be aligned during run-time.
If something like that could actually be done, *please* enlighten me.
:-)
Well, I have to admit, that sounds difficult.  But you don't necessarily
need more than a single-value buffer for each sub-computation, do you?
Indeed you don't, in fact I think you almost never do. Rather, the 
multiple-value buffer is mostly an optimization that enables 
asynchronous operation. The sender can push another value to the queue 
even if the receiver didn't pop the previous one yet, and vice versa. 
On top of that, every thread only needs to wait for its own 
preemptions, not for those of the other threads (unless preemption 
takes longer than filling/emptying the buffer).

The problem with a single-value buffer, whether static or dynamic, is 
really that it can't be asynchronous; production and consumption of a 
message must always take turns. I believe this largely defeats the 
purpose of concurrency. Regardless, if you are to implement concurrent 
message passing with a static single-value buffer, you still have the 
problem that you'll somehow have to align the instruction streams of 
both threads. Now this seems less remote to me than the same thing 
with a multiple-value buffer, but I imagine that the act of aligning 
the instruction streams would add considerable overhead at run-time. 
After all, wouldn't that just be a barrier?
...
...
[...] You appear 
to believe that the same library should provide for both ways of 
composing computations, i.e. the static single-threaded one (like 
Accumulators) and the dynamic asynchronous one. I doubt that would be 
the right thing to do, because the modes of composition are not 
related in their underlying logic (contrary to the static and dynamic 
expressions in Xpressive) and because they pose very different 
requirements on the subcomputations that are to be composed.
You may be right.
...
[...] As the 
Accumulators mode of composition involves much tighter coupling 
between the operations, they need to have much more in common, e.g. 
features and dependencies in Accumulators.
Hmm, I don't think I buy that last part of the argument, but I also
don't think we're very far apart now.
It might not be important enough to elaborate on. I agree that we're 
probably not far apart.

-Julian