[rfc] getting the dataflow library ready for submission

Hello all, I am getting close to having the Dataflow library ready to be submitted for review, and would like to respark some interest. The current version of the library offers a generic layer for dataflow programming that can be applied to various data transport mechanisms, and a Dataflow.Signals support layer for Boost.Signals that provides a number of general-purpose components and easy connectability. In the near future I will try to put together more real-world examples of how this library can be useful, but for now the docs have two possibly interesting examples: * an example of how to provide a support layer for another mechanism, with VTK as the victim. * an example of how to use the Dataflow.Signals layer with Boost.Asio and Boost.Serialization to make a distributed dataflow network. The docs are uploaded to: http://dancinghacker.com/code/dataflow/ Code is available from the boost sandbox (SOC/2007/signals), as well as the boost vault (Dataflow directory). All tests (with 1 exception) pass on GCC 4.0.1/darwin and MSVC8. The docs (and the library to some extent) still need work, but I hope what's out there is good enough to paint a picture of what this library is about and how it might be useful. All feedback is welcome! I will next work on an example that uses Boost.GIL to construct an image processing dataflow network. Best regards, Stjepan

Stjepan Rajko wrote:
I am getting close to having the Dataflow library ready to be submitted for review, and would like to respark some interest.
Hi Stjepan, A couple of months ago I saw a talk by Bdale Garbee about the GNU Radio project (http://gnuradio.org/). They're doing various kinds of impressive-looking radio-related DSP on PCs. IIRC they currently have something where they write Python code to glue together C++ building blocks using some sort of data flow notation, but their next version will not require the Python layer. It may well have some useful material for you, or they may be interested in your code. Personally, I'm interested in something that is superficially similar to this but with some make-like dependency tracking included. I imagine changes to one data structure "flowing" through to other structures that depend on it. This needs to be between threads, and so not changing the dependent data while it is in use is one of the challenges. (I only have the vaguest idea what I really mean by that.) With computers getting more and more parallel all the time, we should all be looking for ways to make our code more concurrent! Providing a reasonably "natural" way to "plumb together" functions could be one way to do that. Regards, Phil.

On Nov 9, 2007 3:39 PM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Hi Stjepan,
A couple of months ago I saw a talk by Bdale Garbee about the GNU Radio project (http://gnuradio.org/). They're doing various kinds of
Great reference - thanks!
Personally, I'm interested in something that is superficially similar to this but with some make-like dependency tracking included. I
At some point in the future, I'm planning a "blueprint" layer which would provide a global representation of a dataflow network. Basically, each actual component would have a corresponding blueprint component, and those would be be embedded in a BGL graph so you could run different sorts of analysis (as well as instantiate or serialize the actual network). Also, if the components expose their threading properties, the blueprint could perhaps be automatically modified to include any needed threading-related guards before instantiating the actual dataflow network. Best regards, Stjepan

On Nov 11, 2007 12:18 PM, Stjepan Rajko <stipe@asu.edu> wrote:
On Nov 9, 2007 3:39 PM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Personally, I'm interested in something that is superficially similar to this but with some make-like dependency tracking included. I
At some point in the future, I'm planning a "blueprint" layer which would provide a global representation of a dataflow network. Basically, each actual component would have a corresponding blueprint component, and those would be be embedded in a BGL graph so you could run different sorts of analysis (as well as instantiate or serialize the actual network).
I have just started implementing this Blueprint layer - there is a thrown-together example which shows how you can use it for run-time reflection and dataflow network analysis via BGL at http://tinyurl.com/2xm832. Is this sort the of functionality you are looking for? The Blueprint layer is being built on top of the generic Dataflow layer, so it could be used for any dataflow mechanism/framework with an appropriate support layer. The code is still rather ugly, but I'll keep working on it. Regards, Stjepan

Hi Stjepan, This looks like a very interesting library. Indeed, by looking at the docs, they need more work :) I guess I'm a bit short sighted, because in my mind, the process should be *extremely* simple: There would be a "processor" - to which you pass this data. The processor passes this to the left-most component (that is, the one that processes the first input). The left-most component processes it - lets say, in its operator(). When operator() finishes, the left-most component has processed the input and has generated one or more outputs. The processor will take those output(s), and pass them on, to the next components, and so on. At the end, there will be some output(s). In my mind, that should be all. Connecting the "dots" should be *extremely* simple. (note: I don't really understand why you make usage of DATAFLOW_PORT_TRAITS... and other macros). Best, John
Hello all,
I am getting close to having the Dataflow library ready to be submitted for review, and would like to respark some interest. The current version of the library offers a generic layer for dataflow programming that can be applied to various data transport mechanisms, and a Dataflow.Signals support layer for Boost.Signals that provides a number of general-purpose components and easy connectability.
-- http://John.Torjo.com -- C++ expert ... call me only if you want things done right

Hi John, On Nov 12, 2007 2:19 AM, John Torjo <john.groups@torjo.com> wrote:
I guess I'm a bit short sighted, because in my mind, the process should be *extremely* simple:
I agree :-)
There would be a "processor" - to which you pass this data.
The processor passes this to the left-most component (that is, the one that processes the first input). The left-most component processes it - lets say, in its operator(). When operator() finishes, the left-most component has processed the input and has generated one or more outputs. The processor will take those output(s), and pass them on, to the next components, and so on.
Although the library currently provides no mechanisms of its own (it just provides a generic layer that could support to some extent a variety of actual data transport mechanisms, and a developed Boost.Signals layer that comes with a bunch of components), a simple example mechanism like this would be interesting to add, and doing so would probably also force the generic layer to expand in good ways. Let me see if I can throw together something like this.
Connecting the "dots" should be *extremely* simple. (note: I don't really understand why you make usage of DATAFLOW_PORT_TRAITS... and other macros).
Yes, the macros are very poorly explained :-( . Basically, (and this should be elaborated in the docs), the VTK example is focused on how you go about providing support for the VTK mechanism (i.e., that code just needs to be written once). After that's included, connecting the dots is (I hope) simple, i.e. you just do: #include "vtk_dataflow_support.hpp" // this include file is "operators.hpp" in the VTK examples // ... function scope: // get some VTK components vtkConeSource *cone = vtkConeSource::New(); vtkPolyDataMapper *coneMapper = vtkPolyDataMapper::New(); vtkActor *coneActor = vtkActor::New(); vtkRenderer *ren1= vtkRenderer::New(); vtkRenderWindow *renWin = vtkRenderWindow::New(); // and now connect: *cone >>= *coneMapper >>= *coneActor >>= *ren1 >>= *renWin; // or, connect this way connect(cone, coneMapper); connect(coneMapper, coneActor); connect(coneActor, ren1); connect(ren1, renWin); The docs do need a lot of work... :-) Thanks for your comments! Stjepan

Stjepan Rajko wrote:
I am getting close to having the Dataflow library ready to be submitted for review, and would like to respark some interest. The current version of the library offers a generic layer for dataflow programming that can be applied to various data transport mechanisms, and a Dataflow.Signals support layer for Boost.Signals that provides a number of general-purpose components and easy connectability.
I just skimmed over the docs, and I have some questions: 1) How would I specify the control flow? I.e. How would I choose between "source driven" and "sink driven"? 2) Is there a mechanism that will give support for block oriented processing, or will a function be called on every data item? To clarify: I do not mean a couple of primitive data items blocked together into a larger container which is itself a data item on its own, but something like the fread() function is able to do: *) request a number of items *) return as much as currently available (may be less than wanted) *) block process what has been received *) wait for more data You are using VTK as an example and so I suspect you are using the demand driven processing in your library. Are you also providing a caching and timestamping mechanism to hold intermediary results that need no updating? Which special concepts for memory management are you suggesting (if at all)? Memory management is one of the crucial parts of data flow programming. Will you provide a single copy mechanism? I always found that in data flow paradigm these are the hard parts. It is not the hard part to come up with a "nice syntax". Roland aka speedsnail

Hi Roland, On Nov 12, 2007 4:00 AM, Roland Schwarz <roland.schwarz@chello.at> wrote:
I just skimmed over the docs, and I have some questions:
Great - thanks for taking the time!
1) How would I specify the control flow? I.e. How would I choose between "source driven" and "sink driven"?
Right now, the generic layer doesn't deal with this at all - it is completely up to the underlying mechanism (just to clarify, by "mechanism" I mean a particular dataflow framework / data transport mechanism, such as Boost.Signals or VTK). The layer based on Boost.Signals is source driven - it sort of assumes that the dominant dataflow direction is from the caller to the callee. Although, if in actuality the data flowed the other way (via the return value), then it would be sink driven. The Dataflow.Signals Example "pull-based networks" illustrates this alternative. On the other hand, I believe the VTK mechanism is sink driven (I am complete newb with VTK - I just learned enough to provide a small example support layer).
2) Is there a mechanism that will give support for block oriented processing, or will a function be called on every data item? To clarify: I do not mean a couple of primitive data items blocked together into a larger container which is itself a data item on its own, but something like the fread() function is able to do: *) request a number of items *) return as much as currently available (may be less than wanted) *) block process what has been received *) wait for more data
You are using VTK as an example and so I suspect you are using the demand driven processing in your library. Are you also providing a caching and timestamping mechanism to hold intermediary results that need no updating?
Which special concepts for memory management are you suggesting (if at all)? Memory management is one of the crucial parts of data flow programming. Will you provide a single copy mechanism?
About all these - the generic dataflow layer is currently really small. It basically deals with connecting the dataflow network together, and that's it. Everything else is up to the underlying mechanism. I'm not sure whether the issues you mention should be dealt with at the generic level, but I guess we'll see as things get built on top of it. As far as the Dataflow.Signals layer is concerned, it currently addresses little of the above features/issues. There is a "storage" component which can serve as cache (but offers no timestamping). It would be useful to add components which dealt with these issues though (e.g., a timestamped storage, and a component (or boost::signals Combiner) which only forwards a signal when it has changed). There is no memory management support here either - because of the nature of dataflow networks built on top of boost::signals (there is no global knowledge of the network - each component only holds connections to its sinks), it's basically all up to the components. There could be some memory management components added if necessary (like now there are some threading related components). And I'll think about fread-like processing components in Dataflow.Signals, seems like it would be a good idea as well. I am not understanding the term "single copy mechanism" - could you help me out with a reference?
I always found that in data flow paradigm these are the hard parts. It is not the hard part to come up with a "nice syntax".
A while back, Tobias Schwinger suggested a dataflow framework which does deal with a lot of the issues you bring up here. It's discussed briefly in the "future directions" of the documentation. I think that such a framework could be supported by the generic layer of this library, after it grows somewhat perhaps. My personal goal in providing a generic layer is to provide a visual programming environment on top of it. That way, anything that has a Dataflow support layer can have it's networks be manipulated visually, serialized, instantiated, etc. In the meantime, while there isn't much built on top of the generic layer, the "nice syntax" is unfortunately the biggest practical benefit. Again, thanks for taking the time! Best regards, Stjepan

How does dataflow lib address the (rather broad) topic of scheduling? I am reminded of the Ptolemy project from Berkeley.

On Nov 12, 2007 5:36 PM, Neal Becker <ndbecker2@gmail.com> wrote:
How does dataflow lib address the (rather broad) topic of scheduling? I am reminded of the Ptolemy project from Berkeley.
As it stands, not in a whole lot of ways :-) The generic dataflow layer doesn't address scheduling (except you can invoke a component). The Dataflow.Signals layer included with the library (which uses Boost.Signals to transport data) is pretty much self scheduling since data transport is coupled with component invocation. The only functionality that remotely relates to scheduling is in the threading-related components (mutex, condition, and timed_storage). Thanks for the Ptolemy reference! I was not aware of that project and it looks very interesting. Stjepan
participants (5)
-
John Torjo
-
Neal Becker
-
Phil Endecott
-
Roland Schwarz
-
Stjepan Rajko