RFC - Updated MapReduce library

I have considerably revised and updated my MapReduce library and committed the changes to the Sandbox. I've also uploaded to the Vault for easy access http://www.boostpro.com/vault/index.php?action=downloadfile <http://www.boostpro.com/vault/index.php?action=downloadfile&filename=mapred uce_0_3.zip&directory=&> &filename=mapreduce_0_3.zip&directory=& Library documentation to date is available along with the code, or on my site at http://craighenderson.co.uk/mapreduce/ I am very interested to hear any comment on design, code, performance or any other area. Regards -- Craig Craig Henderson <http://craighenderson.co.uk/> http://www.craighenderson.co.uk http://www.siteupdatenotification.com

Craig Henderson wrote:
I am very interested to hear any comment on design, code, performance or any other area.
Some comments: * Why do your function object expose a static method instead of being real functor or polymorphic function object. What's the rationale being this choice ? * Why are timing stats embedded into the library ? I may want to not time thing and don't want to suffer from w/e memory footprint of those additional member. * The interface is, on the user side, extremely verbose. I think some functions could extract more information from their input type to lessen the amount of type definition the user has to write. Using the result_of protocol here and there could also prevent people to have to memorize various unrelated name for inner types. * Giving performances figures is OK. Comparing them to sensible existing solution is better. How fast are you with respect to a similar, hand coded application using Boost.thread or even openMP. Instead of giving absolute time,a better metric could be using cycles per processed elements as it gives an architecture independent measure of performances. * On the same topic, why not providing an openMP version on compiler that can support it ? * Give examples of other types of application that can benefit or can't benefit from MapReduce. Can I wrote Image processing with it? HPC ? Which caracteristics should my original application have to benefit from MR ? Counting and sorting words is ok for a tutorial but it doesn't give en incentive on how/why to use this library. Now, it's time for my Google Syndrom rant. For intellectual honesty, it could be nice if people could stop delivering the false sense that everything is invented by Google Lab. MapReduce is only the latest incarnation of a 20+ years research effort on pattern for high-level parallelism (being skeleton, Parallel Design pattern or Bilks ynchronosu parallelism). As it seems Google researcher can't bother doing their homework and provide a proper bibliography, it could be nice if someone actually do it. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

joel wrote:
Some comments: * Why do your function object expose a static method instead of being real functor or polymorphic function object. What's the rationale being this choice ?
This interface has changed several times and I can't decide the most appropriate. I have provided a base class to define the required types, hence the use of a function object. However, it is dangerous to use a real functor with an instance because the Map Tasks are independent of each other and run in different threads. If they had instance data, then synchronization becomes an issue, but more importantly, it breaks the programming model. In a true distributed system, map tasks will run on separate machines, and therefore unable to share data. Support for will is intended for a later release of the library, so I need to keep the design pure.
* Why are timing stats embedded into the library ? I may want to not time thing and don't want to suffer from w/e memory footprint of those additional member.
These stats are very useful for research and testing, but I agree are less important in a production environment. The timings need to be built into the library infrastructure because the library user does not have access to the granularity of timing (without writing a bespoke schedule_policy). I can look at making the timing a another policy class, but I don't think the overhead is really that significant, is it?
* The interface is, on the user side, extremely verbose. I think some functions could extract more information from their input type to lessen the amount of type definition the user has to write. Using the result_of protocol here and there could also prevent people to have to memorize various unrelated name for inner types.
I'm disappointed you think this. I have worked really hard to make the interface as light as possible. If you compare the library interface to other implementations such as Phoenix, I hope you'll agree that this library is quite light. The minimum implementation is: struct map_task : public boost::mapreduce::map_task<k1, v1> { template<typename Runtime> static void map(Runtime &runtime, std::string const &/*key*/, value_type &value) }; struct reduce_task : public boost::mapreduce::reduce_task<k2, v2> { template<typename Runtime, typename It> static void reduce(Runtime &runtime, std::string const &key, It it, It const ite) }; main() { typedef boost::mapreduce::job< map_task , reduce_task > job_t; boost::mapreduce::specification spec; boost::mapreduce::results result; boost::mapreduce::run<job_t>(spec, result); } I am keen to make it lighter if you can be specific with some suggestions, though?
* Giving performances figures is OK. Comparing them to sensible existing solution is better. How fast are you with respect to a similar, hand coded application using Boost.thread or even openMP. Instead of giving absolute time,a better metric could be using cycles per processed elements as it gives an architecture independent measure of performances.
Agreed on the performance figures, and I'll provide some comparisons in the future. Jose on this list has helped with some comparison with Phoenix, and the results are comparable with the WordCount example. You'll appreciate that I am limited to the machines I have access to, and Phoenix isn't available on my platform.
* On the same topic, why not providing an openMP version on compiler that can support it ?
Only that I am not familiar with openMP, and haven't looked at it. It's unlikely that I'll be able to do this, but if someone in the Boost community would like to help out, I'd be delighted.
* Give examples of other types of application that can benefit or can't benefit from MapReduce. Can I wrote Image processing with it? HPC ? Which caracteristics should my original application have to benefit from MR ? Counting and sorting words is ok for a tutorial but it doesn't give en incentive on how/why to use this library.
In the documentation I did say that I am not providing a tutorial on programming in MapReduce, but maybe I will one day. I do, however, recognize that one example does not demonstrate the possibilities for the library, and I will be providing more samples in the future. Thanks for your feedback -- Craig

Craig Henderson wrote:
This interface has changed several times and I can't decide the most appropriate. I have provided a base class to define the required types, hence the use of a function object. However, it is dangerous to use a real functor with an instance because the Map Tasks are independent of each other and run in different threads. If they had instance data, then synchronization becomes an issue, but more importantly, it breaks the programming model. In a true distributed system, map tasks will run on separate machines, and therefore unable to share data. Support for will is intended for a later release of the library, so I need to keep the design pure.
Why not having each thread with a local copy of the functor. Ideally, those are stateless anyway and thus this copy is mainly free. Other thing is, why not allowing the use of anything that acts as a function and provides the correct interface. You'll face the dreaded legacy code reuse wall if your users can't take their years old sequential function and turn it into a mapper or reducer. Storing boost::function inside the implementation to leverage this genericity is maybe a good idea.
These stats are very useful for research and testing, but I agree are less important in a production environment. The timings need to be built into the library infrastructure because the library user does not have access to the granularity of timing (without writing a bespoke schedule_policy). I can look at making the timing a another policy class, but I don't think the overhead is really that significant, is it?
I like when my phone phones and my toaster toasts. When I want a phone that toast, I like to be able to explicitly decide so ;) Make it a policy is def. better in my book.
I'm disappointed you think this. I have worked really hard to make the interface as light as possible. If you compare the library interface to other implementations such as Phoenix, I hope you'll agree that this library is quite light.
It's mainly around the need to have type::other_type::stuff and to have to check/rememebr which comes before. An unified thing like result_of<type(user type)>::type looks better.
I am keen to make it lighter if you can be specific with some suggestions, though?
This sample code is maybe the FIRST thing to be shown to the user really as it is far clearer on the intend on how to structurate the library.
Agreed on the performance figures, and I'll provide some comparisons in the future. Jose on this list has helped with some comparison with Phoenix, and the results are comparable with the WordCount example. You'll appreciate that I am limited to the machines I have access to, and Phoenix isn't available on my platform.
I fully understand that and ...
Only that I am not familiar with openMP, and haven't looked at it. It's unlikely that I'll be able to do this, but if someone in the Boost community would like to help out, I'd be delighted.
... this is something I can contribute.
In the documentation I did say that I am not providing a tutorial on programming in MapReduce, but maybe I will one day. I do, however, recognize that one example does not demonstrate the possibilities for the library, and I will be providing more samples in the future.
Nice thing could be small scale examples tied to tasks that one can have to do in a parallel way and demonstrate that the MapReduce approach adds a value at some point (paraphrasing Murray Cole statement of "show the payback"). -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

Joel wrote:
Why not having each thread with a local copy of the functor. Ideally, those are stateless anyway and thus this copy is mainly free. Other thing is, why not allowing the use of anything that acts as a function and provides the correct interface. You'll face the dreaded legacy code reuse wall if your users can't take their years old sequential function and turn it into a mapper or reducer. Storing boost::function inside the implementation to leverage this genericity is maybe a good idea.
This is how it started :) I'm happy to change it back again if consensus is that it is a cleaner interface. But doesn't it enable/encourage misuse of the separation of map tasks?
I like when my phone phones and my toaster toasts. When I want a phone that toast, I like to be able to explicitly decide so ;) Make it a policy is def. better in my book.
I agree that designation of tasks is better - I saw this as collecting metrics more that a distinct activity. I'll add "metric_policy" to the to-do list
It's mainly around the need to have type::other_type::stuff and to have to check/rememebr which comes before. An unified thing like result_of<type(user type)>::type looks better.
Sorry, but I'm not really understanding your point. Where in the user code is your objection? Can you cite an example and provide your preferred alternative?
... this is something I can contribute.
Thanks ! It should be easy to implement an openMP schedule_policy within the current framework.
Nice thing could be small scale examples tied to tasks that one can have to do in a parallel way and demonstrate that the MapReduce approach adds a value at some point (paraphrasing Murray Cole statement of "show the payback").
Yes, I'll add that too. Thanks -- Craig

Craig Henderson wrote:
This is how it started :) I'm happy to change it back again if consensus is that it is a cleaner interface. But doesn't it enable/encourage misuse of the separation of map tasks?
You can use tag dispatching, Concepts or SFINAE to check for some properties of the function object (aka nb of parametrs etc).
Sorry, but I'm not really understanding your point. Where in the user code is your objection? Can you cite an example and provide your preferred alternative?
I'll write some stuff later on this issue.
Thanks ! It should be easy to implement an openMP schedule_policy within the current framework.
Well I was thinking of comparing your framework to hand written openMP.
Yes, I'll add that too.
___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

Craig Henderson wrote:
This is how it started :) I'm happy to change it back again if consensus is that it is a cleaner interface. But doesn't it enable/encourage misuse of the separation of map tasks?
You can use tag dispatching, Concepts or SFINAE to check for some properties of the function object (aka nb of parametrs etc).
I think we're misunderstanding each other. My point is that if it is an instance of a functor, then the developer may be tempted to presume a state of the object and store instance data. This is invalid as the map task object is created, used and then destroyed. Any attempt to share data between map objects breaks the MapReduce design and scalability properties (which need to be preserved for loyalty to the MapReduce paradigm and for future multi-machine scalability). Regards -- Craig

Craig Henderson wrote:
I think we're misunderstanding each other. My point is that if it is an instance of a functor, then the developer may be tempted to presume a state of the object and store instance data. This is invalid as the map task object is created, used and then destroyed. Any attempt to share data between map objects breaks the MapReduce design and scalability properties (which need to be preserved for loyalty to the MapReduce paradigm and for future multi-machine scalability).
Well that's a problem with any kind of functor anyway. There is the same problem with Boost.Thread where you should be aware of the internal copy of the functor object etc... Just document what happens with the functor I guess. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

My point is that if it is an instance of a functor, then the developer may be tempted to presume a state of the object and store instance data. This is invalid as the map task object is created, used and then destroyed. Any attempt to share data between map objects breaks the MapReduce design and scalability properties (which need to be preserved for loyalty to the MapReduce paradigm and for future multi-machine scalability).
Well that's a problem with any kind of functor anyway. There is the same problem with Boost.Thread where you should be aware of the internal copy of the functor object etc...
Just document what happens with the functor I guess.
But my static method interface solves the problem. Is that not better than documenting a side-effect? Regards -- Craig

Craig Henderson wrote:
But my static method interface solves the problem. Is that not better than documenting a side-effect?
What will happen if you add a static data ? Won't be shared and accessible through the map functor in the same way ? -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

But my static method interface solves the problem. Is that not better
Craig Henderson wrote: than
documenting a side-effect?
What will happen if you add a static data ? Won't be shared and accessible through the map functor in the same way ?
Yes, if it were used to share data between instances, but that is easier for the developer (library user) to understand than documenting a side-effect of doing something more natural like adding instance data. The static interface surely discourages this activity? Regards -- Craig

Craig Henderson wrote:
Yes, if it were used to share data between instances, but that is easier for the developer (library user) to understand than documenting a side-effect of doing something more natural like adding instance data. The static interface surely discourages this activity?
I think the point is moot. If the user want to do somethign bad, he'lldo it anyway. So better stick to the common accepted form of functor I think. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

Craig Henderson wrote:
I think we're misunderstanding each other. My point is that if it is an instance of a functor, then the developer may be tempted to presume a state of the object and store instance data. This is invalid as the map task object is created, used and then destroyed. Any attempt to share data between map objects breaks the MapReduce design and scalability
From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of joel On 09 August 2009 17:32, joel wrote: properties
(which need to be preserved for loyalty to the MapReduce paradigm and for future multi-machine scalability).
Well that's a problem with any kind of functor anyway. There is the same problem with Boost.Thread where you should be aware of the internal copy of the functor object etc...
I've thought about this a lot, and have concluded that a regular functor *is* more intuitive as it is consistent with other libraries. To that end, I've updated the sandbox with the code changes in the library and sample/test applications as well as the documentation To all users of the library; PLEASE NOTE the Breaking change: map_task and reduce_task now must implement a function operator rather than static methods map and reduce. The functor signatures are the same the previous static methods. Thanks -- Craig http://www.craighenderson.co.uk

Craig Henderson wrote:
I have considerably revised and updated my MapReduce library and committed the changes to the Sandbox. I've also uploaded to the Vault for easy access
I am very interested to hear any comment on design, code, performance or any other area.
Hi Craig, Quoting from the start of your docs: "The Boost.MapReduce library is a MapReduce implementation across a plurality of CPU cores rather than machines." Isn't that rather missing the point of what MapReduce is supposed to be about? If I'm limited to one machine, I can write parallel code using the full repertoire of techniques. By re-designing my application to fit into the MapReduce pattern I can potentially scale it over multiple machines. But if I can't scale over multiple machines, why bother? Are you planning to support scaling over multiple machines in the future? Regards, Phil.

Hi Phil,
Quoting from the start of your docs:
"The Boost.MapReduce library is a MapReduce implementation across a plurality of CPU cores rather than machines."
Isn't that rather missing the point of what MapReduce is supposed to be about? If I'm limited to one machine, I can write parallel code using the full repertoire of techniques.
You can, and this is basically just another alternative technique. Writing multithreaded applications can be difficult, and is often done badly, so this library provides a framework to do the donkey-work and allow the developer to concentrate on solving their problem. Other libraries already exist for single-machine map/reduce (google for "phoenix mapreduce"), and there's an evaluation paper on it at http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf
By re-designing my application to fit into the MapReduce pattern I can potentially scale it over multiple machines. But if I can't scale over multiple machines, why bother?
In this scenario, then don't bother, indeed. But if you want easily to implement low-lock-contention multithreaded processing, then you might take a look.
Are you planning to support scaling over multiple machines in the future?
Yes, I am designing and developing a distributed file system that is aimed to achieve this (see http://craighenderson.co.uk/blog/index.php/tag/distributed-file-system/) or integration to any other DFS could do the same. The library is very much in its infancy, but I believe is useful enough to be a part of Boost in its single-machine state. Regards -- Craig
participants (3)
-
Craig Henderson
-
joel
-
Phil Endecott