
I am thinking now about how to progress my MapReduce library that is in the Boost SandBox. I have completed the single-machine implementation; it's performance is comparable to other libraries such as Phoenix (http://mapreduce.stanford.edu) and has been tested by a few people on this list. There has, however, been little interest in the library so far from Boost users/developers which surprises me. I don't know if that is because of the single-machine limitation and people don't see any value that MR can bring to multi-threaded programming? So where do I go next with the library? Options that I see are: 1. Complete the documentation and submit the current library to Boost for a formal review. I'm not clear there is sufficient interest, though. 2. Continue to develop the library in the sandbox to multi-machine implementation and work towards submitting that for formal review. Is there interest for this? 3. Accept that Boost is not an appropriate forum for a MapReduce library and find an alternative community to develop the library. Looking at the download figures from the Vault, there has been a reasonable number of downloads; Of course there is no way to tell the number of people looking at the sandbox implementation. Perhaps the library has some silent interest? Please let me know your thoughts. Thanks -- Craig Craig Henderson <http://craighenderson.co.uk/> http://www.craighenderson.co.uk http://www.siteupdatenotification.com

Craig Henderson wrote:
There has, however, been little interest in the library so far from Boost users/developers which surprises me. I don't know if that is because of the single-machine limitation and people don't see any value that MR can bring to multi-threaded programming?
People are maybe lacking perspective on the problem. One thing that can be beneficial is solving a real life problem with your library and show the payback.
2. Continue to develop the library in the sandbox to multi-machine implementation and work towards submitting that for formal review. Is there interest for this?
You mean over MPI or something ?
Looking at the download figures from the Vault, there has been a reasonable number of downloads; Of course there is no way to tell the number of people looking at the sandbox implementation. Perhaps the library has some silent interest? Please let me know your thoughts.
Well, Boost is sometimes a slow roll community + the potential summer break of possible users may means you jsut have to wait a few more. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

There has, however, been little interest in the library so far from Boost
users/developers which surprises me. I don't know if that is because of the single-machine limitation and people don't see any value that MR can bring to multi-threaded programming?
People are maybe lacking perspective on the problem. One thing that can be beneficial is solving a real life problem with your library and show the payback.
I definitly agree.Concepts like map-reduce are still at the state of "heard of" for a lot of developpers, most of them think "it's a nice idea" but when they need concurrency their immediate real-world thinking is basic threads. Showing simple tasks solved the classic way and the map-reduce way, showing the pros/cons of each approach would be a nice thing to have. Philippe

Philippe Vaucher wrote:
I definitly agree.Concepts like map-reduce are still at the state of "heard of" for a lot of developpers, most of them think "it's a nice idea" but when they need concurrency their immediate real-world thinking is basic threads. Showing simple tasks solved the classic way and the map-reduce way, showing the pros/cons of each approach would be a nice thing to have.
I encourage you to see what Murray Cole and Marco Danelutto said about the exact same thing on skeletons in 1999-2005. The reasoning still hold. -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

Craig, I'm very interested in progressing this library. I'm actually trying to convince my boss to jump on the mapreduce coach, using Hadoop that is. Next for me is to use your lib for a work related problem to show off the potentials. As for the boost version, I could image to create a batch image processor using your mapreduce lib. I have a huge collection of images I need to resize to be able to upload them. We could make this as flexible as possible to run all kinds of operations. Let me know what you think. Christian

I definitly agree.Concepts like map-reduce are still at the state of "heard
of" for a lot of developpers, most of them think "it's a nice idea" but when they need concurrency their immediate real-world thinking is basic threads. Showing simple tasks solved the classic way and the map-reduce way, showing the pros/cons of each approach would be a nice thing to have.
I encourage you to see what Murray Cole and Marco Danelutto said about the exact same thing on skeletons in 1999-2005. The reasoning still hold.
I'm not sure what you meant by this, did you talk about stuffs like http://homepages.inf.ed.ac.uk/mic/ColePisa04.pdf ? Searching google for what you said gives too much results about skeletons, I don't have time to read them all and I'm not sure what to look for anyway. Thanks for pointing out existing material tho :) Philippe

Philippe Vaucher wrote:
I'm not sure what you meant by this, did you talk about stuffs like http://homepages.inf.ed.ac.uk/mic/ColePisa04.pdf ? Searching google for what you said gives too much results about skeletons, I don't have time to read them all and I'm not sure what to look for anyway.
The PISA talk is one of those I had in mind. I'll check my bibliolinks repository and share some. I can also share soem of my own paper on the subject if needed.
Thanks for pointing out existing material tho :)
Well, I always wondered why the google guy that did mapreduce never checked those first and cited them ... -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

I am thinking now about how to progress my MapReduce library that is in the Boost SandBox. I have completed the single-machine implementation; it's performance is comparable to other libraries such as Phoenix (http://mapreduce.stanford.edu) and has been tested by a few people on this list.
2. Continue to develop the library in the sandbox to multi-machine implementation and work towards submitting that for formal review. Is there interest for this?
FWIW, over at saga.cct.lsu.edu we have something similar in the works... There was even a GSoC project this year to implement MR on top of SAGA (SAGA == Simple API for Grid Applications, a platform independent framework for multi-machine (cluster) operation). Regards Hartmut

On Mon, Aug 31, 2009 at 4:19 AM, Craig Henderson<cdm.henderson@googlemail.com> wrote:
I am thinking now about how to progress my MapReduce library that is in the Boost SandBox. I have completed the single-machine implementation; it's performance is comparable to other libraries such as Phoenix (http://mapreduce.stanford.edu) and has been tested by a few people on this list.
There has, however, been little interest in the library so far from Boost users/developers which surprises me. I don't know if that is because of the single-machine limitation and people don't see any value that MR can bring to multi-threaded programming?
So where do I go next with the library? Options that I see are: ... 2. Continue to develop the library in the sandbox to multi-machine implementation and work towards submitting that for formal review. Is there interest for this?
I would say 2 is the best option. People have been splitting tasks into multiple operations and combining the results on single machines for ages -- MapReduce doesn't really offer any innovation there. The innovation, and the buzz about it, is that it offers a reliable, general-purpose, and large-scale distributed implementation of this very basic idea. If you can accomplish that in this library, I think there will be _a lot_ more interest. I think a lot of the MapReduce buzz also has to do with the services tied to it that further ease common scalability bottlenecks, the big ones being Google File System and BigTable. It's really just part of the bigger ecosystem. -- Cory Nelson http://int64.org

Cory Nelson said
People have been splitting tasks into multiple operations and combining the results on single machines for ages -- MapReduce doesn't really offer any innovation there.
Well, it provides a very easy framework for implementing parallel algorithms. Mulithreading is hard and often done very badly - MR simplifies the task tremendously.
The innovation, and the buzz about it, is that it offers a reliable, general-purpose, and large-scale distributed implementation of this very basic idea. If you can accomplish that in this library, I think there will be _a lot_ more interest.
I think a lot of the MapReduce buzz also has to do with the services tied to it that further ease common scalability bottlenecks, the big ones being Google File System and BigTable. It's really just part of the bigger ecosystem.
Agreed - the difficulty is in defining where a library ends and the infrastructure begins. This library cannot (and should not, IMO) explode into a distributed file system (extension to Boost.FileSystem) & communications library (Boost.MPI or Boost.ASIO based). This is the MapReduce algorithm to sit upon other infrastructure to provide an overall solution. -- Craig
participants (6)
-
Christian Henning
-
Cory Nelson
-
Craig Henderson
-
Hartmut Kaiser
-
joel
-
Philippe Vaucher