Re: [boost] Proposal: MapReduce library (single machine)

15 Jun 2009

      ...
Actual comment:
Do you have actual performance data for non trivial task ? 
more precisely, I kno from experience that such kind of 
implementation may suffer from some C++ induced overhead. How 
do you compare to han written pthread or boost::thread code ?
I'm interested to see how this performs
I have designed the class infrastructure to be as flexible as possible using
templates. Job scheduling is a particular interest of mine, and is a policy
that can be specified. The current 'library' includes two schedulers,
mapreduce::schedule_policy::cpu_parallel as in the example which maximises
the use of the CPU cores in the machine, and
mapreduce::schedule_policy::sequential which runs one Map task followed by
one Reduce task. This is useful for debugging the algorithms.

What I haven't shown in the documentation is that
intermediates::local_disk<> takes three parameters; the last two being
defaulted. These are for Sorting and Merging the intermediate results. The
current implementation uses a crude system() call to the OS, which of course
need improving. Interestingly it is the sorting that takes much of the time
in my tests so far.

So, to answer your question, I don't have specific performance metrics and
comparisons that I can shared with you at this time. The principle for the
library is that everything is templated (policy-based) so can be swapped
around and re-implemented to best suite the needs of the application. The
supplied implementations provide the framework and a decent implementation
of the policies, but will not be optimal for all users.

-- Craig

Re: [boost] Proposal: MapReduce library (single machine)

Craig Henderson