Re: [boost] [SORT] Parallel Algorithms

23 Dec 2014

      Francisco,

On Mon Dec 22 2014 at 5:04:17 PM Francisco José Tapia <fjtapia@gmail.com>
wrote:
...
...
I had been working about the memory used by the different algorithms,
specially the stable algorithms. I developed a low memory algorithm
circular_stable_sort, which use only a 10% more of memory, and the speed
decrease around 15% with small objects and with big objects is similar to
the merge_sort
Some people may find that useful.  What is the worst-case computational
complexity?  Is it a well-known algorithm?  The bar is going to be higher
(may require a full review) for previously unknown algorithms.
...
...
I was looking too, the TimSort algorithm. I found a C++ implementation
in https://github.com/gfx/cpp-TimSort .
I inserted this algorithm in the benchmark programs. With small objects,
the speed is slower than the stable sort and merge_sort, and have a good
performance with big objects, greater than 64 bits. It's logic the
decision
of Java and Python, because they must order objects, usually non
contiguous
and bigger than the naked number.
For the measure of the memory used by the program I use the command
/usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04
(100000000 uint64_t numbers), and comment all the invocations to the
algorithms , except one, compile, and run in a terminal with the command
/usr/bin/time -v benchmark_sort_04.
I repeat this process with all the algorithms for to know the memory
used by each algorithm.
I suggest changing the struct you use for benchmarking to this:
template <uint32_t NN>
struct reburcio
{
  uint64_t M[NN];
  reburcio () {
    init();
  }

  void init() {
    for (int i = 0; i < NN; ++i) {
      M[i] = my_rand();
    }
  }

  bool operator < ( const reburcio &R) const{
    for (int i = 0; i < NN; ++i) {
      if (M[i] != R.M[i]) {
        return M[i] < R.M[i];
      }
    }
    return false;
  }
};

That way it's using the default copy and assignment operators, which copy
over the entire datastructure, and it initializes and compares all elements
of the array.  It appears that the benchmark you sent me only copies over
and compares the first element in the array, which might have caused some
of the large-element differences in performance you were seeing.

With this data structure, countertree intro_sort is competitive on my linux
system with std::sort on fully random data, and countertree parallel_sort
is competitive for 8, 16, and 256 bytes per element, but significantly
underperforms (10-15%) tbb for 32 and 64 bytes.
...
...
As you can see with a 256 bytes of size the indirect
parallel_stable_sort is 3 times faster. And the indirect_sort is a 25%
faster.
Yes, indirect sort should eventually be faster, given a large enough
element, especially if indirect sort is done with the fastest underlying
algorithm available.
...
...
The final solution, if at end is taken, will be a hybrid algorithm
depending of the size of the elements
That sounds reasonable for fixed-size elements.  What about variable-length
elements, like strings?  Will this only be for fixed-size elements?
...
...
I propose you the next. Let me 2 or 3 weeks and I will prepare a big
benchmark with all the algorithms, including the indirect version of the
stable and unstable sort. We will run the benchmarks over different
machines, and according the results, take a decision about if we propose
something to Boost, and which algorithm must be used in the hybrid
version.
If, at end, we decide don't propose nothing, I only can say this had
been a nice and exciting experience
I think indirect sorting is promising.
...
...
Mi intention is not to impose my code, I want provide to the final users
the best solution, if exist. With my code or with the code of other. I am
the first to clap the winner.
We are doing this because this parallel implementation don't need any
other library, only the standard. There are compilers without OpenMP or
TBB
...
support, or simply, the user don't want to load them for to make a sort.
Agreed, but OpenMP is very common (even Android supports it)!, so systems
without OpenMP are a very small niche in terms of total systems out there.
TBB is supported on intel-compatible processors, and the vast majority of
deployed processors where parallel sorting makes sense (won't drain the
battery really quickly) are intel-compatible.  If we're going to provide an
alternative, it should cost very little in terms of performance, and
ideally provide some kind of benefit besides eliminating a dependency.
...
...
I have in mind, several improvements of the algorithms. I will codify
and test, and if run well, I include in the algorithms. I want, too,
simplify the interface of indirect_sort, for to be used in a easy way by
any sort algorithm. If you are interested, I will prepare too a brief
document with the explanation and the internal algorithm.
I'd like to see that document.
Along with testing a struct that is an array of ints, you should probably
test variable-length string sorting, as string sorting is very common.
Merry Christmas,

Steve

Re: [boost] [SORT] Parallel Algorithms

Steven Ross