Re: [boost] [SORT] Parallel Algorithms

20 Mar 2015

      Hi Steven

 I have the final version of all the algorithms. I wrote 6 algorithms of
stable sort, and selected the fastest. On my computer is about 15% faster
than the GCC stable sort.

 I had problems too with an error in an algorithm mixed with an error in
the GCC dynamic memory system, Under certain sequence of request, the
system provide a non valid address and due this a segmentation fault. The
debug for to detect and correct had been a nightmare.

 The algorithms provided are :

 1.- sort : The algorithm is intro_sort. Don't need additional memory.
Performance similar to the GCC version and Windows version. Non stable

2.- parallel_sort : Decompose the global range in small ranges for to be
sorted with intro_sort. Don't need additional memory. Similar performance
than the TBB version and Microsoft version. Non Stable

3.- stable_sort : The algorithm is called smart_merge_sort, and is about
10-15% faster than the GCC stable sort version, The additional memory is an
half of the memory used by the data.( The GCC stable sort use the same
memory than the data)

4.- sample_sort : is a parallel stable sort. Have the same performance than
the TBB implementations, and much more faster than the GCC parallel
stable_sort. The additional memory is the same than the data

5.- parallel_stable_sort : This algorithm is based on the sample sort.
Increase the sample_sort time about a 5%, but the additional memory used is
an half of the memory used by the data.

 The functions are in the namespàce boost::sort All the algorithms are
exception safe, and use indirect sort when the size of the objects is big.
According to my benchmarks, the size limit for to change a indirect sort is
128 bytes.

 The function format of the parallel algorithms are

 algorithm ( first, last, compare , number_threads).

 The number of threads is passed with an struct NThreads which receives an
unsigned, and return an unsigned. The default construct of this struct is
filled with the HW threads

 At today I am testing on Windows with Visual Studio 2013 CTP 4, and
changing some things of the code due to the incompatibility with C++11
features

 In a few days I will document all the functions of the code for generate a
doxygen documentation, and can send you a version of the code.

 Then we must talk about how to design the benchmark. I think the starting
point can be your experience with your code

 The C++ committee propose a format for to call the parallel functions
described in

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3554.pdf

it would be a good idea to examine, and decide if follow the proposal, if
we don't follow or if we provide the two interfaces for the parallel
algorithms. According with my first impression it's not very complex to
implement.

 Yours

 Francisco Tapia

2015-01-14 17:26 GMT+01:00 Francisco José Tapia <fjtapia@gmail.com>:
...
Hi Thorsten,
The idea about the indirect sort is very easy. Instead of sort the
elements, create a vector of pointers, and sort the vector of pointers,
with this you don't move the elements , move the pointers. And at end with
the vector of pointers sorted, move the data only 1 time.
In the stable sort, the algorithms usually need additional memory. The
GCC and TBB the same memory than the data, in my algorithms only 1 half.
Imagine you want to sort 10 million elements of 128 bytes each. With the
indirect sort you need 10 million pointers and additionally memory for to
allocate 5 million pointers.
The memory used by the data is 1280 Megas, but pointers and the
additional memory are 15 million * 8 = 120 Megas, around a 10% of the
memory used by the data.
In the first messages of this conversation, I sent a zip with code. In
this file you can find the code of the indirect sort. If don't understand,
something, please say me, and I will try to explain.
Paul,
about the name space, if at end, the results are satisfactory, and
decide to include in the library, only say me the name space and I will
put. For me it's indifferent . But, I need time for the new algorithm, and
design a complete test suite of the code, the benchmarks, for to compare
with others versions and the documentation.
If you want a benchmark with multi precision numbers, we can do without
problem, even if you want to check the speed with different sizes of the
numbers
Yours
Francisco