This message is for to announce the new version of the Boost Sort Parallel algorithms. You can find in http://github.com/fjtapia/sort_parallel. In the documentation, you can see the description of the algorithms, examples of invocations, and the graphs of the benchmark of the algorithms. The benchmark was not include in the library. I created other repository https://github.com/fjtapia/sort_parallel_benchmark, where you can find the code of the benchmark, the library, and the external code of the TBB parallel stable sort. It was not logic, include external code inside the library. The most significant of this version is the new parallel sort algorithm (internally named Block Indirect Sort) In the Parallel Sort Algorithms, we can find two groups. - Algorithms very fast with many threads (GCC Parallel Sort, Microsoft Parallel Buffered Sort), but use an additional memory as the data. - Algorithms without additional memory, but slower with many threads (TBB Parallel Sort, Microsoft Parallel Sort) The new algorithm *invented, designed and implemented by the author, provide the speed of GCC Parallel Sort and a memory usage similar to TBB. With the additional advantage that don't need any library ( Open MP , TBB, Microsoft PPL …). Only need a C++11 compiler. An example shows you this : Sorting 100 000 000 random numbers of 64 bits. In a server with 32 threads. The results obtained was *Time spent* *Memory used (MB)* *GCC Parallel Sort* 1.05 secs 1560 MB *TBB Parallel Sort* 1.76 secs 783 MB *Block Indirect Sort* 0.97 secs 790 MB In the repository with the benchmark code you can find the benchmark results with several machines https://github.com/fjtapia/sort_parallel_benchmark/tree/master/Linux64_GCC/r... https://github.com/fjtapia/sort_parallel_benchmark/tree/master/Linux64_GCC/r...In these files, you can see the rival is GCC. TBB is clearly surpassed, due to their internal algorithm. GCC divide the data, sorted independently by all the threads. And after for to merge do an easy process, and copy in an auxiliary memory, of the same size than the used by the data. All the threads are fed, but need the double of memory. The new algorithm, do the division, and sort independently, but for the merge, have a small buffer of 1024 elements, and when a part is moved to the buffer, must calculate which part must be copied in that place. Do the same but without the additional memory. Due this, need more CPU for to do the process. The hyper threading (HT) provide around a 30% more power to the processor, and usually in the machines with HT activate, the new algorithm is a bit faster than GCC, and with the HT deactivate is slower, and TBB is slower except when very big objects. But then the bottleneck is not the algorithm, is the data bus The new algorithm is a collection of bad algorithms when you have only 1 thread. But they are easily subdivided. This permit parallelize all the process and have fed all the threads. The new algorithm is documented in two articles - block_indirect_sort_brief_en.*:* (Introduction and graph benchmarks) - block_indirect_sort_en.pdf: (Introduction, detailed description of the algorithm and benchmarks.) If you want to test the benchmark in your machine it's easy. In the benchmark repository, you have the code used, and shell scripts for to compile and run, saving the results in a Results.txt file, for Linux GCC and CLANG and Windows Visual Studio 15 If you see any problem, mistake, something to improve or need any other information, please, say me. Yours Francisco * Invented: The algorithm had been ideate, designed and implemented beginning from zero from an old idea. After read hundreds of articles and books, I didn't find any similar. If you know something, please say me. But the important is not the author, is provide a fast, robust, and easy to use algorithm to the community of programmers.