Hi Steven, thanks by you fast response. I want clarify several questions of your message. Please, excuse me my English. It's not my mother tongue, and I have errors and imprecisions. TEST OBJECTS The object used in the test ( Reburcio is a word of an exercise from my first pascal book, and don't have any sense) In the construction , copy construction and operator =, copy all the elements with a loop. The only one operation which use only the first element is the comparison. I had done a new comparison, and compare the sum of all the elements of the object. The time are similar, you must add 5% in the small objects to 15% in the big objects. TIMSORT The Tim Sort code is not mine. Tim Sort is used in Java and Android, and is specially efficient with big objects. ( You can see in benchmark_single_stable_sort_objects.txt). I use it for to speed compare. I didn't explore many about Tim Sort, because I saw the methods based on merge are faster, and I discarded the algorithm. If you think , it deserve a second opportunity, we can look for new implementations and check the speed in order to take a decision. I didn't measure the auxiliary memory used by Tim Sort. Wikipedia talk about an auxiliary memory O(N), but perhaps to say it's equal to the memory used by the data is not correct. In the next benchmark programs, as say Steven, the data are loaded from a file, and the program execute only 1 algorithm. With this configuration check the memory used is simple. Now it's very difficult because the program execute all the algorithms, and it's not possible to know how many memory use each algorithm. DESIGN PHASE Now , we are in the design phase. Nothing is fixed. We can comment and change anything. The specification is simple, provide sort algorithms, stable and not stable, for to run with 1 thread or with any number of threads, without any other external library ( as Open MP , TBB, CilkPlus..). Only with the C++ standard. If you want to discuss about this, we can do. We admit ideas, suggestions, opinions. Any help is welcome. The final decision about if the code is included or not in the boost sort library, depend of Steven, the creator of the library. After the design phase, with the decisions taken and the interfaces fixed. We prepare the final version, redesign a test system, test in deep and create the documentation. ABOUT MYSELF This SW provide of other project for the Boost library ( Contertree, concurrent red-black trees, with access by position , like the vectors), pending the final revision since near two years ago. When began, the implementation of parallel non stable sort, have similar performance than the GCC version, based on OpenMP, and the TBB version. The stable sort with 1 thread was slower than the GCC version, and the parallel stable sort was faster than the GCC version based on OpenMP. When someone mentioned the TBB sample sort implementation, even when is experimental, don't destroy the object in the auxiliary memory, and it's not exception safe. We began a crisis. That version was 20-30% faster than our version. I want tho show my admiration by TBB and their team. I think they are the best doing concurrent SW. I tried to improve, I would like to have more time, more equipment, and better documentation on order to find the best options, but I don't have, and I do this in my free time after my work and my family. I designed 6 stable sort algorithms, for to select the fast (internally named smart_merge_sort, about 10% faster than the GCC version). I prepared a version of Sample Sort, only with threads and atomic variables. In the benchmarks have similar performance than the TBB version, use one half of the auxiliary memory used by TBB and is exception safe. In my opinion, the parallel stable sort is reasonably good for to be included in the final version. The important case, which shows the quality of the algorithm, is when the data are unsorted. I will try to implement the detection of the sorted elements , like in TBB, but I am not sure about the results. Th internal loops of the algorithm are delicate, and a small change , can produce big changes in the performance, as happened in other algorithms We are in the design phase. We must test with others machines, big and small. Perhaps we must change or redesign things. But we need help. If someone test the programs, please send me the file result and the machine description ( Processor , number of cores,speed, memory , cache ) in order to obtain conclusions, fix the code and pass to the next phase. Thanks in advance Francisco