Hi Steven, I thought all the program was checked, but obviously not. Now in the folder you have a shell script with the command for to compile and link with optimization ( by example make_benchmark_sort_03.sh), and a make_all.sh which compile all. Open a terminal in the folder, and compile and run All the programs use the HW thread and use all available. All the programs had been done for to run in a Linux machine with 4 G of RAM. The number of element involved in the sort is defined at the beginning of the program with the #define NELEM. You can change this value, and compile again. As suggested me, change the number generator and now use a Mersenne of the standard library. The stable sort is less used than the non stable sort, but sometimes is important. I developed because in my project have a proxy , which receives the operations over the elements defined by a key. The proxy sort, and send the operations to the corresponding thread, and it is not the same, read the value of a key and after delete that key, than delete the key and after try to read the value. I can't add a time mark for to know which arrived before In order to clarify the information obtained from the benchmarks : In the folder Modified the GCC parallel:sort and parallel:merge_sort, and the countertree::parallel_sort and countertree::parallel_merge_sort use the std::sort and the std::merge_sort. The utility of this folder is the comparison with the same programs of the folder Original, the result show the problems described in this message. In the folder Original the GCC parallel:sort and parallel:merge_sort, use the std::sort and the std::merge_sort., and the countertree::parallel_sort use countertree::intro_sort and countertree::parallel_merge_sort use countertree::merge_sort. SCALABILITY : Some algorithms don't scale well when the number of threads grow. Increasing the number of threads don't decrease the execution time. This lack of scalability is emphasized when the size of the elements to sort grows As say in the previous message, the stable sort, with 1 thread need 2.7, and with 4 HW threads need 8.49, around 3 times the time of 1 thread. And the countertree::merge_sort need 2.7 for to sort with 1 thread and with 4 HW threads need 4.89, secs ( See the [Original test_stress_sort.cpp ] ) This show countertree::merge_sort is a good for to use in a parallel SW and GCC std::stable_sort is bad. If you see the results of the countertree:parallel_merge_sort in the benchmarks in the folder Modified, are really bad compared with the same results in the folder Original. The difference is the internal algorithm. In the folder Modified use GCC std::stable_sort, and in the Original use countertree::merge_sort The GCC std::sort and countertree::intro_sort have similar performance with all the size of elements. But in the parallel process, when the size of the elements grow, the poor scalability appear. The next lines are are from Original benchmark_sort_05.cpp on my computer With 128 bytes elements Random elements, few repeated ----------------------- STL std::sort :2.9188 secs countertree::intro_sort :2.87864 secs parallel::sort :2.46113 secs tbb::parallel_sort :1.86458 secs countertree::parallel_sort:1.89263 secs With 256 bytes elements Random elements, few repeated ( rand() )----------------------- STL std::sort :3.03143 secs countertree::intro_sort :3.02733 secs parallel::sort :2.23231 secs tbb::parallel_sort :1.76186 secs countertree::parallel_sort:1.56509 secs As we can see the parallel process is ineffective with GCC parallel::sort with big objects. About the merge_sort algorithm. It use additionally half of the memory used by the data. But if you don't use additional memory, the speed drops. I didn't examine in deep the Algorithm used in GCC, I only codified my own algorithm. I will examine, and I will try to reduce the memory used, without lost of speed and scalability. About the server for to test. I don't have modern servers in my school. The budget cut eliminate the new servers. I have an I7 Xeon , but is a 3 years old machine with 4 HW threads, and in the benchmarks is slower than the I3 and I5 machined mounted this year. If someone can check and provide us the result, I will be grateful. The version 4.9.1 of GCC have better optimization than the 4.8.2. If you are using Ubuntu, the version 14.10 have the 4.9.1 version as predefined. Yours Francisco Tapia