[Disclaimer, long e-mail, thank you for your attention] Hi to all, Some months ago I tried to gain interest from sorting experts about implementing stable and inplace merge [O(N)] or sort [O(NlongN)] algorithms that don't need additional memory. My main use case for these algorithms was lowering the complexity guarantees of range insertion functions offered by flat associative containers without any temporary allocation (the programmer does not expect allocations when capacity() is big enough to hold the new elements). On the other hand, I wanted those algorithms to be adaptive so as to take advantage of the extra capacity already allocated by flat containers. Of course, if a good implementation is available in Boost, it could be used as existing practice to change the worst-case complexities of some STL algorithms when no additional memory (e.g. get_temporary_buffer()) is available. After reviewing several algorithms from papers and some open source implementations, I decided to implement one of those algorithms. The implemented algorithm is based on the GrailSort algorithm developed by Andrey Astrelin. The main idea of the my adaptive_sort algorithm is explained by Andrey in an article published by the russian collaborative blog Habrahabr (http://habrahabr.ru/post/205290/). The algorithm is based on ideas explained by Huang and Langston in their paper "Fast Stable Merging and Sorting in Constant Extra Space (1989-1992" (http://comjnl.oxfordjournals.org/content/35/6/643.full.pdf). Since the implementation from (https://github.com/Mrrl/GrailSort) is macro-based, does not support non-memcopyable types and has a MIT license (which seems to be problematic for Boost), I implemented the algorithm from scratch following the article and taking ideas from Andrey's implementation. I added some new ideas and optimizations in order to reduce comparisons and elements moves. I must say that the implementation was really hard for someone like me without experience in sorting algorithms, the complexity is huge when compared with a classic mergesort algorithm. The implementation is already in develop and master branches but the develop branch contains more tests and a refactored and more documented code. So please check the code from the develop branch if interested. The implementation is still experimental and not production ready. What about benchmarks? I've only experimented in the oldest Visual compilers to make sure it correctly supports Boost.Move emulation and complexity guarantees were correct. Note that the STL algorithms tested in the benchmark might not be as fast as current implementations. In general, the new algorithms need much more elements moves than algorithms that allocate O(n) memory, but they are still quite fast when data movement is cheap (e.g. when a move constructor just copies a couple of pointers). These are the numbers for A VERY LIGHTWEIGHT object (a struct holding two integers, one for the key, another one to test the stability, don't take times very seriously, just pay attention to the number of comparisons/moves). ////////////////////////// ////////////////////////// STABLE MERGING ////////////////////////// ////////////////////////// - N: number of elements - NK: number of keys (0 means all keys different) - InplaceMerge: std::inplace_merge - Sqrt2AdaptMerge: adaptive_merge with an auxiliary buffer of 2*sqrt(N) elements - SqrtAdaptMerge: adaptive_merge with an auxiliary buffer of sqrt(N) elements - AdaptMerge: adaptive_merge with no external buffer - Cmp: number of comparisons / N - Cpy: number of move constructor/assignments / N - (ratio): time of execution / time needed by InplaceMerge ---------------------------------------------- Numbers produced by the bench_merge.cpp test in Boost.Move: ---------------------------------------------- - - N: 100001, NK: 511 - - InplaceMerge Cmp: 0.9989 Cpy: 1.5000 633.65us ( 1.00) Sqrt2AdaptMerge Cmp: 1.2429 Cpy: 4.4209 910.16us ( 1.44) SqrtAdaptMerge Cmp: 4.2708 Cpy: 5.6712 1.63ms ( 2.57) AdaptMerge Cmp: 5.6014 Cpy: 8.7573 3.11ms ( 4.91) - - N: 100001, NK: 2047 - - InplaceMerge Cmp: 0.9998 Cpy: 1.5000 604.77us ( 1.00) Sqrt2AdaptMerge Cmp: 1.2449 Cpy: 4.4483 934.83us ( 1.55) SqrtAdaptMerge Cmp: 2.0131 Cpy: 4.7609 1.12ms ( 1.85) AdaptMerge Cmp: 2.7608 Cpy: 11.1419 2.83ms ( 4.67) - - N: 100001, NK: 8191 - - InplaceMerge Cmp: 0.9999 Cpy: 1.5000 653.51us ( 1.00) Sqrt2AdaptMerge Cmp: 1.2443 Cpy: 4.3179 943.86us ( 1.44) SqrtAdaptMerge Cmp: 1.4661 Cpy: 4.4348 998.62us ( 1.53) AdaptMerge Cmp: 1.7692 Cpy: 10.6292 2.54ms ( 3.89) - - N: 100001, NK: 32767 - - InplaceMerge Cmp: 1.0000 Cpy: 1.5000 772.06us ( 1.00) Sqrt2AdaptMerge Cmp: 1.2417 Cpy: 4.4317 1.03ms ( 1.33) SqrtAdaptMerge Cmp: 1.3320 Cpy: 4.4758 1.06ms ( 1.38) AdaptMerge Cmp: 1.5295 Cpy: 10.1233 2.44ms ( 3.16) - - N: 100001, NK: 0 - - InplaceMerge Cmp: 1.0000 Cpy: 1.5000 881.88us ( 1.00) Sqrt2AdaptMerge Cmp: 1.2387 Cpy: 4.4124 1.16ms ( 1.31) SqrtAdaptMerge Cmp: 1.2933 Cpy: 4.4412 1.20ms ( 1.36) AdaptMerge Cmp: 1.4110 Cpy: 7.5794 1.93ms ( 2.19) In general, with no memory and enough keys, the algorithm needs 7,5*N data moves and 1,5*N comparisons. Data locality is quite high and with cheap-moving types the overall time can be just two-three times slower. With fewer keys, the algorithm might need upt to 5xN comparisons and 11xN moves. Number are improved if an auxiliary buffer of sqrt(N) elements is available. The number of comparisons and moves are even better than those shown in experimental results from several papers, but I still find them quite high. The algorithm needs some initial steps (e.g. unique key extraction) that are not amortized during the merge (adaptive_sort amortizes the initial steps much better ). I've derived the merge algorithm from the sorting algorithm so maybe there is room for improvement (I've invested much more time in the sorting algorithm). ////////////////////////// ////////////////////////// STABLE SORTING ////////////////////////// ////////////////////////// Numbers when sorting: - N: number of elements - NK: number of keys (0 means all keys different) - MergeSort: A half-copying mergesort (it only needs N/2 auxiliary memory) implemented in boost/move/algo/detail/merge_sort.hpp. Based on the paper "Fast mergesort implementation based on half-copying merge algorithm" by Cezary Juszczak (http://kicia.ift.uni.wroc.pl/algorytmy/mergesortpaper.pdf). Usually faster than std implementations with less auxiliary memory. - StableSort: std::stable_sort - HeapSort: std::heap_sort - QuartAdaptSort: adaptive_sort with N/4 auxiliary memory - Sqrt2AdaptSort: adaptive_sort with 2*sqrt(N) auxiliary memory - SqrtAdaptSort: adaptive_sort with sqrt(N) auxiliary memory - SqrtHAdaptSor: adaptive_sort with sqrt(N)/2 auxiliary memory - AdaptSort: adaptive_sort with no auxiliary memory - Cmp: number of comparisons / N - Cpy: number of move constructor/assignments / N - (ratio): time of execution / time needed by MergeSort [numbers produced by the bench_sort.cpp test in Boost.Move] - - N: 100001, NK: 511 - - MergeSort Cmp: 16.379 Cpy: 16.707 6.49ms ( 1.00) StableSort Cmp: 20.594 Cpy: 22.978 10.85ms ( 1.67) HeapSort Cmp: 16.992 Cpy: 22.098 10.98ms ( 1.69) QuartAdaptSort Cmp: 17.659 Cpy: 26.314 8.34ms ( 1.29) Sqrt2AdaptSort Cmp: 17.829 Cpy: 35.340 9.90ms ( 1.53) SqrtAdaptSort Cmp: 27.805 Cpy: 56.059 19.48ms ( 3.00) SqrtHAdaptSort Cmp: 27.356 Cpy: 60.375 20.35ms ( 3.14) AdaptSort Cmp: 27.352 Cpy: 67.889 21.83ms ( 3.36) - - N: 100001, NK: 8191 - - MergeSort Cmp: 16.394 Cpy: 16.725 7.47ms ( 1.00) StableSort Cmp: 20.578 Cpy: 22.978 11.92ms ( 1.60) HeapSort Cmp: 16.992 Cpy: 22.099 11.91ms ( 1.59) QuartAdaptSort Cmp: 17.676 Cpy: 26.362 9.26ms ( 1.24) Sqrt2AdaptSort Cmp: 17.847 Cpy: 35.747 10.93ms ( 1.46) SqrtAdaptSort Cmp: 18.977 Cpy: 41.641 12.19ms ( 1.63) SqrtHAdaptSort Cmp: 17.076 Cpy: 57.799 15.47ms ( 2.07) AdaptSort Cmp: 17.090 Cpy: 67.179 17.37ms ( 2.33) - - N: 100001, NK: 32767 - - MergeSort Cmp: 16.395 Cpy: 16.727 7.66ms ( 1.00) StableSort Cmp: 20.581 Cpy: 22.986 12.16ms ( 1.59) HeapSort Cmp: 16.994 Cpy: 22.099 11.96ms ( 1.56) QuartAdaptSort Cmp: 17.674 Cpy: 26.319 9.47ms ( 1.24) Sqrt2AdaptSort Cmp: 17.833 Cpy: 35.792 11.16ms ( 1.46) SqrtAdaptSort Cmp: 18.967 Cpy: 41.604 12.42ms ( 1.62) SqrtHAdaptSort Cmp: 17.065 Cpy: 57.756 15.71ms ( 2.05) AdaptSort Cmp: 17.081 Cpy: 67.131 18.32ms ( 2.39) - - N: 100001, NK: 0 - - MergeSort Cmp: 16.399 Cpy: 16.731 7.76ms ( 1.00) StableSort Cmp: 20.521 Cpy: 22.920 12.23ms ( 1.58) HeapSort Cmp: 16.991 Cpy: 22.094 11.85ms ( 1.53) QuartAdaptSort Cmp: 17.671 Cpy: 26.380 9.50ms ( 1.22) Sqrt2AdaptSort Cmp: 17.825 Cpy: 35.653 11.17ms ( 1.44) SqrtAdaptSort Cmp: 18.954 Cpy: 41.545 12.42ms ( 1.60) SqrtHAdaptSort Cmp: 17.058 Cpy: 57.694 15.66ms ( 2.02) AdaptSort Cmp: 17.072 Cpy: 67.072 17.57ms ( 2.26) In general, with no memory and enough keys, the algorithm needs nearly optimal comparisons and significantly more (4x) data movements. Data locality is quite high and with cheap-moving types the overall time can be just two-three times slower. With fewer keys, the algorithm performs worse but it's quite competitive. When extra memory (e.g. sqrt(N)) is available, the algorithm is quite fast with cheap-moving types. //////////////////////////////////////////// Final words: The effort to implement the algorithms was big and I couldn't fix many open bugs in my libraries, so in the following months I don't expect to invest much time on this. At least I wanted to publicly share the committed implementation so that it can be improved by sorting experts. I don't claim these algorithms are groundbreaking or even useful for many programmers, but they might be a good starting point. I'm not happy with the implementation as the algorithm is complicated and I' haven't found a way to simplify it further without noticeable slowdowns. For those interested in the implementation and the algorithm: please contribute correct benchmarks and testing, as there are very few open source implementations of similar algorithms. Any optimization, patch or bug report is welcome. Best, Ion