Adding my sorting algorithm, ska_sort to boost

Hi, a couple of years ago I wrote a generalized version of radix sort, called ska_sort. I gave a talk about it at CppNow, available here: https://www.youtube.com/watch?v=zqs87a_7zxw I have a new version of that algorithm that is both faster and more general. I can now sort almost anything. For example when I gave the talk I couldn't sort a std::vector<std::list<int>> because std::list doesn't have a random access interface, but now that works. (and it's faster than sorting the same thing with std::sort) I've also made it much easier to extend the algorithm with your own types. It was always easy if your type isn't too complicated, but now even weird types like std::variant are pretty easy to add. In my old version I had to go deep into the guts to add support for std::variant, now it's easy enough to do that that I'll add it to the documentation. I'm thinking of releasing the new version in boost, mainly so that it will be used more widely and so that it can be kept working in the long term. Since this is a long weekend in the US, I will try to as much of the work as possible this weekend. I'm currently trying to figure out how the boost documentation works (I saw that there are quickbook files in Boost.Sort, but how do I run the build for it?) and I've already made good progress on cleaning up my code and making it use the boost::sort namespace. Before I go any further I wanted to ask though: Does boost want my sorting algorithm? This would essentially deprecate spreadsort since this one is faster and more general and easier to use. To convince you that this is worth adding, I ran the benchmark that is included in the Boost.Sort source code. On random numbers here is how long all the algorithms took: std::sort: 9.04 pdqsort: 4.65 std::stable_sort: 11.57 spinsort: 5.42 flat_stable_sort: 7.26 spreadsort: 4.41 ska_sort: 2.00 There are a lot more bencmarks in the Boost.Sort code, and in general the results are this: If the data is already sorted or mostly sorted, other algorithms are faster, but if there are no easy patterns, ska_sort is always fastest. (I was going to include the entire table in this email, but the formatting breaks. I can attach all the results separately if anyone is interested) The Boost.Sort code also has benchmarks for sorting "objects" and "strings" and the results are similar there: On random data ska_sort is always fastest. I think this new version would be great in boost. It's not every day that a new sorting algorithm comes along that's literally twice as fast as what we had before. If we want this in boost, what would the next steps be? From the "submission process" document it sounds like there will be several stages of "Refinement" for the code. (or should this go through the Boost Library Incubator?) After I figure out how the documentation works I can upload a first version of this to github. (I want to get the documentation working so that we can talk about the interface) Let me know what you think, Malte -- Malte Skarupke malteskarupke@fastmail.fm

That would be a very useful addition indeed. -- Janek Kozicki, PhD. DSc. Arch. Assoc. Prof. Gdańsk University of Technology Faculty of Applied Physics and Mathematics Department of Theoretical Physics and Quantum Information -- pg.edu.pl/jkozicki (click English flag on top right) On 13 Oct 2019, 13:14 +0200, Malte Skarupke via Boost <boost@lists.boost.org>, wrote:
Hi, a couple of years ago I wrote a generalized version of radix sort, called ska_sort. I gave a talk about it at CppNow, available here: https://www.youtube.com/watch?v=zqs87a_7zxw
I have a new version of that algorithm that is both faster and more general. I can now sort almost anything. For example when I gave the talk I couldn't sort a std::vector<std::list<int>> because std::list doesn't have a random access interface, but now that works. (and it's faster than sorting the same thing with std::sort)
I've also made it much easier to extend the algorithm with your own types. It was always easy if your type isn't too complicated, but now even weird types like std::variant are pretty easy to add. In my old version I had to go deep into the guts to add support for std::variant, now it's easy enough to do that that I'll add it to the documentation.
I'm thinking of releasing the new version in boost, mainly so that it will be used more widely and so that it can be kept working in the long term.
Since this is a long weekend in the US, I will try to as much of the work as possible this weekend. I'm currently trying to figure out how the boost documentation works (I saw that there are quickbook files in Boost.Sort, but how do I run the build for it?) and I've already made good progress on cleaning up my code and making it use the boost::sort namespace.
Before I go any further I wanted to ask though:
Does boost want my sorting algorithm? This would essentially deprecate spreadsort since this one is faster and more general and easier to use. To convince you that this is worth adding, I ran the benchmark that is included in the Boost.Sort source code. On random numbers here is how long all the algorithms took: std::sort: 9.04 pdqsort: 4.65 std::stable_sort: 11.57 spinsort: 5.42 flat_stable_sort: 7.26 spreadsort: 4.41 ska_sort: 2.00
There are a lot more bencmarks in the Boost.Sort code, and in general the results are this: If the data is already sorted or mostly sorted, other algorithms are faster, but if there are no easy patterns, ska_sort is always fastest. (I was going to include the entire table in this email, but the formatting breaks. I can attach all the results separately if anyone is interested)
The Boost.Sort code also has benchmarks for sorting "objects" and "strings" and the results are similar there: On random data ska_sort is always fastest.
I think this new version would be great in boost. It's not every day that a new sorting algorithm comes along that's literally twice as fast as what we had before. If we want this in boost, what would the next steps be? From the "submission process" document it sounds like there will be several stages of "Refinement" for the code. (or should this go through the Boost Library Incubator?) After I figure out how the documentation works I can upload a first version of this to github. (I want to get the documentation working so that we can talk about the interface)
Let me know what you think,
Malte
-- Malte Skarupke malteskarupke@fastmail.fm
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Hi Malte, I see there hasn't been much response to your post yet. Malte Skarupke wrote:
a couple of years ago I wrote a generalized version of radix sort, called ska_sort. I gave a talk about it at CppNow, available here: https://www.youtube.com/watch?v=zqs87a_7zxw
I have a new version of that algorithm that is both faster and more general.
Before I go any further I wanted to ask though:
Does boost want my sorting algorithm? This would essentially deprecate spreadsort since this one is faster and more general and easier to use.
I think this new version would be great in boost. It's not every day that a new sorting algorithm comes along that's literally twice as fast as what we had before. If we want this in boost, what would the next steps be?
The obvious "next step" would be to see if it can be added to the existing Boost.Sort library. Doing that bypasses the public review process. But the author of that library may have other ideas. Personally, often what I need to sort is small or has some complex comparison function, making radix sort inappropriate. If I did have a large dataset, where the performance benefit would be useful, my first thought would be to find a parallel sort (std:: parallel execution policy is starting to appear in a few places now.) Despite that, I think that what you've got is worthwhile. Others may be interested in your old blog posts: https://probablydance.com/2016/12/27/i-wrote-a-faster-sorting-algorithm/ Cheers, Phil.

On Oct 13, 2019, at 03:06, Malte Skarupke via Boost <boost@lists.boost.org> wrote:
Hi, a couple of years ago I wrote a generalized version of radix sort, called ska_sort. I gave a talk about it at CppNow, available here: https://www.youtube.com/watch?v=zqs87a_7zxw
I have a new version of that algorithm that is both faster and more general. I can now sort almost anything. For example when I gave the talk I couldn't sort a std::vector<std::list<int>> because std::list doesn't have a random access interface, but now that works. (and it's faster than sorting the same thing with std::sort)
This seems like a product of a large investment of effort. The big advantage over any previously available radix_sort is the ability to sort almost any datatype if a key maps to an integer somehow. Plus, an advantage in measured practical speed, and apparently no claims about Big O improvement. The average performance is O(n*b) where b is unknown and/or dependent on n ? Is the theoretical worst-case O(n^2) ? Why? Is it because of working in-place? (yes, I understand you fall back to std::sort if you hit this case) Is this software/algorithm fully documented? - I mean by another way than reading the source code. Maybe too many questions at once, but it was not clear after watching your talk at cppcon and the blog posts. Cheers, Kostas <https://www.youtube.com/watch?v=Nz1KZXbghj8>

Hi Malte, I am Francisco Tapia, one of the co-authors of the Sort library. We will examine your algorithm. But we must examine the internal parts of the algorithm, and evaluate the correctness of the algorithm, how is implemented the worst case and a couple of details. After, we check the speed with the benchmarks used in the library, with several sets of data. Since the near sorted to the fully unsorted, and from the single numbers to structs with heavy comparison functions. Now, I am overburdened by work, and I can't revise until December. But we will do and provide to you a reasoned response, and according to it, decide the inclusion or not in the library. As commented by Phil Endecott, the most interesting now is the parallel algorithms. Today, we don't have any parallel algorithm based on radix sort or any other hybrid algorithm. Because with small sets of data the differences are insignificant. But when you have big sets of data, use parallel algorithms, and then, the new single thread algorithms are in the middle, in the no man's land. If you have any doubt or question, contact me, and I will try to answer you. El mar., 15 oct. 2019 a las 23:06, Kostas Savvidis via Boost (< boost@lists.boost.org>) escribió:
On Oct 13, 2019, at 03:06, Malte Skarupke via Boost < boost@lists.boost.org> wrote:
Hi, a couple of years ago I wrote a generalized version of radix sort, called ska_sort. I gave a talk about it at CppNow, available here: https://www.youtube.com/watch?v=zqs87a_7zxw
I have a new version of that algorithm that is both faster and more general. I can now sort almost anything. For example when I gave the talk I couldn't sort a std::vector<std::list<int>> because std::list doesn't have a random access interface, but now that works. (and it's faster than sorting the same thing with std::sort)
This seems like a product of a large investment of effort. The big advantage over any previously available radix_sort is the ability to sort almost any datatype if a key maps to an integer somehow. Plus, an advantage in measured practical speed, and apparently no claims about Big O improvement.
The average performance is O(n*b) where b is unknown and/or dependent on n ?
Is the theoretical worst-case O(n^2) ? Why? Is it because of working in-place? (yes, I understand you fall back to std::sort if you hit this case)
Is this software/algorithm fully documented? - I mean by another way than reading the source code.
Maybe too many questions at once, but it was not clear after watching your talk at cppcon and the blog posts.
Cheers, Kostas
<https://www.youtube.com/watch?v=Nz1KZXbghj8>
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (5)
-
Francisco José Tapia
-
Janek Kozicki
-
Kostas Savvidis
-
Malte Skarupke
-
Phil Endecott