[review] Formal review period for Sort library begins today, November 10, and ends Wednesday, November 19
The formal review of the Sort library by Steven Ross starts today, November 10 and is scheduled to continue through November 19th. About the Sort library ================== The Sort library is a library which implements hybrid sorting algorithms based on a Spreadsort ( http://en.wikipedia.org/wiki/Spreadsort ), of which the author of the library, Steven Ross, is the inventor. The algorithm provides a sort that is faster than O(n*log(n)). The library provides a generic implementation of high-speed sorting algorithms that outperform those in the C++ standard in both average and worst case performance. These algorithms only work on random access iterators. They are hybrids using both radix and comparison-based sorting, specialized to sorting common data types, such as integers, floats, and strings. These algorithms are encoded in a generic fashion and accept functors, enabling them to sort any object that can be processed like these basic data types. Where to get it =============== The library is available on github at https://github.com/spreadsort/sort. The library is in modular Boost format and can be cloned to libs/sort under your local modular boost directory. I have provided as the review manager online documentation at: http://eldiener.github.io/sort Review guidelines ================= Reviews should be submitted to the developer list (boost@lists.boost.org), preferably with '[sort]' in the subject. Or if you don't wish to for some reason or are not subscribed to the developer list you can send them privately to me at 'eldiener at tropicsoft dot com'. If so, please let me know whether or not you'd like your review to be forwarded to the list. For your review you may wish to consider the following questions: - What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? - Did you try to use the library? With what compiler? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problem domain? And finally, every review should attempt to answer this question: - Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion. Even if you do not wish to give a full review any technical comment regarding the library is welcome as part of the review period and will help me as the review manager decide whether the library should be accepted as a Boost library. Any questions about the use of the library are also welcome. I encourage all programmers with an interest or knowledge of sorting algorithms to be part of the review or discussion of the Sort library as your time permits.
On 11/10/2014 08:37 AM, Edward Diener wrote:
The formal review of the Sort library by Steven Ross starts today, November 10 and is scheduled to continue through November 19th.
About the Sort library ==================
The Sort library is a library which implements hybrid sorting algorithms based on a Spreadsort ( http://en.wikipedia.org/wiki/Spreadsort ), of which the author of the library, Steven Ross, is the inventor.
Apologies if I'm overly critical, but we're asked to review an implementation of an algorithm from Steven's paper from 2002, and as of today, neither Google Scholar nor Citeseer can find any references to that paper. So, it's not an established algorithm by any means, Boost is generally not an algorithms review organization, and so I'm not sure it's good for Boost to review the proposed implementation at all.
- What is your evaluation of the potential usefulness of the library?
I am not sure the library is useful. For very few users, sorting 100kb arrays of integers is a performance bottleneck. And if it's truly a performance bottleneck, one would probably ponder cache architecture, or use of GPU, or a zillion other tricks. Like this paper from 2010 appears to do, for example: http://dl.acm.org/citation.cfm?doid=1807167.1807207
And finally, every review should attempt to answer this question:
- Do you think the library should be accepted as a Boost library?
I think the library should not be accepted. There's no reason to give Boost approval to implementation of algorithm that is neither accepted classic nor an obvious breakthrough. -- Vladimir Prus CodeSourcery / Mentor Embedded http://vladimirprus.com
On 10/11/2014 08:35, Vladimir Prus wrote:
I am not sure the library is useful. For very few users, sorting 100kb arrays of integers is a performance bottleneck. And if it's truly a performance bottleneck, one would probably ponder cache architecture, or use of GPU, or a zillion other tricks. Like this paper from 2010 appears to do, for example:
I have personally seen cases where sorting speed is a bottleneck. However, there are implementations that are still quite faster than this library (I have one that is at least 3.5 faster than std::sort), and that fares better with small sizes. Nevertheless, it would be interesting to have a radix sorting library implemented in generic C++.
On 11/10/2014 10:54 AM, Mathias Gaunard wrote:
On 10/11/2014 08:35, Vladimir Prus wrote:
I am not sure the library is useful. For very few users, sorting 100kb arrays of integers is a performance bottleneck. And if it's truly a performance bottleneck, one would probably ponder cache architecture, or use of GPU, or a zillion other tricks. Like this paper from 2010 appears to do, for example:
I have personally seen cases where sorting speed is a bottleneck.
However, there are implementations that are still quite faster than this library (I have one that is at least 3.5 faster than std::sort), and that fares better with small sizes.
Nevertheless, it would be interesting to have a radix sorting library implemented in generic C++.
I suppose Boost implementation of radix sort would be fine, as that's fairly classic algorithm. What I'm concerned about this prosed library is that is not exactly classic and not quite bleeding edge either. -- Vladimir Prus CodeSourcery / Mentor Embedded http://vladimirprus.com
I have personally seen cases where sorting speed is a bottleneck.
However, there are implementations that are still quite faster than this library (I have one that is at least 3.5 faster than std::sort), and that fares better with small sizes.
I'd appreciate if you could point me to specific implementations and/or data, so I can run a direct comparison. Did you test their relative speeds? This library provides a significantly larger speedup on easy problems, such as evenly distributed random data or where the data range is not much larger than the number of elements being sorted. This library is optimized and speed-tested for the worst case, so it provides a consistent speedup over comparison-sorting on the vast majority of distributions, and provides a better worst-case guarantee than both conventional hybrid radix and pure comparison-based approaches. Many hybrid-radix approaches do poorly with data distributions like that in https://github.com/spreadsort/sort/blob/master/example/alrbreaker.cpp , often significantly slower (5X runtime) than pure comparison-based sorts. When integer_sort or float_sort run into a problem like that, they automatically revert to std::sort and end up having comparable speed to pure comparison-based sorting. To restate the case: for any specific data distribution, it is possible to construct a faster sorting implementation than this one. The goal of this library is to provide most of the speedup of specialized radix-based or hybrid-radix algorithms, without sacrificing significant speed on problems where such approaches do poorly, and without having to write specialized code each time.
On 10 November 2014 23:53, Steven Ross
I have personally seen cases where sorting speed is a bottleneck.
However, there are implementations that are still quite faster than this library (I have one that is at least 3.5 faster than std::sort), and that fares better with small sizes.
I'd appreciate if you could point me to specific implementations and/or data, so I can run a direct comparison.
You could always compare against my implementation of LSD radix sort: https://github.com/jeremy-murphy/algorithm/tree/speed_test It's very 'by the book'. I haven't really looked at yours or my code for a while but I just posted my raw speed test results for the purpose of this discussion -- the formatting didn't work out too well but I hope you'll be able to follow it. Hope it is of some use. Cheers. Jeremy
Jeremy, On Mon Nov 10 2014 at 9:02:57 AM Jeremy Murphy < jeremy.william.murphy@gmail.com> wrote:
You could always compare against my implementation of LSD radix sort:
https://github.com/jeremy-murphy/algorithm/tree/speed_test
It's very 'by the book'.
I haven't really looked at yours or my code for a while but I just posted my raw speed test results for the purpose of this discussion -- the formatting didn't work out too well but I hope you'll be able to follow it. Hope it is of some use.
That's a nice and fast lsd radix sort (faster on some distributions than my integer_sort, slower on others). I instantiated it like this in my sample.cpp: else if (lsd) { vector
B(array); stable_radix_sort(B.begin(), B.end(), array.begin()); }
Some notes: When I feed std::sort and integer_sort 800MB of integers generated by the randomgen example program, they use 800MB of RAM. When I feed that same list of integers into your stable_radix_sort, it uses 3GB! Before that I tried 2GB of integers, and my 8GB Ubuntu system started swapping and hung. I understand using 2X the RAM because it's keeping a copy, but I don't understand why an LSD radix sort should need any more than 2X the RAM. My proposed library would probably run faster if I used a full copy of the RAM too, and copied the data directly instead of using a complex swap loop (I could also make it stable at some cost in speed). The downside is the increased RAM usage. As another note, it appears that your function only supports unsigned sorting. You may want to support signed sorting. As stated in the documentation, I think an LSD radix sort is a reasonable addition to the library, though I'm not sure it's a necessity.
On 10/11/2014 13:53, Steven Ross wrote:
I have personally seen cases where sorting speed is a bottleneck.
However, there are implementations that are still quite faster than this library (I have one that is at least 3.5 faster than std::sort), and that fares better with small sizes.
I'd appreciate if you could point me to specific implementations and/or data, so I can run a direct comparison.
The implementation I have that I mentioned is based on these papers: http://www.dia.eui.upm.es/asignatu/pro_par/articulos/AASort.pdf http://www.cse.uconn.edu/~zshi/course/cse5302/ref/chhugani08sorting.pdf http://olab.is.s.u-tokyo.ac.jp/~kamil.rocki/xiaochen_rocki_IA3_SC13.pdf I did not test on other random distributions than uniform.
Thank you for pointing out these interesting papers Mathias. On Tue Nov 11 2014 at 3:31:04 PM Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
I have personally seen cases where sorting speed is a bottleneck.
However, there are implementations that are still quite faster than this library (I have one that is at least 3.5 faster than std::sort), and
On 10/11/2014 13:53, Steven Ross wrote: that
fares better with small sizes.
I'd appreciate if you could point me to specific implementations and/or data, so I can run a direct comparison.
The implementation I have that I mentioned is based on these papers:
http://www.dia.eui.upm.es/asignatu/pro_par/articulos/AASort.pdf http://www.cse.uconn.edu/~zshi/course/cse5302/ref/chhugani08sorting.pdf http://olab.is.s.u-tokyo.ac.jp/~kamil.rocki/xiaochen_rocki_IA3_SC13.pdf
I did not test on other random distributions than uniform.
These are interesting applications of SIMD instructions with impressive results; they might be difficult to port but they look like something that people should seriously consider integrating into std::sort on systems that support the required instructions. On 10M evenly distributed random floats spread over all 32 bits available in the data type, I see a 243% speedup for float_sort vs std::sort. For smaller data ranges the speedup is greater. Will these approaches work well on larger data types, such as 64 or 128-bit integers, doubles, and strings? It appears that they might be harder to generalize. For the algorithm that was 3.5X as fast on evenly distributed random data, what is its worst-case performance? Some of those algorithms appeared to be O(N*N) in the worst case. Does it sort in-place, or require a copy of the memory? If you have an open-source portable version of one of these approaches, and it's as fast as advertised, it might make sense to add to boost.
- What is your evaluation of the potential usefulness of the library?
I am not sure the library is useful. For very few users, sorting 100kb arrays of integers is a performance bottleneck. And if it's truly a performance bottleneck, one would probably ponder cache architecture, or use of GPU, or a zillion other tricks. Like this paper from 2010 appears to do, for example:
With regards to practicality: I applied the string_sort algorithm to bzip and saw a 40% reduction in total compression time. Would you like to see the source for that? There are problems where people sort billions of strings, integers, floats, or things that can be sorted like them, such as processing integrated circuit designs, where time is money. One quote from that article you cite: "Merge sort performs better than radix sort for sorting keys of large sizes - such keys will be required to accommodate the increasing cardinality of future databases" Merge sort is fundamentally less efficient than a smart hybrid-radix approach when CPU sorting long keys because it has to compare the entire length of the key up to the point of difference multiple times, where the string_sort in this library only passes through each character in the key once (or twice depending on how you count) during the radix sorting portion, and then only compares what hasn't already been passed through in the final comparison-sorting stage once the problem has been cut down to less than a specific constant size. Do you happen to have a copy of the full article? I'm happy to apply any helpful concept to improve the library.
And finally, every review should attempt to answer this question:
- Do you think the library should be accepted as a Boost library?
I think the library should not be accepted. There's no reason to give Boost approval to implementation of algorithm that is neither accepted classic nor an obvious breakthrough.
With regards to classic algorithms, this library provides an implementation similar to Adaptive Left Radix for integer_sort, which came after it but was more widely available, with the addition of better worst-case performance. It also provides an adaptation of American Flag Sort (a classic algorithm) to C++ strings and with better worst-case performance guarantees. Did you read the overview section of the documentation that came with the library?
Steven, On 11/10/2014 02:46 PM, Steven Ross wrote:
- What is your evaluation of the potential usefulness of the library?
I am not sure the library is useful. For very few users, sorting 100kb arrays of integers is a performance bottleneck. And if it's truly a performance bottleneck, one would probably ponder cache architecture, or use of GPU, or a zillion other tricks. Like this paper from 2010 appears to do, for example:
With regards to practicality: I applied the string_sort algorithm to bzip and saw a 40% reduction in total compression time. Would you like to see the source for that?
It's interesting, and probably would make a good addition to documentation. Though bzip appears to be phased out these days, replaced by xz, which does not use Burrows–Wheeler transformation.
There are problems where people sort billions of strings, integers, floats, or things that can be sorted like them, such as processing integrated circuit designs, where time is money.
Integrated circuit design appears to meet the "few users" criteria. I doubt that companies who are in this business (such as my employer), would benefit much from a Boost library.
One quote from that article you cite: "Merge sort performs better than radix sort for sorting keys of large sizes - such keys will be required to accommodate the increasing cardinality of future databases" Merge sort is fundamentally less efficient than a smart hybrid-radix approach when CPU sorting long keys because it has to compare the entire length of the key up to the point of difference multiple times, where the string_sort in this library only passes through each character in the key once (or twice depending on how you count) during the radix sorting portion, and then only compares what hasn't already been passed through in the final comparison-sorting stage once the problem has been cut down to less than a specific constant size.
The problem is that it does not seem to me that Boost is a forum to discuss algorithm implementation. You propose a library that purports to do efficient sorting, based on 2002 paper. Were there no papers in subsequent 12 years with better algorithms?
Do you happen to have a copy of the full article? I'm happy to apply any helpful concept to improve the library.
The "full text" link at the top of that page works for me.
And finally, every review should attempt to answer this question:
- Do you think the library should be accepted as a Boost library?
I think the library should not be accepted. There's no reason to give Boost approval to implementation of algorithm that is neither accepted classic nor an obvious breakthrough.
With regards to classic algorithms, this library provides an implementation similar to Adaptive Left Radix for integer_sort, which came after it but was more widely available, with the addition of better worst-case performance. It also provides an adaptation of American Flag Sort (a classic algorithm) to C++ strings and with better worst-case performance guarantees.
I'm worried about "similar" and "adaptation" above. If I see that function whatever_sort implements an algorithm from paper N, with 200 citations, then I can assume enough people reviewed the algorithm. If that function implements an adaptation, then I'm not really sure what complexity I can expect.
Did you read the overview section of the documentation that came with the library?
Yes, but I'm not sure it addresses my concerns. -- Vladimir Prus CodeSourcery / Mentor Embedded http://vladimirprus.com
Vladimir,
It's interesting, and probably would make a good addition to documentation. Though bzip appears to be phased out these days, replaced by xz, which does not use Burrows–Wheeler transformation.
Will do (in the development branch).
One quote from that article you cite: "Merge sort performs better than radix sort for sorting keys of large sizes - such keys will be required to accommodate the increasing cardinality of future databases" Merge sort is fundamentally less efficient than a smart hybrid-radix approach when CPU sorting long keys because it has to compare the entire length of the key up to the point of difference multiple times, where the string_sort in this library only passes through each character in the key once (or twice depending on how you count) during the radix sorting portion, and then only compares what hasn't already been passed through in the final comparison-sorting stage once the problem has been cut down to less than a specific constant size.
The problem is that it does not seem to me that Boost is a forum to discuss algorithm implementation. You propose a library that purports to do efficient sorting, based on 2002 paper. Were there no papers in subsequent 12 years with better algorithms?
I did an algorithm search in 2008 when preparing the boost submission. ALR came out, fairly similar to my paper. I have not seen much attention paid in academia to hybrid radix sorting.
Do you happen to have a copy of the full article? I'm happy to apply any helpful concept to improve the library.
The "full text" link at the top of that page works for me.
I hit a paywall at home, but accessed it at work. This is a comparison of parallel LSD radix sort vs Mergesort. The conclusion is that parallel LSD radix sort is slower with long keys. This is not surprising. MSD hybrid radix sort (string_sort) does much better than Mergesort or LSD radix sort with long keys. This library does not get into parallel implementation issues; it can be used as a subsort inside a parallel Mergesort, or you can split up the problem by MSD to parallelize it and use this library to subsort.
With regards to classic algorithms, this library provides an implementation similar to Adaptive Left Radix for integer_sort, which came after it but was more widely available, with the addition of better worst-case performance. It also provides an adaptation of American Flag Sort (a classic algorithm) to C++ strings and with better worst-case performance guarantees.
I'm worried about "similar" and "adaptation" above. If I see that function whatever_sort implements an algorithm from paper N, with 200 citations, then I can assume enough people reviewed the algorithm. If that function implements an adaptation, then I'm not really sure what complexity I can expect.
Since you've read the overview, how about just looking through the code and seeing if it works as described, or testing it? Can you ever be sure a concrete implementation correctly applies an abstract concept without looking at the implementation? These implementations are fast for the same reason ALR and AFS are. They have superior worst-case performance because they fall back to std::sort at a size that optimizes worst-case performance. The only other difference is that these algorithms scan to see if they can shrink the range each iteration, which improves performance for non-worst-case distributions with little impact on worst-case distributions.
On 11/10/2014 04:54 PM, Steven Ross wrote:
I did an algorithm search in 2008 when preparing the boost submission. ALR came out, fairly similar to my paper. I have not seen much attention paid in academia to hybrid radix sorting.
Okay, so assuming a third party were to implement a sorting algorithm for Boost. Why they would choose your paper over the 2008 paper?
I'm worried about "similar" and "adaptation" above. If I see that function whatever_sort implements an algorithm from paper N, with 200 citations, then I can assume enough people reviewed the algorithm. If that function implements an adaptation, then I'm not really sure what complexity I can expect. Since you've read the overview, how about just looking through the code and seeing if it works as described, or testing it? Can you ever be sure a concrete implementation correctly applies an abstract concept without looking at the implementation? These implementations are fast for the same reason ALR and AFS are. They have superior worst-case performance because they fall back to std::sort at a size that optimizes worst-case performance. The only other difference is that these algorithms scan to see if they can shrink the range each iteration, which improves performance for non-worst-case distributions with little impact on worst-case distributions.
Which "these algorithms"? ALR and AFS? If so, it would mean they are better? -- Vladimir Prus CodeSourcery / Mentor Embedded http://vladimirprus.com
Vladimir,
On Mon, Nov 10, 2014 at 9:07 AM, Vladimir Prus
On 11/10/2014 04:54 PM, Steven Ross wrote:
I did an algorithm search in 2008 when preparing the boost submission. ALR came out, fairly similar to my paper. I have not seen much attention paid in academia to hybrid radix sorting.
Okay, so assuming a third party were to implement a sorting algorithm for Boost. Why they would choose your paper over the 2008 paper?
What 2008 paper? ALR came out later in 2002. The reason is better worst-case performance. Please see the alrbreaker example for an example distribution which causes conventional MSD radix sorts to perform very poorly; overhead starts to dominate as they radix sort small chunks of data. This library avoids that problem by switching to std::sort once it has broken up the problem into small enough chunks that overhead will be a major issue.
I'm worried about "similar" and "adaptation" above. If I see that function whatever_sort implements an algorithm from paper N, with 200 citations, then I can assume enough people reviewed the algorithm. If that function implements an adaptation, then I'm not really sure what complexity I can expect.
Since you've read the overview, how about just looking through the code and seeing if it works as described, or testing it? Can you ever be sure a concrete implementation correctly applies an abstract concept without looking at the implementation? These implementations are fast for the same reason ALR and AFS are. They have superior worst-case performance because they fall back to std::sort at a size that optimizes worst-case performance. The only other difference is that these algorithms scan to see if they can shrink the range each iteration, which improves performance for non-worst-case distributions with little impact on worst-case distributions.
Which "these algorithms"? ALR and AFS? If so, it would mean they are better?
These implementations (integer_sort, float_sort, string_sort) scan to see if they can shrink the range; integer_sort and float_sort find the maximum and minimum using a fast scan across the chunk of data being sorted, and MSD radix sort between the actual max and min. This reduces the size of the problem in many common cases while adding only minimal overhead in the worst case. string_sort does something similar, checking to see if all characters are identical, and if so, it skips on to the next character, only sorting once it encounters an actual difference. This is useful in cases such as hierarchical strings, where some chunks of the key are all the same. Again, the overhead of this scan is minimal.
On 11/10/2014 05:33 PM, Steven Ross wrote:
Which "these algorithms"? ALR and AFS? If so, it would mean they are better? These implementations (integer_sort, float_sort, string_sort) scan to see if they can shrink the range; integer_sort and float_sort find the maximum and minimum using a fast scan across the chunk of data being sorted, and MSD radix sort between the actual max and min. This reduces the size of the problem in many common cases while adding only minimal overhead in the worst case. string_sort does something similar, checking to see if all characters are identical, and if so, it skips on to the next character, only sorting once it encounters an actual difference. This is useful in cases such as hierarchical strings, where some chunks of the key are all the same. Again, the overhead of this scan is minimal.
This all sounds reasonable, but I'm not sure it's in scope for a Boost review. The primary offering of your library is that it's faster than std::sort for particular types of keys. There's no significant interface to discuss, or documentation, or anything, it's mostly about better algorithm. So the question, is who is is qualified to review your algorithm as an algorithm - you know, formal proof of correctness, formal complexity estimates, tests and statistical validity of them, and the quality relative to other existing algorithms. I would not think it's a job for Boost. If you disagree, and the review manager takes your side as well, it would be reasonable for you to provide updated paper, starting with survey of existing algorithms (the original paper has none, merely saying that "Knuth was wrong about sorting" and further referring to Knuth' book published in 1997), and have that paper reviewed here? -- Vladimir Prus CodeSourcery / Mentor Embedded http://vladimirprus.com
Vladimir Prus wrote:
If I see that function whatever_sort implements an algorithm from paper N, with 200 citations, then I can assume enough people reviewed the algorithm. If that function implements an adaptation, then I'm not really sure what complexity I can expect.
...
So the question, is who is is qualified to review your algorithm as an algorithm - you know, formal proof of correctness, formal complexity estimates, tests and statistical validity of them, and the quality relative to other existing algorithms. I would not think it's a job for Boost.
Demanding that the library be backed by a paper containing formal proof of correctness, formal complexity estimates, and 200 citations, strikes me as grossly unfair. We've never done this to any library under consideration. Had we done so, none of them would've passed. What we're interested in is: - is the library useful? - is the library high quality? - is the author going to stick around and support it? We do not, generally, require formal proofs for any of the above, no matter whether the library contains an amount of innovation. Raising the criteria for acceptance to absurd levels for innovative libraries treats innovation as a pollutant. We determine whether the library works by testing it, and we determine the library's performance by timing it. We're empiricists. Of course, if Steven wants to provide formal proofs of correctness and complexity requirements, that would be very nice of him; I just don't think it's fair to demand it.
Peter Dimov-2 wrote
We determine whether the library works by testing it, and we determine the library's performance by timing it. We're empiricists.
Of course, if Steven wants to provide formal proofs of correctness and complexity requirements, that would be very nice of him; I just don't think it's fair to demand it.
+1 In our context, testing trumps proof. This is because all such "proofs" count operations of some sort. And operations (e.g. compare, relink, dereference, etc) vary between algorithms. Also I agree that we should stick to our our stated standards. Robert Ramey PS - I'm was sort of disappointed that the Postman's sort wasn't referred to. It was published many years ago and has been cited numerous times. It has several times won a contest held by Microsoft Research for the world's fastest sort. To prove it, I have two medals hanging on my door knob (each with the name spelled wrong - in different ways). Robert Ramey -- View this message in context: http://boost.2283326.n4.nabble.com/review-Formal-review-period-for-Sort-libr... Sent from the Boost - Dev mailing list archive at Nabble.com.
Robert,
PS - I'm was sort of disappointed that the Postman's sort wasn't referred to. It was published many years ago and has been cited numerous times. It has several times won a contest held by Microsoft Research for the world's fastest sort. To prove it, I have two medals hanging on my door knob (each with the name spelled wrong - in different ways).
Postman's sort is one of the earlier efficient top-down MSD radix sorting implementations. It wasn't one of the academic papers I read, but it probably influenced the paper "Engineering Radix Sort" which was the primary reference for string_sort. I'd be happy to refer to it if that would be helpful.
On 11/10/2014 07:15 PM, Peter Dimov wrote:
Vladimir Prus wrote:
If I see that function whatever_sort implements an algorithm from paper N, with 200 citations, then I can assume enough people reviewed the algorithm. If that function implements an adaptation, then I'm not really sure what complexity I can expect.
...
So the question, is who is is qualified to review your algorithm as an algorithm - you know, formal proof of correctness, formal complexity estimates, tests and statistical validity of them, and the quality relative to other existing algorithms. I would not think it's a job for Boost.
Demanding that the library be backed by a paper containing formal proof of correctness, formal complexity estimates, and 200 citations, strikes me as grossly unfair.
We've never done this to any library under consideration. Had we done so, none of them would've passed.
I don't think we ever had a library whose primary value was algorithmic improvement, where the algorithm was also developed by library author. For many libraries, algorithms and complexities are implementation details. For this one, its the reason for the library to exist, and the presented performance charts only demonstrate constant factor difference with std::sort with old compiler and old CPU. It seems quite fair to ask that the main selling point of a library is well justified. -- Vladimir Prus CodeSourcery / Mentor Embedded http://vladimirprus.com
On Mon, Nov 10, 2014 at 11:15 AM, Peter Dimov
Vladimir Prus wrote:
If I see that function whatever_sort implements an algorithm from paper N, with 200 citations, then I can assume enough people reviewed the algorithm. If that function implements an adaptation, then I'm not really sure what complexity I can expect.
...
So the question, is who is is qualified to review your algorithm as an
algorithm - you know, formal proof of correctness, formal complexity estimates, tests and statistical validity of them, and the quality relative to other existing algorithms. I would not think it's a job for Boost.
Demanding that the library be backed by a paper containing formal proof of correctness, formal complexity estimates, and 200 citations, strikes me as grossly unfair.
We've never done this to any library under consideration. Had we done so, none of them would've passed.
What we're interested in is:
- is the library useful? - is the library high quality? - is the author going to stick around and support it?
We do not, generally, require formal proofs for any of the above, no matter whether the library contains an amount of innovation.
Raising the criteria for acceptance to absurd levels for innovative libraries treats innovation as a pollutant.
We determine whether the library works by testing it, and we determine the library's performance by timing it. We're empiricists.
+1 for all of the above. --Beman
On 10 Nov 2014 at 10:35, Vladimir Prus wrote:
Apologies if I'm overly critical, but we're asked to review an implementation of an algorithm from Steven's paper from 2002, and as of today, neither Google Scholar nor Citeseer can find any references to that paper. So, it's not an established algorithm by any means, Boost is generally not an algorithms review organization, and so I'm not sure it's good for Boost to review the proposed implementation at all.
I want to see scaling benchmarks comparing this implementation with the traditional STL containers for 0 to 10,000 items, for both single threaded and concurrent usage (even if that is simply a spinlock around the library). If these scaling benchmarks are not provided, I automatically vote no. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
I want to see scaling benchmarks comparing this implementation with the traditional STL containers for 0 to 10,000 items, for both single threaded and concurrent usage (even if that is simply a spinlock around the library).
If these scaling benchmarks are not provided, I automatically vote no.
Niall, I'm not sure precisely what you're asking for, and would like clarification: Are you asking me to sort 0, 1, 2, 4, ... 2^14 elements, input in this form: vector<int> vector<float> vector<string> And compare the speeds of std::sort vs integer_sort, float_sort, and string_sort, both sequentially, and doing multiple separate sorts in parallel? What distribution do you want this to sort? Evenly distributed random, the variety of distributions (including evenly distributed random) used in tune.pl, or something else? I'd like to note that if the input is <3000 elements, this library immediately falls back to std::sort, as that is the approximate crossover point at which hybrid radix sorting becomes faster. Differences for arrays smaller than that are likely to be due to the overhead of the size check + the increase in executable size, which is likely to be difficult to separate out from noise.
On 10 Nov 2014 at 8:18, Steven Ross wrote:
I want to see scaling benchmarks comparing this implementation with the traditional STL containers for 0 to 10,000 items, for both single threaded and concurrent usage (even if that is simply a spinlock around the library).
If these scaling benchmarks are not provided, I automatically vote no.
Niall, I'm not sure precisely what you're asking for, and would like clarification:
Are you asking me to sort 0, 1, 2, 4, ... 2^14 elements, input in this form: vector<int> vector<float> vector<string>
I'd be happy with int and string only personally. Float is nice though.
And compare the speeds of std::sort vs integer_sort, float_sort, and string_sort, both sequentially, and doing multiple separate sorts in parallel?
Sounds good. You'd get a graph of CPU cycle scaling to N looking like: http://www.nedprod.com/programs/portable/nedtries/images/LogLogBitwise TreesScaling.png http://www.nedprod.com/programs/portable/nedtries/images/LogLogRedBlac kTreesScaling.png http://www.nedprod.com/programs/portable/nedtries/images/LogLogHashTab leScaling.png For your situation you just sort, so a combined graph showing a comparative to other algorithms makes much more sense. Regarding parallel sort, you don't need to necessarily sort the same dataset. Rather many parallel sorts shows how well an algorithm coexists with other work on the same CPU. As an example of what I mean, a single thread doing red-black tree modification is fast, but two threads doing red-black tree modification sees exponential slow down if the cache lines touch. That makes std::map extremely unsuited to multiple thread use, even when protected by a spinlock. Finally, and the most important, these sorts of scaling benchmarks demonstrate proof you've thought about this sort of stuff. That stamps a great big fat mark of quality on your library. Anyone can claim anything about some bit of code of theirs. Not everyone can empirically demonstrate its truth others can replicate easily for themselves.
What distribution do you want this to sort? Evenly distributed random, the variety of distributions (including evenly distributed random) used in tune.pl, or something else?
I think I remember seeing your algorithm makes good use of structure? If so, totally random, partially sorted and nearly sorted are all good distributions.
I'd like to note that if the input is <3000 elements, this library immediately falls back to std::sort, as that is the approximate crossover point at which hybrid radix sorting becomes faster. Differences for arrays smaller than that are likely to be due to the overhead of the size check + the increase in executable size, which is likely to be difficult to separate out from noise.
Now that is a very vital piece of new information. I'll be honest in saying that I am unsure how useful a > 3000 element algorithm can be. Maybe if you implement a no-alloc core loop like Antony suggested you can significantly improve on this? > 300 would be lovely. I also have to admit a curiosity. My non-allocating bitwise tries library nedtries could be viewed as a way of very quickly nearly sorting items, even though I wrote it with the intention of it being an indexing algorithm rather than sorting algorithm. An interesting side effect of the non-allocating in-place algorithm is that it is highly amenable to being multithreaded i.e. you can fire N threads at it, and get ~N speedup if your keys are very random. Reading back that index would give you an almost sorted array, one upon which bubble sort would surely perform excellently. As you are proposing a full on Boost.Sort library rather than just a single sort algorithm, I thought I should raise this perhaps even better performing sort algorithm which can make excellent use of multiple cores, something which is always hard to do for sort algorithms. I sadly have no need for such algorithms in my present work, and therefore can't allocate any time of my own on this sort of stuff. A shame. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
Niall,
On Mon Nov 10 2014 at 1:05:28 PM Niall Douglas
Sounds good. You'd get a graph of CPU cycle scaling to N looking like:
http://www.nedprod.com/programs/portable/nedtries/images/LogLogBitwise TreesScaling.png http://www.nedprod.com/programs/portable/nedtries/images/LogLogBitwiseTreesS...
http://www.nedprod.com/programs/portable/nedtries/images/LogLogRedBlac kTreesScaling.png http://www.nedprod.com/programs/portable/nedtries/images/LogLogRedBlackTrees...
http://www.nedprod.com/programs/portable/nedtries/images/LogLogHashTab leScaling.png http://www.nedprod.com/programs/portable/nedtries/images/LogLogHashTableScal...
For your situation you just sort, so a combined graph showing a comparative to other algorithms makes much more sense.
I've created an integer_sort graph here (click the graph tab at the bottom): https://docs.google.com/spreadsheets/d/1fUvIJQPaAbsUv54RGgNd-hWmJY996qvocaw-... If that's what you're looking for, I can create the equivalent graph for string_sort too. The crossover point of about 1000 elements is quite visible, and both std::sort and integer_sort have some minor parallelization inefficiency, but I see no surprises, and expect none for string_sort.
On 11/13/2014 08:50 AM, Steven Ross wrote:
Niall,
On Mon Nov 10 2014 at 1:05:28 PM Niall Douglas
wrote: Sounds good. You'd get a graph of CPU cycle scaling to N looking like:
http://www.nedprod.com/programs/portable/nedtries/images/LogLogBitwise TreesScaling.png http://www.nedprod.com/programs/portable/nedtries/images/LogLogBitwiseTreesS...
http://www.nedprod.com/programs/portable/nedtries/images/LogLogRedBlac kTreesScaling.png http://www.nedprod.com/programs/portable/nedtries/images/LogLogRedBlackTrees...
http://www.nedprod.com/programs/portable/nedtries/images/LogLogHashTab leScaling.png http://www.nedprod.com/programs/portable/nedtries/images/LogLogHashTableScal...
For your situation you just sort, so a combined graph showing a comparative to other algorithms makes much more sense.
I've created an integer_sort graph here (click the graph tab at the bottom): https://docs.google.com/spreadsheets/d/1fUvIJQPaAbsUv54RGgNd-hWmJY996qvocaw-...
If that's what you're looking for, I can create the equivalent graph for string_sort too. The crossover point of about 1000 elements is quite visible, and both std::sort and integer_sort have some minor parallelization inefficiency, but I see no surprises, and expect none for string_sort.
Steven, what input data distribution is this? Could you run the tests with worst-case data distribution for your algorithm? -- Vladimir Prus CodeSourcery / Mentor Embedded http://vladimirprus.com
Vladimir,
On Thu Nov 13 2014 at 2:07:53 AM Vladimir Prus
If that's what you're looking for, I can create the equivalent graph for string_sort too. The crossover point of about 1000 elements is quite visible, and both std::sort and integer_sort have some minor parallelization inefficiency, but I see no surprises, and expect none for string_sort.
Steven,
what input data distribution is this? Could you run the tests with worst-case data distribution for your algorithm?
The input data distribution was evenly distributed 32-bit random data. The absolute worst-case distribution I've been able to design for spreadsort is binaryalrbreaker with an ALR_THRESHOLD of 3. With 32-bit integers, it has 512 elements, and takes 15 microseconds for integer_sort, and 14 microseconds for std::sort. With 64-bit integers, it has 131072 elements, and takes 2.8ms for integer_sort, and 1.6ms for std::sort. If you add or subtract any elements, it won't be the worst-case distribution! If I simply double the number of elements, 64-bit integer_sort takes 5.35ms, and std::sort takes 4.56ms. I'm not sure what kind of graph would make sense with this type of situation. I'd also like to note that this distribution is specifically designed to make integer_sort slow, and cannot be scaled up without improving the relative speed of integer_sort.
On 11/15/2014 06:43 AM, Steven Ross wrote:
Vladimir,
On Thu Nov 13 2014 at 2:07:53 AM Vladimir Prus
wrote: If that's what you're looking for, I can create the equivalent graph for string_sort too. The crossover point of about 1000 elements is quite visible, and both std::sort and integer_sort have some minor parallelization inefficiency, but I see no surprises, and expect none for string_sort.
Steven,
what input data distribution is this? Could you run the tests with worst-case data distribution for your algorithm?
The input data distribution was evenly distributed 32-bit random data. The absolute worst-case distribution I've been able to design for spreadsort is binaryalrbreaker with an ALR_THRESHOLD of 3. With 32-bit integers, it has 512 elements, and takes 15 microseconds for integer_sort, and 14 microseconds for std::sort. With 64-bit integers, it has 131072 elements, and takes 2.8ms for integer_sort, and 1.6ms for std::sort. If you add or subtract any elements, it won't be the worst-case distribution! If I simply double the number of elements, 64-bit integer_sort takes 5.35ms, and std::sort takes 4.56ms. I'm not sure what kind of graph would make sense with this type of situation. I'd also like to note that this distribution is specifically designed to make integer_sort slow, and cannot be scaled up without improving the relative speed of integer_sort.
You've lost me here. Surely, for any N there is input that results in worst-case performance, and it should be possible to generate such input? -- Vladimir Prus CodeSourcery / Mentor Embedded http://vladimirprus.com
The input data distribution was evenly distributed 32-bit random data. The absolute worst-case distribution I've been able to design for spreadsort is binaryalrbreaker with an ALR_THRESHOLD of 3. With 32-bit integers, it has 512 elements, and takes 15 microseconds for integer_sort, and 14 microseconds for std::sort. With 64-bit integers, it has 131072 elements, and takes 2.8ms for integer_sort, and 1.6ms for std::sort. If you add or subtract any elements, it won't be the worst-case distribution! If I simply double the number of elements, 64-bit integer_sort takes 5.35ms, and std::sort takes 4.56ms. I'm not sure what kind of graph would make sense with this type of situation. I'd also like to note that this distribution is specifically designed to make integer_sort slow, and cannot be scaled up without improving the relative speed of integer_sort.
You've lost me here. Surely, for any N there is input that results in worst-case performance, and it should be possible to generate such input?
The worst-case input that I've been able to devise for spreadsort
consists of data structured so that: 1) It uses the full width of the key, forcing radix sorting to use as many iterations as possible, maximizing overhead and the number of swaps. 2) Each radix iteration only cuts the problem in half, and involves swapping all the elements. Swaps are the slowest part of this algorithm, so maximizing the number of swaps is important. Using restrictions 1 and 2, I get a minimum data size of 512 for 32-bit data and 131072 for 64-bit data (my binaryalrbreaker example does this if you set ALR_THRESHOLD to 3). The only way to increase size above that is to have more elements at the very bottom of the distribution, which amortizes the radix costs over more elements and rapidly makes the algorithm relatively more efficient compared to std::sort. Below that it's just not the worst-case input.
On Fri Nov 14 2014 at 10:43:48 PM Steven Ross
On Thu Nov 13 2014 at 2:07:53 AM Vladimir Prus
wrote:
what input data distribution is this? Could you run the tests with
worst-case data distribution for your algorithm?
The input data distribution was evenly distributed 32-bit random data. The absolute worst-case distribution I've been able to design for spreadsort is binaryalrbreaker with an ALR_THRESHOLD of 3. With 32-bit integers, it has 512 elements, and takes 15 microseconds for integer_sort, and 14 microseconds for std::sort. With 64-bit integers, it has 131072 elements, and takes 2.8ms for integer_sort, and 1.6ms for std::sort. If you add or subtract any elements, it won't be the worst-case distribution! If I simply double the number of elements, 64-bit integer_sort takes 5.35ms, and std::sort takes 4.56ms. I'm not sure what kind of graph would make sense with this type of situation. I'd also like to note that this distribution is specifically designed to make integer_sort slow, and cannot be scaled up without improving the relative speed of integer_sort.
In analyzing this data to see why it was slower with integer_sort than
std::sort, I realized that it was creating the input data in reverse-sorted
order. This ends up being a case where std::sort is substantially faster
than usual (as branch prediction becomes very accurate), making it not a
worst-case comparison. I changed the code to swap 10% of the inputs with
random other inputs to break up the reverse-sorted order, and both
integer_sort and std::sort slowed down to comparable speeds. I also
regenerated the data (the least significant bits are randomized) and reran
the sort 10 times in a loop (on new data each time) to reduce noise. The
net result is:
std::sort elapsed time 0.049344
integer_sort elapsed time 0.046429
(results in seconds)
integer_sort is 6% faster on 64-bit integers, on my Ubuntu system, for
integer_sort's worst-case distribution.
The code is checked into the develop branch
in example/binaryalrbreaker.cpp. From the
On 13 Nov 2014 at 5:50, Steven Ross wrote:
For your situation you just sort, so a combined graph showing a comparative to other algorithms makes much more sense.
I've created an integer_sort graph here (click the graph tab at the bottom): https://docs.google.com/spreadsheets/d/1fUvIJQPaAbsUv54RGgNd-hWmJY996qvocaw-...
If that's what you're looking for, I can create the equivalent graph for string_sort too. The crossover point of about 1000 elements is quite visible, and both std::sort and integer_sort have some minor parallelization inefficiency, but I see no surprises, and expect none for string_sort.
That is indeed a very useful graph which definitely should be near the first page of your docs. At a glance one can tell whether one should be interested in your algorithm or not. Three items: I'd put everything in CPU cycles instead of seconds. This removes to some extent the effect of fast CPUs, or indeed of Intel CPUs, and lets the reader more quickly decide if this algorithm is worth using. I would assume you'll be mostly memory bound anyway, but I do see a small amount of logistic curve from LL cache effects in there. I would definitely say in your documentation the *exact* spec of the machine used for testing, including its RAM speed and bandwidth. I may be able to give you some benchmarks for your library on a quad core ARMv7 Tegra K1 board if you email me privately when you have a final benchmark package ready. You may find your algorithm has *very* different performance on an ARM :) The second thing is that I would separate 4 threaded from single threaded graphs as too much information confuses. You could use the opportunity to put string sort on the same graph as int sort for single threaded. And thirdly, I'd use more than 10,000 elements as I understand that's where your algorithm really pays off. You can then say "at > 100,000 elements my algorithm is X.Y times the performance of std::sort". I think for me at least Steven my only last remaining problem is the library name. Boost.Sort implies a library with at least three useful sorting algorithms not in the STL. Either you add those, or rename the library. I'd much prefer the former. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
Niall,
On Thu Nov 13 2014 at 6:44:50 AM Niall Douglas
On 13 Nov 2014 at 5:50, Steven Ross wrote:
For your situation you just sort, so a combined graph showing a comparative to other algorithms makes much more sense.
I've created an integer_sort graph here (click the graph tab at the bottom): https://docs.google.com/spreadsheets/d/1fUvIJQPaAbsUv54RGgNd- hWmJY996qvocaw-ylnUDG0/edit?usp=sharing
That is indeed a very useful graph which definitely should be near the first page of your docs. At a glance one can tell whether one should be interested in your algorithm or not.
Three items: I'd put everything in CPU cycles instead of seconds. This removes to some extent the effect of fast CPUs, or indeed of Intel CPUs, and lets the reader more quickly decide if this algorithm is worth using. I would assume you'll be mostly memory bound anyway,
I'd like to clarify one point: for all these graphs, I changed the min_sort_size from 3000 to 2 in include/boost/sort/detail/constants.hpp, so that the algorithm won't just fall back to std::sort when the list is too small to sort efficiently using Spreadsort. If I had used a min_sort_size of 1000 (or 3000) the left side of the graph would just be flat, as it would fall back to std::sort. I'm not sure how you count CPU cycles. CLOCKS_PER_SEC on my system is simply 1 million. I can assume some particular speed for them, but as cache misses probably dominate, the CPU speed is probably less important than bus latency and cache sizes. I think stating time in cpu seconds while also stating the cpu enables relative comparisons. I've checked in the code in the develop branch in these files: example/parallelint.cpp example/parallelstring.cpp And added graphs as you suggested (except for cycle counts) in the "Graphs" tab of this document: https://docs.google.com/spreadsheets/d/1fUvIJQPaAbsUv54RGgNd-hWmJY996qvocaw-... I built them like this: b2 --tune linkflags="-lboost_system -lboost_thread" parallelstring and ran them like this: ./randomgen 16 16 1024 ./parallelstring 1 100000 -std;./parallelstring 1 100000;./parallelstring 4 100000 -std;./parallelstring 4 100000 The two 16s mean use a 16-bit range on both the top and the bottom of the distribution (in other words, evenly distributed random 32-bit values). The 1024 means to write 1024 32-bit values to disk (4096 bytes). The 100000 means to loop 100000 times, to reduce the relative impact of thread overhead. The integer sorting is over 32-bit ints. The string sorting uses the same random data, but reads it in as strings. This has the interesting effect of creating random-length random strings with a mean length of 43 bytes. The only reasonable way to put them both on the same graph is to show the number of bytes sorted, as opposed to the number of array elements. I was quite surprised to see that while the crossover point for integer_sort is around 1000 elements (give or take), the crossover point for string_sort is around 30 elements. I can create a separate, lower min_sort_size for string_sort, though I think a better value for the threshold would be around (1 << (8*sizeof(unsigned_char_type))), which is what the algorithm uses internally, and works out to 256 for conventional char strings, as creating more bins (1 per possible value) than there are items to sort rapidly becomes inefficient. Does anyone have a preference on this subject?
The second thing is that I would separate 4 threaded from single threaded graphs as too much information confuses. You could use the opportunity to put string sort on the same graph as int sort for single threaded.
done
And thirdly, I'd use more than 10,000 elements as I understand that's where your algorithm really pays off. You can then say "at > 100,000 elements my algorithm is X.Y times the performance of std::sort".
Actually, the algorithm does best at intervals of 2^(C*max_splits + log_mean_bin_size) elements, which works out to 2^13 and 2^24, and cases where it can use bucketsort because the range is limited. Since the data is 32-bit evenly distributed random in this case, it does best at about 2^13, and then a little better later at about 2^24 if I included it in the graph. In other words, the vast majority of the speedup this algorithm
provides for data that can't be bucketsorted is achieved by n=10,000.
I think for me at least Steven my only last remaining problem is the library name. Boost.Sort implies a library with at least three useful sorting algorithms not in the STL. Either you add those, or rename the library. I'd much prefer the former. http://lists.boost.org/mailman/listinfo.cgi/boost
What are peoples' preferences with regards to what algorithms to offer? Here are main possibilities I'm willing to consider: 1) in-place Spreadsort, as already exists in the library. 2) copy-based (2x RAM) Spreadsort, which may be faster in general, and should be faster for mostly sorted data. This would also provide a stable sort option at a modest cost in speed. 3) LSD radix sort (2x RAM). It's better in some cases. 4) boost::thread or OpenMP based parallel implementations of any included algorithm. My preference is just #1. Doing #4 right is probably quite a bit of work, and I'm not sure how generally useful it really would be (as they are many ways to run code in parallel).
The Sort library is a library which implements hybrid sorting algorithms based on a Spreadsort ( http://en.wikipedia.org/wiki/Spreadsort ), of which the author of the library, Steven Ross, is the inventor.
With a library name of Sort, I would think this would provide a variety of sorting algorithms. As it is now I believe this would be better named as the Spreadsort library, or as a Spreadsort feature of an existing library (such as boost Algorithm)
And finally, every review should attempt to answer this question:
- Do you think the library should be accepted as a Boost library?
No. This library has potential, but is too specific for a what I would expect from a Sort library. I believe the only way to get something like this into boost is to provide Spreadsort as one among many algorithms. This will also allow benchmarks and comparisons to be performed among the various algorithms directly, as well as provide a foundation and building blocks for more advanced usage. As an aside, I am very interested in Spreadsort! I would like to see it available, definitely. But certainly there must be situations where other algorithms are a better fit. -Adam D. Walling
Adam,
And finally, every review should attempt to answer this question:
- Do you think the library should be accepted as a Boost library?
No. This library has potential, but is too specific for a what I would expect from a Sort library. I believe the only way to get something like this into boost is to provide Spreadsort as one among many algorithms. This will also allow benchmarks and comparisons to be performed among the various algorithms directly, as well as provide a foundation and building blocks for more advanced usage.
As an aside, I am very interested in Spreadsort! I would like to see it available, definitely. But certainly there must be situations where other algorithms are a better fit.
I think LSD radix sort is a valid alternative because it's stable, and possibly a stable hybrid radix sort. Other possibilities are using sorting networks instead of insertionsort for really small lists, and an algorithm optimized for mostly sorted lists, but each of these has decreasing marginal benefit. The idea with this library is that you don't need to worry about the algorithm, it uses radix and/or comparison based sorting as best suits your problem. The Perl script tune.pl provides a basic benchmarking and parameter tuning setup. Do you have a specific use case you'd like to use another algorithm for?
2014-11-10 9:37 GMT+04:00 Edward Diener
The formal review of the Sort library by Steven Ross starts today, November 10 and is scheduled to continue through November 19th.
<...>
- What is your evaluation of the design?
API is good.
- What is your evaluation of the implementation?
Implementation is not tuned. For example, take a look here https://github.com/spreadsort/sort/blob/master/include/boost/sort/detail/spr... and here https://github.com/spreadsort/sort/blob/master/include/boost/sort/detail/spr... . Memory allocations occur at those points, however they could be easily avoided. Here's a typical use case of `size_bins` function: https://github.com/spreadsort/sort/blob/master/include/boost/sort/detail/str.... Using a stack based buffer would significantly improve sort times on small datasets. - What is your evaluation of the documentation?
OK
- What is your evaluation of the potential usefulness of the library?
Useful.
- Did you try to use the library? With what compiler? Did you have any problems?
Have not used it.
- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
Quick reading.
- Are you knowledgeable about the problem domain?
Not a pro in radix based sorts.
And finally, every review should attempt to answer this question:
- Do you think the library should be accepted as a Boost library?
Not in its current state. And not as a separate library. This must be a part of Boost.Algorithm library. This library must be accepted only after some good optimization. As a result, I'd like to see this library outperforming std::sort even on small datasets (~150 elements). At the current point std::sort and sort from the subject library differ on some constant value (see graph/windows_integer_sort.htm or graph/windows_string_sort.htm), not O(n*log(n)) vs O(n*log(k/s+s)). -- Best regards, Antony Polukhin
Antony,
Implementation is not tuned.
For example, take a look here https://github.com/spreadsort/sort/blob/master/include/boost/sort/detail/spr... and here https://github.com/spreadsort/sort/blob/master/include/boost/sort/detail/spr... . Memory allocations occur at those points, however they could be easily avoided. Here's a typical use case of `size_bins` function: https://github.com/spreadsort/sort/blob/master/include/boost/sort/detail/str....
Using a stack based buffer would significantly improve sort times on small datasets.
You have a good point that a stack-based buffer would eliminate memory allocation for the bin_sizes (thanks!). On large data sets, there are very few memory allocations, as it quickly hits the maximum size that can be used. My optimization has concentrated on minimizing memory overhead and optimizing speed on large data sets. Would you agree that a fixed stack-based array of bin_sizes, passed down into the recursive calls, would provide a good trade-off between speed and memory overhead? It will still need memory allocation for the bin_cache, unless it computes the worst-case size based on the size of the input data type and the maximum number of bins per iteration. That could end up being a substantial chunk of RAM to put on the stack (10s of kilobytes!) for the non-functor versions of integer_sort and float_sort. It would be tricky to add for the functor versions. string_sort can iterate to an incredible depth in the worst-case, so it just doesn't seem practical to put the bin_cache on the stack there. Do you think I should add the bin_cache to the stack where feasible for integer_sort and float_sort? It may add significant complication.
Not in its current state. And not as a separate library. This must be a part of Boost.Algorithm library.
That's how I originally planned to add it, but apparently that doesn't work so well with modular boost. I'm open to that possibility.
This library must be accepted only after some good optimization. As a result, I'd like to see this library outperforming std::sort even on small datasets (~150 elements). At the current point std::sort and sort from the subject library differ on some constant value (see graph/windows_integer_sort.htm or graph/windows_string_sort.htm), not O(n*log(n)) vs O(n*log(k/s+s)).
n * (k/s + s) is the worst-case, rarely encountered for integer_sort for normal data. What ends up happening on most data is that it tends to run around log(n)/s expensive radix-based based iterations, plus around s fast comparison-based iterations. So It works out to comparing log(n) vs C1*log(n)/s + s. This ends up looking like a constant, though it does get a bit faster as the "+s" term get proportionately smaller with large n (and the asymptote in this regime is a constant). Eventually n gets close to k and the worst-case guarantee kicks in. See the left side of graph/osx_integer_sort.htm for examples of this faster bucket sorting. Note that in the speed vs size graphs each increment in the normal number of radix-based iterations causes a jump in relative performance: this can be seen in windows_integer_sort between 2^23 and 2^24, and 2^13 to 2^14. string_sort shows similar changes in relative performance vs size as the number of radix-sorted bytes increases, though I should note that for string_sort, the worst-case is much superior to comparison-based sorting, as each comparison may run over the entire length of the string, where radix-based sorting will only pass over each character once (or twice, if you include the scan for all characters being identical). I didn't use this worst-case for comparison-based sorting when making the graphs, but I could if you'd like to see spectacular speedups.
I've added fasterbzip to the develop doc directory: https://github.com/spreadsort/sort/blob/develop/doc/fasterbzip2-1.0.5.zip, but haven't referenced it from any documentation. bzip has a bizarre, highly optimized sort, and stuffing the Spreadsort algorithm into it was non-trivial. Still, I see substantial speedup on data sets with high compression levels (and little speedup on random or near-random data that doesn't compress much). Antony,
Implementation is not tuned.
For example, take a look here
https://github.com/spreadsort/sort/blob/master/include/boost/sort/detail/spr...
and here
https://github.com/spreadsort/sort/blob/master/include/boost/sort/detail/spr...
. Memory allocations occur at those points, however they could be easily avoided. Here's a typical use case of `size_bins` function:
https://github.com/spreadsort/sort/blob/master/include/boost/sort/detail/str... .
Using a stack based buffer would significantly improve sort times on small datasets.
The develop branch now has the change you suggested, and it sped up sort times (with the minimum threshold disabled) on 700 element arrays by 37%, making 700 elements the new crossover point. I'm going to test the impact on large arrays more, but so far I see no measurable impact. I could try using the tricks used in "Engineering Radix Sort" to logarithmically bound the bin_cache size, add additional stack when it is needed via co-recursion, and put the bin_cache on the stack (or I could just allocate the bin_cache on the stack every iteration). As int_min_split_count (min s) is tuned to 9, I doubt that I'll be able to get the crossover point below around 500 elements. Do you think bringing the crossover point down to 500 or so is worth significant additional code complexity or stack memory usage to avoid heap allocation completely?
Edward Diener wrote:
I have provided as the review manager online documentation at:
Hi Edward, Please have a look at e.g. http://eldiener.github.io/sort/doc/string_sort.html and change template <foo> to template <foo>. Currently the template parameters are disappearing: template inline void string_sort(RandomAccessIter first, RandomAccessIter last); Regards, Phil.
On 11/11/2014 10:25 AM, Phil Endecott wrote:
Edward Diener wrote:
I have provided as the review manager online documentation at:
Hi Edward,
Please have a look at e.g. http://eldiener.github.io/sort/doc/string_sort.html and change template <foo> to template <foo>. Currently the template parameters are disappearing:
template inline void string_sort(RandomAccessIter first, RandomAccessIter last);
Thanks for catching this. I have made the correction.
Edward Diener wrote:
The formal review of the Sort library by Steven Ross starts today, November 10 and is scheduled to continue through November 19th.
I studied this submission back in 2008 when Steven first presented
it; those exchanges can be found here:
http://thread.gmane.org/gmane.comp.lib.boost.devel/176781
There is much interesting discussion there and I suggest that the
review manager should have a good look at it, especially if the volume
of actual reviews now should be low.
I'm not going to attempt a full review at this time, but some
assorted comments follow:
Should Boost accept novel algorithms?
-------------------------------------
Vladimir Prus has suggested that Boost should not accept submissions
of new algorithms such as this one because we're not qualified to
review them. I disagree: personally, I feel more comfortable
reviewing a sorting algorithm than trying to understand things like
atomics or the dark corners of C++ TMP. There have been plenty of
other libraries that have been accepted despite containing novel
algorithms (e.g. the two geometry libraries).
Having said that, I would have been happier if the submission had
simply been a collection of sort algorithms copied from Knuth, or
if Steven's paper had hundreds of citations!
Is a "faster" serial sort algorithm useful?
-------------------------------------------
In 2008, Luke Simonson said:
anyone interested in runtime of an application that
spends a lot of time sorting large data volumes is
going to do it on an up-to-date machine, which means
4, 8, 16 or 32 cores
I think std::sort is a straw man to compare yourself
to, not because it isn't good at what it does, but
because the target use case in question of very large
data volumes is better suited to a parallel algorithm.
At that time I didn't entirely agree as I had some embedded applications
that could benefit from faster sorting of perhaps tens of thousands of
items. But in the intervening five years, even phones have gained
multiple CPUs. Today, if I needed faster sorting my first thought would
be to parallelise it.
Code Quality
------------
Back in 2008, the code had issues like calls to malloc() and free()
and a lack of exception safety. No doubt it has improved, but perhaps
any keen reviewers could look for any remaining issues.
Skipping common prefixes
------------------------
One undeniable complexity advantage of this algorithm compared to
std::sort<string> is that it doesn't need to re-compare common
string prefixes once it knows that they are equal. Intuitively,
one would expect a substantial speedup from this change if the
input contains many strings with common prefixes. But intuition
is not always correct. I did some experiments using the pathnames
of all the files on my computer as the test data; these certainly
have many common prefixes e.g. /usr/local/include/boost/. I
wrote a modified quicksort that would skip common prefixes (see
http://svn.chezphil.org/libpbe/trunk/include/string_qsort.hh )
and concluded that actually the intuition is wrong; comparing
bytes can be done 8-at-a-time on a 64-bit system, and the cost of
constantly re-comparing strings like /usr/local/include/boost/
must be small. So the intuition that Steven's algorithm would
be useful in applications where strings have long common prefixes
should not be followed without first measuring it.
Does anyone sort a vector<int> ?
--------------------------------
I don't think I've ever needed to sort a vector<int>; there is normally
some associated data, e.g. I'm sorting a vector<struct> or vector
On 11/11/2014 11:28 AM, Phil Endecott wrote:
Edward Diener wrote:
The formal review of the Sort library by Steven Ross starts today, November 10 and is scheduled to continue through November 19th.
I studied this submission back in 2008 when Steven first presented it; those exchanges can be found here:
http://thread.gmane.org/gmane.comp.lib.boost.devel/176781
There is much interesting discussion there and I suggest that the review manager should have a good look at it, especially if the volume of actual reviews now should be low.
Thank you for the link to the above discussion.
snip...
Thanks for your feedback Phil.
Is a "faster" serial sort algorithm useful? -------------------------------------------
In 2008, Luke Simonson said: anyone interested in runtime of an application that spends a lot of time sorting large data volumes is going to do it on an up-to-date machine, which means 4, 8, 16 or 32 cores I think std::sort is a straw man to compare yourself to, not because it isn't good at what it does, but because the target use case in question of very large data volumes is better suited to a parallel algorithm.
At that time I didn't entirely agree as I had some embedded applications that could benefit from faster sorting of perhaps tens of thousands of items. But in the intervening five years, even phones have gained multiple CPUs. Today, if I needed faster sorting my first thought would be to parallelise it.
Sure, you can gain speedup by parallelizing, but if you can use half as many cores to get the same speedup by using a faster subsort, then isn't that helpful? On phones, the biggest issue is usually battery usage, which is why many of them rarely use more than 1 CPU core at a time. On servers, frameworks like MapReduce or other unshared memory techniques are commonly used to distribute the problem in many smaller pieces across separate processes. This library provides a faster subsort that will work with every standard way to parallelize a CPU-based sorting problem except LSD radix sorting. Parallel sorting can be valuable, but because of all the potential applications, variations between memory, runtime, and CPU usage constraints, and the level of shared-memory parallelization, there will be large variation across applications in which solution is best, complicating any library that attempts to meet these various needs. I'm not fundamentally opposed to adding library calls for the specific case of shared-memory parallel sorting, but it would add substantial complexity, and not cover all parallel-sorting needs.
Code Quality ------------
Back in 2008, the code had issues like calls to malloc() and free() and a lack of exception safety. No doubt it has improved, but perhaps any keen reviewers could look for any remaining issues.
There is now just a vector::resize of the bin_cache in the develop branch. I looked at eliminating that by using (more) stack memory, but it slowed down the library.
Skipping common prefixes ------------------------
One undeniable complexity advantage of this algorithm compared to std::sort<string> is that it doesn't need to re-compare common string prefixes once it knows that they are equal. Intuitively, one would expect a substantial speedup from this change if the input contains many strings with common prefixes. But intuition is not always correct. I did some experiments using the pathnames of all the files on my computer as the test data; these certainly have many common prefixes e.g. /usr/local/include/boost/. I wrote a modified quicksort that would skip common prefixes (see http://svn.chezphil.org/libpbe/trunk/include/string_qsort.hh ) and concluded that actually the intuition is wrong; comparing bytes can be done 8-at-a-time on a 64-bit system, and the cost of constantly re-comparing strings like /usr/local/include/boost/ must be small. So the intuition that Steven's algorithm would be useful in applications where strings have long common prefixes should not be followed without first measuring it.
Highly compressible data is one such application, and it works well there in bzip, but the length of the common prefixes were as much as hundreds of bytes. With more conventional data, you are correct, it isn't much of an advantage. I added that feature more to minimize the disadvantage created by otherwise doing a full radix sort on each byte, when nothing is different, which is a common issue with radix sorts on strings (if there are better solutions, I'm open to them). The end result was consistently faster than comparison-based sorting when fed real data, including the hierarchical strings I tested with, but had no relative advantage on short hierarchies like what you describe vs normal strings.
Does anyone sort a vector<int> ? --------------------------------
I don't think I've ever needed to sort a vector<int>; there is normally some associated data, e.g. I'm sorting a vector<struct> or vector
where one field of the struct is an int key. Or the key has multiple fields like an (x,y) pair, or there is some extra feature of the comparison such as being case-insensitive. The proposed API does provide a reasonable way to sort these more complex containers, i.e. by providing additional functors that (for string_sort) return the nth character of the key string. But what we don't know is whether its performance benefits relative to introsort also apply in these cases: as far as I've seen, the performance results are all for simple vector<int>-like cases. One could easily imagine that the time taken for copying would dominate, diluting the algorithmic benefits.
If you run the tune.pl script that is in the base directory of the library, you'll see the speedup on key+value sorts. They actually do less swaps than a comparison-based algorithm, because there are less iterations, so large chunks of data sort faster relative to std::sort with this approach than small chunks of data. Some excerpts from my latest tune.pl run (with no args) on Ubuntu: Verifying integer_sort runtime: 87.596183 std::sort time: 165.384112 speedup: 88.80% Verifying float_sort runtime: 85.212854 std::sort time: 182.711132 speedup: 114.42% Verifying string_sort runtime: 115.193192 std::sort time: 250.349607 speedup: 117.33% Verifying integer_sort with separate key and data runtime: 405.510518 std::sort time: 991.96294 speedup: 144.62% Verifying string_sort with all functors runtime: 139.453782 std::sort time: 311.68992 speedup: 123.51% Verifying boost::sort on its custom-built worst-case distribution runtime: 17.710653 std::sort time: 17.606037 speedup: -0.59% I encourage all reviewers to run this script themselves to see how fast this library is on their system (the README describes how to do this, or you can just call ./tune.pl -small [-windows] from the libs/sort directory and wait a minute). A standard trick with a large data payload is to use a pointer to the data to minimize the amount of memory that must be swapped, so sorting with a really large data payload should be fairly unusual.
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Edward Diener Sent: 10 November 2014 05:38 To: boost@lists.boost.org Cc: boost-users@lists.boost.org Subject: [boost] [review] Formal review period for Sort library begins today, November 10, and ends Wednesday, November 19
The formal review of the Sort library by Steven Ross starts today, November 10 and is scheduled to continue through November 19th. For your review you may wish to consider the following questions:
- What is your evaluation of the design?
Well established history (though how much usage is less clear - but that is point of adding to Boost).
- What is your evaluation of the implementation?
Neatly templated, making it immediately useful to Boost C++ users (including Boost.Multiprecision). Although pinning down sort speed can be very tricky considering the very, very many confounding factors, well demonstrated and at least partly understood by Steven Ross, his evidence was convincing that in some cases, sorting with the Spreadsort can be usefully (if not always dramatically) faster. With all sorts of sort, YMMY always applies. The only true evaluation methodology is suck'n'see.
- What is your evaluation of the documentation?
OK - though I'd like to see much more (and more recent) bibliography on sorting. And using Quickbook would give it a more familiar look'n'feel, and be more easily maintained. (Ask me for help?)
- What is your evaluation of the potential usefulness of the library?
Although some reviewers counter that those who need speed will parallel and use a GPU, this algorithm allows the sorting novice to easily test if the heuristics provided will be good enough, without the potentially precipitous learning curve of fancier sort tactics.
- Did you try to use the library
No.
- How much effort did you put into your evaluation?
A quick reading.
- Are you knowledgeable about the problem domain?
No.
And finally, every review should attempt to answer this question:
- Do you think the library should be accepted as a Boost library?
Yes because it seems a useful addition to Boost. Boost needs more on sorting and this looks a good starting point. The author obviously knows more about sorting than the average user. A Boost.Sort library would encourage other template sort algorithms to be added. Steven Ross is willing to maintain the library, hopefully adding contributions from others. I'd like to see at least one more algorithm in order to justify the Boost.Sort library title. (Failing that the library should be called Boost.Spreadsort - but I'd much prefer Boost.Sort). Paul --- Paul A. Bristow Prizet Farmhouse Kendal UK LA8 8AB +44 (0) 1539 561830
participants (13)
-
Adam D. Walling
-
Antony Polukhin
-
Beman Dawes
-
Edward Diener
-
Jeremy Murphy
-
Mathias Gaunard
-
Niall Douglas
-
Paul A. Bristow
-
Peter Dimov
-
Phil Endecott
-
Robert Ramey
-
Steven Ross
-
Vladimir Prus