Proposed templated integer_sort

older
BoostCon 2009 Registrar's Report

Steven Ross

28 Jun 2008 28 Jun '08

2:39 a.m.

Would there be interest in an integer_sort call that is faster than std::sort? Spreadsort is an in-place hybrid radix-based and comparison-based algorithm with better worst-case and average-case (roughly 20-50%) performance than introsort on random integer data. The worst-case advantage is algorithmic, with Spreadsort taking the better of O(n(square_root(bit_sizeof(data) - log(n)))) and O(nlog(n)) time. For 32-bit integers, that's O(n(square_root(32 - log(n)))), so for a log(n) of 16 (65536 items), it's 4n (plus a small constant times n), vs 16n for introsort. For larger n, the advantage vs. pure comparison-based algorithms increases. In effect, Spreadsort smoothly transitions from pure MSB Radix sort to std::sort with its data cut up by a couple iterations of MSB Radix sort (reducing the size of the log(n) factor), depending upon the distribution and size of the data. I could write a templated version of Spreadsort for the Boost library if there is interest. It would work for anything that supports the << and - operators. It could also handle standard floats sorted as integers. Functors supporting << and - could make it more general. Multi-dimensional sorting is also sped up by Spreadsort. I could also write a similar string_sort, that should also be significantly faster than std::sort. This would look much like postal sort, just using introsort after a certain number of iterations. Is this of interest? http://sourceforge.net/projects/spreadsort

Show replies by date

Johan Råde

28 Jun 28 Jun

1:57 p.m.

Hi Steven, If you can provide a generic implementation that can sort 1. integers, floats, strings 2. structs that should be sorted according a data member that is an integer, float or string 3. pointers to structs as in 2 then I would be interested. --Johan Råde Steven Ross wrote:

...

Would there be interest in an integer_sort call that is faster than std::sort? Spreadsort is an in-place hybrid radix-based and comparison-based algorithm with better worst-case and average-case (roughly 20-50%) performance than introsort on random integer data. The worst-case advantage is algorithmic, with Spreadsort taking the better of O(n(square_root(bit_sizeof(data) - log(n)))) and O(nlog(n)) time. For 32-bit integers, that's O(n(square_root(32 - log(n)))), so for a log(n) of 16 (65536 items), it's 4n (plus a small constant times n), vs 16n for introsort. For larger n, the advantage vs. pure comparison-based algorithms increases. In effect, Spreadsort smoothly transitions from pure MSB Radix sort to std::sort with its data cut up by a couple iterations of MSB Radix sort (reducing the size of the log(n) factor), depending upon the distribution and size of the data.

I could write a templated version of Spreadsort for the Boost library if there is interest. It would work for anything that supports the << and - operators. It could also handle standard floats sorted as integers. Functors supporting << and - could make it more general. Multi-dimensional sorting is also sped up by Spreadsort. I could also write a similar string_sort, that should also be significantly faster than std::sort. This would look much like postal sort, just using introsort after a certain number of iterations. Is this of interest?

http://sourceforge.net/projects/spreadsort

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Phil Endecott

6:09 p.m.

Steven Ross wrote:

...

Would there be interest in an integer_sort call that is faster than std::sort? Spreadsort is an in-place hybrid radix-based and comparison-based algorithm with better worst-case and average-case (roughly 20-50%) performance than introsort on random integer data. The worst-case advantage is algorithmic, with Spreadsort taking the better of O(n(square_root(bit_sizeof(data) - log(n)))) and O(nlog(n)) time. For 32-bit integers, that's O(n(square_root(32 - log(n)))), so for a log(n) of 16 (65536 items), it's 4n (plus a small constant times n), vs 16n for introsort. For larger n, the advantage vs. pure comparison-based algorithms increases. In effect, Spreadsort smoothly transitions from pure MSB Radix sort to std::sort with its data cut up by a couple iterations of MSB Radix sort (reducing the size of the log(n) factor), depending upon the distribution and size of the data.

I could write a templated version of Spreadsort for the Boost library if there is interest. It would work for anything that supports the << and - operators.

Yes, this would interest me. I have a fixed-point class that it could probably be used with. I have a feeling that you need a little bit more than just << and - though, don't you? Hmm, what would a (pure) radix sort need? Presumably, since yours is a hybrid method, you need everything that a comparison sort needs and everything that a radix sort needs. Maybe we should start by implementing a (sketch of a) radix sort template. Johan mentioned sorting a vector of [pointers to] structs by some member of that struct; I also have some wrappers around std::sort (e.g. below) that do this and would be interested to see how you would implement it. template <typename obj_t, typename data_member_ptr_t> struct compare_data_member { data_member_ptr_t data_member_ptr; compare_data_member(data_member_ptr_t data_member_ptr_): data_member_ptr(data_member_ptr_) {} bool operator()(const obj_t& a, const obj_t& b) { return a.*data_member_ptr < b.*data_member_ptr; } }; template <typename iter_t, typename data_member_ptr_t> void sort_by(iter_t begin, iter_t end, data_member_ptr_t data_member_ptr) { std::sort(begin,end, compare_data_member<typename iter_t::value_type, data_member_ptr_t> (data_member_ptr)); }

...

It could also handle standard floats sorted as integers.

what exactly do you mean by this? As I understand it, if you treat an IEEE float as an integer for sorting it will almost work but the negative range will be backwards; you need to do some sign-mag vs. 2s-comp fixup to allow for it. But I bet std::c++ doesn't guarantee anything along those lines.

...

Functors supporting << and - could make it more general. Multi-dimensional sorting is also sped up by Spreadsort. I could also write a similar string_sort, that should also be significantly faster than std::sort. This would look much like postal sort, just using introsort after a certain number of iterations. Is this of interest?

Yes, that too. I'm particularly interested in algorithmically-efficient sorting of strings that frequently have common prefixes, e.g. filenames, URLs etc; I think would be a good case for this algorithm, wouldn't it?

...

http://sourceforge.net/projects/spreadsort

I've recenty been re-coding something where a std::map<> was using more memory than I wanted. I decided to instead put the data in a std::vector, sort it, and then search using the std::lower/upper_bound algorithms. This works well, but I did notice std::sort using a lot of time in the profile. So an improved sort would be appreciated. This is off-topic but people might be interested: there are a couple of remaining problems with this sorted vector implementation. Firstly, the access pattern when you search is horrible; it starts by reading one element in the middle, then one at the 1/4 or 3/4 point, and so on; probably each of those is a page or cache-line that will never be accessed otherwise. This can be fixed by bit-reversed addressing (well, at least in the case where the vector size is a power of two, I think) and I'm considering a container-adaptor that bit-reverses the index. It makes normal iteration slower of course. The other problem is that if my data is a key-value pair, I waste a lot of cache space for the value parts of the elements that I only access the keys of (i.e. the nodes near the root of the tree). I could fix this by replacing the vector<pair<key,value>> with a pair<vector<key>,vector<value>>, but can I somehow get the memory layout of the latter with the syntax of the former? Regards, Phil

Steven Ross

11:46 p.m.

Correction: I need the >> operator, not the << operator. It's used for division. I apologize for the error. First, Johan: I would probably need to have three different methods: integer_sort, float_sort, and string_sort. What I have right now is a C non-templated version of integer_sort in the sourceforge Spreadsort project , so I would start with a generic integer_sort. float_sort is only slightly different. 2 and 3 could readily be handled by either adding >>, -, and < operators to either the structure being sorted or a functor. Phil: You are correct, I do require a comparison operator too (<). I assumed that, but it's good to make clear. You are also correct that negative floats have to sorted differently. Also, the IEEE float casting operation works on most OSes, but not all, so my float_sort implementation would probably never make it to a standard. That said, it would still be useful on most platforms, and it's very similar to integer_sort, just requiring flipping the order of negative bins, which should have a trivial performance impact. I could sort fixed-point data, given a >> operator, which you should be able to implement. Here is what is needed for a hybrid MSB radix sort and introsort combo: bool operator <(const T &) : for comparison size_t operator -(const T & lhs, const T & rhs) : a - b > 0 -> b > a Note: - will always be used such that lhs >= rhs inside the algorithm. T operator >>(const T &) : to divide the values in a quick and efficient manner that will preserve ordering (a >> n) < (b >> n) -> a < b. Each item is allocated a bin number that corresponds to (x >> n) - (minval >> n), where the algorithm picks an ideal n. That's all that should be necessary for integer_sort based upon Spreadsort, and for a cast float_sort. string_sort is a different beast. I would probably specialize it to strings and let the user specify a comparison operator initially, adding more generality once it has proved its worse. Yes, it should work well if there are long common substrings. I'll leave that until after integer_sort. As a little note Phil about lookups, if you keep the bin structure created by Spreadsort around and don't delete any of the sub-bins, you can do high-speed lookups using the bin structure that are much faster than binary search (radix-based lookup, like a trie). Back when Spreadsort wasn't cache-friendly and I split the data into n/8 bins, I found that doing lookups using that bin structure on random data was 3-5X as fast as a binary search. The idea works like this: you have an array of bins over the range MIN to MAX bin 0 points to the first value >= MIN bin 1 points to the first value >= MIN + offset bin 2 points to the first value >= MIN + 2*offset ... bin n points to the first value >= MIN + n*offset ... the last bin points to the first value >= MAX - offset The you find the bin index for "x" by taking (x - MIN)/offset, and do a binary search between that bin and the next bin. This can be made a faster on integers by using the >> operator. This kind of lookup will basically have one cache miss (into the big list of bins), and then a binary_search inside of one bin, which doesn't normally have many elements. If you use the Spreadsort-created bin structure, you might be able to get away with zero cache misses and more evenly divided bins, but you'll have more lookups. The most practical application would probably be a flat lookup into n/C bins, where C is small enough that most bins will fit in a memory page (256?). But I digress. It sounds like there is interest in a generic integer_sort and I will start working on one. After that float_sort should be simple. string_sort will be a lot of work, but doable. Steve

Phil Endecott

30 Jun 30 Jun

8:45 p.m.

Hi Steven, I've had time to have a quick look at your code: Steven Ross wrote:

...

You are correct, I do require a comparison operator too (<). I assumed that, but it's good to make clear.

...

I could sort fixed-point data, given a >> operator, which you should be able to implement. Here is what is needed for a hybrid MSB radix sort and introsort combo:

bool operator <(const T &) : for comparison size_t operator -(const T & lhs, const T & rhs) : a - b > 0 -> b > a Note: - will always be used such that lhs >= rhs inside the algorithm. T operator >>(const T &) : to divide the values in a quick and efficient manner that will preserve ordering (a >> n) < (b >> n) -> a < b.

No, T T::operator>>(const T&) isn't what you're using; you need int T::operator>>(int). The shift amount is obviously an integer, not T, but furthermore you're using the result of the shift as a bin number. So if my fixed-point class lets you say e.g. assert(3.1415 >> 2 == 0.7854), i.e. it returns a T, then you can't use that to select a bin. My fixed class provides a base() method to access the underlying integer type, and somehow your algorithm needs to be supplied with a functor that uses that base() method to extract the next few bits. I'm not sure how best to do that: it's easy to solve for specific cases but it's hard to be general without becoming so generic that the code is incomprehensible. You can draw parallels with generating CRCs or hash functions for arbitrary types [i.e. what tr1:: hash functions don't do]: you need a simple concept that most types can satisfy for accessing their bit patterns. In your case you also need that you get the bit patterns with some sort of most-to-least significant ordering. Anyway, I suggest that you get it looking good for integers first and then we can all think about how to make it more generic. You might also like to prepare some benchmarks so that we can see how fast it really is and argue about the distribution of the values being sorted. I have a nice big file of GPS coordinates you can try it on... Oh, since no-one else has mentioned it, there was someone else who proposed some sorting algorithms here a while ago. You might like to have a look at the list archive (search for "sorting library") and see what you can learn. Phil.

Steven Ross

1 Jul 1 Jul

9:34 p.m.

Hi Phil, No, T T::operator>>(const T&) isn't what you're using; you need int

...

T::operator>>(int). The shift amount is obviously an integer, not T, but furthermore you're using the result of the shift as a bin number. So if my fixed-point class lets you say e.g. assert(3.1415 >> 2 == 0.7854), i.e. it returns a T, then you can't use that to select a bin.

Actually, as long as the - operator works as I specified (returning a size_t), my >> operator definition is correct. If >> worked as you specify, then I wouldn't need you to define a - operator. Do you think that would be more usable and/or efficient? Do you think calling the operator rightshift, and defaulting it to >> would be cleaner? As a note, doing it your way, I'd actually need this: long T::operator>>(unsigned) (to handle 64-bit values correctly) With that proviso, I think your suggestion is worth running with, as long as nobody sees a problem with limiting this sort to a maximum size of 64 bits. That saves people having to define an additional operator. Now all they need is this: bool operator <(const T &) : for comparison long operator >>(const T &) : to divide the values in a quick and efficient manner that will preserve ordering (a >> n) < (b >> n) -> a < b.

...

My fixed class provides a base() method to access the underlying integer type, and somehow your algorithm needs to be supplied with a functor that uses that base() method to extract the next few bits. I'm not sure how best to do that: it's easy to solve for specific cases but it's hard to be general without becoming so generic that the code is incomprehensible.

You could call >> on your base() integer and inline the operator, and the sorting should work pretty quickly.

...

Anyway, I suggest that you get it looking good for integers first and then we can all think about how to make it more generic. You might also like to prepare some benchmarks so that we can see how fast it really is and argue about the distribution of the values being sorted. I have a nice big file of GPS coordinates you can try it on...

sourceforge.net/projects/spreadsort has the source code in a form you can download, and a regression test setup which tests it on various random distributions trying different constant values (the defaults are pretty good). If you're interested in checking the speed, I suggest looking there (get the generic version). My testing across various platforms and different distributions resulted in a 30-50% speedup on most types of data. I saw no slowdowns. Or you can just copy your GPS coordinate file to input.txt, modify Sample.C to read your data in the right format, call "make spreadsort;./spreadsort", and see how it turns out. Call spreadsort -std if you want to compare vs. std::sort with the same setup. Processor-usage-based timing is integrated with the sample app.

...

Oh, since no-one else has mentioned it, there was someone else who proposed some sorting algorithms here a while ago. You might like to have a look at the list archive (search for "sorting library") and see what you can learn.

I took a look. It looked like suggesting putting a bunch of different implementations in Boost, but not much happened. I have no objection to putting a plain radix_sort in Boost if someone can figure out a way to template it efficiently. My integer_sort idea is based on the idea of having one standard algorithm for sorting all integers, if you want to sort integers of any size a little faster than std::sort. The main problem I saw with the previous sorting suggestion is that it adds complexity, where ideally you should make it as easy as possible to choose the ideal algorithm for what you're doing. Introsort is easy to use, good in worst-case and average-case, and general, so I see the std::sort call as the ideal all sorting libraries should try to live up to, and if they can't come close, they shouldn't bother, unless their speedup is tremendous. I hope to give a decent speed boost with a minimum of additional effort by the user, and with a simple name that makes it clear that people should use it if they're sorting by a key that is an integer or can act like an integer, and not bother (use std::sort) otherwise. Also, more on worst-case performance: I forgot about my MAX_SPLITS < log(n_prev) check in the GetMaxCount function (which was in the code I posted). The recursion check for a constant MAX_SPLITS is effectively this: (log(n)) > (LOG_CONST * bit_sizeof(remaining_range)/MAX_SPLITS) So for 64-bit data it would quit and call std::sort on the second iteration if the log of the number of items in the bin is less than (55 *3 / 9)= 18, and then go down by 3s (15, 12, 9 -> 9 is the absolute minimum at which point it will call std::sort regardless). So the worst case relative to std::sort would be log(n) = 19, and then still following the pattern I described, recursing 7 times then calling std::sort on log(n) = 9 items at the end. The absolute worst-case distribution would be sorted in reverse order, recursively branching into two pieces at each level with the full width of the remaining distribution between them, until the second to last iteration, at which point it would split into the maximum size bins on which std::sort is always recursively called (log(n) = 9). You need two roughly equal-sized pieces for worst-case performance because with just one piece and outliers, very little swapping would occur, and swapping dominates Spreadsort's runtime. Given constant MAX_SPLITS (which speeds up the average case by around 50% by avoiding cache misses), the worst-case performance is the better of O(n(bit_sizeof(range)/MAX_SPLITS)) and O(nlog(n)). Removing MAX_SPLITS the worst-case is as I originally described, but it would be slower in reality due to a bigger constant. In effect, spreadsort picks the better of Radixsort and std::sort worst-case performance, and average-case tends to perform better than either for large ranges and large n (LSB Radix sort can be faster, depending on memory allocation overhead, for small ranges). std::sort is still faster for small n (<1000), which is why Spreadsort immediately calls std::sort if the provided vector is too small. Thanks for your help Phil. Steve

Phil Endecott

3 Jul 3 Jul

12:58 p.m.

Steven Ross wrote:" <spreadsort@gmail.com>

...

If you're interested in checking the speed, [...]

I've done a simple benchmark with about 2 million latitude values scaled to ints in the range +/-90 x 1e6. The input is approximately alphabetically ordered by country and then by placename within the country. Your algorithm seems to be about 13% faster than std::sort. This is with g++-4.3 -O3. Cheers, Phil.

Steven Ross

2:39 p.m.

Hi Phil, On Thu, Jul 3, 2008 at 5:58 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

I've done a simple benchmark with about 2 million latitude values scaled to ints in the range +/-90 x 1e6. The input is approximately alphabetically ordered by country and then by placename within the country. Your algorithm seems to be about 13% faster than std::sort. This is with g++-4.3 -O3.

Thanks for testing it. That's a little slower than I'd expect (but believable). It's possible that the default tuning values don't work well on your system. Are you running on Windows? I've seen poor performance (only small speedups) on Windows for some reason. On OSX, cygwin, and Linux I haven't seen a real (large) dataset with that small a speedup. If you can afford to leave it running for an hour or so, you might also try running tune.pl -skip_verify 2000000 to see if it changes the constants on your system, and rerun with the constants tune.pl generates. If the constants differ significantly, I'd be interested to see what they are. If I understand you correctly you're sorting 2M elements over a 28-bit range. Most of my performance tests are with 40M elements over a 32-bit range, and my experience is that the speedup gradually increases between 10K and 10M elements, in roughly this fashion: 10K no difference 100K 10% faster 1M 20% faster (not too far from your result) 10M 30% faster (and 50+% speedup on some systems or with some types of data and large datasets) I assume you are comparing times that the application displays, not total time which includes file I/O. If it isn't proprietary data, I'd also be willing to look at your testcase to see if there is something I can improve in either my algorithm or your testing of it. My development system is a G4 OSX laptop (yes, it's ancient) , so results can be expected to differ a little. One final note: I see an error of around +-.03s with my timing approach. With a data set as small as 2M elements which can be expected to sort in well under a second, that can significantly impact the percentage runtime difference. Spreadsort is designed for large n. If you have small n, std::sort is better, and between 1K and 1M, the speedup isn't much. If you have 100M elements, then it starts to make a significant difference. That's the reason for the initial check to see if the number of elements is less than some constant value, in which case it uses std::sort. Steve

Phil Endecott

4:36 p.m.

Steven Ross wrote:

...

On Thu, Jul 3, 2008 at 5:58 AM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:

...
I've done a simple benchmark with about 2 million latitude values scaled to ints in the range +/-90 x 1e6. The input is approximately alphabetically ordered by country and then by placename within the country. Your algorithm seems to be about 13% faster than std::sort. This is with g++-4.3 -O3.

Thanks for testing it. That's a little slower than I'd expect (but believable). It's possible that the default tuning values don't work well on your system. Are you running on Windows?

No, this is Linux. It's a VIA C3, so it probably has less cache than many other systems.

...

If you can afford to leave it running for an hour or so, you might also try running tune.pl -skip_verify 2000000 to see if it changes the constants on your system, and rerun with the constants tune.pl generates.

I'll have a look at that. What do we think about "magic tuning parameters" in algorithms? My feeling is that things dependent on e.g. cache sizes (or number of processors etc) really need to be determined at run time to be useful to most people.

...

If I understand you correctly you're sorting 2M elements over a 28-bit range.

Something like that, yes.

...

Most of my performance tests are with 40M elements over a 32-bit range, and my experience is that the speedup gradually increases between 10K and 10M elements, in roughly this fashion: 10K no difference 100K 10% faster 1M 20% faster (not too far from your result) 10M 30% faster (and 50+% speedup on some systems or with some types of data and large datasets)

Hmm. Actually my application is more typically sorting of the order of 1e4 to 1e5 elements, within a smaller range.

...

I assume you are comparing times that the application displays, not total time which includes file I/O.

This is the elapsed (wall-clock) time between calling SpreadSort and it returning. No I/O.

...

If it isn't proprietary data, I'd also be willing to look at your testcase to see if there is something I can improve in either my algorithm or your testing of it.

OK, I've put a 14 Mbyte file called placenames.dat.bz2 in the directory http://chezphil.org/tmp/ (I've not written the URL in full to stop search engines from wasting their time on it). When un-bziped you'll find a tab-delimited 3-column text file. I got basically the same results using the latitude and longitude columns.

...

Spreadsort is designed for large n. If you have small n, std::sort is better, and between 1K and 1M, the speedup isn't much. If you have 100M elements, then it starts to make a significant difference. That's the reason for the initial check to see if the number of elements is less than some constant value, in which case it uses std::sort.

I'm unsure how many real-world problems there are that sort tens or hundreds of millions of elements in main memory. 1e8 elements takes ~ a gigabyte if you have just one pointer associated with it. Sorting this size of dataset from disk is a more common, but entirely different, problem. (But of course there are whole problem domains out there that I don't know anything about...) Phil.

Steven Ross

8:44 p.m.

On Thu, Jul 3, 2008 at 9:36 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

I'll have a look at that. What do we think about "magic tuning parameters" in algorithms? My feeling is that things dependent on e.g. cache sizes (or number of processors etc) really need to be determined at run time to be useful to most people.

The values I picked are pretty good across most systems. If there is a standard, quick way to find the size of the processor cache, then I could consider using it. By far the most important parameter MAX_SPLITS is dependent on the page size, and 9 corresponds to the number of bins that take up 4096 bytes on a 32-bit system, which is a common page size. introsort looks like it has a tuning parameter of 16 in SGI's implementation. Most of my performance tests are with 40M elements over a 32-bit

...

...
range, and my experience is that the speedup gradually increases between 10K and 10M elements, in roughly this fashion: 10K no difference 100K 10% faster 1M 20% faster (not too far from your result) 10M 30% faster (and 50+% speedup on some systems or with some types of data and large datasets)

Hmm. Actually my application is more typically sorting of the order of 1e4 to 1e5 elements, within a smaller range.

Then Spreadsort won't be very useful to you for that purpose, unless the range is small enough that bucketsort starts to become practical (range width on the order of n, or the data is spiky). Spreadsort is fast for small ranges, as is any decent bucketsort implementation. Spreadsort is designed for large n. If you have small n, std::sort is

...

better, and between 1K and 1M, the speedup isn't much. If you have 100M elements, then it starts to make a significant difference. That's the reason for the initial check to see if the number of elements is less than some constant value, in which case it uses std::sort.

I'm unsure how many real-world problems there are that sort tens or hundreds

...

of millions of elements in main memory. 1e8 elements takes ~ a gigabyte if you have just one pointer associated with it. Sorting this size of dataset from disk is a more common, but entirely different, problem. (But of course there are whole problem domains out there that I don't know anything about...) <http://lists.boost.org/mailman/listinfo.cgi/boost>

For those of us who need fast sorts of gigabytes of data, it is very important. For those who don't, it doesn't make much difference, and they're probably not under the same performance constraints. With regards to on-disk sorting, I also developed a fast (greater than order of magnitude improvement vs. conventional mergesort) algorithm for that years ago, but I'm not open-sourcing it at this time, so this isn't the place to discuss it. As an additional note, as it is radix-based, Spreadsort can readily run in parallel. Is that worth trying to put in boost? Steve

David Abrahams

9:02 p.m.

Steven Ross wrote:

...

On Thu, Jul 3, 2008 at 9:36 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...
I'll have a look at that. What do we think about "magic tuning parameters" in algorithms? My feeling is that things dependent on e.g. cache sizes (or number of processors etc) really need to be determined at run time to be useful to most people.

The values I picked are pretty good across most systems. If there is a standard, quick way to find the size of the processor cache, then I could consider using it. By far the most important parameter MAX_SPLITS is dependent on the page size, and 9 corresponds to the number of bins that take up 4096 bytes on a 32-bit system, which is a common page size. introsort looks like it has a tuning parameter of 16 in SGI's implementation.

It would be really good to get an ATLAS-like tuning framework into Boost so people could configure these things properly at installation time. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Steven Ross

4 Jul 4 Jul

2:31 p.m.

Hi Phil, I investigated a little further about VIA C3s and downloaded your testcase, so I have more to add. On Thu, Jul 3, 2008 at 9:36 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

No, this is Linux. It's a VIA C3, so it probably has less cache than many other systems.

...

From my reading, it actually has a larger L1 cache (and smaller L2 cache), and for that purpose a MAX_SPLITS as high as 13 might be better. Increasing LOG_MIN_SPLIT_COUNT might also speed up your test a little. Those are

probably the only consts that are worth changing. std::sort performs very well as soon as the data is small enough to fit in the L1 cache. Spreadsort performs less iterations the larger MAX_SPLITS is, though if MAX_SPLITS exceeds the cache size, its performance gets worse by 50% or sometimes more (drops off a cliff). The best usage of Spreadsort is usually to cut the data up small enough for std::sort to fit in the L1 cache and finish sorting.

...

OK, I've put a 14 Mbyte file called placenames.dat.bz2 in the directory http://chezphil.org/tmp/ (I've not written the URL in full to stop search engines from wasting their time on it). When un-bziped you'll find a tab-delimited 3-column text file. I got basically the same results using the latitude and longitude columns.

What's the best way to read data like that from a file? fscanf("%f %f %s\n")? Won't that have problems if there is a space? I don't do much file parsing in C/C++. Spreadsort is often even faster relative to std::sort when sorting key + data relative to sorting just keys, because it performs less swaps than std::sort, and outside of swapping just uses the keys. It also looks like you could use multikey sorting, which is a logical next step once I have integer_sort, float_sort, and string_sort fully implemented. Steve

Phil Endecott

9:26 p.m.

Steven Ross wrote:

...

From my reading, it actually has a larger L1 cache (and smaller L2 cache), and for that purpose a MAX_SPLITS as high as 13 might be better. Increasing LOG_MIN_SPLIT_COUNT might also speed up your test a little.

I'm afraid adjusting those makes it slower.

...

...
OK, I've put a 14 Mbyte file called placenames.dat.bz2 in the directory http://chezphil.org/tmp/ (I've not written the URL in full to stop search engines from wasting their time on it). When un-bziped you'll find a tab-delimited 3-column text file. I got basically the same results using the latitude and longitude columns.

What's the best way to read data like that from a file? fscanf("%f %f %s\n")? Won't that have problems if there is a space? I don't do much file parsing in C/C++.

Nor me. awk it into something else first. You can probably use fscanf("%f\t%f\t%[^\n]") or try cin >> lng >> lat; getline(cin,name);. Phil.

Steven Ross

11:01 p.m.

Hi Phil, I've attached a testcase. Call "make", and copy your data file to "input.txt", then call ./spreadsort, and compare the runtime to ./spreadsort -std. Please note that I made a couple modifications to SpreadSort.h so that non-integer types can be sorted using just >> and < operators, and that this testcase works that way. This sample sorts the data on the first row of data values only, and only prints out that first row to avoid confusion when diffing results. It also casts the floats to ints because I haven't written float_sort yet, and I don't have your fixed-point code to work with. On Fri, Jul 4, 2008 at 2:26 PM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

Steven Ross wrote:

...
From my reading, it actually has a larger L1 cache (and smaller L2 cache), and for that purpose a MAX_SPLITS as high as 13 might be better. Increasing LOG_MIN_SPLIT_COUNT might also speed up your test a little.

I'm afraid adjusting those makes it slower.

Using my attached testcase, I reproduced the minor speedup you saw using the constants I sent you. I tested different values and discoved that MAX_SPLITS was getting in the way of good performance, and that I was better off living with cache misses and doing less iterations. With MAX_SPLITS thus increased to 20 (8MB maximum bin memory), I saw a 34% speedup. I've attached the Constants.h that shows this speedup. Could you confirm this speedup on your end? Steve

Phil Endecott

5 Jul 5 Jul

11:46 a.m.

Steven Ross wrote:

...

I've attached a testcase.

Running this code, I see a 36% improvement.

...

It also casts the floats to ints because I haven't written float_sort yet, and I don't have your fixed-point code to work with.

The benchmarks I was doing before were simply multiplying the float by 1e6 and rounding to an int. I was also only storing and sorting the actual number, not the associated data; I see that your code sorts everything. And I was measuring elapsed time rather than CPU time. I don't know how much difference any of that makes. When I simply plug the new Constants.h into my existing test I see a further slowdown; it's actually slower than std::sort.

...

A non-text attachment was scrubbed... Name: phil.tar.gz Type: application/x-gzip Size: 9248 bytes Desc: not available URL: <http://lists.boost.org/MailArchives/boost/attachments/20080704/61b21aed/attachment.gz>

These mailing list attachment archive links still don't work. Who is the mailman expert? They're accessible via gmane though. Phil.

Steven Ross

9:05 p.m.

On Sat, Jul 5, 2008 at 4:46 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

Steven Ross wrote:

...
I've attached a testcase.

Running this code, I see a 36% improvement.

Good; that confirms my results, and that Spreadsort can get a significant improvement with the right constants on your data.

...

It also casts the floats to ints because I haven't

...
written float_sort yet, and I don't have your fixed-point code to work with.

The benchmarks I was doing before were simply multiplying the float by 1e6 and rounding to an int. I was also only storing and sorting the actual number, not the associated data; I see that your code sorts everything. And I was measuring elapsed time rather than CPU time. I don't know how much difference any of that makes.

I've found CPU time to give slightly more stable results, particularly when there are processes running in the background. That said, I wouldn't worry about the difference.

...

When I simply plug the new Constants.h into my existing test I see a further slowdown; it's actually slower than std::sort.

There is a good reason for that: MAX_SPLITS is picked as the maximum number of bins that won't cause a cache miss. When that value is exceeded, lookups to find which bin an item belongs in will be performance limited by memory latency. For this reason, my testing on sorting just keys (no data) finds that runtime increases 50-100% when MAX_SPLITS exceeds what can fit on a memory page. These differences clearly show that constant-picking is important, and that I should probably do some more work to try to gain some more performance. At a high level, there are three main types of operation that take significant runtime in Spreadsort (outside of calls to std::sort): Iterating over the data vector -> this is done three times per iteration. One iteration is for finding the max and min, the second is for finding how many items are in each bin, and the third is part of the swap loop. Swapping data when it is out of order in the core sorting loop. Random accesses into the bin array array to find the correct bin. This is done twice per element: once to determine how many items belong in a bin, and a second time to decide which bin to swap an element into. As long as the bins are small enough in size to avoid bin lookup cache misses, the only cache misses are the swap operations, so there is an average of 1 cache miss per element (assuming the data doesn't start in the right bins). When the number of bins increase beyond that point, there is an average of 3 cache misses per element, substantially increasing total runtime. On the other hand, less iterations are necessary the greater the number of bins, so if the bin count becomes large enough, it's theoretically possible for it to run faster. If the bin count becomes too close to n, overhead in processing bins can also impact performance. In my testing, as long as LOG_MEAN_BIN_SIZE >= 3, then this issue is not significant. When the key is the only data involved, then swaps don't take much time, because they involve moving only a small amount of data. When data is added in addition to the key, swap time increases, eventually to the point where it's faster to have less iterations even at the expense of more cache misses on bin lookups. That explains why the cache-safe constants I sent you initially are faster when sorting just keys, but the higher bin counts are faster with key+data. As Spreadsort generally becomes faster relative to std::sort as the amount of non-key data increases, I normally optimize for the worst relative case, key-only data, and for that, a cache-friendly bin count is clearly superior. With that said, I may want to revisit the way I allocate the number of bins, and use a different approach for larger data types. There are 2 decisions currently made based upon tuning constants in Spreadsort: Recurse or call std::sort? This is pretty well handled by calling std::sort if the number of items is less than a modest constant (usually 256), and also checking for relative worst-case performance between the two, and picking the one with the better worst-case performance. It might be useful to call insertionsort for really small n if std::sort isn't well-optimized for tiny n (say 16), but that's about the only improvement I can think of. How many bins should be used? Changing this can make significant differences in performance, as we just saw. I experimented a little with the constants for the example I sent you (see attached), and found that there is a slightly modified set of constants that shows a 38% speed improvement on my system with my testcase. Most notable was that the ideal MAX_SPLITS was 15, which is roughly half of the log of the range. That suggests to me that it is fastest to split up the range half at a time in even-sized chunks for this data set, when the DATATYPE is large. The current approach works like this (in pseudocode): logbincount = log(n) - LOG_MEAN_BIN_SIZE (normally 3) if(logbincount > MAX_SPLITS) logbincount = MAX_SPLITS I'm thinking that if sizeof(T) > sizeof(Bin), then I should apply a different approach: logbincount = log(n) - LOG_MEAN_BIN_SIZE (normally 3) if(logbincount > log(range)) logbincount = log(range) // this will complete sorting in just one iteration else if(logbincount > MAX_SPLITS) { if(logbincount >= log(range)/2) logbincount = (log(range) + 1)/2; //+1 so that odd numbers round up, not down else logbincount = MAX_SPLITS; //otherwise just minimize cache misses } This is basically special-casing two situations: 1) The range can be divided in a single iteration, in which case we should just finish the job. 2) The range can be divided in two iterations, and it may often be faster to do in two even steps. In cases where there is more work to do, it won't try to take a shortcut, because that invites lots of unnecessary cache misses. I'm open to suggestions of better ways to handle this.

...

These mailing list attachment archive links still don't work. Who is the mailman expert? They're accessible via gmane though.

Is there a better way to exchange testcases, given that I don't have a personal website? Steve

David Abrahams

9:11 p.m.

on Sat Jul 05 2008, "Steven Ross" <spreadsort-AT-gmail.com> wrote:

...

...
These mailing list attachment archive links still don't work. Who is the mailman expert? They're accessible via gmane though.

Is there a better way to exchange testcases, given that I don't have a personal website?

The Boost subversion sandbox comes to mind -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Steven Ross

6 Jul 6 Jul

5:31 a.m.

I've uploaded a new integer_sort release to http://sourceforge.net/projects/spreadsort/, and attached the source to this email. This uses randomaccessiterators just like std::sort, and should be just as flexible. I will see if using some generic algorithms for iterators improves the performance a little; I'm seeing a noticeable speed penalty with iterators instead of pointers. I get the same performance as the old code when I pass in pointers as the iterators, but a roughly 15% increase in runtime when I use .begin() and .end(). I tried the performance optimization I discussed in a prior email, and it didn't work. I'm going to leave it as is, because I've found the old way to give a safe, steady performance improvement across platforms, even if said improvement isn't huge. Suggestions are welcome. Steve

David Abrahams

10:31 a.m.

on Sun Jul 06 2008, "Steven Ross" <spreadsort-AT-gmail.com> wrote:

...

I've uploaded a new integer_sort release to http://sourceforge.net/projects/spreadsort/, and attached the source to this email. This uses randomaccessiterators just like std::sort,

FWIW, the requirements of std::sort are satisfied by quicksort, an algorithm that works on bidirectional iterators only (it just depends on partition and swap), and if you allow it to be adaptive (i.e. use temporary storage) it can work on forward iterators (see std::stable_partition) -- although the latter may not be any faster than copying into a vector, sorting that, and copying back.

...

and should be just as flexible. I will see if using some generic algorithms for iterators improves the performance a little; I'm seeing a noticeable speed penalty with iterators instead of pointers. I get the same performance as the old code when I pass in pointers as the iterators, but a roughly 15% increase in runtime when I use .begin() and .end().

What compiler are you testing? VC++ uses checked iterators by default, which slows everything down terribly. You might try putting -D_SECURE_SCL=0 in your command line. See http://einaros.blogspot.com/2007/02/iteration-techniques-put-on-benchmark.ht... -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Steven Ross

7:15 p.m.

...

FWIW, the requirements of std::sort are satisfied by quicksort, an algorithm that works on bidirectional iterators only (it just depends on partition and swap), and if you allow it to be adaptive (i.e. use temporary storage) it can work on forward iterators (see std::stable_partition) -- although the latter may not be any faster than copying into a vector, sorting that, and copying back.

That may be true, but Introsort is significantly faster both on average and worst-case than Quicksort, and SGI's STL implementation uses Introsort, which uses a random access iterator. The first version of Spreadsort I wrote in 2001 allocated separate memory for each bin, and with a linked-list approach instead each sorting operation could be done in a single pass (as opposed to the three passes used in the version I'm turning into integer_sort), though without finding the maximum and minimum values. That would make more sense to use with a linked-list data set. It's not complicated to implement, but my testing shows that iterating through linked lists is slow because they often have cache misses, and besides, binary search only makes sense in a data structure that supports random access. Do you have a genuine desire for a real application to see a bidirectional (or unidirectional) iterator version of a hybrid sort? Spreadsort isn't much slower on a linked list than on an array, so it should do much better than Quicksort. I normally think of sorting linked lists as a rare niche application. In my own work I avoid linked lists because there is a better standard data structure for every problem I've run into.

...

...
and should be just as flexible. I will see if using some generic algorithms for iterators improves the performance a little; I'm seeing a noticeable speed penalty with iterators instead of pointers. I get the same performance as the old code when I pass in pointers as the iterators, but a roughly 15% increase in runtime when I use .begin() and .end().

What compiler are you testing? VC++ uses checked iterators by default, which slows everything down terribly. You might try putting -D_SECURE_SCL=0 in your command line. See

http://einaros.blogspot.com/2007/02/iteration-techniques-put-on-benchmark.ht...

I'm using g++ on an OSX laptop with a G4 processor (slow and old, in other words). I tried your suggested command line option and it made no visible difference, but thanks for the suggestion. If I sort using &(vec[0]), &(vec[0])+vec.size(), it still runs 15% faster than with vec.begin(), vec.end(). Steve

David Abrahams

10:21 p.m.

on Sun Jul 06 2008, "Steven Ross" <spreadsort-AT-gmail.com> wrote:

...

...
FWIW, the requirements of std::sort are satisfied by quicksort, an algorithm that works on bidirectional iterators only (it just depends on partition and swap), and if you allow it to be adaptive (i.e. use temporary storage) it can work on forward iterators (see std::stable_partition) -- although the latter may not be any faster than copying into a vector, sorting that, and copying back.

That may be true, but Introsort is significantly faster both on average and worst-case than Quicksort,

As I have always understood it (and as Musser's paper describes it), introsort *is* Quicksort until its behavior starts to stray from the average case, when it switches to heapsort as a fallback. How could it possibly have better average case complexity than Quicksort?

...

and SGI's STL implementation uses Introsort, which uses a random access iterator.

Yes, I know.

...

The first version of Spreadsort I wrote in 2001 allocated separate memory for each bin, and with a linked-list approach instead each sorting operation could be done in a single pass (as opposed to the three passes used in the version I'm turning into integer_sort), though without finding the maximum and minimum values. That would make more sense to use with a linked-list data set. It's not complicated to implement, but my testing shows that iterating through linked lists is slow because they often have cache misses, and besides, binary search only makes sense in a data structure that supports random access.

You might consider why the STL provides binary search for forward iterators, then ;-)

...

Do you have a genuine desire for a real application to see a bidirectional (or unidirectional) iterator version of a hybrid sort?

I'm just generally interested in algorithms. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Steven Ross

10:47 p.m.

...

...
That may be true, but Introsort is significantly faster both on average and worst-case than Quicksort,

As I have always understood it (and as Musser's paper describes it), introsort *is* Quicksort until its behavior starts to stray from the average case, when it switches to heapsort as a fallback. How could it possibly have better average case complexity than Quicksort?

Three reasons: 1) Mean performance across all testcases does not mean just Quicksort's best-case. Heapsort is a good general algorithm, and improves the average performance even if it doesn't change much on random data. 2) Introsort also uses insertion_sort, which is really fast on small data sets. I used to use Quicksort for recursive calls from Spreadsort, then found that for sizes up to 16 it was much better to use insertion_sort. So for the final sorting of very small data sets after it's finished cutting them up, introsort is faster. 3) In benchmark testing I find introsort to be about 3X as fast as the qsort library call. That should be the strongest argument of all.

...

...
The first version of Spreadsort I wrote in 2001 allocated separate memory for each bin, and with a linked-list approach instead each sorting operation could be done in a single pass (as opposed to the three passes used in the version I'm turning into integer_sort), though without finding the maximum and minimum values. That would make more sense to use with a linked-list data set. It's not complicated to implement, but my testing shows that iterating through linked lists is slow because they often have cache misses, and besides, binary search only makes sense in a data structure that supports random access.

You might consider why the STL provides binary search for forward iterators, then ;-)

That might save a little iteration over a search over every element, but wouldn't it still be O(n)? I'd think it's there just because it's possible, not because it has much use. I have no problem with providing general capabilities just because they're possible.

...

...
Do you have a genuine desire for a real application to see a bidirectional (or unidirectional) iterator version of a hybrid sort?

I'm just generally interested in algorithms.

It would be at least a few hours of development plus testing to make sure it isn't broken. I'd be happy to tell you how to do it if you wish, but without a need, I don't see why I should bother. I have many things to do, and little time. Steve

David Abrahams

11:03 p.m.

on Sun Jul 06 2008, "Steven Ross" <spreadsort-AT-gmail.com> wrote:

...

...
...
That may be true, but Introsort is significantly faster both on average and worst-case than Quicksort,

As I have always understood it (and as Musser's paper describes it), introsort *is* Quicksort until its behavior starts to stray from the average case, when it switches to heapsort as a fallback. How could it possibly have better average case complexity than Quicksort?

Three reasons: 1) Mean performance across all testcases does not mean just Quicksort's best-case. Heapsort is a good general algorithm, and improves the average performance even if it doesn't change much on random data.

You're saying that by improving the worst cases, introsort improves the average. I guess the worst cases for median-of-3 quicksort are more common than I understood them to be? They would have to be pretty common to make introsort *significantly* faster on average, wouldn't they?

...

2) Introsort also uses insertion_sort, which is really fast on small data sets.

Yeah, but that's essentially an implementation detail optimization. Many Quicksort implementations do the same thing. I don't believe that changes the overall big-O behavior.

...

I used to use Quicksort for recursive calls from Spreadsort, then found that for sizes up to 16 it was much better to use insertion_sort. So for the final sorting of very small data sets after it's finished cutting them up, introsort is faster. 3) In benchmark testing I find introsort to be about 3X as fast as the qsort library call. That should be the strongest argument of all.

Well that's no surprise at all. You have to pay for a function call indirection with qsort.

...

...
...
The first version of Spreadsort I wrote in 2001 allocated separate memory for each bin, and with a linked-list approach instead each sorting operation could be done in a single pass (as opposed to the three passes used in the version I'm turning into integer_sort), though without finding the maximum and minimum values. That would make more sense to use with a linked-list data set. It's not complicated to implement, but my testing shows that iterating through linked lists is slow because they often have cache misses, and besides, binary search only makes sense in a data structure that supports random access.

You might consider why the STL provides binary search for forward iterators, then ;-)

That might save a little iteration over a search over every element, but wouldn't it still be O(n)?

It saves no iteration at all. It is O(n) iterator steps but O(log N) comparisons. If comparison is expensive enough, it adds up.

...

I'd think it's there just because it's possible, not because it has much use. I have no problem with providing general capabilities just because they're possible.

...
...
Do you have a genuine desire for a real application to see a bidirectional (or unidirectional) iterator version of a hybrid sort?

I'm just generally interested in algorithms.

It would be at least a few hours of development plus testing to make sure it isn't broken. I'd be happy to tell you how to do it if you wish, but without a need, I don't see why I should bother. I have many things to do, and little time.

I'm not trying to get you to do it. I'm just pointing out that std::sort is overconstrained and should at least be made to work on bidirectional iterators (or have its worst-case complexity bound tightened in the case of random access iterators, or both). -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Steven Ross

7 Jul 7 Jul

4:35 p.m.

...

...
Three reasons: 1) Mean performance across all testcases does not mean just Quicksort's best-case. Heapsort is a good general algorithm, and improves the average performance even if it doesn't change much on random data.

You're saying that by improving the worst cases, introsort improves the average. I guess the worst cases for median-of-3 quicksort are more common than I understood them to be? They would have to be pretty common to make introsort *significantly* faster on average, wouldn't they?

The best-case for median-of-3 quicksort assumes that the median-of-3 evenly bisects the data. Depending upon the distribution, this is not a reasonable assumption. If it divides the data roughly 1/4 - 3/4, it'll take roughly twice as long. I would expect such uneven divisions to be relatively common. If it divides 1/8-7/8, it'll take roughly three times as long, but will be even worse if it always happens to one side. These situations are far from the worst-case, but explain why even on average, Quicksort is somewhat worse than Introsort, because some distributions are uneven enough for Introsort to call Heapsort.

...

...
2) Introsort also uses insertion_sort, which is really fast on small data sets.

Yeah, but that's essentially an implementation detail optimization. Many Quicksort implementations do the same thing. I don't believe that changes the overall big-O behavior.

No, it doesn't, but the overhead of finding the median becomes more significant on overall performance on the smaller subsets, so it is a significant implementation detail.

...

...
3) In benchmark testing I find introsort to be about 3X as fast as the qsort library call. That should be the strongest argument of all.

Well that's no surprise at all. You have to pay for a function call indirection with qsort.

Then why hasn't the compiler/library provider inlined the qsort call so the uninformed don't have such a huge speed penalty? I'm not claiming Introsort is vastly superior to Quicksort; I'm just saying it is enough faster to be significant.

...

...
...
You might consider why the STL provides binary search for forward iterators, then ;-)

That might save a little iteration over a search over every element, but wouldn't it still be O(n)?

It saves no iteration at all. It is O(n) iterator steps but O(log N) comparisons. If comparison is expensive enough, it adds up.

That sounds like a corner-case to me; somebody with a situation like that might want to consider a radix sort, if they can radix-sort their data, and a trie-like structure to minimize comparisons would probably save some runtime. If radix-based operations were cheap enough relative to comparisons, they might want a pure-radix sort and radix-based lookup.

...

...
...
...
Do you have a genuine desire for a real application to see a bidirectional (or unidirectional) iterator version of a hybrid sort?

I'm not trying to get you to do it. I'm just pointing out that std::sort is overconstrained and should at least be made to work on bidirectional iterators (or have its worst-case complexity bound tightened in the case of random access iterators, or both). <http://lists.boost.org/mailman/listinfo.cgi/boost>

I thought about what would be required to make integer_sort work with forward iterators: 1) I would have to either eliminate the check to see whether the data set is small enough that std::sort would be better, require the user to provide a count of the number of items being sorted, or be able to determine that the iterator isn't random-access and thus not do the check. 2) I would have to eliminate recursive calls to std::sort, and instead use an algorithm that works safely using the iterator I'm provided for recursive comparison-based sorting calls. 3) I would need an inlined function call for going forward n places from an iterator that takes constant time for a random access iterator, and the minimum number of operations necessary for a more limited forward iterator. With that said, it should add an additional pass through the data for a non-random-access iterator. Given all that, I can make integer_sort work off a forward iterator, without impairing randomaccess iterator performance, and still having decent performance for a forward iterator. So if you can provide me all of these: 1) A way to determine that an iterator is not random-access. 2) An efficient generic comparison-based recursive algorithm that takes forward iterators that is competitive with Introsort in performance. 3) An efficient way to go forward n steps from an iterator that is just as fast as + n for a random access iterator, but also works for normal iterators. Then I could make integer_sort work with a plain forward-iterator. I'd still have to be able to swap elements. For linked lists, a specialized version of Spreadsort would be significantly faster.

Steven Watanabe

4:45 p.m.

AMDG Steven Ross wrote:

...

I thought about what would be required to make integer_sort work with forward iterators: 1) I would have to either eliminate the check to see whether the data set is small enough that std::sort would be better, require the user to provide a count of the number of items being sorted, or be able to determine that the iterator isn't random-access and thus not do the check.

use std::distance

...

3) I would need an inlined function call for going forward n places from an iterator that takes constant time for a random access iterator, and the minimum number of operations necessary for a more limited forward iterator. With that said, it should add an additional pass through the data for a non-random-access iterator.

std::advance or boost::next(iter, n)

...

Given all that, I can make integer_sort work off a forward iterator, without impairing randomaccess iterator performance, and still having decent performance for a forward iterator. So if you can provide me all of these: 1) A way to determine that an iterator is not random-access.

For a random access iterator Iter typename std::iterator_traits<Iter>::iterator_category inherits from std::random_access_iterator_tag

...

3) An efficient way to go forward n steps from an iterator that is just as fast as + n for a random access iterator, but also works for normal iterators.

std::advance In Christ, Steven Watanabe

David Abrahams

8:05 p.m.

on Mon Jul 07 2008, "Steven Ross" <spreadsort-AT-gmail.com> wrote:

...

...
...
Three reasons: 1) Mean performance across all testcases does not mean just Quicksort's best-case. Heapsort is a good general algorithm, and improves the average performance even if it doesn't change much on random data.

You're saying that by improving the worst cases, introsort improves the average. I guess the worst cases for median-of-3 quicksort are more common than I understood them to be? They would have to be pretty common to make introsort *significantly* faster on average, wouldn't they?

The best-case for median-of-3 quicksort assumes that the median-of-3 evenly bisects the data. Depending upon the distribution, this is not a reasonable assumption. If it divides the data roughly 1/4 - 3/4, it'll take roughly twice as long. I would expect such uneven divisions to be relatively common. If it divides 1/8-7/8, it'll take roughly three times as long, but will be even worse if it always happens to one side. These situations are far from the worst-case, but explain why even on average, Quicksort is somewhat worse than Introsort, because some distributions are uneven enough for Introsort to call Heapsort.

I'm having a hard time believing that the average case for QuickSort is as bad as you say. I have done a good deal of searching and I see quite a few test results showing introsort and quicksort have approximately identical average-case behaviors.

...

...
...
2) Introsort also uses insertion_sort, which is really fast on small data sets.

Yeah, but that's essentially an implementation detail optimization. Many Quicksort implementations do the same thing. I don't believe that changes the overall big-O behavior.

No, it doesn't, but the overhead of finding the median becomes more significant on overall performance on the smaller subsets, so it is a significant implementation detail.

It's not significant in the comparison with QS if QS implementations do it too!

...

...
...
3) In benchmark testing I find introsort to be about 3X as fast as the qsort library call. That should be the strongest argument of all.

Well that's no surprise at all. You have to pay for a function call indirection with qsort.

Then why hasn't the compiler/library provider inlined the qsort call so the uninformed don't have such a huge speed penalty?

Uh... because qsort is a 'C' library routine?

...

I'm not claiming Introsort is vastly superior to Quicksort; I'm just saying it is enough faster to be significant.

I don't believe that's true in the average case.

...

...
...
...
You might consider why the STL provides binary search for forward iterators, then ;-)

That might save a little iteration over a search over every element, but wouldn't it still be O(n)?

It saves no iteration at all. It is O(n) iterator steps but O(log N) comparisons. If comparison is expensive enough, it adds up.

That sounds like a corner-case to me; somebody with a situation like that might want to consider a radix sort, if they can radix-sort their data, and a trie-like structure to minimize comparisons would probably save some runtime. If radix-based operations were cheap enough relative to comparisons, they might want a pure-radix sort and radix-based lookup.

Not all data can be radix sorting. And anyway, what does sorting have to do with binary search?

...

...
...
...
...
Do you have a genuine desire for a real application to see a bidirectional (or unidirectional) iterator version of a hybrid sort?

I'm not trying to get you to do it. I'm just pointing out that std::sort is overconstrained and should at least be made to work on bidirectional iterators (or have its worst-case complexity bound tightened in the case of random access iterators, or both). <http://lists.boost.org/mailman/listinfo.cgi/boost>

I thought about what would be required to make integer_sort work with forward iterators: 1) I would have to either eliminate the check to see whether the data set is small enough that std::sort would be better, require the user to provide a count of the number of items being sorted, or be able to determine that the iterator isn't random-access and thus not do the check. 2) I would have to eliminate recursive calls to std::sort, and instead use an algorithm that works safely using the iterator I'm provided for recursive comparison-based sorting calls. 3) I would need an inlined function call for going forward n places from an iterator that takes constant time for a random access iterator, and the minimum number of operations necessary for a more limited forward iterator. With that said, it should add an additional pass through the data for a non-random-access iterator.

Given all that, I can make integer_sort work off a forward iterator, without impairing randomaccess iterator performance, and still having decent performance for a forward iterator. So if you can provide me all of these: 1) A way to determine that an iterator is not random-access. 2) An efficient generic comparison-based recursive algorithm that takes forward iterators that is competitive with Introsort in performance. 3) An efficient way to go forward n steps from an iterator that is just as fast as + n for a random access iterator, but also works for normal iterators.

Then I could make integer_sort work with a plain forward-iterator. I'd still have to be able to swap elements. For linked lists, a specialized version of Spreadsort would be significantly faster. _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- Dave Abrahams BoostPro Computing http://www.boostpro.com

Steven Ross

6 Jul 6 Jul

10:27 p.m.

I've attached a fully functional revision that only has a 9% speed penalty from using randomaccessiterator types instead of directly using pointers (as opposed to 15% before). std::sort has a 6% penalty in my tests, so 9% doesn't seem unreasonable to me. The only additional modifications I'm considering making at this point is whatever formatting Boost requires (where's the best place to look that up?). The question is: is this code sufficiently useful, fast, and easy to use to include in Boost? I think it is because of its speedup across many different data types and usage of only >> and <, but I'm looking for input. If this is accepted, I can add versions that accept a functor, float_sort, and string_sort, but I'd like to go through the acceptance process with just this simpler code, so I know exactly what to do with the others. Steve

Phil Endecott

7 Jul 7 Jul

6:53 p.m.

Steven Ross wrote:

...

I've uploaded a new integer_sort release to http://sourceforge.net/projects/spreadsort/, and attached the source to this email.

...

Suggestions are welcome.

Hi Steve, Some comments, mostly minor things I think: - Boost generally uses .hpp for header files. - Identifiers (including macros) starting with _ are reserved. - I don't recall seeing much Hungarian Notation in Boost. - Some way of sorting structs, or pointers to structs, by passing some sort of key access functor as an additional template parameter is needed. - Use new and delete rather than *alloc and free. - Can any exceptions occur? If so, is the dynamically allocated memory leaked? Could you use a smart pointer to avoid this? - If you re-organise the recursion so that SpreadSortCore calls SpreadSortBins directly, rather than first returning to SpreadSortRec, can you limit the scope of the bins to SpreadSortCore? - Does it work with uint64_t? I see some variables declared as 'long', which often can't store a uint64_t. - I look forward to seeing float and string versions. - If std::vector<int>::iterator really is slower than int* (and I suggest some investigation of the assembler to see why this is), then you could add a specialisation that converts vector::iterators into pointers. - Is there a use-case for a pure radix sort, without the std::sort fallback? i.e. are there types that can't easily provide a comparison function, or distributions where the std::sort fallback is never useful? Having said all that, personally I have little interest in an algorithm that has no benefit for fewer than millions of items. What do other people think about that aspect? Cheers, Phil.

Simonson, Lucanus J

7:40 p.m.

Phil wrote:

...

Having said all that, personally I have little interest in an algorithm

...

that has no benefit for fewer than millions of items. What do other people think about that aspect?

I frequently sort gigabytes, tens of gigabytes or even hundreds of gigabytes of geometry data as a pre-requisite to scanline over the data. However, a faster sort would only benefit me if the speedup were reliably 2X or more. A 30% speedup in sort would disappear with the effect of Amdahl's Law to something under 10%. I get better speedup just by adopting each new version of the compiler. For me, the modest size and tenuous nature (depends upon architecture specific constants) of the speedup does not justify any effort to integrate and carry the dependency to the library that provides it. I have been following the thread only out of academic interest. I would end up using the algorithm only if it were integrated into std::sort and applied in a transparent manner where appropriate. Bringing the algorithm to boost might be a good first step in that direction. Template metaprogramming techniques could be used to detect at compile time whether a data type fits the model for spreadsort. From that standpoint I'd like to see a wrapper around spreadsort/std::sort that selects the appropriate algorithm for arbitrary T and allows the user to be completely abstracted away from even knowing about the existence of spreadsort. The next step from there is to try to push it into the standard itself. I wouldn't expect that to ever happen though, and am not too optimistic about how worthwhile it would be to everyone if it did happen. The speedup is just not consistently large enough to be worth caring about in the context of an application that does more than just sorting data. Luke

Simonson, Lucanus J

8:10 p.m.

Phil wrote:

...

Having said all that, personally I have little interest in an algorithm that has no benefit for fewer than millions of items. What do other people think about that aspect?

I forgot to mention, anyone interested in runtime of an application that spends a lot of time sorting large data volumes is going to do it on an up-to-date machine, which means 4, 8, 16 or 32 cores today and many more very soon. There is more than one source for a parallel_sort that should beat both std::sort and introsort by an order of magnitude on the machines that I use for large data sets, and if I were really interested in sorting faster I'd be using the threaded algorithms. I think std::sort is a straw man to compare yourself to, not because it isn't good at what it does, but because the target use case in question of very large data volumes is better suited to a parallel algorithm. Parallelize spreadsort, compare yourself to the good parallel sorting algorithms and suddenly the whole issue will become a lot more relevant and applicable to what is currently going on in the industry. Luke

Steven Ross

11:34 p.m.

On Mon, Jul 7, 2008 at 12:40 PM, Simonson, Lucanus J < lucanus.j.simonson@intel.com> wrote:

...

Phil wrote:

...
Having said all that, personally I have little interest in an algorithm

...
that has no benefit for fewer than millions of items. What do other people think about that aspect?

I frequently sort gigabytes, tens of gigabytes or even hundreds of gigabytes of geometry data as a pre-requisite to scanline over the data. However, a faster sort would only benefit me if the speedup were reliably 2X or more. A 30% speedup in sort would disappear with the effect of Amdahl's Law to something under 10%. I get better speedup just by adopting each new version of the compiler. For me, the modest size and tenuous nature (depends upon architecture specific constants) of the speedup does not justify any effort to integrate and carry the dependency to the library that provides it. I have been following the thread only out of academic interest.

Maybe I should specify a little more about the performance: As the ratio of keysize to n decreases (and as the data size increases), the speedup of Spreadsort vs. std::sort improves. A 30% speedup is roughly what I get at around 10 million random integer keys with no data on an OSX system with an old G4 processor. Linux with Intel processors shows a speedup of around 40% with the same data (and Windows around 15%). The speedup becomes larger as n increases, because keysize/n decreases and more radix sorting occurs. Spreadsort balances O(n(keysize/constant)) radix-based sorting with O(nlog(n)) comparison-based sorting, and what effectively happens as n increases is the O(n(keysize/constant) factor dominates, and the unit runtime of Spreadsort stays roughly constant, while that of std::sort goes up by the log(n). So they're roughly at parity when log(n) = 13 (8K), but by the time log(n)=26 (64M), Spreadsort is roughly 40% faster. Projecting that out, at log(n)=34(16G), it should be around 66% faster. Not 2X, but not trivial either. I fail to see how Amdahl's law applies to Spreadsort, as it's not parallel, but non-sorting tasks in your application obviously won't be sped up by Spreadsort. Spreadsort already calls std::sort automatically if the count is less than a minimum value, which last time I checked was 8192. The idea is to have an integer_sort that calls the better algorithm for the provided data, whichever it is, so that part of what you want is already there. To actually make it part of std::sort, I'd need to have float, integer, and string versions, check for which applies, and if none apply, call the current std::sort. That will require significant work. With regards to running this in parallel, here's how you do it: 1) Find the maximum and minimum of a list in parallel in FindExtremes. This would seem easy to run in parallel, as you just divide the list up evenly, and merge the results at the end. 2) The only fully serial part is finding the size of the bins to create in the first iteration. This could be made parallel by allocating external memory for bins, but at the cost of locking code, or merging bins from each processor, in addition to the memory allocation, which probably isn't worth it with a small number of processors. If you did insert into bins at this point, the algorithm would be fully parallel, and it could skip the next two steps. 3) Allocate bins (there are normally n/512 bins in the first iteration, so this is fast, even if serial) 4) In the swap loop, as soon as a bin is filled, fire it off on a separate thread. As you cut the data into 512-or-so pieces in the first iteration, each successive iteration should now be running efficiently in parallel. How many swaps are required to fill the first bin is not reliable, so this step is not fully parallel. As to std::sort being a straw man, I'd be happy to compare against a faster general-case serial sorting algorithm (if said algorithm exists). Improvements in serial sorting algorithms should mostly apply to parallel algorithms, and nothing stops you from running Spreadsort in parallel. I'd be happy to look into developing parallel versions once integer_sort, float_sort, and string_sort are in Boost, and hopefully I have enough money to buy a test machine with enough cores to be worth testing on. In reply to David Abrahams: Sorting and searching are analogs of each other; radix sorting is like a trie structure, and many comparison-based algorithms are like a binary tree. So the two are related. You can even use the Spreadsort bin structure (if you don't delete it) to do high-speed lookups into the array; in the past I found this to be much faster than binary search. The Judy Array data structure is based upon a similar idea. If you have a generic Quicksort or some other fast sort implementation that doesn't have O(n*n) worst-case behavior and that can be used on any forward-iterator, I'd be happy to call that on non-random-access iterators inside of integer_sort, thus allowing integer_sort to work on any forward-iterator. In reply to Phil: Thank you for your detailed suggestions. I will implement those that make sense (probably most of them), and reply in detail later. Yes, a pure radix_sort may be useful, though you can always write a comparison function if you can describe a radix sort (I don't see how you can sort anything where you don't know if a < b; I think that's part of the definition of sorting in Knuth's Sorting And Searching book). An LSB external radix_sort is both fast (for small keys and large n) and stable, so that could clearly be useful to generalize, but the tricky part is how to generalize it (they normally use bit masks)? Spreadsort is a hybrid that uses an MSB radix sort, and is able to identify for itself whether the distribution is such that pure radix sorting or comparison-based sorting is better, so I don't see the point of writing a separate MSB radix sort.

Simonson, Lucanus J

8 Jul 8 Jul

11:05 p.m.

...

To actually make it part of std::sort, I'd need to have float, integer, and string versions, check for which applies, and if none apply, call the current std::sort. That will require significant work.

...

2) The only fully serial part is finding the size of the bins to create in the first iteration. This could be made parallel by allocating external memory for bins, but at the cost of locking code, or merging bins from each processor, in addition to the memory allocation, which probably isn't worth it with a small number of processors. If you did insert into bins at

Yes, that is the work I had in mind. But as I said, I'm not sure it would be worthwhile because nobody stands to benefit from a faster serial sorting algorithm for large data sets. this

...

point, the algorithm would be fully parallel, and it could skip the next >two steps.

I disagree with you here. You could easily split the input into kernels of data and process each one as a task to compute the size of bins required by that kernel and join the results.

...

3) Allocate bins (there are normally n/512 bins in the first iteration, so this is fast, even if serial)

...

4) In the swap loop, as soon as a bin is filled, fire it off on a separate thread. As you cut the data into 512-or-so pieces in the first iteration, each successive iteration should now be running efficiently in

...

How many swaps are required to fill the first bin is not reliable, so

Allocation of bins would happen as a serial stage in your pipeline, yes, but be very fast and not harm performance. parallel. this

...

step is not fully parallel.

...

As to std::sort being a straw man, I'd be happy to compare against a faster general-case serial sorting algorithm (if said algorithm exists). Improvements in serial sorting algorithms should mostly apply to

You are thinking about threading in a thread-oriented instead of task-oriented way. If you recursively create threads you will oversubscribe your machine and the threads will contend for cores. 512 threads is way, way, way too many. That's not a recipe for success. Instead of creating threads you want to create tasks and have a task scheduler that assigns them to threads. The first iteration of spreadsort can be fully parallel because if you cut the original input into enough kernels and process them as tasks you can keep all your cores busy until all the bins are filled, then proceed with parallel recursion into each bin. parallel

...

algorithms, and nothing stops you from running Spreadsort in parallel. I'd be happy to look into developing parallel versions once integer_sort, float_sort, and string_sort are in Boost, and hopefully I have enough money to buy a test machine with enough cores to be worth testing on.

You sort of missed my point about std::sort being a straw man. It is the best general case serial sorting algorithm. It isn't a good benchmark to compare against, however, in a world with parallel sorting algorithms and multi-core machines that are quickly becoming ubiquitous. Nobody should ever consider investigating using a faster serial algorithm for sorting large data sets when they would be much better served to apply the freely available implementation of a parallel algorithm. The thing stopping me from running spreadsort in parallel is that you haven't released a threaded version of the algorithm with a nice, clean, generic API. I would never parallelize it myself, because if I wanted a parallel sorting algorithm I'd use one off the shelf. Also, it stands to reason that there is at least a 50/50 chance parallel spreadsort would under-perform existing parallel sorting implementations. Whether it turns out to be better or worse than the best parallel implementation is what I'm interested in finding out. Marginally better than std::sort for large n only tells me that you are at a good point to start thinking about threading the algorithm since you have bottomed out on serial performance improvements. Luke

Robert Ramey

12:42 a.m.

Simonson, Lucanus J wrote:

...

Phil wrote:

...
Having said all that, personally I have little interest in an algorithm

...
that has no benefit for fewer than millions of items. What do other people think about that aspect?

I frequently sort gigabytes, tens of gigabytes or even hundreds of gigabytes of geometry data as a pre-requisite to scanline over the data. However, a faster sort would only benefit me if the speedup were reliably 2X or more. A 30% speedup in sort would disappear with the effect of Amdahl's Law to something under 10%. I get better speedup just by adopting each new version of the compiler. For me, the modest size and tenuous nature (depends upon architecture specific constants) of the speedup does not justify any effort to integrate and carry the dependency to the library that provides it. I have been following the thread only out of academic interest.

hmmm - have you looked at postman's sort. This was the subject of an article in 1992 C user's journal and subsequently available as a commercial product. see www.rrsd.com. Robert Ramey

Steven Ross

7 Jul 7 Jul

11:49 p.m.

...

hmmm - have you looked at postman's sort. This was the subject of an article in 1992 C user's journal and subsequently available as a commercial product. see www.rrsd.com.

Yes. As I believe I mentioned in a previous posting, string_sort will probably look much like postman's sort; the main difference would probably be worst-case behavior handling. If you (or anyone else) would like to write a generic string_sort, you're welcome to. I'm mostly interested in doing it to create a suite of hybrid sorting algorithms.

Steven Ross

8 Jul 8 Jul

2:42 p.m.

Hi Phil, I haven't gotten around to making changes yet (weekends are when I have time), but I still have some feedback. - Use new and delete rather than *alloc and free.

...

- Can any exceptions occur? If so, is the dynamically allocated memory leaked? Could you use a smart pointer to avoid this?

Using *alloc, they should return NULL if allocation failed, and that situation is handled. With new and delete I'll have to catch memory exceptions (and write a constructor for the Bin structure) and probably use smart pointers. I could also convert to only using a fixed-size bin array, and putting this on the stack, as if you look at what the default constants do, the bin array ends up being of size MAX_SPLITS. If it doesn't impact performance, I'll be happy to do it the C++ way. Otherwise I'll have to investigate what's going on in detail.

...

- Does it work with uint64_t? I see some variables declared as 'long', which often can't store a uint64_t.

There exists the problem of how to identify and deal with negatives. I try to avoid forcing any particular return data type, but in the case of divMin and divMax, they are used in performance-critical loops, so they should be used as constants without any extra computation. They are already divided by 512, so an unsigned should fit inside a signed data type without trouble. Should I use an int64_t? - If std::vector<int>::iterator really is slower than int* (and I suggest

...

some investigation of the assembler to see why this is), then you could add a specialisation that converts vector::iterators into pointers.

Good point. I can look at that, though it could be due to my ancient system. I see the same performance problem with std::sort, and for a vector that isn't vector<bool> (why would anybody sort a vector<bool>?) I should be able to convert to a pointer. I haven't read assembler since college, and I don't even remember how to obtain the assembler for C++ source. I do also have a Windows development environment if finding assembler on that is easier.

...

- Is there a use-case for a pure radix sort, without the std::sort fallback? i.e. are there types that can't easily provide a comparison function, or distributions where the std::sort fallback is never useful?

A stable, external LSB Radix Sort could be templated by overriding the & operator, and passing in the key size in bits as an argument. This would be particularly useful to those who need a stable sort, and for those with small keys (32 bits or less) and memory to spare, it should be quite fast for large n, running in O(n(keysize/bincount)) operations, and without recursion. The ideal bincount would be close to Spreadsort's MAX_SPLITS, probably between 9 and 11. It would probably be best to have a radix_sort and a float_radix_sort. I make no claim to be an expert on LSB Radix sort, and would welcome someone else writing it. If I did write it, I'd probably take some fast open-source implementation and just template it. Steve

Phil Endecott

4:25 p.m.

Steven Ross wrote:

...

...
- Use new and delete rather than *alloc and free. - Can any exceptions occur? If so, is the dynamically allocated memory leaked? Could you use a smart pointer to avoid this?

I was referring to exceptions in general between allocation and de-allocation, not just during allocation itself.

...

Using *alloc, they should return NULL if allocation failed, and that situation is handled. With new and delete I'll have to catch memory exceptions

See new(nothrow).

...

...
- Does it work with uint64_t? I see some variables declared as 'long', which often can't store a uint64_t.

There exists the problem of how to identify and deal with negatives. I try to avoid forcing any particular return data type

Why not just use Iterator::value_type (or whatever it's called; or some iterator_traits thing) everywhere?

...

They are already divided by 512, so an unsigned should fit inside a signed data type without trouble.

Dividing a 64-bit value by 512 does not make it small enough to fit in a 32-bit long. (Have I misunderstood you?) Phil.

Steven Ross

9:01 p.m.

On Tue, Jul 8, 2008 at 9:25 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

Steven Ross wrote:

...
- Use new and delete rather than *alloc and free.

...
- Can any exceptions occur? If so, is the dynamically allocated memory leaked? Could you use a smart pointer to avoid this?

I was referring to exceptions in general between allocation and de-allocation, not just during allocation itself.

I guess the user-defined types could have exceptions in their >> and < methods, so I'll have to make it exception-safe. Good point.

...

Using *alloc, they should return NULL if allocation failed, and that

...
situation is handled. With new and delete I'll have to catch memory exceptions

See new(nothrow).

Right, thanks for the reminder. That + smart pointers should make sure there are no memory leaks with an exception. I could use a try-catch, but std::sort doesn't, so I'll just let the user catch exceptions from their own operator overloads.

...

- Does it work with uint64_t? I see some variables declared as 'long',

...
...
which often can't store a uint64_t.

There exists the problem of how to identify and deal with negatives. I try to avoid forcing any particular return data type

Why not just use Iterator::value_type (or whatever it's called; or some iterator_traits thing) everywhere?

They are already divided

...
by 512, so an unsigned should fit inside a signed data type without trouble.

Dividing a 64-bit value by 512 does not make it small enough to fit in a 32-bit long. (Have I misunderstood you?)

I was suggesting using an int64_t, which will hold a 64-bit value, so yes, there was a misunderstanding. The problem is that I need to use the return type of the user's >> method, and with the same code supporting any-size data and both signed and unsigned integers, I need a data type that can support all the different possibilities. Is there some way I can grab the return type of the user's >> method and use it directly? Otherwise, I think int64_t should work as long as it's fast until people start using 128-bit values.

Steven Watanabe

9:10 p.m.

AMDG Steven Ross wrote:

...

Is there some way I can grab the return type of the user's >> method and use it directly?

typedef BOOST_TYPEOF_TPL(x >> y) result_of_shift; In Christ, Steven Watanabe

Phil Endecott

11:39 p.m.

Steven Ross wrote:

...

...
...
...
- Does it work with uint64_t? I see some variables declared as 'long', which often can't store a uint64_t.

...

I was suggesting using an int64_t The problem is that I need to use the return type of the user's >> method

I think it's reasonable to assume that T::operator>>(int) returns T, for integer and integer-like classes. (Can anyone think of any exceptions to that? Has anyone defined an integer-like concept that covers this yet?) Hence my comment about using the iterator::value_type:

...

...
Why not just use Iterator::value_type (or whatever it's called; or some iterator_traits thing) everywhere?

Phil.

Daryle Walker

9 Jul 9 Jul

7:42 a.m.

On Jul 8, 2008, at 5:01 PM, Steven Ross wrote:

...

On Tue, Jul 8, 2008 at 9:25 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...
Steven Ross wrote:

[SNIP]

...
- Does it work with uint64_t? I see some variables declared as 'long',

...
...
which often can't store a uint64_t.

There exists the problem of how to identify and deal with negatives. I try to avoid forcing any particular return data type

Why not just use Iterator::value_type (or whatever it's called; or some iterator_traits thing) everywhere?

They are already divided

...
by 512, so an unsigned should fit inside a signed data type without trouble.

Dividing a 64-bit value by 512 does not make it small enough to fit in a 32-bit long. (Have I misunderstood you?)

I was suggesting using an int64_t, which will hold a 64-bit value, so yes, there was a misunderstanding. The problem is that I need to use the return type of the user's >> method, and with the same code supporting any-size data and both signed and unsigned integers, I need a data type that can support all the different possibilities. Is there some way I can grab the return type of the user's >> method and use it directly? Otherwise, I think int64_t should work as long as it's fast until people start using 128-bit values.

Why not use boost::uintmax_t? -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Steven Ross

2:30 p.m.

On Wed, Jul 9, 2008 at 12:42 AM, Daryle Walker <darylew@hotmail.com> wrote:

...

On Jul 8, 2008, at 5:01 PM, Steven Ross wrote:

On Tue, Jul 8, 2008 at 9:25 AM, Phil Endecott <

...
spam_from_boost_dev@chezphil.org> wrote:

Steven Ross wrote:

...
[SNIP]

...
- Does it work with uint64_t? I see some variables declared as 'long',

...
...
which often can't store a uint64_t.

...
There exists the problem of how to identify and deal with negatives. I try to avoid forcing any particular return data type

Why not just use Iterator::value_type (or whatever it's called; or some iterator_traits thing) everywhere?

They are already divided

...
by 512, so an unsigned should fit inside a signed data type without trouble.

Dividing a 64-bit value by 512 does not make it small enough to fit in a 32-bit long. (Have I misunderstood you?)

I was suggesting using an int64_t, which will hold a 64-bit value, so yes, there was a misunderstanding. The problem is that I need to use the return type of the user's >> method, and with the same code supporting any-size data and both signed and unsigned integers, I need a data type that can support all the different possibilities. Is there some way I can grab the return type of the user's >> method and use it directly? Otherwise, I think int64_t should work as long as it's fast until people start using 128-bit values.

Why not use boost::uintmax_t?

All these type problems can be resolved by using a trick from the standard library. I call >> on the first element inside the integer_sort call, and then template the rest of the calls on the type of the returned class. This extra call is made exactly once, and Spreadsort isn't used on less than 4000 elements, so the runtime overhead is constant and trivial. I've also added a 1% speed improvement, based upon something I learned from investigating Phil's testcase, and verified across my test suite. It's not much, but they add up. Spreadsort is not done being optimized. I'm not done implementing Phil's suggestions, but the latest version is attached.

Steven Ross

10 Jul 10 Jul

2:54 p.m.

Hi Phil, My responses are inline. On Mon, Jul 7, 2008 at 11:53 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

Some comments, mostly minor things I think:

- Boost generally uses .hpp for header files.

Done

...

- Identifiers (including macros) starting with _ are reserved. - I don't recall seeing much Hungarian Notation in Boost.

I can remove the Hungarian notation. Is there any suggested notation, or should I just use names that make sense, and try to be similar to standard libraries? __last and __first were copied from std::sort. Is there a good reason I shouldn't use the same notation?

...

- Some way of sorting structs, or pointers to structs, by passing some sort of key access functor as an additional template parameter is needed.

I'll do that as soon as the rest of this is acceptable. There is no sense in copying code that isn't ready yet (and the additional template parameter is handled by a duplicate method). - Use new and delete rather than *alloc and free. Done. The runtime seems more variable now, but on average is the same within my margin of error. - Can any exceptions occur? If so, is the dynamically allocated memory

...

leaked? Could you use a smart pointer to avoid this?

I'd like to use a smart pointer, but is there one that will call delete[]? As this is a very simple situation, I could also write my own trivial one, copying part of the implementation out of More Effective C++. - If you re-organise the recursion so that SpreadSortCore calls

...

SpreadSortBins directly, rather than first returning to SpreadSortRec, can you limit the scope of the bins to SpreadSortCore?

Yes. In fact, I eliminated both SpreadSortCore and SpreadSortBins, now that SpreadSort is just a wrapper around SpreadSortRec (I found that had a slightly positive impact on performance), so SpreadSortRec calls itself and the BinArray is scoped just to it.

...

- Does it work with uint64_t? I see some variables declared as 'long', which often can't store a uint64_t.

I used the templating off a return value trick from the standard library to handle this.

...

- I look forward to seeing float and string versions.

Once integer_sort is finished, float_sort shouldn't take long. I want it finished first, to avoid modifying the same code in multiple places. string_sort will come after that.

...

- If std::vector<int>::iterator really is slower than int* (and I suggest some investigation of the assembler to see why this is), then you could add a specialisation that converts vector::iterators into pointers.

Is there a simple way to check if an iterator is a vector iterator? Thanks again. Steve

Michael Marcin

3:23 p.m.

Steven Ross wrote:

...

I can remove the Hungarian notation. Is there any suggested notation, or should I just use names that make sense, and try to be similar to standard libraries? __last and __first were copied from std::sort. Is there a good reason I shouldn't use the same notation?

[lib.global.names] 17.4.3.1.2 Global names 1 Certain sets of names and function signatures are always reserved to the implementation: - Each name that contains a double underscore (__) or begins with an underscore followed by an upper- case letter (2.11) is reserved to the implementation for any use. - Each name that begins with an underscore is reserved to the implementation for use as a name in the global namespace. 165)

Phil Endecott

4:21 p.m.

...

__last and __first were copied from std::sort. Is there a good reason I shouldn't use the same notation?

Yes; you're not writing the standard library.

...

I'd like to use a smart pointer, but is there one that will call delete[]?

With the way that you've now structured the code, I think your choice is between a boost::scoped_array (the difference between scoped_ptr and scoped_array is delete vs. delete[]), or to avoid dynamic allocation by using a std::vector. If the number of bins is knowable at compile time then a std::tr1::array or C-style array is another choice, but some people might complain about having so much data on the stack.

...

Is there a simple way to check if an iterator is a vector iterator?

Something like this (probably full of syntax errors....) template <typename Iter> void SpreadSort(Iter begin, Iter end) { .... generic code .... } template <typename T> void SpreadSort< std::vector<T>::iterator >( std::vector<T>::iterator begin, std::vector<T>::iterator end) { SpreadSort(&(*begin), &(*end)); } I have previously asked, probably on this list, about whether there's a good way to detect iterators that point to contiguous storage. The answer seems to be that there isn't, except for explicitly detecting those that are guaranteed to do so (as above). Phil.

Steven Watanabe

4:39 p.m.

AMDG Phil Endecott wrote:

...

Something like this (probably full of syntax errors....)

template <typename Iter> void SpreadSort(Iter begin, Iter end) { .... generic code .... }

template <typename T> void SpreadSort< std::vector<T>::iterator >( std::vector<T>::iterator begin, std::vector<T>::iterator end) { SpreadSort(&(*begin), &(*end)); }

That doesn't work. The compiler can't deduce T. In Christ, Steven Watanabe

Phil Endecott

11 Jul 11 Jul

12:11 a.m.

Steven Watanabe wrote:

...

Phil Endecott wrote:

...
Something like this (probably full of syntax errors....)

template <typename Iter> void SpreadSort(Iter begin, Iter end) { .... generic code .... }

template <typename T> void SpreadSort< std::vector<T>::iterator >( std::vector<T>::iterator begin, std::vector<T>::iterator end) { SpreadSort(&(*begin), &(*end)); }

That doesn't work. The compiler can't deduce T.

True. Well, in this case since the SpreadSort template only works for integers then Steve could just specialise for std::vector<char,short,int,long,long long,etc>. Otherwise I guess something using enable_if is possible, but I'm not going to stick my neck out and guess how to write it with all these experts reading... Phil.

Steven Ross

2:34 p.m.

I've cleaned up the names so as not to use a leading _ for anything except the #ifndef and not use Hungarian notation. On Thu, Jul 10, 2008 at 9:21 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

I'd like to use a smart pointer, but is there one that will call delete[]?

...
With the way that you've now structured the code, I think your choice is between a boost::scoped_array (the difference between scoped_ptr and scoped_array is delete vs. delete[]), or to avoid dynamic allocation by using a std::vector. If the number of bins is knowable at compile time then a std::tr1::array or C-style array is another choice, but some people might complain about having so much data on the stack.

I decided to use a vector, and while I was at it, use a single vector for the entire integer_sort call, so as to limit the number of allocations/deallocations. I did use the &(vec[index]) trick to get the bins, but the vector is defined to be based upon an array, so that works. If I only used indices into the vector, I would need an extra addition in the inner swap loop, which is expensive in terms of performance (about 2%).

...

Is there a simple way to check if an iterator is a vector iterator?

...
I have previously asked, probably on this list, about whether there's a good way to detect iterators that point to contiguous storage. The answer seems to be that there isn't, except for explicitly detecting those that are guaranteed to do so (as above).

My conclusion on the * usage for vector sorting is that the best way to handle it would be to provide a vector::integer_sort just like vector::sort, and people "in the know" can use the &(vec[0]) trick. Are there any more suggestions for the core algorithm? Once this is finalized, I'll add functors, and once that's finalized, I'll move on to float_sort. Steve

David Abrahams

3:57 p.m.

on Fri Jul 11 2008, "Steven Ross" <spreadsort-AT-gmail.com> wrote:

...

I've cleaned up the names so as not to use a leading _ for anything except the #ifndef

That one's illegal too. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Phil Endecott

5:43 p.m.

Steven Ross wrote:

...

An embedded and charset-unspecified text was scrubbed... Name: spreadsort.hpp URL: <http://lists.boost.org/MailArchives/boost/attachments/20080711/c5ef0f58/attachment.ksh> (Archive link not working, as usual.)

<vector.h> --> <vector> vector<> --> std::vector<> Are you certain that resizing the bins_cache vector when you do is safe, from the point of view of subsequent iterations of the loop which continue to use a pointer to the first element from before the resize()? Phil.

Steven Ross

7:36 p.m.

On Fri, Jul 11, 2008 at 10:43 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

Are you certain that resizing the bins_cache vector when you do is safe, from the point of view of subsequent iterations of the loop which continue to use a pointer to the first element from before the resize()?

Oops. Good point. The bins pointer is no longer safe as soon as a recursive call is made; it didn't crash because reallocs didn't change the vector location. I should use the vector (which will stay valid) when I start making recursive calls, not the bins pointer, and the performance impact should be trivial, as that's a loop over the bins (n/512), not all items as the other loops are. I'll fix that. I'll also fix the minor syntax issues you and David pointed out. Anything else?

Phil Endecott

12 Jul 12 Jul

3:48 p.m.

Steven Ross wrote:

...

Are there any more suggestions for the core algorithm?

Log2 can be done in O(log Nbits) steps, rather than O(Nbits). I spent a while trying to understand this bit of code: //Swap into place //This dominates runtime, especially in the swap; std::sort calls are the other main time-user RandomAccessIter current = first; for(unsigned u = 0; current != last; ++u) { for(current_bin = (bins + ((*current >> log_divisor) - div_min)); (current_bin->count > u); current_bin = bins + ((*current >> log_divisor) - div_min)) std::swap(*current, *(current_bin->current_position++)); //Now that we've found the item belonging in this position, increment the bucket count if(current_bin->current_position == current++) ++(current_bin->current_position); } You don't help understanding by changing the meaning of bin.count half way through :-(. Anyway, I think you're doing more work than necessary here. For each bin, the elements between the start of the bin and bin.current_position will be correct and can be skipped over; this will be on average half of the elements. This doesn't save any swaps but it does save the shifting and some comparisons. Pseudo-code: for each bin { for each element between this bin's current_position and its end { while (1) { determine the bin this element should be in if it's in the right bin { break } else { swap it with the element at the right bin's current_position increment that bin's current_position } } } } Have I understood this correctly? Phil.

Steven Ross

5:41 p.m.

On Sat, Jul 12, 2008 at 8:48 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

Steven Ross wrote:

...
Are there any more suggestions for the core algorithm?

Log2 can be done in O(log Nbits) steps, rather than O(Nbits).

I agree, but do you know a fast way to do so?

...

I spent a while trying to understand this bit of code:

//Swap into place //This dominates runtime, especially in the swap; std::sort calls are the other main time-user RandomAccessIter current = first; for(unsigned u = 0; current != last; ++u) { for(current_bin = (bins + ((*current >> log_divisor) - div_min)); (current_bin->count > u); current_bin = bins + ((*current >> log_divisor) - div_min)) std::swap(*current, *(current_bin->current_position++)); //Now that we've found the item belonging in this position, increment the bucket count if(current_bin->current_position == current++) ++(current_bin->current_position); }

You don't help understanding by changing the meaning of bin.count half way through :-(.

Would you like to

...

Anyway, I think you're doing more work than necessary here. For each bin, the elements between the start of the bin and bin.current_position will be correct and can be skipped over; this will be on average half of the elements. This doesn't save any swaps but it does save the shifting and some comparisons. Pseudo-code:

for each bin { for each element between this bin's current_position and its end { while (1) { determine the bin this element should be in if it's in the right bin { break } else { swap it with the element at the right bin's current_position increment that bin's current_position } } } }

Have I understood this correctly? <http://lists.boost.org/mailman/listinfo.cgi/boost>

Good analysis. Yes, you have, but there's a much faster way to do that. I already tried your technique some time back, and it works well on my system, but because of the extra nested loop does not optimize well on Intel processors (at least the way I coded it). There's a better way to do it, attached, and this time I ran my regression tests, and it's about 11% faster (they're not quite done yet). Basically I did what you're talking about, but without the extra loop over the bins, just adding the else case: for(unsigned u = 0; current != last;) { for(current_bin = (bins + ((*current >> log_divisor) - div_min)); (current_bin->count > u); current_bin = bins + ((*current >> log_divisor) - div_min)) std::swap(*current, *(current_bin->current_position++)); //Now that we've found the item belonging in this position, increment the bucket count if(current_bin->current_position == current++) { ++(current_bin->current_position); ++u; } else {//skipping forward to the first unfinished element in the bin current = current_bin->current_position; u = current_bin->current_position - first; } } I'll try your way again just to see if it's any faster than my newest optimization. Steve

Phil Endecott

6:20 p.m.

Steven Ross wrote:

...

Are there any more suggestions for the core algorithm?

for each bin { for each element between this bin's current_position and its end { while (1) { determine the bin this element should be in if it's in the right bin { break } else { swap it with the element at the right bin's current_position increment that bin's current_position } } } } I'm considering the case where the type is relatively large and swap takes time proportional to its size, e.g. a large struct. Currently your code (and my version above) will, on average, swap each element twice: once to what's probably the wrong bin and then into the right bin. If swap itself uses a temporary then each element is copied on average 3 times. Maybe the compiler can optimise some of that, and the cache may make some of those copies faster than others. But I guess that it can probably be improved to nearer one copy per element. Here's the idea: We want to move element A to the position currently occupied by element B. We could swap them, but moving B to A's position is probably not useful. So let's first move B to a good position for it, and then move A into its space. Of course some other element C will be occupying the position where we want to put B, so we have to keep going until we find an element that wants to be in the position where A was. It's relatively easy to code a recursive version of this: void circulate(Iter i) { // Do an N-way circular swap that eventually puts the element at i in its right place, // and puts an element that wants to be at i there. *i = circulate_rec(i,i); } T circulate_rec(Iter source, Iter current) { // Move the element at current to its right position, and return an element // that can be stored at source. // Find current's right position: Iter target = right_bin_for(*current).current_position++; // Base case: if the element at target would be happy at position source, that's easy: if (right_bin_for(*target) == bin_of(source)) { tmp = *target; } else { // Otherwise, first move target to a good position tmp = circulate_rec(source,target); } // Now actually copy the element at current to the vacant target *target = *current; return tmp; } The problem with that is that you might recurse a lot and fill your stack, so in practice you'd want to limit it to a few iterations and then give up and put a "wrong" value back at the source. Also you need to be certain that 'tmp' gets some sort of return value optimisation and isn't copied during each return. An explicitly unrolled iterative version would probably be best: Iter a = ...; Iter b = right_bin_for(*a).current_position++; if (right_bin_for(*b) != bin_of(a)) { Iter c = right_bin_for(*b).current_position++; if (right_bin_for(*c) != bin_of(a)) { Iter d = right_bin_for(*c).current_position++; if (right_bin_for(*d != bin_of(a)) { Iter e = right_bin_for(*d); tmp = *e; *e = *d; } else { tmp = *d; } *d = *c; } else { tmp = *c; } *c = *b; } else { tmp = *b; } *b = *a; *a = tmp; Whether it actually improves performance will depend on quite how large the structs are. It certainly won't help with ints. So let's see your version that sorts structs! Phil.

Steven Ross

14 Jul 14 Jul

2:21 p.m.

On Sat, Jul 12, 2008 at 11:20 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

We want to move element A to the position currently occupied by element B. We could swap them, but moving B to A's position is probably not useful. So let's first move B to a good position for it, and then move A into its space. Of course some other element C will be occupying the position where we want to put B, so we have to keep going until we find an element that wants to be in the position where A was.

Whether it actually improves performance will depend on quite how large the structs are. It certainly won't help with ints. So let's see your version that sorts structs! <http://lists.boost.org/mailman/listinfo.cgi/boost>

This is an excellent idea. You need n + 2 copies to definitely put n elements in their correct position, and if you're lucky, put n + 1 elements in the correct position. With a swap, n = 1, so you need 3 copies. Most of the speedup will probably occur with a small n (n = 4 to 8?). I will try an explicitly unrolled loop with a small number of elements to see how well this works. I have the struct code commented out in sample.c, but there is no code change in spreadsort.hpp required to handle larger structs. I don't really need to swap the elements into position, what I need is the fastest way to rearrange elements that are not in the correct order, but where I do know where they need to be put, in a minimum number of copies and checks for the correct position. The prior optimization we discussed where I posted my new code for the swap loop had a net 16% speed improvement across my regressions, to the point where my regressions ran in less than half the time that std::sort took on them (for 4e7 elements), so we're getting susbstantial improvement. I intend to try your version of that change at some point, but I'll have to verify that it also works well on Intel processors, and I'm skeptical from prior experience. Your latest suggestion is more promising and novel, so I will try it first. Your idea may even make an improvement for integers, if I code it right, because if you think about it, most of the time swapping is spent with roughly 512 swaps before an appropriate element for the position is found. If I can find a way to parameterize the "n" number of copies based upon size_of(*RandomAccessIter), then I could tune this to the size of the data type, but even without that, a small number of iterations of this approach should get a sizeable speedup across all data types. I had considered some ideas of this type in the past, but got stuck on the extra memory required, and thought that might add overhead, but with a small constant number of elements, the overhead should be minimal. In effect, this placement algorithm (as improved with my last posting) works in two stages: first stage, repeated n/512 times: swap until the right element is in position (roughly 512 times), then go to the next element This stage should dominate runtime now, and it's the one your latest suggestion would improve. second stage: run along the bins, skipping to the first element that is out of place, and do a few swaps for each out-of-place element until it is in the correct position. This stage is now very fast. I was working on fixing my worst-case calculation to be more accurate, which is most important for worst-case performance, but gave a consistent .1-.2% improvement in my regressions, and gave LOG_CONST more meaning. I'll include that fix with my next posting, but I intend to try out your idea next time I work on integer_sort, and I expect to see a substantial improvement. Thanks again Phil. I think when we're done optimizing we'll have a much faster algorithm. Steve

Steven Ross

16 Jul 16 Jul

2:14 p.m.

On Sat, Jul 12, 2008 at 11:20 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

The problem with that is that you might recurse a lot and fill your stack, so in practice you'd want to limit it to a few iterations and then give up and put a "wrong" value back at the source. Also you need to be certain that 'tmp' gets some sort of return value optimisation and isn't copied during each return. An explicitly unrolled iterative version would probably be best:

Whether it actually improves performance will depend on quite how large the structs are. It certainly won't help with ints. So let's see your version that sorts structs!

Hi Phil, I implemented this change, using a 3-way swap (n = 2; 4 copies, 2 items placed in the right position) and sent it through my regressions, and saw a 7% speed improvement. It required also iterating over the bins, as you originally suggested. What I also discovered was that I could get a 6% speed improvement on my original algorithm just by replacing std::swap with an explicitly inlined swap and changing which item gets assigned to tmp in the swap (the order matters!), so the improvement (on integers) of your multi-way swap on its own is about 1%. By doing the 3-way swap I should get the majority of the speedup of your technique, as the marginal speedup rapidly degrades as you increase n. For my testing on integers, a 4-way swap seemed slower than a 3-way swap; my guess is that the 3-way swap is easier to optimize. I will upload two sample apps to www.sourceforge.net/projects/spreadsort in the file release; one with integers, and one with integer + data. I've attached two files: philspreadsort.hpp spreadsort.hpp (plus an updated constants.h) I'd welcome people comparing to see which is faster, and by how much, on various platforms. I'll note that I can add a memory optimization to philspreadsort.hpp (count no longer has to be in the bins for speed), but haven't done so for now. Assuming philspreadsort.hpp is at least as fast on average as spreadsort.hpp, I'll run with philspreadsort.hpp, on the assumption that the swapping of larger data types will be faster. We're getting close to making integer_sort faster than std::sort for 1000 elements! Steve

Steven Ross

22 Jul 22 Jul

2:40 p.m.

On Sat, Jul 12, 2008 at 11:20 AM, Phil Endecott <
spam_from_boost_dev@chezphil.org> wrote:

> We want to move element A to the position currently occupied by element B.
>  We could swap them, but moving B to A's position is probably not useful.
>  So let's first move B to a good position for it, and then move A into its
> space.  Of course some other element C will be occupying the position where
> we want to put B, so we have to keep going until we find an element that
> wants to be in the position where A was.
>
> The problem with that is that you might recurse a lot and fill your stack,
> so in practice you'd want to limit it to a few iterations and then give up
> and put a "wrong" value back at the source.  Also you need to be certain
> that 'tmp' gets some sort of return value optimisation and isn't copied
> during each return.  An explicitly unrolled iterative version would probably
> be best:
>
> Whether it actually improves performance will depend on quite how large the
> structs are.  It certainly won't help with ints.  So let's see your version
> that sorts structs!
>
> I found that an explicitly inlined swap was as fast as your 3-way swap and
multi-level loop (combined) on Intel systems for integers, but since the
3-way swap should be faster with larger data types, and the 3-way swap is
faster on my system, I'm running with it.  I found that the primary
time-limiting steps are:
1) the random accesses used to find the "remote" element in swapping
2) bin lookups
The problem with a multi-way swap is that it risks wasting its last bin
lookup, and does no less bin lookups per item put in the correct place.
The random accesses aren't avoided either; the main speedup is from reducing
the processing overhead of the actual copy operation, which is minor with
ints.  Since I saw a small but tangible speedup with a 3-way swap (but not
4-way or higher) I ran with it.
I made some other speed enhancements that are possible with a multi-level
loop, including splitting up the bin into separate pointers and sizes.  It
would be nice if I could tell to cache the sizes first, then dump them from
core cache back to a lower-level cache and cache the pointers.  I'm not
aware of any way to do so.  This splitting optimization, in addition to
improving runtime, reduces memory overhead of the algorithm.  I also think
it's more readable, as the meaning of the sizes doesn't change.

I'm not aware of any other optimizations to try; I've made all of the ones
that worked, and dumped the ones that didn't.
I tried an explicit binary-search-style roughlog2, but found that it
actually hurt performance (did not help).  I think the control structures
are much slower than an at most 32-step very simple while loop.

If anyone has suggestions for improvement, I'm open to them.  Otherwise, I
will consider this the final version of the core algorithm, and if
acceptable, will move on to adding functor support, float_sort, and
(eventually) string_sort.
I have seen a net 22% speed improvement relative to the first version I
posted on my system; a significant portion of this is due to Phil's
suggestions, so thanks Phil!

Steve

Steven Ross

27 Jul 27 Jul

7:53 p.m.

I've added support for functors and verified it with my keyplusdatasample.c, available from http://sourceforge.net/project/showfiles.php?group_id=229739(spreadsort project). I've also attached the updated integer_sort source. Some design decision notes, which I'd welcome comments on: I'm using pointers to iterate over the bins where I can safely do so (everywhere but recursion). For some pointer values, the bins pointer can be offset by subtracting div_min from it ahead of using it for bin lookups. This shows a noticeable speed improvement (around 2%), but requires checking whether this offset is safe to do, is a hack, and requires an if/else with largely duplicate code. My conclusion is that such a hack is not appropriate for Boost code. One disadvantage I found with the 3-way swap is that it requires a public default constructor data_type() to be efficient. If I code it so that it only uses a copy constructor (by moving the creation of tmp inside the if and else, and thus making most of the code dependent upon the conditional check), it is substantially slower. Some object models deliberately eliminate the public default constructor. To stay more general, should I use the conventional 2-way swap approach, even though it is slightly (around 1%) slower? I'm only using constructs that compile on my old g++ compiler. That means I can't use some of the newer additions to the standard, but I'm doing this in the interest of being able to compile on as many systems as possible. After all my optimizations, the impact of pointers vs. random access iterators has increased as a percentage of runtime (it's over 10% now). I'd be interested to hear if this issue is visible on other systems, or just mine, and other suggestions for what might be the problem. I tried an earlier suggestion of a compiler option, and it had no speed impact, though that might be because my compiler treats that option differently. I'll start working on float_sort now.

Steven Ross

28 Jul 28 Jul

2:31 p.m.

My current float_sort correctly sorts positive floats with a comparable speedup to that for integer_sort. I'll deal with negative floats soon, but one issue popped up in testing that I wasn't expecting: nan's integer representation is a high positive number, yet with the float < operator it is treated as a very small (smallest?) number. Unless I special-case nan, the results of std::sort and float_sort will differ. Do I need to special-case nan? Also, there will only be a functor version of float_sort; I don't want to confuse the normal float >> (whatever that does) with the operation Spreadsort requires, and in my testing a good functor doesn't make a noticeable performance difference versus a member function. Given these issues, and the fact that the speedup is no greater than for integer_sort, is a float_sort still of interest? Steve

Steven Ross

30 Jul 30 Jul

3 a.m.

Here's float_sort, which requires a RightShift functor. For this functor RightShift(x, 0) >> n must equal RightShift(x, n), which is stricter than integer_sort is. float_sort sorts floats as integers, but flips the negative bins around to get the exact same sorting order as std::sort. Positives floats sort fine via integer_sort. A functor is required because of the cast operation; I don't see any generic way to code a general cast operation. The RightShift functor looks like this: struct rightshift { int operator()(const DATATYPE &x, const unsigned offset) const { return *((int *)&x) >> offset; } }; Note the cast. NaNs get dumped in the last bucket by default, without special-casing. std::sort dumps them randomly in the file because they evaluate as equivalent to everything, which is a little odd, and behavior I have no intention to emulate. In my testing, float_sort takes 61% of the time that std::sort takes on 40MB files, so the speedup is substantial. float_sort takes about 10% longer than integer_sort, while std::sort takes about 20% longer on floats than ints. I'll pretty this up a bit and add a version that takes a comparison functor, then move on to string::sort, which will take more work. That should complete the sorting library I intend to propose adding to Boost. Steve

Steven Watanabe

3:39 a.m.

AMDG Steven Ross wrote:

...

A functor is required because of the cast operation; I don't see any generic way to code a general cast operation.

How about using a traits template?

...

NaNs get dumped in the last bucket by default, without special-casing. std::sort dumps them randomly in the file because they evaluate as equivalent to everything, which is a little odd, and behavior I have no intention to emulate.

The standard states that, "For the algorithms to work correctly, comp has to induce a strict weak ordering on the values." This is not true for NaNs, therefore, IMO, you don't need to worry about them. In Christ, Steven Watanabe

Steven Ross

6:12 p.m.

On Tue, Jul 29, 2008 at 8:39 PM, Steven Watanabe <watanabesj@gmail.com>wrote:

...

Steven Ross wrote:

...
A functor is required because of the cast operation; I don't see any generic way to code a general cast operation.

How about using a traits template?

That's worth considering. Can I use is_floating_point to mean that the entire value_type can be safely cast to an integer? I could cast a built-in float/double type to a built-in signed integer type of the same size, but this wouldn't work for user-defined data types being sorted on a float key (unless I cast the result of their user-defined >> operator, which would be weird), so effectively it would be a special-case for built-in data types. float_sort works by taking a signed-magnitude floating point number, casting it to a 2's complement integer, and then reversing the negative buckets to fix the order. Since the usage that I can come up with for a functor-free version is so specialized, I would prefer to skip it, but if the application is more general than I imagine, then I'll consider it seriously. I've thought about some prior suggestions for making integer_sort/float_sort/string_sort apply to any forward_iterator, and came to this conclusion: Spreadsort (the underlying algorithm) is optimized on the assumption that memory allocation is slow and best minimized, random accesses are somewhat (but not greatly) slower than comparisons, and comparisons are slower than iterating across the provided iterators. What should be done if these assumptions are substantially off: If memory allocation is fast and memory reasonably available, use an LSB radix sort or the original non-in-place variant of Spreadsort. If random accesses are really slow, an entirely different algorithm is called for, such as the file sorting version of Spreadsort (not open source), or Mergesort. If random accesses are fast, MAX_SPLITS can be eliminated (or made infinite) and Spreadsort will run in theta(n) time on continuous integrable function distributions. If comparisons are extremely slow, use a pure radix sort, such as LSB radix sort; Spreadsort isn't optimized for this situation; if they're just a little slow, tune Spreadsort to prefer radix sorting. If iteration is slow then MAX_SPLITS should be eliminated, and the non-in-place Spreadsort should probably be used. Based upon these optimization assumptions, I don't think my in-place Spreadsort algorithm is appropriate for non-random-access sorting in a situation where a programmer is using an appropriate data structure for their problem; a different and probably specialized algorithm would be called for. The basic idea behind integer_sort/float_sort/string_sort is to provide hybrid generic algorithms that are substantially faster for most common sorting problems than std::sort. The assumption is that for more specialized tasks, a different off-the-shelf algorithm will probably be better. Do you recommend any particular reference on traits, such as Mr. Abraham's book? The standard states that, "For the algorithms to work correctly,

...

comp has to induce a strict weak ordering on the values." This is not true for NaNs, therefore, IMO, you don't need to worry about them.

Thanks for checking, Steve, I won't worry about them then.

Steven Watanabe

6:46 p.m.

AMDG Steven Ross wrote:

...

...
How about using a traits template?

That's worth considering. Can I use is_floating_point to mean that the entire value_type can be safely cast to an integer?

I'm thinking of something more like: template<class T> struct default_right_shift; template<> struct default_right_shift<float> { typedef rightshift type; }; // ... In Christ, Steven Watanabe

Steven Ross

13 Dec 13 Dec

4:28 p.m.

...

I'm thinking of something more like:

template<class T> struct default_right_shift;

template<> struct default_right_shift<float> { typedef rightshift type; };

Thanks for the suggestion Steve. I thought about it, and decided that asking the user to create a template for every override they wish to make is

On Wed, Jul 30, 2008 at 10:46 AM, Steven Watanabe <watanabesj@gmail.com>wrote: probably too much for the average user. Instead, I created a float_sort call that takes first, last, and a data type to cast to. Performance is not significantly superior to the one that takes a functor, and I added a wrapper casting_float_sort that casts the data type to an int (so the user only needs to pass first and last). Pointer type-casts that float_sort is dependent upon must be used with caution. I also created a string_sort which correctly handles embedded null characters, with functor and reverse versions. I wonder if it would be a good idea to create a cstring_sort that treats a null character as a string termination (and handles NULL ptrs), which should be noticeably faster? This would enable an apples-to-apples comparison of speed with highly optimized c-string sorting algorithms. That said, I'm not sure whether cstring_sort is appropriate for boost. I've made some new optimizations. Integer_sort and float_sort are both roughly twice as fast as std::sort on my test suite of randomized data with different distributions, and have roughly theta(n * square_root(log(n))) performance, so it slowly improves its relative speed as data sets get larger, with a crossover point of roughly 1000 elements. String_sort has O(n * length) runtime much like Postal Sort (which it is similar to), versus O(n * log(n) * length) for std::sort, so though it has significant constant overhead, it rapidly becomes faster as data sets become larger, being a little less than twice as fast on random data, and roughly twice as fast on real text (Dickens novel). string_sort sacrifices a small constant overhead on random data for superior performance in situations where there are multiple character long matching substrings. This does not impact its asymptotic performance, better or worse. String_sort is not appropriate for characters over 2 bytes in size, and is optimized for 1-byte characters. The runtime crossover point for string_sort is roughly 32000 elements. All these algorithms are in-place, with less than linear memory overhead. Testing was done on 10 million element data sets for integer_sort and float_sort on OSX. For string_sort, the data read in was 40MB in size, but it was converted to strings of varying lengths, and additional embedded nulls were added for testing, so the actual data set size is larger. I encountered a compiler optimization bug with float_sort (but not integer_sort) on Cygwin with -O2 and -O3 options. Does anyone know the best place to report that? For now, I'm having float_sort call std::sort on Cygwin. I'm still running regression tests and doing a little commenting and testcase cleanup, but once all that is done my Spreadsort-based library suite should be done. Does this sound like something that is worth adding to Boost? I plan to upload source here when I'm done testing.

Steven Watanabe

15 Dec 15 Dec

6:16 p.m.

AMDG Steven Ross wrote:

...

I encountered a compiler optimization bug with float_sort (but not integer_sort) on Cygwin with -O2 and -O3 options. Does anyone know the best place to report that? For now, I'm having float_sort call std::sort on Cygwin.

I don't think this is a compiler bug. The standard forbids accessing floats as ints. In Christ, Steven Watanabe

Steven Ross

16 Dec 16 Dec

5 a.m.

On Mon, Dec 15, 2008 at 10:16 AM, Steven Watanabe <watanabesj@gmail.com>wrote:

...

AMDG

Steven Ross wrote:

...
I encountered a compiler optimization bug with float_sort (but not integer_sort) on Cygwin with -O2 and -O3 options. Does anyone know the best place to report that? For now, I'm having float_sort call std::sort on Cygwin.

I don't think this is a compiler bug. The standard forbids accessing floats as ints.

That's a reasonable restriction for people who don't know what they're doing and higher-level languages, but to quickly radix sort a floating-point number requires treating it as an integer (or viewing its bits as one would an integer). To avoid the casting restriction on the data type, I cast the pointer instead, then dereference it. Is there a better way to do this? Interestingly, I get a 5X speedup with -O1 optimizations for floating-point sorting on Cygwin vs. std::sort, so my algorithm should be quite useful on that platform, if I can get it to compile correctly. The actual problem appears to be y = x++ compiling as y = ++x, causing an incorrect memory access, based upon some printf statements I added in the area where it was crashing. It's notable that putting a printf in the same scope as the increment causes the problem to dissappear, which strongly suggests along with the other evidence that there is some form of compiler bug. I've attached my latest source for reference.

Gevorg Voskanyan

1:03 p.m.

Steven Ross wrote:

...

On Mon, Dec 15, 2008 at 10:16 AM, Steven Watanabe wrote:

...
AMDG

Steven Ross wrote:

...
I encountered a compiler optimization bug with float_sort (but not integer_sort) on Cygwin with -O2 and -O3 options. Does anyone know the best place to report that? For now, I'm having float_sort call std::sort on Cygwin.

I don't think this is a compiler bug. The standard forbids accessing floats as ints.

That's a reasonable restriction for people who don't know what they're doing and higher-level languages, but to quickly radix sort a floating-point number requires treating it as an integer (or viewing its bits as one would an integer). To avoid the casting restriction on the data type, I cast the pointer instead, then dereference it. Is there a better way to do this? Interestingly, I get a 5X speedup with -O1 optimizations for floating-point sorting on Cygwin vs. std::sort, so my algorithm should be quite useful on that platform, if I can get it to compile correctly. The actual problem appears to be y = x++ compiling as y = ++x, causing an incorrect memory access, based upon some printf statements I added in the area where it was crashing. It's notable that putting a printf in the same scope as the increment causes the problem to dissappear, which strongly suggests along with the other evidence that there is some form of compiler bug.

This is not a compiler bug but a manifestation of undefined behavior. In general, you cannot access an object with an lvalue of different type than that it has been declared with. There are a few exceptions to this rule, listed in 3.10/15. Accessing a float through an int pointer does not fall under those exceptions, so you get undefined behavior.

...

I've attached my latest source for reference.

I suggest modifying CastFloatIter() function to use memcpy for accessing floats as integers, as shown below: WARNING: not tested! <code> //Casts a RandomAccessIter to the specified data type template<class cast_type, class RandomAccessIter> inline cast_type CastFloatIter(const RandomAccessIter & floatiter) { BOOST_STATIC_ASSERT( sizeof( cast_type ) == sizeof( *floatiter ) ); cast_type dst; std::memcpy( &dst, &*floatiter, sizeof dst ); return dst; } </code> instead of <code> return *((cast_type *)(&(*floatiter))); </code> which is not legal if e.g. cast_type is int and *floatiter is float. HTH Best Regards, Gevorg

Pyry Jahkola

2:03 p.m.

Gevorg Voskanyan wrote:

...

I suggest modifying CastFloatIter() function to use memcpy for accessing floats as integers, as shown below:

Hmm, I don't think it's any better than reinterpret casting. (It actually does the reinterpret cast when converting float const * to void const *...) I'd suggest using a union instead, for the purpose of treating a float as integer: template <typename To, typename From> inline To bitwise_cast(From const & from) { BOOST_STATIC_ASSERT(sizeof(To) == sizeof(From)); union { To to; From from; } u; u.from = from; return u.to; } Usage example: http://codepad.org/7gNXqFHO -- Pyry Jahkola pyry.jahkola@gmail.com

Giovanni Piero Deretta

3:19 p.m.

On Tue, Dec 16, 2008 at 3:03 PM, Pyry Jahkola <pyry.jahkola@gmail.com> wrote:

...

Gevorg Voskanyan wrote:

...
I suggest modifying CastFloatIter() function to use memcpy for accessing floats as integers, as shown below:

Hmm, I don't think it's any better than reinterpret casting.

Yes it is. It has the nice property of not leading to UB. And the result is well defined (the bitwise representation of the source is copied into the destination).

...

(It actually does the reinterpret cast when converting float const * to void const *...)

Not really.

...

I'd suggest using a union instead, for the purpose of treating a float as integer:

template <typename To, typename From> inline To bitwise_cast(From const & from) { BOOST_STATIC_ASSERT(sizeof(To) == sizeof(From)); union { To to; From from; } u; u.from = from; return u.to; }

This is also undefined behavior even if it is explicitly supported by most compilers as an extension. It is not necessarily better than a memcpy and might actually be slower. -- gpd

Scott McMurray

3:34 p.m.

On Tue, Dec 16, 2008 at 10:19, Giovanni Piero Deretta <gpderetta@gmail.com> wrote:

...

On Tue, Dec 16, 2008 at 3:03 PM, Pyry Jahkola <pyry.jahkola@gmail.com> wrote:

...
I'd suggest using a union instead, for the purpose of treating a float as integer:

template <typename To, typename From> inline To bitwise_cast(From const & from) { BOOST_STATIC_ASSERT(sizeof(To) == sizeof(From)); union { To to; From from; } u; u.from = from; return u.to; }

This is also undefined behavior even if it is explicitly supported by most compilers as an extension. It is not necessarily better than a memcpy and might actually be slower.

Empyrically, putting the following into the LLVM online demo (http://llvm.org/demo/) shows that, in llvm-g++ at least, the (intermediate) code generated from the memcpy is optimal: #include <string.h> int foo(float const y) { int x; memcpy(&x,&y,sizeof(int)); return x; } define i32 @_Z3foof(float %y) nounwind readnone { entry: %0 = bitcast float %y to i32 ; <i32> [#uses=1] ret i32 %0 } So since the memcpy is always safe and doesn't break the aliasing rules, it certainly seems the best approach.

Steven Ross

5 Jan 5 Jan

4:08 p.m.

On Tue, Dec 16, 2008 at 7:34 AM, Scott McMurray <me22.ca+boost@gmail.com>wrote:

...

Empyrically, putting the following into the LLVM online demo (http://llvm.org/demo/) shows that, in llvm-g++ at least, the (intermediate) code generated from the memcpy is optimal:

#include <string.h> int foo(float const y) { int x; memcpy(&x,&y,sizeof(int)); return x; }

define i32 @_Z3foof(float %y) nounwind readnone { entry: %0 = bitcast float %y to i32 ; <i32> [#uses=1] ret i32 %0 }

So since the memcpy is always safe and doesn't break the aliasing rules, it certainly seems the best approach.

Thanks, Scott, I've added you to the list of contributors. I made that change after some string_sort improvements, and found: 1) float_sort now works on Cygwin 2) No measurable performance penalty relative to my prior casting approach 3) My Cygwin testcase that was crashing runs over 100X as fast with float_sort than std::sort, generating exactly the same (correct) results. My speculation for the cause of the incredible speedup, based upon some other testcases where float_sort runs in 40% less time than std::sort, is that floating-point comparisons are extremely inefficiently implemented on my Cygwin test system; the testcase that is 100X as fast is purely radix-sorted by float_sort, while the ones that are only modestly faster use some recursive calls to std::sort, which uses comparisons. Based upon these results, it looks to me like an LSB radix sort is the best approach to use on Cygwin for sorting sets of floating-point numbers large than some basic minimum size (1000 elements?), as long as the additional memory overhead for a duplicate vector is acceptable. It could also just be my old processor (refurbished windows system purchased from Dell over 5 years ago); modern systems might not have this problem. Does anyone have a competing theory as to why pure radix sorting is so much faster for floating-point numbers on an old Cygwin system?

Steven Ross

8 Jan 8 Jan

7:51 a.m.

I've finished optimizing and testing Spreadsort (integer_sort/float_sort/string_sort), a hybrid radix-based/comparison-based algorithm. It is now available from: http://sourceforge.net/projects/spreadsort I have tested it using an (included) automated test script on Mac OSX and Cygwin with a range of different distributions of random data on 20MB data sets. It is roughly twice as fast as std::sort on average across all three data types and many distributions, with the main exception of a 28X average performance improvement for float_sort on Cygwin. Average runtime seems to be proportional to roughly theta(n*square_root(log(n))); worst-case is more complicated but better than O(nlog(n)), due to the radix-based portion. Test it and see for yourself how it performs on your system. At this point I'm looking for suggestions on formatting it to submit as a Boost.Algorithm.Sorting namespace. I would appreciate comments that aid that process. I am also interested in knowing whether people would like a parallel_sort integrated in Boost.Algorithm.Sorting with Spreadsort.

David Abrahams

4:34 p.m.

on Thu Jan 08 2009, "Steven Ross" <spreadsort-AT-gmail.com> wrote:

...

I am also interested in knowing whether people would like a parallel_sort integrated in Boost.Algorithm.Sorting with Spreadsort.

The more algorithms the better, I always say :-) -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Steven Watanabe

9 Jan 9 Jan

2:29 a.m.

AMDG Steven Ross wrote:

...

I've finished optimizing and testing Spreadsort (integer_sort/float_sort/string_sort), a hybrid radix-based/comparison-based algorithm. It is now available from: http://sourceforge.net/projects/spreadsort I have tested it using an (included) automated test script on Mac OSX and Cygwin with a range of different distributions of random data on 20MB data sets. It is roughly twice as fast as std::sort on average across all three data types and many distributions, with the main exception of a 28X average performance improvement for float_sort on Cygwin. Average runtime seems to be proportional to roughly theta(n*square_root(log(n))); worst-case is more complicated but better than O(nlog(n)), due to the radix-based portion. Test it and see for yourself how it performs on your system. At this point I'm looking for suggestions on formatting it to submit as a Boost.Algorithm.Sorting namespace. I would appreciate comments that aid that process. I am also interested in knowing whether people would like a parallel_sort integrated in Boost.Algorithm.Sorting with Spreadsort

Well, the normal directory structure for boost would be boost/ algorithm sorting/ spreadsort.hpp libs/ algorithm/ sorting/ test/ Jamfile.v2 ... doc/ ... index.html BTW, the code doesn't compile with msvc 9.0 or gcc 4.3.0. error logs attached. In Christ, Steven Watanabe ...found 64 targets... ...updating 22 targets... gcc.compile.c++ bin\gcc-4.3.0\release\sample.o samples\sample.cpp: In function 'int main(int, const char**)': samples\sample.cpp:19: error: 'strcmp' was not declared in this scope "g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\sample.o" "samples\sample.cpp" ...failed gcc.compile.c++ bin\gcc-4.3.0\release\sample.o... ...skipped <pbin\gcc-4.3.0\release>spreadsort.exe for lack of <pbin\gcc-4.3.0\release>sample.o... gcc.compile.c++ bin\gcc-4.3.0\release\rightshiftsample.o samples\rightshiftsample.cpp: In function 'int main(int, const char**)': samples\rightshiftsample.cpp:22: error: 'strcmp' was not declared in this scope "g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\rightshiftsample.o" "samples\rightshiftsample.cpp" ...failed gcc.compile.c++ bin\gcc-4.3.0\release\rightshiftsample.o... ...skipped <pbin\gcc-4.3.0\release>rightshift.exe for lack of <pbin\gcc-4.3.0\release>rightshiftsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\floatsample.o samples\floatsample.cpp: In function 'int main(int, const char**)': samples\floatsample.cpp:21: error: 'strcmp' was not declared in this scope samples\/../spreadsort.hpp: In function 'cast_type CastFloatIter(const RandomAccessIter&) [with cast_type = int, RandomAccessIter = __gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >]': samples\/../spreadsort.hpp:705: instantiated from 'void FloatSortRec(RandomAccessIter, RandomAccessIter, std::vector<RandomAccessIter, std::allocator<_Tp1> >&, unsigned int, std::vector<unsigned int, std::allocator<unsigned int> >&) [with RandomAccessIter = __gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, div_type = int, data_type = float]' samples\/../spreadsort.hpp:945: instantiated from 'void FloatSort(RandomAccessIter, RandomAccessIter, cast_type, data_type) [with RandomAccessIter = __gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, cast_type = int, data_type = float]' samples\/../spreadsort.hpp:974: instantiated from 'void float_sort_cast(RandomAccessIter, RandomAccessIter, cast_type) [with RandomAccessIter = __gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, cast_type = int]' samples\/../spreadsort.hpp:983: instantiated from 'void float_sort_cast_to_int(RandomAccessIter, RandomAccessIter) [with RandomAccessIter = __gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >]' samples\floatsample.cpp:62: instantiated from here samples\/../spreadsort.hpp:424: error: 'memcpy' was not declared in this scope "g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\floatsample.o" "samples\floatsample.cpp" ...failed gcc.compile.c++ bin\gcc-4.3.0\release\floatsample.o... ...skipped <pbin\gcc-4.3.0\release>floatsort.exe for lack of <pbin\gcc-4.3.0\release>floatsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\shiftfloatsample.o samples\shiftfloatsample.cpp: In function 'int main(int, const char**)': samples\shiftfloatsample.cpp:30: error: 'strcmp' was not declared in this scope samples\/../spreadsort.hpp: In function 'cast_type MemCast(const DATA_TYPE&) [with DATA_TYPE = float, cast_type = int]': samples\shiftfloatsample.cpp:17: instantiated from here samples\/../spreadsort.hpp:434: error: 'memcpy' was not declared in this scope "g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\shiftfloatsample.o" "samples\shiftfloatsample.cpp" ...failed gcc.compile.c++ bin\gcc-4.3.0\release\shiftfloatsample.o... ...skipped <pbin\gcc-4.3.0\release>shiftfloatsort.exe for lack of <pbin\gcc-4.3.0\release>shiftfloatsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\floatfunctorsample.o samples\floatfunctorsample.cpp: In function 'int main(int, const char**)': samples\floatfunctorsample.cpp:30: error: 'strcmp' was not declared in this scope samples\/../spreadsort.hpp: In function 'cast_type MemCast(const DATA_TYPE&) [with DATA_TYPE = float, cast_type = int]': samples\floatfunctorsample.cpp:17: instantiated from here samples\/../spreadsort.hpp:434: error: 'memcpy' was not declared in this scope "g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\floatfunctorsample.o" "samples\floatfunctorsample.cpp" ...failed gcc.compile.c++ bin\gcc-4.3.0\release\floatfunctorsample.o... ...skipped <pbin\gcc-4.3.0\release>floatfunctorsort.exe for lack of <pbin\gcc-4.3.0\release>floatfunctorsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\stringsample.o samples\stringsample.cpp: In function 'int main(int, const char**)': samples\stringsample.cpp:21: error: 'strcmp' was not declared in this scope "g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\stringsample.o" "samples\stringsample.cpp" ...failed gcc.compile.c++ bin\gcc-4.3.0\release\stringsample.o... ...skipped <pbin\gcc-4.3.0\release>stringsort.exe for lack of <pbin\gcc-4.3.0\release>stringsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\reversestringsample.o samples\reversestringsample.cpp: In function 'int main(int, const char**)': samples\reversestringsample.cpp:23: error: 'strcmp' was not declared in this scope "g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\reversestringsample.o" "samples\reversestringsample.cpp" ...failed gcc.compile.c++ bin\gcc-4.3.0\release\reversestringsample.o... ...skipped <pbin\gcc-4.3.0\release>reversestringsort.exe for lack of <pbin\gcc-4.3.0\release>reversestringsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\charstringsample.o samples\charstringsample.cpp: In function 'int main(int, const char**)': samples\charstringsample.cpp:35: error: 'strcmp' was not declared in this scope "g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\charstringsample.o" "samples\charstringsample.cpp" ...failed gcc.compile.c++ bin\gcc-4.3.0\release\charstringsample.o... ...skipped <pbin\gcc-4.3.0\release>charstringsort.exe for lack of <pbin\gcc-4.3.0\release>charstringsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\stringfunctorsample.o samples\stringfunctorsample.cpp: In function 'int main(int, const char**)': samples\stringfunctorsample.cpp:34: error: 'strcmp' was not declared in this scope "g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\stringfunctorsample.o" "samples\stringfunctorsample.cpp" ...failed gcc.compile.c++ bin\gcc-4.3.0\release\stringfunctorsample.o... ...skipped <pbin\gcc-4.3.0\release>stringfunctorsort.exe for lack of <pbin\gcc-4.3.0\release>stringfunctorsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\reversestringfunctorsample.o samples\reversestringfunctorsample.cpp: In function 'int main(int, const char**)': samples\reversestringfunctorsample.cpp:34: error: 'strcmp' was not declared in this scope "g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\reversestringfunctorsample.o" "samples\reversestringfunctorsample.cpp" ...failed gcc.compile.c++ bin\gcc-4.3.0\release\reversestringfunctorsample.o... ...skipped <pbin\gcc-4.3.0\release>reversestringfunctorsort.exe for lack of <pbin\gcc-4.3.0\release>reversestringfunctorsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\keyplusdatasample.o samples\keyplusdatasample.cpp: In function 'int main(int, const char**)': samples\keyplusdatasample.cpp:30: error: 'strcmp' was not declared in this scope "g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\keyplusdatasample.o" "samples\keyplusdatasample.cpp" ...failed gcc.compile.c++ bin\gcc-4.3.0\release\keyplusdatasample.o... ...skipped <pbin\gcc-4.3.0\release>keyplusdata.exe for lack of <pbin\gcc-4.3.0\release>keyplusdatasample.o... ...failed updating 11 targets... ...skipped 11 targets... ...found 65 targets... ...updating 8 targets... compile-c-c++ bin\msvc-9.0express\release\threading-multi\floatsample.obj floatsample.cpp samples\floatsample.cpp(26) : warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. C:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\stdio.h(237) : see declaration of 'fopen' samples\floatsample.cpp(43) : error C3861: 'fpclassify': identifier not found samples\floatsample.cpp(43) : error C2050: switch expression not integral samples\floatsample.cpp(44) : error C2065: 'FP_NAN' : undeclared identifier samples\floatsample.cpp(44) : error C2051: case expression not constant samples\floatsample.cpp(45) : error C2065: 'FP_ZERO' : undeclared identifier samples\floatsample.cpp(45) : error C2051: case expression not constant samples\floatsample.cpp(51) : warning C4065: switch statement contains 'default' but no 'case' labels call "C:\Program Files\Microsoft Visual Studio 9.0\VC\vcvarsall.bat" x86 >nul cl /Zm800 -nologo @"bin\msvc-9.0express\release\threading-multi\floatsample.obj.rsp" ...failed compile-c-c++ bin\msvc-9.0express\release\threading-multi\floatsample.obj... ...skipped <pbin\msvc-9.0express\release\threading-multi>floatsort.exe for lack of <pbin\msvc-9.0express\release\threading-multi>floatsample.obj... compile-c-c++ bin\msvc-9.0express\release\threading-multi\shiftfloatsample.obj shiftfloatsample.cpp samples\shiftfloatsample.cpp(35) : warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. C:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\stdio.h(237) : see declaration of 'fopen' samples\shiftfloatsample.cpp(52) : error C3861: 'fpclassify': identifier not found samples\shiftfloatsample.cpp(52) : error C2050: switch expression not integral samples\shiftfloatsample.cpp(53) : error C2065: 'FP_NAN' : undeclared identifier samples\shiftfloatsample.cpp(53) : error C2051: case expression not constant samples\shiftfloatsample.cpp(54) : error C2065: 'FP_ZERO' : undeclared identifier samples\shiftfloatsample.cpp(54) : error C2051: case expression not constant samples\shiftfloatsample.cpp(60) : warning C4065: switch statement contains 'default' but no 'case' labels call "C:\Program Files\Microsoft Visual Studio 9.0\VC\vcvarsall.bat" x86 >nul cl /Zm800 -nologo @"bin\msvc-9.0express\release\threading-multi\shiftfloatsample.obj.rsp" ...failed compile-c-c++ bin\msvc-9.0express\release\threading-multi\shiftfloatsample.obj... ...skipped <pbin\msvc-9.0express\release\threading-multi>shiftfloatsort.exe for lack of <pbin\msvc-9.0express\release\threading-multi>shiftfloatsample.obj... compile-c-c++ bin\msvc-9.0express\release\threading-multi\floatfunctorsample.obj floatfunctorsample.cpp samples\floatfunctorsample.cpp(35) : warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. C:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\stdio.h(237) : see declaration of 'fopen' samples\floatfunctorsample.cpp(52) : error C3861: 'fpclassify': identifier not found samples\floatfunctorsample.cpp(52) : error C2050: switch expression not integral samples\floatfunctorsample.cpp(53) : error C2065: 'FP_NAN' : undeclared identifier samples\floatfunctorsample.cpp(53) : error C2051: case expression not constant samples\floatfunctorsample.cpp(54) : error C2065: 'FP_ZERO' : undeclared identifier samples\floatfunctorsample.cpp(54) : error C2051: case expression not constant samples\floatfunctorsample.cpp(60) : warning C4065: switch statement contains 'default' but no 'case' labels call "C:\Program Files\Microsoft Visual Studio 9.0\VC\vcvarsall.bat" x86 >nul cl /Zm800 -nologo @"bin\msvc-9.0express\release\threading-multi\floatfunctorsample.obj.rsp" ...failed compile-c-c++ bin\msvc-9.0express\release\threading-multi\floatfunctorsample.obj... ...skipped <pbin\msvc-9.0express\release\threading-multi>floatfunctorsort.exe for lack of <pbin\msvc-9.0express\release\threading-multi>floatfunctorsample.obj... compile-c-c++ bin\msvc-9.0express\release\threading-multi\randomGen.obj randomGen.cpp samples\randomGen.cpp(24) : warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. C:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\stdio.h(237) : see declaration of 'fopen' samples\randomGen.cpp(40) : error C3861: 'isnan': identifier not found samples\randomGen.cpp(43) : error C3861: 'isnan': identifier not found call "C:\Program Files\Microsoft Visual Studio 9.0\VC\vcvarsall.bat" x86 >nul cl /Zm800 -nologo @"bin\msvc-9.0express\release\threading-multi\randomGen.obj.rsp" ...failed compile-c-c++ bin\msvc-9.0express\release\threading-multi\randomGen.obj... ...skipped <pbin\msvc-9.0express\release\threading-multi>randomgen.exe for lack of <pbin\msvc-9.0express\release\threading-multi>randomGen.obj... ...failed updating 4 targets... ...skipped 4 targets...

Steven Ross

12 Jan 12 Jan

1:45 a.m.

Thanks for testing it Steve. I was unable to reproduce the errors you encountered, but I eliminated all references to nans and strcmp. I also changed the code to use std::memcpy and include <cstring>. I believe this should resolve the problems you encountered. I've added html documentation. The new version is at: http://sourceforge.net/projects/spreadsort I'd appreciate feedback on any issues. I believe that the only thing this is missing now is test integration with Boost, which I'll try to do with a Jamfile. When I submit a proposed library, should I include the multi-level directory structure for it as it would appear in the boost hierarchy? Regards, Steve Ross On Thu, Jan 8, 2009 at 6:29 PM, Steven Watanabe <watanabesj@gmail.com>wrote:

...

AMDG

Steven Ross wrote:

...
I've finished optimizing and testing Spreadsort (integer_sort/float_sort/string_sort), a hybrid radix-based/comparison-based algorithm. It is now available from: http://sourceforge.net/projects/spreadsort I have tested it using an (included) automated test script on Mac OSX and Cygwin with a range of different distributions of random data on 20MB data sets. It is roughly twice as fast as std::sort on average across all three data types and many distributions, with the main exception of a 28X average performance improvement for float_sort on Cygwin. Average runtime seems to be proportional to roughly theta(n*square_root(log(n))); worst-case is more complicated but better than O(nlog(n)), due to the radix-based portion. Test it and see for yourself how it performs on your system. At this point I'm looking for suggestions on formatting it to submit as a Boost.Algorithm.Sorting namespace. I would appreciate comments that aid that process. I am also interested in knowing whether people would like a parallel_sort integrated in Boost.Algorithm.Sorting with Spreadsort

Well, the normal directory structure for boost would be

boost/ algorithm sorting/ spreadsort.hpp libs/ algorithm/ sorting/ test/ Jamfile.v2 ... doc/ ... index.html

BTW, the code doesn't compile with msvc 9.0 or gcc 4.3.0. error logs attached.

In Christ, Steven Watanabe

...found 64 targets... ...updating 22 targets... gcc.compile.c++ bin\gcc-4.3.0\release\sample.o samples\sample.cpp: In function 'int main(int, const char**)': samples\sample.cpp:19: error: 'strcmp' was not declared in this scope

"g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\sample.o" "samples\sample.cpp"

...failed gcc.compile.c++ bin\gcc-4.3.0\release\sample.o... ...skipped <pbin\gcc-4.3.0\release>spreadsort.exe for lack of <pbin\gcc-4.3.0\release>sample.o... gcc.compile.c++ bin\gcc-4.3.0\release\rightshiftsample.o samples\rightshiftsample.cpp: In function 'int main(int, const char**)': samples\rightshiftsample.cpp:22: error: 'strcmp' was not declared in this scope

"g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\rightshiftsample.o" "samples\rightshiftsample.cpp"

...failed gcc.compile.c++ bin\gcc-4.3.0\release\rightshiftsample.o... ...skipped <pbin\gcc-4.3.0\release>rightshift.exe for lack of <pbin\gcc-4.3.0\release>rightshiftsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\floatsample.o samples\floatsample.cpp: In function 'int main(int, const char**)': samples\floatsample.cpp:21: error: 'strcmp' was not declared in this scope samples\/../spreadsort.hpp: In function 'cast_type CastFloatIter(const RandomAccessIter&) [with cast_type = int, RandomAccessIter = __gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >]': samples\/../spreadsort.hpp:705: instantiated from 'void FloatSortRec(RandomAccessIter, RandomAccessIter, std::vector<RandomAccessIter, std::allocator<_Tp1> >&, unsigned int, std::vector<unsigned int, std::allocator<unsigned int> >&) [with RandomAccessIter = __gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, div_type = int, data_type = float]' samples\/../spreadsort.hpp:945: instantiated from 'void FloatSort(RandomAccessIter, RandomAccessIter, cast_type, data_type) [with RandomAccessIter = __gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, cast_type = int, data_type = float]' samples\/../spreadsort.hpp:974: instantiated from 'void float_sort_cast(RandomAccessIter, RandomAccessIter, cast_type) [with RandomAccessIter = __gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, cast_type = int]' samples\/../spreadsort.hpp:983: instantiated from 'void float_sort_cast_to_int(RandomAccessIter, RandomAccessIter) [with RandomAccessIter = __gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >]' samples\floatsample.cpp:62: instantiated from here samples\/../spreadsort.hpp:424: error: 'memcpy' was not declared in this scope

"g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\floatsample.o" "samples\floatsample.cpp"

...failed gcc.compile.c++ bin\gcc-4.3.0\release\floatsample.o... ...skipped <pbin\gcc-4.3.0\release>floatsort.exe for lack of <pbin\gcc-4.3.0\release>floatsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\shiftfloatsample.o samples\shiftfloatsample.cpp: In function 'int main(int, const char**)': samples\shiftfloatsample.cpp:30: error: 'strcmp' was not declared in this scope samples\/../spreadsort.hpp: In function 'cast_type MemCast(const DATA_TYPE&) [with DATA_TYPE = float, cast_type = int]': samples\shiftfloatsample.cpp:17: instantiated from here samples\/../spreadsort.hpp:434: error: 'memcpy' was not declared in this scope

"g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\shiftfloatsample.o" "samples\shiftfloatsample.cpp"

...failed gcc.compile.c++ bin\gcc-4.3.0\release\shiftfloatsample.o... ...skipped <pbin\gcc-4.3.0\release>shiftfloatsort.exe for lack of <pbin\gcc-4.3.0\release>shiftfloatsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\floatfunctorsample.o samples\floatfunctorsample.cpp: In function 'int main(int, const char**)': samples\floatfunctorsample.cpp:30: error: 'strcmp' was not declared in this scope samples\/../spreadsort.hpp: In function 'cast_type MemCast(const DATA_TYPE&) [with DATA_TYPE = float, cast_type = int]': samples\floatfunctorsample.cpp:17: instantiated from here samples\/../spreadsort.hpp:434: error: 'memcpy' was not declared in this scope

"g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\floatfunctorsample.o" "samples\floatfunctorsample.cpp"

...failed gcc.compile.c++ bin\gcc-4.3.0\release\floatfunctorsample.o... ...skipped <pbin\gcc-4.3.0\release>floatfunctorsort.exe for lack of <pbin\gcc-4.3.0\release>floatfunctorsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\stringsample.o samples\stringsample.cpp: In function 'int main(int, const char**)': samples\stringsample.cpp:21: error: 'strcmp' was not declared in this scope

"g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\stringsample.o" "samples\stringsample.cpp"

...failed gcc.compile.c++ bin\gcc-4.3.0\release\stringsample.o... ...skipped <pbin\gcc-4.3.0\release>stringsort.exe for lack of <pbin\gcc-4.3.0\release>stringsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\reversestringsample.o samples\reversestringsample.cpp: In function 'int main(int, const char**)': samples\reversestringsample.cpp:23: error: 'strcmp' was not declared in this scope

"g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\reversestringsample.o" "samples\reversestringsample.cpp"

...failed gcc.compile.c++ bin\gcc-4.3.0\release\reversestringsample.o... ...skipped <pbin\gcc-4.3.0\release>reversestringsort.exe for lack of <pbin\gcc-4.3.0\release>reversestringsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\charstringsample.o samples\charstringsample.cpp: In function 'int main(int, const char**)': samples\charstringsample.cpp:35: error: 'strcmp' was not declared in this scope

"g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\charstringsample.o" "samples\charstringsample.cpp"

...failed gcc.compile.c++ bin\gcc-4.3.0\release\charstringsample.o... ...skipped <pbin\gcc-4.3.0\release>charstringsort.exe for lack of <pbin\gcc-4.3.0\release>charstringsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\stringfunctorsample.o samples\stringfunctorsample.cpp: In function 'int main(int, const char**)': samples\stringfunctorsample.cpp:34: error: 'strcmp' was not declared in this scope

"g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\stringfunctorsample.o" "samples\stringfunctorsample.cpp"

...failed gcc.compile.c++ bin\gcc-4.3.0\release\stringfunctorsample.o... ...skipped <pbin\gcc-4.3.0\release>stringfunctorsort.exe for lack of <pbin\gcc-4.3.0\release>stringfunctorsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\reversestringfunctorsample.o samples\reversestringfunctorsample.cpp: In function 'int main(int, const char**)': samples\reversestringfunctorsample.cpp:34: error: 'strcmp' was not declared in this scope

"g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\reversestringfunctorsample.o" "samples\reversestringfunctorsample.cpp"

...failed gcc.compile.c++ bin\gcc-4.3.0\release\reversestringfunctorsample.o... ...skipped <pbin\gcc-4.3.0\release>reversestringfunctorsort.exe for lack of <pbin\gcc-4.3.0\release>reversestringfunctorsample.o... gcc.compile.c++ bin\gcc-4.3.0\release\keyplusdatasample.o samples\keyplusdatasample.cpp: In function 'int main(int, const char**)': samples\keyplusdatasample.cpp:30: error: 'strcmp' was not declared in this scope

"g++-4.3.0" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -DNDEBUG -I"." -c -o "bin\gcc-4.3.0\release\keyplusdatasample.o" "samples\keyplusdatasample.cpp"

...failed gcc.compile.c++ bin\gcc-4.3.0\release\keyplusdatasample.o... ...skipped <pbin\gcc-4.3.0\release>keyplusdata.exe for lack of <pbin\gcc-4.3.0\release>keyplusdatasample.o... ...failed updating 11 targets... ...skipped 11 targets...

...found 65 targets... ...updating 8 targets... compile-c-c++ bin\msvc-9.0express\release\threading-multi\floatsample.obj floatsample.cpp samples\floatsample.cpp(26) : warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. C:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\stdio.h(237) : see declaration of 'fopen' samples\floatsample.cpp(43) : error C3861: 'fpclassify': identifier not found samples\floatsample.cpp(43) : error C2050: switch expression not integral samples\floatsample.cpp(44) : error C2065: 'FP_NAN' : undeclared identifier samples\floatsample.cpp(44) : error C2051: case expression not constant samples\floatsample.cpp(45) : error C2065: 'FP_ZERO' : undeclared identifier samples\floatsample.cpp(45) : error C2051: case expression not constant samples\floatsample.cpp(51) : warning C4065: switch statement contains 'default' but no 'case' labels

call "C:\Program Files\Microsoft Visual Studio 9.0\VC\vcvarsall.bat" x86

...
nul cl /Zm800 -nologo @"bin\msvc-9.0express\release\threading-multi\floatsample.obj.rsp"

...failed compile-c-c++ bin\msvc-9.0express\release\threading-multi\floatsample.obj... ...skipped <pbin\msvc-9.0express\release\threading-multi>floatsort.exe for lack of <pbin\msvc-9.0express\release\threading-multi>floatsample.obj... compile-c-c++ bin\msvc-9.0express\release\threading-multi\shiftfloatsample.obj shiftfloatsample.cpp samples\shiftfloatsample.cpp(35) : warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. C:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\stdio.h(237) : see declaration of 'fopen' samples\shiftfloatsample.cpp(52) : error C3861: 'fpclassify': identifier not found samples\shiftfloatsample.cpp(52) : error C2050: switch expression not integral samples\shiftfloatsample.cpp(53) : error C2065: 'FP_NAN' : undeclared identifier samples\shiftfloatsample.cpp(53) : error C2051: case expression not constant samples\shiftfloatsample.cpp(54) : error C2065: 'FP_ZERO' : undeclared identifier samples\shiftfloatsample.cpp(54) : error C2051: case expression not constant samples\shiftfloatsample.cpp(60) : warning C4065: switch statement contains 'default' but no 'case' labels

call "C:\Program Files\Microsoft Visual Studio 9.0\VC\vcvarsall.bat" x86

...
nul cl /Zm800 -nologo @"bin\msvc-9.0express\release\threading-multi\shiftfloatsample.obj.rsp"

...failed compile-c-c++ bin\msvc-9.0express\release\threading-multi\shiftfloatsample.obj... ...skipped <pbin\msvc-9.0express\release\threading-multi>shiftfloatsort.exe for lack of <pbin\msvc-9.0express\release\threading-multi>shiftfloatsample.obj... compile-c-c++ bin\msvc-9.0express\release\threading-multi\floatfunctorsample.obj floatfunctorsample.cpp samples\floatfunctorsample.cpp(35) : warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. C:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\stdio.h(237) : see declaration of 'fopen' samples\floatfunctorsample.cpp(52) : error C3861: 'fpclassify': identifier not found samples\floatfunctorsample.cpp(52) : error C2050: switch expression not integral samples\floatfunctorsample.cpp(53) : error C2065: 'FP_NAN' : undeclared identifier samples\floatfunctorsample.cpp(53) : error C2051: case expression not constant samples\floatfunctorsample.cpp(54) : error C2065: 'FP_ZERO' : undeclared identifier samples\floatfunctorsample.cpp(54) : error C2051: case expression not constant samples\floatfunctorsample.cpp(60) : warning C4065: switch statement contains 'default' but no 'case' labels

call "C:\Program Files\Microsoft Visual Studio 9.0\VC\vcvarsall.bat" x86

...
nul cl /Zm800 -nologo @"bin\msvc-9.0express\release\threading-multi\floatfunctorsample.obj.rsp"

...failed compile-c-c++ bin\msvc-9.0express\release\threading-multi\floatfunctorsample.obj... ...skipped <pbin\msvc-9.0express\release\threading-multi>floatfunctorsort.exe for lack of <pbin\msvc-9.0express\release\threading-multi>floatfunctorsample.obj... compile-c-c++ bin\msvc-9.0express\release\threading-multi\randomGen.obj randomGen.cpp samples\randomGen.cpp(24) : warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. C:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\stdio.h(237) : see declaration of 'fopen' samples\randomGen.cpp(40) : error C3861: 'isnan': identifier not found samples\randomGen.cpp(43) : error C3861: 'isnan': identifier not found

call "C:\Program Files\Microsoft Visual Studio 9.0\VC\vcvarsall.bat" x86

...
nul cl /Zm800 -nologo @"bin\msvc-9.0express\release\threading-multi\randomGen.obj.rsp"

...failed compile-c-c++ bin\msvc-9.0express\release\threading-multi\randomGen.obj... ...skipped <pbin\msvc-9.0express\release\threading-multi>randomgen.exe for lack of <pbin\msvc-9.0express\release\threading-multi>randomGen.obj... ...failed updating 4 targets... ...skipped 4 targets...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Steven Ross

13 Jan 13 Jan

7:09 a.m.

How do I run Boost's automated testing on a single unit test, and how do I create a new test directory in BOOST_ROOT/libs/algorithm/sorting such that it gets executed? If someone could just explain how I run libs/algorithm/string's automated tests, I'd be quite happy to copy how its tests are written, and modify them to test sorting correctness. I've spent over 7 hours so far trying to figure this out, and it's starting to get ridiculous. I've read through the documentation of bjam and the Boost test library and they don't seem to explain this.

Steven Watanabe

5:02 p.m.

AMDG Steven Ross wrote:

...

How do I run Boost's automated testing on a single unit test,

If you have run mytest.cpp ; (e.g.) in the Jamfile you can run this test with bjam mytest

...

and how do I create a new test directory in BOOST_ROOT/libs/algorithm/sorting such that it gets executed?

You don't need to deal with this yet. The full regression tests are controlled by status/Jamfile.v2 which just contains a list of libraries.

...

If someone could just explain how I run libs/algorithm/string's automated tests, I'd be quite happy to copy how its tests are written, and modify them to test sorting correctness. I've spent over 7 hours so far trying to figure this out, and it's starting to get ridiculous. I've read through the documentation of bjam and the Boost test library and they don't seem to explain this

# run all tests bjam # run the first test bjam trim The fifth argument to the run rule is the name of the target. The name defaults to be the name of the first source file. Also, the test-suite is unnecessary. The StringAlgo Jamfile could just as easily be written as run trim_test.cpp : : : : trim ; run conv_test.cpp : : : : conv ; #... In Christ, Steven Watanabe

Steven Ross

14 Jan 14 Jan

3:46 p.m.

algorithm_sorting.tar.gz is now in the boost vault, and is formatted for submission. I'd appreciate it if people could test it on their compiler and give me some feedback. It's not quite ready for formal review yet; I'd like a test on MSVC++ first. On Tue, Jan 13, 2009 at 9:02 AM, Steven Watanabe <watanabesj@gmail.com>wrote:

...

If you have run mytest.cpp ; (e.g.) in the Jamfile

you can run this test with

bjam mytest

Thanks. I'd already tried bjam, but what I figured out is that a link error: /usr/bin/ld: unknown flag: --start-group is preventing the executable from being built, so the tests don't run. So instead I link the .o file that it was built by bjam, run it, and it runs my tests. Some questions: 1) I have a perl tuning/regression script that I've used to verify correctness, performance, and tune some tuning constants. One of these tuning constants (MAX_SPLITS) has a 10+% impact on performance, because it impacts cache misses and how many recursive stages are executed. 10-12 is the range of ideal values for MAX_SPLITS on Intel Processors, and 10-11 is ideal on my old Motorola PowerPC processor. The penalty for too large a value is prohibitive (20+%), where the penalty for a value 1 too small is generally less than 10%. Based upon that, picking a value of 10 or 11 should be decent across most systems. For ideal performance, this should probably be configured to the processor it is run on. The other 3 constants have much less impact and can reasonably have the same value across all systems. Should I include the tune.pl script with the library (and if so, where, libs?), or should I do all the performance tuning myself? 2) Should I have a separate constants.hpp to contain these constants (as now), or define them directly in the spreadsort.hpp source file? If I define them directly in the source file, tune.pl won't work, but it makes it easier for users to just copy spreadsort.hpp on its own. 3) Should I use configuration information to set the constants , or just pick some decent values and stick with them? 4) Edouard is writing a separate parallel_sort, and our intention is that it can take a separate sorting algorithm (such as integer_sort etc) as an argument. Are we better off submitting our source for final review separately or together? They're intended to be in the same library.

Phil Endecott

5:51 p.m.

Steven Ross wrote:

...

algorithm_sorting.tar.gz is now in the boost vault, and is formatted for submission. I'd appreciate it if people could test it on their compiler and give me some feedback.

Seems to be here: http://www.boostpro.com/vault/index.php?action=downloadfile&filename=algorithm_sorting.tar.gz&directory=& Some first impressions follow. I will try to find time to look at it more carefully at some point but I can't promise. You seem to have quite a lot of code that's duplicated for the functor and non-function versions. Can't this be eliminated by using e.g. std::less as a default for the compare template parameter? (Isn't this what std::sort does?) I suggest that you get rid of the tab characters in the source. I personally find your lines rather long, but that's a matter of taste. It would be good if it were possible to automatically detect whether a floating point type is one for which the cast-to-integer trick works. It would be great if C++ defined some macro or similar that defined whether float and double were IEEE 754 or not. (Hmm, is there some sort of type_traits thing to do this?). Maybe it's possible to get a "good guess" answer by checking whether the bit pattern is as expected for some known constant; would someone like to suggest a good way to check this at compile time? You could then fall back to std::sort if it fails. I think you can use things like typename RandomAccessIter::value_type instead of passing *first in order to determine data_type. I'm not sure how useful your functor-versions are since they still seem to depend on the "integer-ness" or "float-ness" of the sorted type. Apart from sorting backwards, what use cases are there? Using functors would be more useful if it let you sort types that are neither integers nor floats; for example, fixed-point types or pointers-to-integers or structs containing a key field to be sorted on. I think that the key here is that you make assumptions about the return type of operator>> or the right_shift functor which are not true in those cases. It may be that you just need to rename that functor and document what it needs to do. I think it's really "get N bits from key". I would have thought that the floating-point algorithm would just be an instance of the generic algorithm with custom functors, with a bit of specialisation to cope with the reverse-order-for-negatives thing. One of the things that your string sort does is to not compare known-equal initial substrings. Does anyone know if there is a variant of a conventional sort (e.g. quicksort) that does this? Some more documentation of the performance benefit would be useful. How about some graphs? Cheers, Phil.

Steven Watanabe

6:26 p.m.

AMDG Phil Endecott wrote:

...

It would be good if it were possible to automatically detect whether a floating point type is one for which the cast-to-integer trick works. It would be great if C++ defined some macro or similar that defined whether float and double were IEEE 754 or not. (Hmm, is there some sort of type_traits thing to do this?).

Are you thinking of std::numeric_limits<>::is_iec559? In Christ, Steven Watanabe

Steven Ross

1 Feb 1 Feb

11:11 p.m.

On Wed, Jan 14, 2009 at 10:26 AM, Steven Watanabe <watanabesj@gmail.com>wrote:

...

AMDG Phil Endecott wrote:

...
It would be good if it were possible to automatically detect whether a floating point type is one for which the cast-to-integer trick works. It would be great if C++ defined some macro or similar that defined whether float and double were IEEE 754 or not. (Hmm, is there some sort of type_traits thing to do this?).

Are you thinking of std::numeric_limits<>::is_iec559?

I've added these static assertions in the casting methods: BOOST_STATIC_ASSERT(sizeof(cast_type) == sizeof(data_type)); BOOST_STATIC_ASSERT(std::numeric_limits<data_type>::is_iec559); BOOST_STATIC_ASSERT(std::numeric_limits<cast_type>::is_integer); That should guarantee I'm casting an IEEE floating-point number to an integer of the same size. Also, using toolset=darwin resolved by bjam issue, and my tests run and pass now. Thanks for your help Steve.

Phil Endecott

2 Feb 2 Feb

9:47 a.m.

Steven Ross wrote:

...

On Wed, Jan 14, 2009 at 10:26 AM, Steven Watanabe <watanabesj@gmail.com>wrote:

...
Phil Endecott wrote:

...
It would be good if it were possible to automatically detect whether a floating point type is one for which the cast-to-integer trick works. It would be great if C++ defined some macro or similar that defined whether float and double were IEEE 754 or not. (Hmm, is there some sort of type_traits thing to do this?).

Are you thinking of std::numeric_limits<>::is_iec559?

I've added these static assertions in the casting methods: BOOST_STATIC_ASSERT(sizeof(cast_type) == sizeof(data_type)); BOOST_STATIC_ASSERT(std::numeric_limits<data_type>::is_iec559); BOOST_STATIC_ASSERT(std::numeric_limits<cast_type>::is_integer); That should guarantee I'm casting an IEEE floating-point number to an integer of the same size.

I was imagining that you'd use these to enable a specialisation, e.g. using enable_if, rather than in asserts. Phil.

Steven Ross

10:14 p.m.

On Mon, Feb 2, 2009 at 1:47 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

I've added these static assertions in the casting methods:

...
BOOST_STATIC_ASSERT(sizeof(cast_type) == sizeof(data_type)); BOOST_STATIC_ASSERT(std::numeric_limits<data_type>::is_iec559); BOOST_STATIC_ASSERT(std::numeric_limits<cast_type>::is_integer); That should guarantee I'm casting an IEEE floating-point number to an integer of the same size.

I was imagining that you'd use these to enable a specialisation, e.g. using enable_if, rather than in asserts. <http://lists.boost.org/mailman/listinfo.cgi/boost>

I suppose I could call just call std::sort for situations where any of those asserts would fail, but I think a compile error is more appropriate. Can you think of a usage case with a standards-compliant C++ compiler where someone would want to use float_sort but those assertions fail for a reason other than user error? If not, I believe telling the user that they're using it wrong via a static assertion failure is more appropriate.

Steven Ross

3 Feb 3 Feb

3:50 a.m.

I've uploaded a new algorithm_sorting.zip to www.boostpro.com/vault, which resolves all the issues I found mentioned in past emails (tabs, data_type, graphs, etc). I believe it contains everything necessary for a Boost library at this point. Suggestion are welcome, as are performance results. I'd particularly appreciate it if people could run tune.pl on their systems and send me the output (tune.pl determines ideal constants). I made my performance graphs on OSX. I have graphs of runtime vs. file size, and other graphs of runtime vs. data range (0 to 32 bits). The data range graphs are interesting, showing both the importance of MAX_SPLITS, and how Spreadsort takes advantage of small ranges (clumpy data) for better performance. I've thought about Phil's enable_if suggestion more, and thought that it might be possible to use a combination of my different algorithms to make a spread_sort that first checks if an iterator's value_type is an integer, if so, it uses integer_sort; otherwise it checks to see if it's a float, if so, it sorts it with float_sort; finally it checks to see if it is a string, if so, it sorts it with string_sort, and if not, it falls back on std::sort. Does that sound like a useful generalization, or just an unwieldy mess? I think most people know what they're sorting, but a generic spread_sort could make generic programming a little easier. On Mon, Feb 2, 2009 at 1:47 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

I've added these static assertions in the casting methods:

...
BOOST_STATIC_ASSERT(sizeof(cast_type) == sizeof(data_type)); BOOST_STATIC_ASSERT(std::numeric_limits<data_type>::is_iec559); BOOST_STATIC_ASSERT(std::numeric_limits<cast_type>::is_integer); That should guarantee I'm casting an IEEE floating-point number to an integer of the same size.

I was imagining that you'd use these to enable a specialisation, e.g. using enable_if, rather than in asserts.

Steven Watanabe

4:45 a.m.

AMDG Steven Ross wrote:

...

I've uploaded a new algorithm_sorting.zip to www.boostpro.com/vault, which resolves all the issues I found mentioned in past emails (tabs, data_type, graphs, etc). I believe it contains everything necessary for a Boost library at this point. Suggestion are welcome, as are performance results. I'd particularly appreciate it if people could run tune.pl on their systems and send me the output (tune.pl determines ideal constants).

I would appreciate it if you used bjam instead of a make in tune.pl, so that I can test it with msvc. I've attached a Jamfile. In Christ, Steven Watanabe local properties = ; if --tune in [ modules.peek : ARGV ] { properties = <location>. <variant>release ; } project spreadsort : source-location samples : requirements <include>../.. $(properties) ; exe spreadsort : sample.cpp ; exe alreadysorted : alreadysorted.cpp ; exe rightshift : rightshiftsample.cpp ; exe reverseintsort : reverseintsample.cpp ; exe floatsort : floatsample.cpp ; exe shiftfloatsort : shiftfloatsample.cpp ; exe floatfunctorsort : floatfunctorsample.cpp ; exe stringsort : stringsample.cpp ; exe reversestringsort : reversestringsample.cpp ; exe charstringsort : charstringsample.cpp ; exe stringfunctorsort : stringfunctorsample.cpp ; exe reversestringfunctorsort : reversestringfunctorsample.cpp ; exe keyplusdata : keyplusdatasample.cpp ; exe randomgen : randomgen.cpp ;

Steven Ross

2:59 p.m.

Thanks, Steve. That Jamfile works for building it (debug). A couple issues: 1) When I try "bjam release" on my system with toolset=darwin (the correct toolset), I get all kinds of errors from the -gdwarf-2 option. 2) How do I find (or move) the executables bjam builds? It's building them in ../../../bin.v2/libs/algorithm/sorting/darwin-4.0.1/debug/, which won't work on other systems if I provide a link in tune.pl. I could convert the tuning script to another language (or even C++) if that's necessary, though that might take me a while. I picked PERL because it's easy and available on most platforms. On Mon, Feb 2, 2009 at 8:45 PM, Steven Watanabe <watanabesj@gmail.com>wrote:

...

AMDG

Steven Ross wrote:

...
I've uploaded a new algorithm_sorting.zip to www.boostpro.com/vault, which resolves all the issues I found mentioned in past emails (tabs, data_type, graphs, etc). I believe it contains everything necessary for a Boost library at this point. Suggestion are welcome, as are performance results. I'd particularly appreciate it if people could run tune.pl on their systems and send me the output (tune.pl determines ideal constants).

I would appreciate it if you used bjam instead of a make in tune.pl, so that I can test it with msvc. I've attached a Jamfile.

Steven Watanabe

5:34 p.m.

AMDG Steven Ross wrote:

...

Thanks, Steve. That Jamfile works for building it (debug). A couple issues: 1) When I try "bjam release" on my system with toolset=darwin (the correct toolset), I get all kinds of errors from the -gdwarf-2 option.

It looks like this option is being added unconditionally on darwin.jam line 332. What exactly is the error? What is the command line being used? Can you figure out what the command line should be?

...

2) How do I find (or move) the executables bjam builds? It's building them in ../../../bin.v2/libs/algorithm/sorting/darwin-4.0.1/debug/, which won't work on other systems if I provide a link in tune.pl.

I included a --tune option that uses the release variant and forces everything to be built in the directory of the Jamfile.

...

I could convert the tuning script to another language (or even C++) if that's necessary, though that might take me a while. I picked PERL because it's easy and available on most platforms.

Perl is fine. It's just a Makefile, hard coded to use gcc, that's a problem. In Christ, Steven Watanabe

Steven Ross

4 Feb 4 Feb

5:57 a.m.

On Tue, Feb 3, 2009 at 9:34 AM, Steven Watanabe <watanabesj@gmail.com>wrote:

...

1) When I try "bjam release" on my system with toolset=darwin (the correct

...
toolset), I get all kinds of errors from the -gdwarf-2 option.

It looks like this option is being added unconditionally on darwin.jam line 332. What exactly is the error?

hundreds of lines of this: /var/tmp//cco8nWgI.s:11638:Complex expression. Absolute segment assumed. /var/tmp//cco8nWgI.s:11640:Complex expression. Absolute segment assumed. /var/tmp//cco8nWgI.s:11644:Complex expression. Absolute segment assumed.

...

What is the command line being used?

"g++" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -dynamic -no-cpp-precomp -gdwarf-2 -Wno-long-double -fPIC -DBOOST_ALL_NO_LIB=1 -DNDEBUG -I"../.." -I"../../.." -c -o "../../../bin.v2/libs/algorithm/sorting/darwin-4.0.1/release/sample.o" "samples/sample.cpp"

...

Can you figure out what the command line should be?

I deleted -gdwarf-2 from darwin.bjam, and tune.pl is working fine with bjam now.

...

2) How do I find (or move) the executables bjam builds? It's building

...
them in ../../../bin.v2/libs/algorithm/sorting/darwin-4.0.1/debug/, which won't work on other systems if I provide a link in tune.pl.

I included a --tune option that uses the release variant and forces everything to be built in the directory of the Jamfile.

tune.pl is now using that Jamfile, and I've uploaded a revised algorithm_sorting.zip to the Boost Vault. Thanks for the help making tune.pl work with Jam, and now it's ready for people to try. I'm expecting Windows results to be worse than LINUX/UNIX/CYGWIN, and would like to improve them. Looking at my performance vs. range graph, combined with knowledge of some odd results on tune.pl inspired me to make an optimization to spreadsort that makes it a significantly faster for 11-bit and 21-bit ranges (MAX_SPLITS + 1 and MAX_SPLITS*2 + 1). Those two graph points are now out of date; I'll update them eventually.

Steven Ross

11 Feb 11 Feb

4:40 p.m.

The graphs are updated now. I also resolved an MSVC issue due to a missing include <limits>. I've reuploaded algorithm_sorting.zip to the Boost Vault. A friend of mine ran tune.pl with MSVC and found these results: integer_sort is 30% faster than std::sort string_sort is 88% faster than std::sort float_sort is 460% faster than std::sort He also found that the ideal MAX_SPLITS on Windows is larger. On his system 12 was ideal, but 16 was close. On some Windows systems MAX_SPLITS 16 will probably be ideal, due to whatever 16-bit optimizations are in the OS. The value of MAX_SPLITS is important to the performance of this algorithm; the other values aren't as critical, and can usually be left alone. On OSX/Darwin I get roughly a 2X speedup across all three algorithms. Questions: 1) Is it a good idea to automatically detect the OS, and for windows set MAX_SPLITS to 16, and for all other OSes set it to 10? Or should I just leave it always 10 and let the user modify it if they are performance-conscious, using either the tune.pl script or manually? 2) Does anyone have any issues with this library? Besides comments on MAX_SPLITS, I think this is done. Steve

Michael Marcin

12 Feb 12 Feb

2:48 a.m.

Steven Ross wrote:

...

Questions: 1) Is it a good idea to automatically detect the OS, and for windows set MAX_SPLITS to 16, and for all other OSes set it to 10?

Yes -- Michael Marcin

Steven Ross

18 Apr 18 Apr

6:59 p.m.

I'm using #ifdef MSVC to set MAX_SPLITS to a different value on Windows MSVC. I'm currently working on making my samples run cleanly without warnings on Windows. fopen gives warnings in MSVC (see below), and I've run into file I/O errors on Windows (which do not occur on Linux). Here are some Windows compile warnings with fopen: samples\alreadysorted.cpp(34) : warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. C:\Program Files (x86)\Microsoft Visual Studio 8\VC\INCLUDE\stdio.h(234) : see declaration of 'fopen' samples\alreadysorted.cpp(65) : warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. C:\Program Files (x86)\Microsoft Visual Studio 8\VC\INCLUDE\stdio.h(234) : see declaration of 'fopen' samples\alreadysorted.cpp(67) : warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. C:\Program Files (x86)\Microsoft Visual Studio 8\VC\INCLUDE\stdio.h(234) : see declaration of 'fopen' What is the best platform-independent file I/O approach to use for writing and reading bytes? ofstream writes 4031 bytes when I tell it to write 4000 on Windows, so that doesn't seem appropriate. On Wed, Feb 11, 2009 at 7:48 PM, Michael Marcin <mike.marcin@gmail.com>wrote:

...

Steven Ross wrote:

...
Questions:

...
1) Is it a good idea to automatically detect the OS, and for windows set MAX_SPLITS to 16, and for all other OSes set it to 10?

Yes

-- Michael Marcin

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Steven Watanabe

8:23 p.m.

AMDG Steven Ross wrote:

...

I'm using #ifdef MSVC to set MAX_SPLITS to a different value on Windows MSVC.

I'm currently working on making my samples run cleanly without warnings on Windows. fopen gives warnings in MSVC (see below), and I've run into file I/O errors on Windows (which do not occur on Linux). Here are some Windows compile warnings with fopen:

samples\alreadysorted.cpp(34) : warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. C:\Program Files (x86)\Microsoft Visual Studio 8\VC\INCLUDE\stdio.h(234) : see declaration of 'fopen' samples\alreadysorted.cpp(65) : warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. C:\Program Files (x86)\Microsoft Visual Studio 8\VC\INCLUDE\stdio.h(234) : see declaration of 'fopen' samples\alreadysorted.cpp(67) : warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. C:\Program Files (x86)\Microsoft Visual Studio 8\VC\INCLUDE\stdio.h(234) : see declaration of 'fopen'

What is the best platform-independent file I/O approach to use for writing and reading bytes? ofstream writes 4031 bytes when I tell it to write 4000 on Windows, so that doesn't seem appropriate.

Feel free to ignore these warnings. In Christ, Steven Watanabe

Steven Watanabe

8:24 p.m.

AMDG Steven Ross wrote:

...

What is the best platform-independent file I/O approach to use for writing and reading bytes? ofstream writes 4031 bytes when I tell it to write 4000 on Windows, so that doesn't seem appropriate.

You probably need to open the file in binary mode. In Christ, Steven Watanabe

Steven Ross

19 Apr 19 Apr

12:41 p.m.

Thanks Steve, that was the problem. Oddly, now that I've resolved these file I/O issues, MSVC works best with similar constants to gcc, and it is slightly faster. Now, on MSVC I see a 27% speedup for integer_sort, 6X speedup for float_sort, and 136% speedup for string_sort (relative to std::sort). The new version is in the boost vault, named algorithm_sorting.zip. As this zip file was built on my new (purchased for this project) Windows system, it no longer has those annoying MacOS tag files in it. I verified it works in gcc with my old mac system. To make the tuning script cross-platform, I send undesired output to a .tunelog file (/dev/null doesn't work on windows), I have duplicate rm/del commands, and I modify the user's path to have "." in it for Windows (then remove it when the script finishes). For some reason I haven't figured out, it leaves a time.txt file in the directory instead of deleting it on Windows, but I don't think that's a big deal. I'm currently rerunning performance tests (trying to see if I can eliminate the #ifdef MSVC), and once that is done and I've made some graphs for Windows performance, I should be ready for a formal submission. Steven Ross On Sat, Apr 18, 2009 at 1:24 PM, Steven Watanabe <watanabesj@gmail.com>wrote:

...

AMDG

Steven Ross wrote:

...
What is the best platform-independent file I/O approach to use for writing and reading bytes? ofstream writes 4031 bytes when I tell it to write 4000 on Windows, so that doesn't seem appropriate.

You probably need to open the file in binary mode.

In Christ, Steven Watanabe

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Steven Watanabe

1:53 p.m.

AMDG Steven Ross wrote:

...

To make the tuning script cross-platform, I send undesired output to a .tunelog file (/dev/null doesn't work on windows)

On windows you can use >nul. In Christ, Steven Watanabe

Steven Ross

25 Apr 25 Apr

8:02 a.m.

Thanks for the tip Steve, but I'm not sure how to tell for sure that I'm in Windows inside PERL. The .tunelog works for now, without adding excessive clutter. I've finished a set of default constants, that provides a roughly 5% performance penalty on Windows, and roughly 10% performance penalty versus optimal settings on MacOSX. That avoids the hassles of the user needing to tune for themselves, and I figure it's good enough. I have a final draft up on the boost vault named algorithm_sorting.zip. These were my findings: On Mac OSX, I get a 60% to 3X speedup for on large data sets, depending on the problem type. floats and integers sorted at comparable speeds. On Windows: integer_sort is about 20% faster than std::sort. integer_sort on already-sorted data takes 3X as long to run as std::sort. The pattern of these differences, where integer_sort is roughly 50% faster on highly random data, but (relatively) slower on low-randomness data, suggests that std::sort on Windows has some form of optimization for mostly-sorted data. Worst-case performance is 50% better for integer_sort than std::sort. integer_sort has the better average and worst-case performance, but std::sort has superior best-case performance. sorting an integer key plus string data, integer_sort is 3.5X as fast as std::sort. This is because integer_sort uses asymptotically fewer swaps, and swaps take longer the larger the data type. float_sort is roughly 7X faster than std::sort, but this follows a similar pattern to integer_sort: The worst case for float_sort is 13X as fast as std::sort is on the same case. float_sort is faster on all tests except the one where all data elements are identical, the best-case for both algorithms. Because of this huge difference in performance (which I believe is due to extremely slow float comparisons on Windows) the ideal LOG_CONST for float_sort on Windows is probably 1 (where on Mac OS, it's about the same as for integer_sort, the default of 4). I found that floating-point sorting with std::sort on a new Windows system took as much as 50X longer than on an old Macintosh Altivec (G4) laptop, so it's no so much that float_sort is fast as std::sort is slow. string_sort is roughly 2.3X faster than std::sort. This doesn't seem to follow any particular pattern; it's consistently faster. Comments and suggestions are welcome. I should be ready for a formal submission soon. Steven Ross On Sun, Apr 19, 2009 at 6:53 AM, Steven Watanabe <watanabesj@gmail.com>wrote:

...

AMDG

Steven Ross wrote:

...
To make the tuning script cross-platform, I send undesired output to a .tunelog file (/dev/null doesn't work on windows)

On windows you can use >nul.

In Christ, Steven Watanabe

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Steven Ross

14 Jan 14 Jan

9:39 p.m.

In reply to Steve Watanabe: Thanks. I'd already tried bjam, but what I figured out is that a link error: /usr/bin/ld: unknown flag: --start-group is preventing the executable from being built, so the tests don't run. So instead I link the .o file that it was built by bjam, run it, and it runs my tests. This ought to just work. I presume that you are using gcc. What OS? I am using gcc 4.0.1 on Mac OSX10.4 (PPC). I attempted to install gcc4.3.2, and it failed, even though I added the libraries it stated it was dependent upon. Apple only provides binaries past gcc4.0 for 10.5, so I'm inclined to believe that 10.4 is missing necessary support in the OS for the latest gcc libraries. On Wed, Jan 14, 2009 at 9:51 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

<http://www.boostpro.com/vault/index.php?action=downloadfile&filename=algorithm_sorting.tar.gz&directory=&>Some first impressions follow. I will try to find time to look at it more carefully at some point but I can't promise.

You seem to have quite a lot of code that's duplicated for the functor and non-function versions. Can't this be eliminated by using e.g. std::less as a default for the compare template parameter? (Isn't this what std::sort does?)

In the version of STL I downloaded to look at, there is the functor version of std::sort, and the non-functor version. The code is entirely duplicated at all levels where comparison are used, because the functor adds overhead (it takes up space on the stack, if nothing else). This overhead is small (around 2-5%), but measurable. Considering how much effort I went to squeezing out every 1% improvement I could, I would consider it wasteful to accept that kind of penalty just to reduce code size in the library. I've combined functions between functor/non-functor versions where I could without sacrificing measurable performance. My philosophy is that I (as the library developer) go to the effort to make things as fast and flexible as possible, so that those who include the library don't have to worry about any of that.

...

I suggest that you get rid of the tab characters in the source. I personally find your lines rather long, but that's a matter of taste.

I'll make sure to remove the ones I missed, thanks for the reminder. I'll also shorten lines over 80 characters.

...

I think you can use things like typename RandomAccessIter::value_type instead of passing *first in order to determine data_type.

I tried that, but it didn't compile on my system (MacOSX PPC gcc4.0.1). *first works just the same, though I agree, ::value_type is prettier. I'll see if I can find out how to make it work on my system, and thus hopefully everybody's system, but no guarantees.

...

I'm not sure how useful your functor-versions are since they still seem to depend on the "integer-ness" or "float-ness" of the sorted type. Apart from sorting backwards, what use cases are there? Using functors would be more useful if it let you sort types that are neither integers nor floats; for example, fixed-point types or pointers-to-integers or structs containing a key field to be sorted on.

Did you look at my KeyPlusData sample or the index.html? integer_sort can sort anything with an integer or integer-convertible key float_sort can sort anything with a floating-point key string_sort can sort anything with a string key, including (though not tested yet) wide character strings That's what the functor versions are there for: sorting classes or structures based upon a key. All you have to do is define the correct functor. I believe that most single-key data types can be sorted by one of these three functions and the appropriate functors. Fixed-point types can sort as an integer of the same size in bytes. The only real requirement is that they must return a data type that when subtracted from an item of the same type returns something convertible to a size_t, that if (b - a) < (c - a) then b < c, and that bitshifting works as division by a power of two would, reducing the difference by a power of two per shift. Pointers-to-integers need a shift functor: *x >> offset; and a compare functor *x < *y. KeyPlusData shows an example of a struct with a key field.

...

I think that the key here is that you make assumptions about the return type of operator>> or the right_shift functor which are not true in those cases. It may be that you just need to rename that functor and document what it needs to do. I think it's really "get N bits from key". I would have thought that the floating-point algorithm would just be an instance of the generic algorithm with custom functors, with a bit of specialisation to cope with the reverse-order-for-negatives thing.

Positive floats can be sorted just like integers, with the right functor. Negative floats have to be sorted backwards. I didn't see any efficient way to write a functor to handle negative floats. Hence, a mostly separate algorithm, but positive float sorting shares code with integer_sort. The nice thing about how I wrote float_sort is that it sorts floats almost as fast as integers (I see differences of under 10%).

...

One of the things that your string sort does is to not compare known-equal initial substrings. Does anyone know if there is a variant of a conventional sort (e.g. quicksort) that does this?

I'm pretty sure that's a radix-sorting trick; it's incompatible with only taking a comparison function, and requires paying attention to the radix index. Plenty of radix sorts do it. The speedup I attained was on the order of 5%; more than enough to be worth doing, but not a huge gain. String comparisons are cache-friendly, so they're still fast relative to swap operations on my test systems.

...

Some more documentation of the performance benefit would be useful. How about some graphs?

Okay. It will vary substantially from system to system (such as my odd 28X float_sort speedup on Cygwin, where on my system it's roughly 2X as fast, or the minimal speed improvement of string_sort on Cygwin, vs 2X on my system), across different compilers, and depending upon the data distribution. I could graph runtime vs. file size on evenly distributed random data on multiple platforms for each of the 3 algorithms relative to std::sort. Does that sound like what you're looking for? My experience is that Spreadsort tends to be a little faster on unevenly distributed data than evenly distributed data, as it starts bucketsorting; my performance tests include various distributions, but they take a long time to run and it would be impractical and probably confusing to the reader to graph their results. I'll look at std::numeric_limits<>::is_iec559 and see if that does what I want.

Steven Watanabe

9:55 p.m.

AMDG Steven Ross wrote:

...

In reply to Steve Watanabe: Thanks. I'd already tried bjam, but what I figured out is that a link error: /usr/bin/ld: unknown flag: --start-group is preventing the executable from being built, so the tests don't run. So instead I link the .o file that it was built by bjam, run it, and it runs my tests. This ought to just work. I presume that you are using gcc. What OS? I am using gcc 4.0.1 on Mac OSX10.4 (PPC). I attempted to install gcc4.3.2, and it failed, even though I added the libraries it stated it was dependent upon. Apple only provides binaries past gcc4.0 for 10.5, so I'm inclined to believe that 10.4 is missing necessary support in the OS for the latest gcc libraries.

Ok. This is an annoying problem with Boost.Build that keeps coming up. Add toolset=darwin and it ought to work. Also, did you get a message like this: warning: No toolsets are configured. warning: configuring default toolset "gcc" ... ? If so, the default toolset should be darwin--which is an easy fix. In Christ, Steven Watanabe

Phil Endecott

10:48 p.m.

Steven Ross wrote:

...

...
You seem to have quite a lot of code that's duplicated for the functor and non-function versions. Can't this be eliminated by using e.g. std::less as a default for the compare template parameter? (Isn't this what std::sort does?)

In the version of STL I downloaded to look at, there is the functor version of std::sort, and the non-functor version. The code is entirely duplicated at all levels where comparison are used, because the functor adds overhead (it takes up space on the stack, if nothing else). This overhead is small (around 2-5%), but measurable.

OK. I would hope that a class with no state in it would take no space, but I suppose compilers may be dumb about that sometimes.

...

My philosophy is that I (as the library developer) go to the effort to make things as fast and flexible as possible, so that those who include the library don't have to worry about any of that.

That's a good attitude, but there's also the "library reviewer" to factor in :-)

...

...
I think you can use things like typename RandomAccessIter::value_type instead of passing *first in order to determine data_type.

I tried that, but it didn't compile on my system (MacOSX PPC gcc4.0.1).

That's surprising. Maybe you could distil it down to a minimal example and we can all scratch our heads over it.

...

...
I'm not sure how useful your functor-versions are since they still seem to depend on the "integer-ness" or "float-ness" of the sorted type. Apart from sorting backwards, what use cases are there? Using functors would be more useful if it let you sort types that are neither integers nor floats; for example, fixed-point types or pointers-to-integers or structs containing a key field to be sorted on.

Did you look at my KeyPlusData sample or the index.html?

Ah sorry, I didn't find index.html; I found README and assumed that was the limit of the documentation...

...

That's what the functor versions are there for: sorting classes or structures based upon a key. All you have to do is define the correct functor.

I still believe that "rightshift" is the wrong name for this functor in these cases.

...

Fixed-point types can sort as an integer of the same size in bytes. The only real requirement is that they must return a data type that when subtracted from an item of the same type returns something convertible to a size_t, that if (b - a) < (c - a) then b < c, and that bitshifting works as division by a power of two would, reducing the difference by a power of two per shift.

Can you clarify that a bit? Say I have this: class fixed { .... some_integer_type impl; .... bool operator<(const fixed& other) const; fixed operator>>(unsigned int shift_mt) const; }; What do I need to do?

...

...
I would have thought that the floating-point algorithm would just be an instance of the generic algorithm with custom functors, with a bit of specialisation to cope with the reverse-order-for-negatives thing.

Positive floats can be sorted just like integers, with the right functor. Negative floats have to be sorted backwards. I didn't see any efficient way to write a functor to handle negative floats.

Can't you just pass a "greater" functor instead of "less"?

...

I'll look at std::numeric_limits<>::is_iec559 and see if that does what I want.

It looks promising to me! Phil.

Steven Watanabe

10:56 p.m.

AMDG Phil Endecott wrote:

...

...
...
I think you can use things like typename RandomAccessIter::value_type instead of passing *first in order to determine data_type.

I tried that, but it didn't compile on my system (MacOSX PPC gcc4.0.1).

That's surprising. Maybe you could distil it down to a minimal example and we can all scratch our heads over it.

Shouldn't it be typename std::iterator_traits<RandomAccessIter>::value_type? In Christ, Steven Watanabe

Scott McMurray

10:58 p.m.

On Wed, Jan 14, 2009 at 17:48, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:

...

I would hope that a class with no state in it would take no space, but I suppose compilers may be dumb about that sometimes.

Doesn't the standard require that it have a unique address, essentially preventing that?

...

...
...
I think you can use things like typename RandomAccessIter::value_type instead of passing *first in order to determine data_type.

I tried that, but it didn't compile on my system (MacOSX PPC gcc4.0.1).

That's surprising. Maybe you could distil it down to a minimal example and we can all scratch our heads over it.

Maybe you can use the MPL metafunction version, with its possible workarounds? typename boost::iterator_value<RandomAccessIter>::type

Steven Ross

15 Jan 15 Jan

1:40 p.m.

On Wed, Jan 14, 2009 at 2:48 PM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

That's what the functor versions are there for: sorting classes or

...
structures based upon a key. All you have to do is define the correct functor.

I still believe that "rightshift" is the wrong name for this functor in these cases.

It's an arithmetic right shift, or alternatively division by a power of 2. What do you want to call it?

...

Fixed-point types can sort as an integer of the same size in bytes. The

...
only real requirement is that they must return a data type that when subtracted from an item of the same type returns something convertible to a size_t, that if (b - a) < (c - a) then b < c, and that bitshifting works as division by a power of two would, reducing the difference by a power of two per shift.

Can you clarify that a bit? Say I have this:

class fixed { .... some_integer_type impl; .... bool operator<(const fixed& other) const; fixed operator>>(unsigned int shift_mt) const; };

What do I need to do?

You < operator is fine, and usable without modification. If your - operator returns something castable to a size_t, then you don't need any functors, that will do (return impl - right.impl;) If your - operator doesn't return something castable to a size_t, then you need to define a rightshift functor that does. How about: struct rightshift { some_integer_type operator()(const fixed &x, unsigned offset) { return x.impl >> offset; } };

...

I would

...
...
have thought that the floating-point algorithm would just be an instance of the generic algorithm with custom functors, with a bit of specialisation to cope with the reverse-order-for-negatives thing.

Positive floats can be sorted just like integers, with the right functor. Negative floats have to be sorted backwards. I didn't see any efficient way to write a functor to handle negative floats.

Can't you just pass a "greater" functor instead of "less"?

Yes, but that would hurt efficiency.

Phil Endecott

5:12 p.m.

Steven Ross wrote:

...

On Wed, Jan 14, 2009 at 2:48 PM, Phil Endecott wrote:

...
I still believe that "rightshift" is the wrong name for this functor in these cases.

It's an arithmetic right shift, or alternatively division by a power of 2.

No, not in general, e.g.:

...

...
class fixed { .... some_integer_type impl; .... bool operator<(const fixed& other) const; fixed operator>>(unsigned int shift_mt) const; };

What do I need to do?

...

you need to define a rightshift functor that does. How about: struct rightshift { some_integer_type operator()(const fixed &x, unsigned offset) { return x.impl >> offset; } };

This is not performing a right shift on the fixed value itself but rather on the integer that it uses as its implementation. Hence my preference for another name. Maybe you consider this too subtle to worry about. Maybe it is. Phil.

Steven Ross

8:45 p.m.

On Thu, Jan 15, 2009 at 9:12 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

you need to define a rightshift functor that does. How about:

...
struct rightshift { some_integer_type operator()(const fixed &x, unsigned offset) { return x.impl >> offset; } };

This is not performing a right shift on the fixed value itself but rather on the integer that it uses as its implementation. Hence my preference for another name. Maybe you consider this too subtle to worry about. Maybe it is.

It is not performing a right shift on the object, only on the underlying data, but if the class did support casting to an integer, then its >> operator would suffice. So, it is a replacement for an arithmetic right shift, and the the user may write it any way they want that provides an even division by two which is rounded consistently. It is also critical that this method be fast as it is used in multiple loops of the integer_sort algorithm, so the user should be encouraged to use a fast operator (not division). Can you think of a better name for this than rightshift?

Phil Endecott

14 Jan 14 Jan

10:27 p.m.

Earlier I wrote:

...

One of the things that your string sort does is to not compare known-equal initial substrings. Does anyone know if there is a variant of a conventional sort (e.g. quicksort) that does this?

I've had a quick play and come up with this: http://svn.chezphil.org/libpbe/trunk/include/string_qsort.hh Of course the benefit depends on how many long common initial substrings there are. Testing with the output of "find / -print" (i.e. the pathnames of every file on my system, which will have lots of common initial substrings) it is a bit slower than std::sort. So the advantage of reducing the number of character comparisons [and that reduction is substantial] does not make up for the absence of whatever other cleverness std::sort does that I haven't replicated. But I thought I'd post it anyway, in case anyone is interested or can see how it could be improved. Phil.

Steven Ross

15 Jan 15 Jan

12:56 a.m.

On Wed, Jan 14, 2009 at 2:27 PM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

http://svn.chezphil.org/libpbe/trunk/include/string_qsort.hh

Of course the benefit depends on how many long common initial substrings there are. Testing with the output of "find / -print" (i.e. the pathnames of every file on my system, which will have lots of common initial substrings) it is a bit slower than std::sort. So the advantage of reducing the number of character comparisons [and that reduction is substantial] does not make up for the absence of whatever other cleverness std::sort does that I haven't replicated. But I thought I'd post it anyway, in case anyone is interested or can see how it could be improved. <http://lists.boost.org/mailman/listinfo.cgi/boost>

Out of curiousity, did you test string_sort on the same example? string_sort guarantees a reduction of at least 1 character in length every iteration, which a comparison-based algorithm will not do. Also, you should try insertion sort on small data sets (size 16 or so); that should give your test algorithm a little speed boost.

Steven Watanabe

14 Jan 14 Jan

5:55 p.m.

AMDG Steven Ross wrote:

...

algorithm_sorting.tar.gz is now in the boost vault, and is formatted for submission. I'd appreciate it if people could test it on their compiler and give me some feedback. It's not quite ready for formal review yet; I'd like a test on MSVC++ first.

The tests pass with msvc 8.0 express and msvc 9.0 express.

...

...
If you have run mytest.cpp ; (e.g.) in the Jamfile

you can run this test with

bjam mytest

Thanks. I'd already tried bjam, but what I figured out is that a link error: /usr/bin/ld: unknown flag: --start-group is preventing the executable from being built, so the tests don't run. So instead I link the .o file that it was built by bjam, run it, and it runs my tests.

This ought to just work. I presume that you are using gcc. What OS?

...

1) I have a perl tuning/regression script that I've used to verify correctness, performance, and tune some tuning constants. One of these tuning constants (MAX_SPLITS) has a 10+% impact on performance, because it impacts cache misses and how many recursive stages are executed. 10-12 is the range of ideal values for MAX_SPLITS on Intel Processors, and 10-11 is ideal on my old Motorola PowerPC processor. The penalty for too large a value is prohibitive (20+%), where the penalty for a value 1 too small is generally less than 10%. Based upon that, picking a value of 10 or 11 should be decent across most systems. For ideal performance, this should probably be configured to the processor it is run on. The other 3 constants have much less impact and can reasonably have the same value across all systems. Should I include the tune.pl script with the library (and if so, where, libs?), or should I do all the performance tuning myself?

If you include it, it should go under libs. My inclination would be to include the script even if you do have good default values.

...

2) Should I have a separate constants.hpp to contain these constants (as now), or define them directly in the spreadsort.hpp source file? If I define them directly in the source file, tune.pl won't work, but it makes it easier for users to just copy spreadsort.hpp on its own.

I would go for the separate header, but it's up to you.

...

3) Should I use configuration information to set the constants , or just pick some decent values and stick with them?

It's up to you. What exactly do you mean by configuration information?

...

4) Edouard is writing a separate parallel_sort, and our intention is that it can take a separate sorting algorithm (such as integer_sort etc) as an argument. Are we better off submitting our source for final review separately or together? They're intended to be in the same library

It doesn't make a difference to me... In Christ, Steven Watanabe

David Abrahams

13 Jan 13 Jan

6:10 p.m.

New subject: [build] how do I run tests? (was: Proposed templated integer_sort)

Steven, Your subject line was pretty much designed not to get the attention of the people who can help you ;-) on Tue Jan 13 2009, Steven Ross <spreadsort-AT-gmail.com> wrote:

...

How do I run Boost's automated testing on a single unit test,

$ cd directory/containing/test/Jamfile/or/Jamfile.v2 $ bjam name-of-test

...

and how do I create a new test directory in BOOST_ROOT/libs/algorithm/sorting such that it gets executed?

Integrate it into $BOOST_ROOT/status/Jamfile.v2 similarly to how the other test suites are integrated.

...

If someone could just explain how I run libs/algorithm/string's automated tests,

$ cd $BOOST_ROOT/libs/algorithm/string/test/ $ bjam

...

I'd be quite happy to copy how its tests are written, and modify them to test sorting correctness. I've spent over 7 hours so far trying to figure this out, and it's starting to get ridiculous. I've read through the documentation of bjam and the Boost test library and they don't seem to explain this.

Boost.Test has basically nothing to do with the question, and the bjam docs won't help you because bjam is a low-level build tool. You need the Boost.Build docs. Boost.Build is a build system that uses bjam. HTH, -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Gevorg Voskanyan

16 Dec 16 Dec

5:36 p.m.

Giovanni Piero Deretta wrote:

...

To: boost@lists.boost.org Sent: Tuesday, December 16, 2008 7:19:52 PM Subject: Re: [boost] Proposed templated integer_sort

On Tue, Dec 16, 2008 at 3:03 PM, Pyry Jahkola wrote:

...
Gevorg Voskanyan wrote:

...
I suggest modifying CastFloatIter() function to use memcpy for accessing floats as integers, as shown below:

Hmm, I don't think it's any better than reinterpret casting.

Yes it is. It has the nice property of not leading to UB. And the result is well defined (the bitwise representation of the source is copied into the destination).

...
(It actually does the reinterpret cast when converting float const * to void const *...)

Not really.

...
I'd suggest using a union instead, for the purpose of treating a float as integer:

template inline To bitwise_cast(From const & from) { BOOST_STATIC_ASSERT(sizeof(To) == sizeof(From)); union { To to; From from; } u; u.from = from; return u.to; }

This is also undefined behavior even if it is explicitly supported by most compilers as an extension. It is not necessarily better than a memcpy and might actually be slower.

-- gpd

Thanks Giovanni, I could not reply any better. Best Regards, Gevorg

David Abrahams

9:17 p.m.

on Tue Dec 16 2008, "Giovanni Piero Deretta" <gpderetta-AT-gmail.com> wrote:

...

On Tue, Dec 16, 2008 at 3:03 PM, Pyry Jahkola <pyry.jahkola@gmail.com> wrote:

...
Gevorg Voskanyan wrote:

...
I suggest modifying CastFloatIter() function to use memcpy for accessing floats as integers, as shown below:

Hmm, I don't think it's any better than reinterpret casting.

Yes it is. It has the nice property of not leading to UB.

If so, that surprises me. Is it really guaranteed that all valid floating-point bit patterns are also valid integer bit patterns? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Ravi

11:29 p.m.

On Tuesday 16 December 2008 16:17:10 David Abrahams wrote:

...

on Tue Dec 16 2008, "Giovanni Piero Deretta" <gpderetta-AT-gmail.com> wrote:

...
On Tue, Dec 16, 2008 at 3:03 PM, Pyry Jahkola <pyry.jahkola@gmail.com> wrote:

...
Gevorg Voskanyan wrote:

...
I suggest modifying CastFloatIter() function to use memcpy for accessing floats as integers, as shown below:

Hmm, I don't think it's any better than reinterpret casting.

Yes it is. It has the nice property of not leading to UB.

If so, that surprises me. Is it really guaranteed that all valid floating-point bit patterns are also valid integer bit patterns?

Not true; I am aware of at least one embedded platform where int is 24 bits whereas float on the same platform is 32 bits. As far as I can tell, the port of gcc that is used on this platform works fine for C; don't know about C++. Regards, Ravi

Scott McMurray

11:42 p.m.

On Tue, Dec 16, 2008 at 18:29, Ravi <lists_ravi@lavabit.com> wrote:

...

On Tuesday 16 December 2008 16:17:10 David Abrahams wrote:

...
on Tue Dec 16 2008, "Giovanni Piero Deretta" <gpderetta-AT-gmail.com> wrote:

...
On Tue, Dec 16, 2008 at 3:03 PM, Pyry Jahkola <pyry.jahkola@gmail.com> wrote:

...
Gevorg Voskanyan wrote:

...
I suggest modifying CastFloatIter() function to use memcpy for accessing floats as integers, as shown below:

Hmm, I don't think it's any better than reinterpret casting.

Yes it is. It has the nice property of not leading to UB.

If so, that surprises me. Is it really guaranteed that all valid floating-point bit patterns are also valid integer bit patterns?

Not true; I am aware of at least one embedded platform where int is 24 bits whereas float on the same platform is 32 bits. As far as I can tell, the port of gcc that is used on this platform works fine for C; don't know about C++.

Given the static assert, it's still safe, since not compiling prevents UB. As for the bit patterns, I can't think of any bit patterns that cannot be created by legal <<ing and ++ing, so I'm not convinced that invalid integer bit patterns exist (at least in unsigned types -- I don't know what ones-complement signed ints do with ~0, for example). That said, how is it used in practise? Aliasing allows reading through an unsigned char*, the size of which is always a factor of the size of any other type, so would it be better to implement everything to always go through that? I know naïve radix sorts will often go CHAR_BIT bits at a time to take advantage of that, but I figure the better implementation may well need something more complex.

Steven Ross

17 Dec 17 Dec

11:40 p.m.

On Tue, Dec 16, 2008 at 3:42 PM, Scott McMurray <me22.ca+boost@gmail.com>wrote:

...

As for the bit patterns, I can't think of any bit patterns that cannot be created by legal <<ing and ++ing, so I'm not convinced that invalid integer bit patterns exist (at least in unsigned types -- I don't know what ones-complement signed ints do with ~0, for example).

An IEEE floating-point number maps nicely to a signed int of the same size for the purposes of sorting. The main issue is that negative floats increase in the opposite order from the integer with the same bits, so I sort them in reverse (if you look at the source). Positive floats sort in the same order as the positive integer with the same bit representation. -0 ends up being less than 0 this way, though for std::sort they are equal. NANs sort in a definite order during the radix-sorting portion, unlike std::sort which dumps them arbitrarily throughout the resulting file. I assume that developers won't mind these discrepancies for -0 and NANs. This trick is commonly used by radix sorting implementations, and is especially valuable because floating-point compares are inefficiently implemented on some platforms. I should note that I named the operation "casting_float_sort" because I was concerned that not all theoretical platforms may support the cast. If people are concerned about sorting some non-IEEE floating-point number, they can use float_sort, which requires the user to specify the type they cast the float to, or the functor versions, where the user defines the >> operation (where the cast is). All versions of float_sort still assume that negatives need to be sorted in reverse order; it's a method specialized to sorting IEEE floating_point numbers as integers. Anything that can be sorted in a normal fashion based upon an integer return from the >> operator (or equivalent functor) can be sorted by integer_sort.

...

That said, how is it used in practise? Aliasing allows reading through an unsigned char*, the size of which is always a factor of the size of any other type, so would it be better to implement everything to always go through that? I know naïve radix sorts will often go CHAR_BIT bits at a time to take advantage of that, but I figure the better implementation may well need something more complex.

I could do that, but I would often need one more iteration than is strictly necessary, and I wouldn't be able to reuse integer_sort to sort the positive floats. I have no reason to believe that going 8 bits at a time would be any faster, unless it can save me a memory operation to work around a casting restriction that isn't enforced on all (any?) platforms.

Scott McMurray

18 Dec 18 Dec

12:08 a.m.

On Wed, Dec 17, 2008 at 18:40, Steven Ross <spreadsort@gmail.com> wrote:

...

I could do that, but I would often need one more iteration than is strictly necessary, and I wouldn't be able to reuse integer_sort to sort the positive floats. I have no reason to believe that going 8 bits at a time would be any faster, unless it can save me a memory operation to work around a casting restriction that isn't enforced on all (any?) platforms.

Uh, the casting restriction is "enforced". GCC's optimizer takes advantage of the aliasing rules, so breaking them is a great way for your code to not work after optimization in GCC. There are warnings for it, too: int foo(float f) { return *(int*)&f; } aliasing_demo.cxx: In function 'int foo(float)': aliasing_demo.cxx:2: warning: dereferencing type-punned pointer will break strict-aliasing rules And as I mentioned earlier (Dec 16th), decent optimizers convert the memcpy into the bitcast that's what you wanted in the first place, so since there's nothing to save, it does sound like the 8-bits-at-a-time version is a bad idea.

Giovanni Piero Deretta

17 Dec 17 Dec

11:03 a.m.

On Tue, Dec 16, 2008 at 10:17 PM, David Abrahams <dave@boostpro.com> wrote:

...

on Tue Dec 16 2008, "Giovanni Piero Deretta" <gpderetta-AT-gmail.com> wrote:

...
On Tue, Dec 16, 2008 at 3:03 PM, Pyry Jahkola <pyry.jahkola@gmail.com> wrote:

...
Gevorg Voskanyan wrote:

...
I suggest modifying CastFloatIter() function to use memcpy for accessing floats as integers, as shown below:

Hmm, I don't think it's any better than reinterpret casting.

Yes it is. It has the nice property of not leading to UB.

If so, that surprises me. Is it really guaranteed that all valid floating-point bit patterns are also valid integer bit patterns?

Well, in general I do not think so. I should have said that the result is implementation defined. But it is defined on most common architectures and their compilers. -- gpd

David Abrahams

29 Jun 29 Jun

1:27 a.m.

Steven Ross wrote:

...

Would there be interest in an integer_sort call that is faster than std::sort? Spreadsort is an in-place hybrid radix-based and comparison-based algorithm with better worst-case and average-case (roughly 20-50%) performance than introsort on random integer data. The worst-case advantage is algorithmic, with Spreadsort taking the better of O(n(square_root(bit_sizeof(data) - log(n)))) and O(nlog(n)) time. For 32-bit integers, that's O(n(square_root(32 - log(n)))), so for a log(n) of 16 (65536 items), it's 4n (plus a small constant times n), vs 16n for introsort. For larger n, the advantage vs. pure comparison-based algorithms increases. In effect, Spreadsort smoothly transitions from pure MSB Radix sort to std::sort with its data cut up by a couple iterations of MSB Radix sort (reducing the size of the log(n) factor), depending upon the distribution and size of the data.

I could write a templated version of Spreadsort for the Boost library if there is interest.

The world needs more generic algorithms, so I'm interested purely on that basis. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Steven Ross

8:21 p.m.

I've attached a .h source file and constant definitions for a generic version of integer_sort that sorts an array, which can sort a vector with the array conversion trick: &(vec[0]) I will work on converting this to look and act as much like std::sort as possible, but anyone interested in trying it or looking and making comments can start with this. I uploaded this code to http://sourceforge.net/projects/spreadsort under the generic branch with sample source for time comparisons, a random number generator, and a regression test script (tune.pl), in case anyone would like to see the results for themselves. Once I get this in shape, I would like to confirm that this is something the community would like to have in Boost. Steve

...

...
I could write a templated version of Spreadsort for the Boost library if there is interest.

The world needs more generic algorithms, so I'm interested purely on that basis.

-- Dave Abrahams BoostPro Computing http://www.boostpro.com _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Thorsten Ottosen

30 Jun 30 Jun

7:38 a.m.

Steven Ross skrev:

...

Once I get this in shape, I would like to confirm that this is something the community would like to have in Boost.

I think it is fit for boost, but agree with Dave that it should be made as generic as possible. -Thorsten

Simonson, Lucanus J

7:26 p.m.

Steven Ross wrote:

...

For larger n, the advantage vs. pure comparison-based algorithms increases. In effect, Spreadsort smoothly transitions from pure MSB Radix sort to std::sort with its data cut up by a couple iterations of MSB Radix sort (reducing the size of the log(n) factor), depending upon the distribution and size of the data.

The analysis seems a little too optimistic and seems to be based upon the assumption that the input is evenly distributed. To analyze the complexity you need to think like an adversary who knows your algorithm and intentionally creates an input that will make it perform poorly. In the worst case your algorithm would be slower than std::sort because a data set with a cluster of values that all fall within the same bin and a handful of outliers that give rise to the unfortunate binning would result in runtime for std::sort + runtime for one or more recursive stages of profitless binning. So, the algorithm should be expected to be faster than std::sort alone for some inputs, but slower for others. Luke

Steven Ross

1 Jul 1 Jul

12:26 a.m.

On Mon, Jun 30, 2008 at 12:26 PM, Simonson, Lucanus J < lucanus.j.simonson@intel.com> wrote:

...

Steven Ross wrote:

...
For larger n, the advantage vs. pure comparison-based algorithms increases. In effect, Spreadsort smoothly transitions from pure MSB Radix sort to std::sort with its data cut up by a couple iterations of MSB Radix sort (reducing the size of the log(n) factor), depending upon the distribution and size of the data.

The analysis seems a little too optimistic and seems to be based upon the assumption that the input is evenly distributed. To analyze the complexity you need to think like an adversary who knows your algorithm and intentionally creates an input that will make it perform poorly. In the worst case your algorithm would be slower than std::sort because a data set with a cluster of values that all fall within the same bin and a handful of outliers that give rise to the unfortunate binning would result in runtime for std::sort + runtime for one or more recursive stages of profitless binning. So, the algorithm should be expected to be faster than std::sort alone for some inputs, but slower for others.

The worst case is actually binning that only marginally splits up the data each iteration in such a fashion that std::sort must be called on an only slightly reduced size list at the end. This corresponds to the square_root(bit_sizeof(T)) + a constant number of std::sort iterations situation. Because each radix-based sorting iteration is significantly slower than an introsort iteration, this situation may actually be slightly slower than introsort (depending on the system configuration), but the worst-case performance is still nicely bounded. The situation you describe can actually perform quite well. Under that situation it takes bit_sizeof(T)/x iterations until the data is fully radix sorted, where x is the lesser of MAX_SPLITS (usually 9) and log(n). MAX_SPLITS is picked for performance reasons and is not limited in any fundamental way, but even operating with 9 as the divisor, for a 64-bit integer chunky data with outliers as you describe it will take 8 iterations until the data is fully radix sorted. It's important to note that Spreadsort will give up and call std::sort on the current bin for iterations after the first if log(remaining_range)*LOG_CONST > (log(n_before_sort) * log(n_in_bin)) (LOG_CONST is normally 3). So for chunky data as you describe, if the data is 64 bits long and we just completed the first iteration, log(remaining_range) = 64 - 9 = 55, and n_before_sort is 16, n_after_sort is 16 (very few outliers), then the comparison is: 55 * 3 > 16 * 16 : 165 > 256, it would continue radix-sorting all the way to the end. If n was smaller (12 or less), it would stop and call std::sort. If on the other hand, it only slowly diminished in size: starting with log(n) = 16 second iteration: 15 (16 x 15 = 240 > 165) third iteration: 14 (15 x 14 = 210 > (46 * 3) = 138) fourth iteration: 13 (14 x 13 = 182 > (37 * 3) = 111) fifth iteration: 12 (13 x 12 = 156 > (28 * 3) = 84) sixth iteration: 11 (12 x 11 = 132 > (19 * 3) = 57) seventh iteration: 10 (11 x 10 = 110 > (10 * 3) = 30) eighth iteration: hits minimum sort size (9) and calls std::sort on n=512 chunks. That's the worst case. Note that if LOG_CONST were larger or n diminished in size faster, std::sort would be called earlier. A LOG_CONST of 5 would significantly improve worst-case performance relative to std::sort for 64-bit integers. Note also that for larger n, the worst-case runtime does not increase. Technically, the greater of bit_sizeof(T)/MAX_SPLITS and square_root(bit_sizeof(T)) are both worst-case performance situations; which one is more applicable as "worst" depends on the value of LOG_CONST and the size of n. For large n, the bit_sizeof(T)/MAX_SPLITS is accurate, but on the other hand, MAX_SPLITS can be increased to any value up to log(n)/8 with reasonable performance, which is why I don't like to use the bit_sizeof(T)/MAX_SPLITS form. Using a smaller MAX_SPLITS avoids cache misses.

5910

Age (days ago)

6211

Last active (days ago)

List overview

Download

118 comments

15 participants

participants (15)

Daryle Walker
David Abrahams
Gevorg Voskanyan
Giovanni Piero Deretta
Johan Råde
Michael Marcin
Phil Endecott
Pyry Jahkola
Ravi
Robert Ramey
Scott McMurray
Simonson, Lucanus J
Steven Ross
Steven Watanabe
Thorsten Ottosen