
22 Jul
2008
22 Jul
'08
2:40 p.m.
On Sat, Jul 12, 2008 at 11:20 AM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote: > We want to move element A to the position currently occupied by element B. > We could swap them, but moving B to A's position is probably not useful. > So let's first move B to a good position for it, and then move A into its > space. Of course some other element C will be occupying the position where > we want to put B, so we have to keep going until we find an element that > wants to be in the position where A was. > > The problem with that is that you might recurse a lot and fill your stack, > so in practice you'd want to limit it to a few iterations and then give up > and put a "wrong" value back at the source. Also you need to be certain > that 'tmp' gets some sort of return value optimisation and isn't copied > during each return. An explicitly unrolled iterative version would probably > be best: > > Whether it actually improves performance will depend on quite how large the > structs are. It certainly won't help with ints. So let's see your version > that sorts structs! > > I found that an explicitly inlined swap was as fast as your 3-way swap and multi-level loop (combined) on Intel systems for integers, but since the 3-way swap should be faster with larger data types, and the 3-way swap is faster on my system, I'm running with it. I found that the primary time-limiting steps are: 1) the random accesses used to find the "remote" element in swapping 2) bin lookups The problem with a multi-way swap is that it risks wasting its last bin lookup, and does no less bin lookups per item put in the correct place. The random accesses aren't avoided either; the main speedup is from reducing the processing overhead of the actual copy operation, which is minor with ints. Since I saw a small but tangible speedup with a 3-way swap (but not 4-way or higher) I ran with it. I made some other speed enhancements that are possible with a multi-level loop, including splitting up the bin into separate pointers and sizes. It would be nice if I could tell to cache the sizes first, then dump them from core cache back to a lower-level cache and cache the pointers. I'm not aware of any way to do so. This splitting optimization, in addition to improving runtime, reduces memory overhead of the algorithm. I also think it's more readable, as the meaning of the sizes doesn't change. I'm not aware of any other optimizations to try; I've made all of the ones that worked, and dumped the ones that didn't. I tried an explicit binary-search-style roughlog2, but found that it actually hurt performance (did not help). I think the control structures are much slower than an at most 32-step very simple while loop. If anyone has suggestions for improvement, I'm open to them. Otherwise, I will consider this the final version of the core algorithm, and if acceptable, will move on to adding functor support, float_sort, and (eventually) string_sort. I have seen a net 22% speed improvement relative to the first version I posted on my system; a significant portion of this is due to Phil's suggestions, so thanks Phil! Steve