
Steven Ross wrote:
Are there any more suggestions for the core algorithm?
for each bin { for each element between this bin's current_position and its end { while (1) { determine the bin this element should be in if it's in the right bin { break } else { swap it with the element at the right bin's current_position increment that bin's current_position } } } } I'm considering the case where the type is relatively large and swap takes time proportional to its size, e.g. a large struct. Currently your code (and my version above) will, on average, swap each element twice: once to what's probably the wrong bin and then into the right bin. If swap itself uses a temporary then each element is copied on average 3 times. Maybe the compiler can optimise some of that, and the cache may make some of those copies faster than others. But I guess that it can probably be improved to nearer one copy per element. Here's the idea: We want to move element A to the position currently occupied by element B. We could swap them, but moving B to A's position is probably not useful. So let's first move B to a good position for it, and then move A into its space. Of course some other element C will be occupying the position where we want to put B, so we have to keep going until we find an element that wants to be in the position where A was. It's relatively easy to code a recursive version of this: void circulate(Iter i) { // Do an N-way circular swap that eventually puts the element at i in its right place, // and puts an element that wants to be at i there. *i = circulate_rec(i,i); } T circulate_rec(Iter source, Iter current) { // Move the element at current to its right position, and return an element // that can be stored at source. // Find current's right position: Iter target = right_bin_for(*current).current_position++; // Base case: if the element at target would be happy at position source, that's easy: if (right_bin_for(*target) == bin_of(source)) { tmp = *target; } else { // Otherwise, first move target to a good position tmp = circulate_rec(source,target); } // Now actually copy the element at current to the vacant target *target = *current; return tmp; } The problem with that is that you might recurse a lot and fill your stack, so in practice you'd want to limit it to a few iterations and then give up and put a "wrong" value back at the source. Also you need to be certain that 'tmp' gets some sort of return value optimisation and isn't copied during each return. An explicitly unrolled iterative version would probably be best: Iter a = ...; Iter b = right_bin_for(*a).current_position++; if (right_bin_for(*b) != bin_of(a)) { Iter c = right_bin_for(*b).current_position++; if (right_bin_for(*c) != bin_of(a)) { Iter d = right_bin_for(*c).current_position++; if (right_bin_for(*d != bin_of(a)) { Iter e = right_bin_for(*d); tmp = *e; *e = *d; } else { tmp = *d; } *d = *c; } else { tmp = *c; } *c = *b; } else { tmp = *b; } *b = *a; *a = tmp; Whether it actually improves performance will depend on quite how large the structs are. It certainly won't help with ints. So let's see your version that sorts structs! Phil.