
Mathias Gaunard <mathias.gaunard@ens-lyon.org> writes:
We generate something along the lines of
float tmp = 0.f; for(int i ....) tmp += d[i] + e[i];
for(int i ...) f[i] = b[i] + 3 * c[i] + tmp;
Will NT2 fuse the loops to get rid of the temporary? Does it do strip-mining or other such things (beyond that needed for vectorization)? Does NT2 try to generate a loop nest with the appropriate loops interchanged to improve performance? Again, these are not criticisms, I'm simply trying to get a better grasp of it. I am really, really interested in this. Abstracting loops for HPC is a really good idea, in my mind. It would be best if there was an option to leave the resulting loops scalar in case the user wants to try to have the compiler vectorize them. -Dave