
Joel falcou <joel.falcou@gmail.com> writes:
On 11/06/11 11:17, David A. Greene wrote:
Mathias Gaunard<mathias.gaunard@ens-lyon.org> writes:
Making data parallelism simpler is the goal of NT2. And we do that by removing loops and pointers entirely.
First off, I want to apologize for sparking some emotions. That was not my intent. I am deeply sorry for not expressing myself well.
We all fall for blatant miscommunication there I guess ;)
NT2 sounds very interesting! Does it generate the loops given calls into generic code?
Basically yes, you express container based, semantic driven code using a matlab like syntax (+ more in case where matlab dont provide anything suitable) and the various evaluation point generates loop nests with properties derived from information carried by the container type and its settings (storage order, sharing data status, etc).
This is super-cool! Anything to help the programmer restructure code (or generate the loops correctly in the first place) is a huge win.
The evaluation is then done by forwarding the expression to a hierarchical layer of architecture dependant meta-programms that, at each steps, strip the expression of its important high level semantic inforamtions and help generate the proper piece of code.
You machine intrinsics here, yes? This is where I think many times the compiler might do better. If the compiler is good. :) It's a little odd that "important" information would be stripped. I know this is not a discussion of NT2 but for the curious, can you explain this? Thanks!
I assume the rest of the discussion is done for a programm written with the correct algorithm in term of compelxity, right ?
By correct algorithm, you mean an algorithm structured to expose data parallelism? If so, yes, I think that's right.
- Programmer tries to run the compiler on it, examines code - Code sometimes (maybe most of the time) executes poorly - If not, done
Yes.
- Programmer restructures loop nest to expose parallelism - Try compiler directives first, if available (tell compiler which loops to interchange, where to cache block, blocking factors, which loops to collapse, etc.) - Otherwise, hand-restructure (ouch!)
If compilers allow for such informations to be carried yes.
Right. Many don't and in those cases, boost.simd is a great alternative.
- Programmer tries compiler again on restructured loop nest - Code may execute poorly - If not, done
Yes
- Programmer adds directives to tell the compiler which loops to vectorize, which to leave scalar, etc. - Code may still execute poorly - If not, done
Again, provided such a compiler is available on said platform
Of course.
- Programmer uses boost.simd to write vector code at a higher level than provided compiler intrinsics
Yes and using a proper range based interface instead of a mess of for loops.
Yep!
Does that seem like a reasonable use case?
Yes. What we missed to clarify is that for a large share of people, available compilers on their systems fails to provide way to do step #2 and #3. And for these people, what they see is a world in which they are on their own dealing with this.
Oh absolutely. But I think that such people should be aware that code generated by boost.simd may not be the best for their hardware implementation IF they have access to a good compiler later. In those cases, though, I suppose replacing, say, pack<float> with float everywhere should get most of the original scalar code back. There may yet be a little more cleanup to do but isn't that the case with _every_ HPC code? :) :) :) -Dave