
On 11/03/11 22:57, Simonson, Lucanus J wrote:
The problems come from load/store, which if you emit too many the compiler generally doesn't optimize them away. You need to somehow manage which varaibles are in register and which are in memory and if you get that wrong it is hard to right that wrong. For one thing alias analysis won't be successful enough to do what the programmer could. For another, when it comes to intrinsics the compiler puts some trust in the programmer's choices and doesn't always try to second guess. Loading and storing memory can get complicated in a great hurry and you generally want to rearrange your algorithm and data structures for more efficient memory access. Just aligning things is a very big deal and it isn't always enough to say "oh, the allocator will take care of that for me." There are a lot of complications once you get into the details of it. The compiler can sometimes do amazing things, but other times produce a really weak effort and it will never replace your algorithm with a better one based on knowledge of the instruction set. There might be good explanations for why the compiler can't emit the code you expected, but that doesn't change the fact that it often dissapoints.
Alignment is one issue, the only thigns it solves it that you can be sure you can mova* instead of movu* without thinking about it. For data rearrangement, this is a whole different level of stuff and we solved it by *not* havign to care at the SIMD DSL level, the upper array-like interface take care of this usign some fusion adaptation and transformation mumbo-jumbo. It is vry crucial that you separate the concerns there. The SIMD DSL should only help you lay down your SIMD data-flow algorithm and help you map it to w/e IS you may have in a platform independent way, provide you tools for iterating SIMD chunk by SIMD chunk over arbitrary memory and have the significant pattern matcher in place for your ISA. Now, all the other algorithmic level transform and other memory layout and access problems have to be solved at an upper level (in w/e abstraction you need matrix, array, big int, image, polynoms etc) because the very core optimizations you may want are expressed in a proper way in the high-level domain, not the low level one. Same goes for unrolling or stripmining loop nests, it is not the job of the SIMD DSL. It is somethign that has uses in soem domain but the other. The SIMD DSL should provide an interface tat make writing such numerical code the same wether it is indeed in SIMD or scalar mode. I really think it is this problem you encountered by trying to make your SIMD DSL do too much at its level, something that ultimately fails. The only such a beast should do is hide the intricacies of setting up load/store from memory into SIMD registers, uniformize function and operators across ISA, provide SIMD based iterator and range travrsal support. It should not do many more but provide, mainly in its pattern matching facility or in its way to define a function, ways to be extended for any new subtle low level combination of operations leading to optimized SIMD block or IS support. See below for the rest of what can be done.
To implement optimal infinite prcision integer arithmetic with lrb vector instructions (or altavec) you have tons of goodies to help you in the instruction set like multiply and store high result and the carry add borrow sub and gather scatter, but if you abstract the instruction set away you don't even get the basics right and all the goodies are out of reach entirely because you need to craft your algorithm to fit the features of the instruction set to make use of them.
At some point, it all depends of the way you did your homework. Having a one-to-one mapping between one SIMD extension instruction to one operator only works so far for classical operations. However, and we agree on this, if IS 1 has a funky 'add-two-vectors-and-take-the-sqrt-of-them' fused operations and not the other, it is rather trivial to have some function like sqrt_sum(a,b) and have it using the intrinsic in ISA1 and do it "by the book" in IS 2 support. In the same way, your pattern matcher will depends on the IS and do the proper contraction if needed. Our stance on this matter was not to say "abstract to the common lowest denominator of features" but try to see all SIMD IS as an aggregate pool of features and fill the gap of IS with less features than others. Our approach was to seggregate functions into "toolbox" with different aims and providing mid-level functionnality that get locally optimized with w/e IS support. Now, w/e the IS, you can call it and depending on what you have it generates the proper algorithm. Now if i was about to rewrite some large integer function or some crypto algorithm, what I will do is look at the proper SIMD way to do it, seekign for the most fused possible intrinsic for the job (for crypto, IIRC SSE4 has some specific stuff for ex., so let's sue them in the sse4 version). And after that, ho and behold, generic programming strikes again: we will probably take a few time to lift/abstract/generalize said algorithm in proper conceptualized series of operations. Said operations will then be mapped into new function in the SIMD DSL (and prolly in a sequential fashion too). Rewrite the algorithm operating on a parametric type T, use said new functions and watch the code generated be either SIMD or scalar depending on the ISs selected. In terms of integer stuff at hand, this means the high level algorithm should be expressed as a combination of lifted concepts where the actual bit twiddling can be aggregated in logical, semantic-rich, functions operating on w/e. Let the SIMD DSL provide such function, implement them as properly needed (use the proper specialized itnrinsic or add pattern replacer) and voila. No need to dumb down the feature set of the SIMD DSL, it will grow naturally as cocnept requiring it and some special stuff will arise from algorthimic analysis, much more like normal code. And int his case, you'll see that these problem of optimality actually lessens as they willbe captured at high level by your concept based analysis. I bet we have a lot of stuff to exchange and argue on the subject as we seems to have taken radically different angle on the matter, I will be glad if we can go over our first premises of arguments and maybe try to see if our respective experiences in some concrete area can't be shared for the greater good of everyone. Especially, if you have actual algorithm you see as something hard that didn't fit your design (and by concrete I mean a snapshot of code not a mere description in words were both of us can misinterpret parts (see the CSE fiasco above :p) so we can look at the problem in our framework and see where problem arise. Hoping to help :)