[boost] SIMD EDSL blues (was [XInt] Some after thoughts)

12 Mar 2011

      On 11/03/11 22:57, Simonson, Lucanus J wrote:
...
The problems come from load/store, which if you emit too many the compiler generally doesn't optimize them away.  You need to somehow manage which varaibles are in register and which are in memory and if you get that wrong it is hard to right that wrong.  For one thing alias analysis won't be successful enough to do what the programmer could.  For another, when it comes to intrinsics the compiler puts some trust in the programmer's choices and doesn't always try to second guess.  Loading and storing memory can get complicated in a great hurry and you generally want to rearrange your algorithm and data structures for more efficient memory access.  Just aligning things is a very big deal and it isn't always enough to say "oh, the allocator will take care of that for me."  There are a lot of complications once you get into the details of it.  The compiler can sometimes do amazing things, but other times produce a really weak effort and it will never replace your algorithm with a
   better one based on knowledge of the instruction set.  There might be good explanations for why the compiler can't emit the code you expected, but that doesn't change the fact that it often dissapoints.
Alignment is one issue, the only thigns it solves it that you can be 
sure you can mova* instead of movu* without thinking about it. For data 
rearrangement, this is a whole different level of stuff and we solved it 
by *not* havign to care at the SIMD DSL level, the upper array-like 
interface take care of this usign some fusion adaptation and 
transformation mumbo-jumbo. It is vry crucial that you separate the 
concerns there. The SIMD DSL should only help you lay down your SIMD 
data-flow algorithm and help you map it to w/e IS you may have in a 
platform independent way, provide you tools for iterating SIMD chunk by 
SIMD chunk over arbitrary memory and have the significant pattern 
matcher in place for your ISA. Now, all the other algorithmic level 
transform and other memory layout and access problems have to be solved 
at an upper level (in w/e abstraction you need matrix, array, big int, 
image, polynoms etc) because the very core optimizations you may want 
are expressed in a proper way in the high-level domain, not the low 
level one. Same goes for unrolling or stripmining loop nests, it is not 
the job of the SIMD DSL. It is somethign that has uses in soem domain 
but the other. The SIMD DSL should provide an interface tat make writing 
such numerical code the same wether it is indeed in SIMD or scalar mode.

I really think it is this problem you encountered by trying to make your 
SIMD DSL do too much at its level, something that ultimately fails. The 
only such a beast should do is hide the intricacies of setting up 
load/store from memory into SIMD registers, uniformize function and 
operators across ISA, provide SIMD based iterator and range travrsal 
support. It should not do many more but provide, mainly in its pattern 
matching facility or in its way to define a function, ways to be 
extended for any new subtle low level combination of operations leading 
to optimized SIMD block or IS support. See below for the rest of what 
can be done.
...
To implement optimal infinite prcision integer arithmetic with lrb vector instructions (or altavec) you have tons of goodies to help you in the instruction set like multiply and store high result and the carry add borrow sub and gather scatter, but if you abstract the instruction set away you don't even get the basics right and all the goodies are out of reach entirely because you need to craft your algorithm to fit the features of the instruction set to make use of them.
At some point, it all depends of the way you did your homework. Having a 
one-to-one mapping between one SIMD extension instruction to one 
operator only works so far for classical operations. However, and we 
agree on this, if IS 1 has a funky 
'add-two-vectors-and-take-the-sqrt-of-them' fused operations and not the 
other, it is rather trivial to have some function like sqrt_sum(a,b) and 
have it using the intrinsic in ISA1 and do it "by the book" in IS 2 
support. In the same way, your pattern matcher will depends on the IS 
and do the proper contraction if needed. Our stance on this matter was 
not to say "abstract to the common lowest denominator of features" but 
try to see all SIMD IS as an aggregate pool of features and fill the gap 
of IS with less features than others. Our approach was to seggregate 
functions into "toolbox" with different aims and providing mid-level 
functionnality that get locally optimized with w/e IS support. Now, w/e 
the IS, you can call it and depending on what you have it generates the 
proper algorithm. Now if i was about to rewrite some large integer 
function or some crypto algorithm, what I will do is look at the proper 
SIMD way to do it, seekign for the most fused possible intrinsic for the 
job (for crypto, IIRC SSE4 has some specific stuff for ex., so let's sue 
them in the sse4 version). And after that, ho and behold, generic 
programming strikes again: we will probably take a few time to 
lift/abstract/generalize said algorithm in proper conceptualized series 
of operations. Said operations will then be mapped into new function in 
the SIMD DSL (and prolly in a sequential fashion too). Rewrite the 
algorithm operating on a parametric type T, use said new functions and 
watch the code generated be either SIMD or scalar depending on the ISs 
selected.

In terms of integer stuff at hand, this means the high level algorithm 
should be expressed as a combination of lifted concepts where the actual 
bit twiddling can be aggregated in logical, semantic-rich, functions 
operating on w/e. Let the SIMD DSL provide such function, implement them 
as properly needed (use the proper specialized itnrinsic or add pattern 
replacer) and voila.

No need to dumb down the feature set of the SIMD DSL, it will grow 
naturally as cocnept requiring it and some special stuff will arise from 
algorthimic analysis, much more like normal code. And int his case, 
you'll see that these problem of optimality actually lessens as they 
willbe captured at high level by your concept based analysis.

I bet we have a lot of stuff to exchange and argue on the subject as we 
seems to have taken radically different angle on the matter, I will be 
glad if we can go over our first premises of arguments and maybe try to 
see if our respective experiences in some concrete area can't be shared 
for the greater good of everyone. Especially, if you have actual 
algorithm you see as something hard that didn't fit your design (and by 
concrete I mean a snapshot of code not a mere description in words were 
both of us can misinterpret parts (see the CSE fiasco above :p) so we 
can look at the problem in our framework and see where problem arise.

Hoping to help :)

[boost] SIMD EDSL blues (was [XInt] Some after thoughts)

Joel Falcou