[XInt] Some after thoughts

Ok, as the ET based rambling went nowhere and some people (especially Steven) had some valid point on the simplicity of the underlying representation, i am about to propose the following plan. The premises still are that I believe that at some point an ET layers is worthwhile. So here it is: 1/ restructure XInt so a generic way to access to the needed internals of the big integer representation through a terse Concept providing way to specify how to store and retirvee bits o. 2/ Build the xinteger_t type on top of this generic layer. Let's stop here. At this point, we fulfill needs expressed by Chad and Steven: a simple and generic large integer library ready to be used. Now this I can see as a very valid candiate to be accepted right now. So now, what's going afterward ? By using the concept and interface, we ( and by we, maybe I mean only me for the moment) can do the following: 3/ implement a EDSL on top of the interface defined in 1/ By reusing the low level memory and computation interface we can build a ET-enabled large integer that behave the same way than the original large integer type in 2/ but allow us to add "fancy" optimisations 4/ Have a way to say large integer use SIMD register of 128 bits and implement a SIMD aware large integer Or any combination of 3+4 or any other similar extension. So, can the effort required by 1/ (definition of a minimal concept for handling representation of large integer) and 2/ seems valid ? I think this actually gives everything to everyone : the simplicity concerned public and the "fancy" public. Of course, the only thing Chad may have to do is 1+2, leaving 3+ to people interested to provide such extension later. How does it sound ?

On Thu, 10 Mar 2011 10:06:43 +0100 falcou <Joel.Falcou@lri.fr> wrote:
Ok, as the ET based rambling went nowhere and some people (especially Steven) had some valid point on the simplicity of the underlying representation, i am about to propose the following plan. [...] How does it sound ?
As a battle plan, it sounds viable. Of course, the battle plan is always the first casualty of the battle. -- Chad Nelson Oak Circle Software, Inc. * * *

falcou wrote:
Ok, as the ET based rambling went nowhere and some people (especially Steven) had some valid point on the simplicity of the underlying representation, i am about to propose the following plan. The premises still are that I believe that at some point an ET layers is worthwhile. So here it is:
1/ restructure XInt so a generic way to access to the needed internals of the big integer representation through a terse Concept providing way to specify how to store and retirvee bits o.
2/ Build the xinteger_t type on top of this generic layer.
Let's stop here. At this point, we fulfill needs expressed by Chad and Steven: a simple and generic large integer library ready to be used. Now this I can see as a very valid candiate to be accepted right now. So now, what's going afterward ?
By using the concept and interface, we ( and by we, maybe I mean only me for the moment) can do the following:
3/ implement a EDSL on top of the interface defined in 1/ By reusing the low level memory and computation interface we can build a ET-enabled large integer that behave the same way than the original large integer type in 2/ but allow us to add "fancy" optimisations
4/ Have a way to say large integer use SIMD register of 128 bits and implement a SIMD aware large integer
Or any combination of 3+4 or any other similar extension.
So, can the effort required by 1/ (definition of a minimal concept for handling representation of large integer) and 2/ seems valid ? I think this actually gives everything to everyone : the simplicity concerned public and the "fancy" public. Of course, the only thing Chad may have to do is 1+2, leaving 3+ to people interested to provide such extension later.
How does it sound ?
I support this idea. Having an expert do the ET is a good plan, and since you know proto you can feel free to use it to the full extent you see fit. Big integer expression templates is exactly the sort of thing proto was intended to address. My only objection was forcing Chad to use proto, and it wasn't as strong of an objection as came across yesterday, I just got a little over enthusiastic since I find this subject matter very engaging. My rehtorical devices could have as easly been turned in the other direction, and my argument was strongly worded, but weak. Taking an idea to its extreme and burning a strawman doesn't prove anything. Throughout this review I never saw anyone make the case for COW that seems most relevant to me. Most usage of built in integer types in a generic context uses pass by value explicitly rather than using the type traits. This means that libraries like ratio and random or whatever else will probably pass multi-precision integer types by value in all of their functions. COW is the only way to mitigate all that copying around. Perhaps this was so obvious or covered so early in the review that it went unstated recently, but I think ET on COW enabled (or COW+move enabled) objects is a perfectly valid design choice. We need ET code to have access to the low level algorithms so that it can make optimizations such as stack allocation of temporaries, of course. It is cleaner if the library providees a good interface for that access. Regards, Luke

On 3/10/2011 11:14 AM, Simonson, Lucanus J wrote: [...]
Throughout this review I never saw anyone make the case for COW that seems most relevant to me.
Agreed.
Most usage of built in integer types in a generic context uses pass by value explicitly rather than using the type traits.
I'm guessing you mean call_traits, right?
This means that libraries like ratio and random or whatever else will probably pass multi-precision integer types by value in all of their functions.
Yes, that's unfortunate.
COW is the only way to mitigate all that copying around.
Well...not the only way, but the least intrusive way.
Perhaps this was so obvious or covered so early in the review that it went unstated recently, but I think ET on COW enabled (or COW+move enabled) objects is a perfectly valid design choice.
I agree that the above example is actually a valid *and* compelling argument for CoW. However, as I've stated before, I believe any CoW infrastructure would be best built on top of a non-CoW infrastructure, since non-CoW indeed has its own advantages, and building a non-CoW infrastructure on top of a CoW infrastructure likely compromises many of the non-CoW advantages.
We need ET code to have access to the low level algorithms so that it can make optimizations such as stack allocation of temporaries, of course. It is cleaner if the library providees a good interface for that access.
Agreed. - Jeff

Jeffrey Lee Hellrung, Jr. wrote:
On 3/10/2011 11:14 AM, Simonson, Lucanus J wrote: [...]
Throughout this review I never saw anyone make the case for COW that seems most relevant to me.
Agreed.
Most usage of built in integer types in a generic context uses pass by value explicitly rather than using the type traits.
I'm guessing you mean call_traits, right?
Yes, I misspoke.
This means that libraries like ratio and random or whatever else will probably pass multi-precision integer types by value in all of their functions.
Yes, that's unfortunate.
COW is the only way to mitigate all that copying around.
Well...not the only way, but the least intrusive way.
I was thinking of ways that can be accomplished within the scope of the xint library itself without forcing everyone elses code to change.
Perhaps this was so obvious or covered so early in the review that it went unstated recently, but I think ET on COW enabled (or COW+move enabled) objects is a perfectly valid design choice.
I agree that the above example is actually a valid *and* compelling argument for CoW.
However, as I've stated before, I believe any CoW infrastructure would be best built on top of a non-CoW infrastructure, since non-CoW indeed has its own advantages, and building a non-CoW infrastructure on top of a CoW infrastructure likely compromises many of the non-CoW advantages.
I hate reference counting schemes as a general rule because they always seem so reasonable and straightforward at the beginning but tend to lead to a lot of problems in the end. I could write a country western song about reference counting, broken hearts and bitter regret.... Even so, there is a good rationale for having it in this case. However, I would prefer as a design consideration to separate out COW from the integer class and perhaps have a boost.cow library that allows you to declare COW handles to objects of any type rather than tightly coupling that general capability to the implementation of this library. That way people can use COW or non-COW (perhaps move enabled) infinite precision integers by composition, but it does complicate the ET system if it needs to handle COW wraped handles to ininite precision integer objects, but that complication is well within the scope of what proto can easily handle, so I'm not concerned. Regards, Luke

2011/3/10 Simonson, Lucanus J <lucanus.j.simonson@intel.com>:
I hate reference counting schemes as a general rule because they always seem so reasonable and straightforward at the beginning but tend to lead to a lot of problems in the end. I could write a country western song about reference counting, broken hearts and bitter regret....
Please do, we could sing it at BoostCon ;-)
Joachim

Joachim Faulhaber wrote:
2011/3/10 Simonson, Lucanus J <lucanus.j.simonson@intel.com>:
I hate reference counting schemes as a general rule because they always seem so reasonable and straightforward at the beginning but tend to lead to a lot of problems in the end. I could write a country western song about reference counting, broken hearts and bitter regret....
Please do, we could sing it at BoostCon ;-)
Well... I lost my job And my truck broke down And my dog died and My girlfriend left me All because I <chorus>Put my faith in reference counting</chorus> Instead of the good lord above! <refrain>Yes my luck ran out when the reference count went to -1.</refrain> Now I promise I'll <chorus>Swear off of reference counting</chorus> And quit drinking And shave Just as soon as I get sober! <refrain>Yes my luck ran out when the Reference count went to -1.</refrain> But in the mean time I need to think of A really good excuse 'cause I don't know how to explain <chorus> all my reference counting problems </chorus> to my wife! <refrain>Yes my luck ran out when the Reference count went to -1.</refrain> <refrain>Yes my luck ran out when the Reference count went to -1.</refrain> Will that do or does it need to be longer? Regards, Luke

2011/3/11 Simonson, Lucanus J <lucanus.j.simonson@intel.com>:
Joachim Faulhaber wrote:
2011/3/10 Simonson, Lucanus J <lucanus.j.simonson@intel.com>:
I hate reference counting schemes as a general rule because they always seem so reasonable and straightforward at the beginning but tend to lead to a lot of problems in the end. I could write a country western song about reference counting, broken hearts and bitter regret....
Please do, we could sing it at BoostCon ;-)
Well... I lost my job And my truck broke down And my dog died and My girlfriend left me All because I <chorus>Put my faith in reference counting</chorus> Instead of the good lord above!
<refrain>Yes my luck ran out when the reference count went to -1.</refrain>
Now I promise I'll <chorus>Swear off of reference counting</chorus> And quit drinking And shave Just as soon as I get sober!
<refrain>Yes my luck ran out when the Reference count went to -1.</refrain>
But in the mean time I need to think of A really good excuse 'cause I don't know how to explain <chorus> all my reference counting problems </chorus> to my wife!
<refrain>Yes my luck ran out when the Reference count went to -1.</refrain>
<refrain>Yes my luck ran out when the Reference count went to -1.</refrain>
Will that do or does it need to be longer?
I'm deeply impressed. I will never even think of using ref counting again! Your song rocks :-D Joachim

On 10/03/2011 23:44, Simonson, Lucanus J wrote:
I hate reference counting schemes as a general rule because they always seem so reasonable and straightforward at the beginning but tend to lead to a lot of problems in the end. I could write a country western song about reference counting, broken hearts and bitter regret....
I personally refer to it as the mad COW disease.

On 03/10/2011 01:06 AM, falcou wrote:
[snip]
4/ Have a way to say large integer use SIMD register of 128 bits and implement a SIMD aware large integer
This last item seems particularly useful, but I don't see how it has anything to do with a potential restructuring. Since non-bitwise operations aren't explicitly single instruction multiple data operations, really what we are talking about is optimizing the library so as to make the most use of the processor capabilities, which may include SSE instructions, etc. (Presumably the performance of GMP already reflects such optimizations.) Maybe what you have in mind is letting the digit type be a template parameter, and then substituting in a user-defined type (or in the case of some compilers, a compiler-defined non-standard type) that serves as a 128-bit unsigned integer. I'm not convinced that this level of abstraction is compatible with generation of optimal code, though. Furthermore, this abstraction doesn't seem particularly useful, as the only purpose I can imagine of specifying a non-default digit type would be for this particular optimization.

Jeremy Maitin-Shepard wrote:
On 03/10/2011 01:06 AM, falcou wrote:
[snip]
4/ Have a way to say large integer use SIMD register of 128 bits and implement a SIMD aware large integer
This last item seems particularly useful, but I don't see how it has anything to do with a potential restructuring. Since non-bitwise operations aren't explicitly single instruction multiple data operations, really what we are talking about is optimizing the library so as to make the most use of the processor capabilities, which may include SSE instructions, etc. (Presumably the performance of GMP already reflects such optimizations.)
Maybe what you have in mind is letting the digit type be a template parameter, and then substituting in a user-defined type (or in the case of some compilers, a compiler-defined non-standard type) that serves as a 128-bit unsigned integer. I'm not convinced that this level of abstraction is compatible with generation of optimal code, though. Furthermore, this abstraction doesn't seem particularly useful, as the only purpose I can imagine of specifying a non-default digit type would be for this particular optimization.
I can attest that it is generally not compatible with generation of optimal code. Others may disagree, but I've had some unique experience with trying to create a C++ abstraction on vector instruction sets. Other than the fact that I couldn't do some necessary AST transformations in the ETs of our DSEL for vector operations because there is no way to identify whether two arguments to an expression are the same varaible, only the same type, there were many, many other problems that caused the reality to fall short of our expectations and promising results early on. Frankly, if you want fast SSE accelerated code you should do it by hand in assembler. For infinite precision integer in particular you need access to the carry bit that is generated by vector artihmetic and other instruction level features. Just replacing int with int_128_t isn't going to get you too far. You need to hand unroll loops and generally put a lot of thought into what you are doing at the instruction level to get the best implementation and everything needs to be driven by benchmark results. Even intrisics don't give you the ammount of control that is needed to get max performance because the compiler can't run benchmarks to guide its choices the way you can. (Well, Intel compiler actually can to some extent...) Instead of writing assember we should provide points of customization to allow the user to override all the algorithms with their own implementation. That could be gmp, their own hand coded assembly or some high level code that has been optimized through Intel parallel building blocks, for example, which accomplishes with vector type abstraction in C++ what ETs cannot through use of a JIT compilation step. Regards, Luke

On 03/10/2011 04:06 PM, Simonson, Lucanus J wrote:
I can attest that it is generally not compatible with generation of optimal code. Others may disagree, but I've had some unique experience with trying to create a C++ abstraction on vector instruction sets. Other than the fact that I couldn't do some necessary AST transformations in the ETs of our DSEL for vector operations because there is no way to identify whether two arguments to an expression are the same varaible, only the same type, there were many, many other problems that caused the reality to fall short of our expectations and promising results early on. Frankly, if you want fast SSE accelerated code you should do it by hand in assembler. For infinite precision integer in particular you need access to the carry bit that is generated by vector artihmetic and other instruction level features. Just replacing int with int_128_t isn't going to get you too far. You need to hand unroll loops and generally put a lot of thought into what you are doing at the instructio n level to get the best implementation and everything needs to be driven by benchmark results. Even intrisics don't give you the ammount of control that is needed to get max performance because the compiler can't run benchmarks to guide its choices the way you can. (Well, Intel compiler actually can to some extent...) Instead of writing assember we should provide points of customization to allow the user to override all the algorithms with their own implementation. That could be gmp, their own hand coded assembly or some high level code that has been optimized through Intel parallel building blocks, for example, which accomplishes with vector type abstraction in C++ what ETs cannot through use of a JIT compilation step.
Relying on "customization points" to customize the performance, rather than behavior, runs counter to the whole point of a library. The idea, after all, is to provide a fast arbitrary precision integer library, not to merely provide an interface, and then ask the user to write the implementation. Editing the source code itself is the appropriate customization point for optimizing the implementation. Any such performance optimizations should just be submitted back and included in the official version of the library, rather than forcing all users to individually waste their time reimplementing the optimizations. If possible, a compile-time/preprocessor option to replace the existing C++ implementation with calls to GMP, for instance, would seem to be quite useful and also compatible with Boost licensing.

Jeremy Maitin-Shepard wrote:
On 03/10/2011 04:06 PM, Simonson, Lucanus J wrote:
I can attest that it is generally not compatible with generation of optimal code. Others may disagree, but I've had some unique experience with trying to create a C++ abstraction on vector instruction sets. Other than the fact that I couldn't do some necessary AST transformations in the ETs of our DSEL for vector operations because there is no way to identify whether two arguments to an expression are the same varaible, only the same type, there were many, many other problems that caused the reality to fall short of our expectations and promising results early on. Frankly, if you want fast SSE accelerated code you should do it by hand in assembler. For infinite precision integer in particular you need access to the carry bit that is generated by vector artihmetic and other instruction level features. Just replacing int with int_128_t isn't going to get you too far. You need to hand unroll loops and generally put a lot of thought into what you are doing at the instructio n level to get the best implementation and everything needs to be driven by benchmark results. Even intrisics don't give you the ammount of control that is needed to get max performance because the compiler can't run benchmarks to guide its choices the way you can. (Well, Intel compiler actually can to some extent...) Instead of writing assember we should provide points of customization to allow the user to override all the algorithms with their own implementation. That could be gmp, their own hand coded assembly or some high level code that has been optimized through Intel parallel building blocks, for example, which accomplishes with vector type abstraction in C++ what ETs cannot through use of a JIT compilation step.
Relying on "customization points" to customize the performance, rather than behavior, runs counter to the whole point of a library. The idea, after all, is to provide a fast arbitrary precision integer library, not to merely provide an interface, and then ask the user to write the implementation. Editing the source code itself is the appropriate customization point for optimizing the implementation. Any such performance optimizations should just be submitted back and included in the official version of the library, rather than forcing all users to individually waste their time reimplementing the optimizations. If possible, a compile-time/preprocessor option to replace the existing C++ implementation with calls to GMP, for instance, would seem to be quite useful and also compatible with Boost licensing.
I don't believe it is practical for a boost library to try to compete with gmp on performance, and there are good reasons why people who do wouldn't contribute their code back to us. I agree that a compile-time option to use gmp is a good thing, but if we do that we might as well generalize it so that people who have an implementation that is compatible with their commercial license can use that instead. LGPL is not for everyone. Regards, Luke

On 11/03/11 01:06, Simonson, Lucanus J wrote:
I can attest that it is generally not compatible with generation of optimal code. Others may disagree, but I've had some unique experience with trying to create a C++ abstraction on vector instruction sets. Other than the fact that I couldn't do some necessary AST transformations in the ETs of our DSEL for vector operations because there is no way to identify whether two arguments to an expression are the same varaible, only the same type, there were many, many other problems that caused the reality to fall short of our expectations and promising results early on. Frankly, if you want fast SSE accelerated code you should do it by hand in assembler. Except not ... We disagree ont his point but we wont re-run the play again on this subject. Wait may 15.
For infinite precision integer in particular you need access to the carry bit that is generated by vector artihmetic and other instruction level features. Just replacing int with int_128_t isn't going to get you too far. Of course, again you put word in my mouth. The underlying representation can still be some int arrays, you just need to work on them using a range adaptor outputing SIMD register filled with data on the fly. It is a specialization of the memory allocator and of the algorithm.

Joel Falcou wrote:
For infinite precision integer in particular you need access to the carry bit that is generated by vector artihmetic and other instruction level features. Just replacing int with int_128_t isn't going to get you too far.
Of course, again you put word in my mouth. The underlying representation can still be some int arrays, you just need to work on them using a range adaptor outputing SIMD register filled with data on the fly. It is a specialization of the memory allocator and of the algorithm.
I said some will disagree to acknowledge that this was my opinion and not an absolute truth. I was trying to speak for myself. Regards, Luke

On 11/03/2011 01:06, Simonson, Lucanus J wrote:
For infinite precision integer in particular you need access to the carry bit that is generated by vector artihmetic and other instruction level features.
SSE provides no mean whatsoever to obtain the carry bits when doing addition of vectors. Altivec does, however. I don't see how that changes anything with regards to the user anyway. Those are details from the internals of the 128-bit integer type.

Mathias Gaunard wrote:
On 11/03/2011 01:06, Simonson, Lucanus J wrote:
For infinite precision integer in particular you need access to the carry bit that is generated by vector artihmetic and other instruction level features.
SSE provides no mean whatsoever to obtain the carry bits when doing addition of vectors.
Altivec does, however.
I don't see how that changes anything with regards to the user anyway. Those are details from the internals of the 128-bit integer type.
Shouldn't the infinite precision algorithm use carry bit if the hardware supports it? Do we program to the lowest common demoninator then claim optimality? Only in the cases where the lowest common denominator is sufficent to achieve optimality. That is the exception, not the rule, I'm afraid. I am basing my opinion on experience with larrabee vector instruction set not SSE. http://drdobbs.com/architecture-and-design/216402188?pgno=2 "Vector Arithmetic, Logical, and Shift Instructions The arithmetic, logical, and shift vector instructions include everything you'd expect: add, subtract, add with carry, subtract with borrow, multiply, round, clamp, max, min, absolute max, logical-or, logical-and, logical-xor, logical-shift and arithmetic-shift by a per-element variable number of bits, and conversions among floats, doubles, and signed and unsigned int32s." I'll call out add with carry and subtract with borrow from that list, which is not exhaustive, by the way. There are a ton of instructions and some are impossible to map an expression to using expression templates due to language limitations. I said vector instruction sets are a moving target at the outset. If you are designing based on SSE you are solving yesterday's problems. Regards, Luke

On 11/03/11 20:29, Simonson, Lucanus J wrote:
I am basing my opinion on experience with larrabee vector instruction set not SSE.
http://drdobbs.com/architecture-and-design/216402188?pgno=2
"Vector Arithmetic, Logical, and Shift Instructions The arithmetic, logical, and shift vector instructions include everything you'd expect: add, subtract, add with carry, subtract with borrow, multiply, round, clamp, max, min, absolute max, logical-or, logical-and, logical-xor, logical-shift and arithmetic-shift by a per-element variable number of bits, and conversions among floats, doubles, and signed and unsigned int32s."
I'll call out add with carry and subtract with borrow from that list, which is not exhaustive, by the way. There are a ton of instructions and some are impossible to map an expression to using expression templates due to language limitations.
I said vector instruction sets are a moving target at the outset. If you are designing based on SSE you are solving yesterday's problems. Sorry to not have a Larabee to do stuff , really.
More seriously, I fail to see an vector operation that is "limited by the language". I have very eager to see the one you speak about.

Joel Falcou wrote:
On 11/03/11 20:29, Simonson, Lucanus J wrote:
I am basing my opinion on experience with larrabee vector instruction set not SSE.
http://drdobbs.com/architecture-and-design/216402188?pgno=2
"Vector Arithmetic, Logical, and Shift Instructions The arithmetic, logical, and shift vector instructions include everything you'd expect: add, subtract, add with carry, subtract with borrow, multiply, round, clamp, max, min, absolute max, logical-or, logical-and, logical-xor, logical-shift and arithmetic-shift by a per-element variable number of bits, and conversions among floats, doubles, and signed and unsigned int32s."
I'll call out add with carry and subtract with borrow from that list, which is not exhaustive, by the way. There are a ton of instructions and some are impossible to map an expression to using expression templates due to language limitations.
I said vector instruction sets are a moving target at the outset. If you are designing based on SSE you are solving yesterday's problems. Sorry to not have a Larabee to do stuff , really.
More seriously, I fail to see an vector operation that is "limited by the language". I have very eager to see the one you speak about.
If you have the same variable in more than one place in an expression you can map that expression to different/fewer instructions than if they were all different variables in different places. There is no way to perform that mapping using expression templates because it cannot be detected in template metaprogramming which varaibles are the same, only that they have the same type. Therefor you cannot map expression to the optimal instructions using template metaprogramming. In general three register instructions that modify one register and perhaps use it as an input also will map to expression where the variable is on both sides of the assignment operator. Think fused multiply add. At first we thought we could do anything with expression templates. It is not true. We can only perform logic on types at compile time. You can't implement a compiler for runtime variables as a template metaprogram, only a compiler for types. I already gave the example of common sub expression elimination being unimplementable in expression templates. Rather than a contrived example it is actually the most general and easily understood example of the problem. It is a pretty important optimization in and of itself, but it is also widely understood, at least by readers of this list. Regards, Luke

If you have the same variable in more than one place in an expression you can map that expression to different/fewer instructions than if they were all different variables in different places. There is no way to perform that mapping using expression templates because it cannot be detected in template metaprogramming which varaibles are the same, only that they have the same type. Therefor you cannot map expression to the optimal instructions using template metaprogramming.
In general three register instructions that modify one register and perhaps use it as an input also will map to expression where the variable is on both sides of the assignment operator. Think fused multiply add.
At first we thought we could do anything with expression templates. It is not true. We can only perform logic on types at compile time. You can't implement a compiler for runtime variables as a template metaprogram, only a compiler for types. I already gave the example of common sub expression elimination being unimplementable in expression templates. Rather than a contrived example it is actually the most general and easily understood example of the problem. It is a pretty important optimization in and of itself, but it is also widely understood, at least by readers of this list. Well, I dunno which compiler you use, but usually, when you do stuff
On 11/03/11 21:04, Simonson, Lucanus J wrote: properly, the ET are usually optimized by the compiler itself afterward and it usually do this for you. We made the test in nt2 with a shifted load based FIR using a very simple TMP logic, and well ... it never used more register than what we though tit should. As for madd, well ... we do pattern matching on the AST structure, reemit a madd node and let the thingy do its job afterward. I fail to see where the fail is, but it is important to have the ET emit rather trivial code and let the compiler polish the code generation afterward. Under GCC, the CSE stuff is done byt he compiler as if it was normal code. We noticed loop unrolling done through matrix ET and other such effect. Which compiler did you used ? How are your ET base code done ?

Joel Falcou wrote:
If you have the same variable in more than one place in an expression you can map that expression to different/fewer instructions than if they were all different variables in different places. There is no way to perform that mapping using expression templates because it cannot be detected in template metaprogramming which varaibles are the same, only that they have the same type. Therefor you cannot map expression to the optimal instructions using template metaprogramming.
In general three register instructions that modify one register and perhaps use it as an input also will map to expression where the variable is on both sides of the assignment operator. Think fused multiply add.
At first we thought we could do anything with expression templates. It is not true. We can only perform logic on types at compile time. You can't implement a compiler for runtime variables as a template metaprogram, only a compiler for types. I already gave the example of common sub expression elimination being unimplementable in expression templates. Rather than a contrived example it is actually the most general and easily understood example of the problem. It is a pretty important optimization in and of itself, but it is also widely understood, at least by readers of this list. Well, I dunno which compiler you use, but usually, when you do stuff
On 11/03/11 21:04, Simonson, Lucanus J wrote: properly, the ET are usually optimized by the compiler itself afterward and it usually do this for you. We made the test in nt2 with a shifted load based FIR using a very simple TMP logic, and well ... it never used more register than what we though tit should.
As for madd, well ... we do pattern matching on the AST structure, reemit a madd node and let the thingy do its job afterward. I fail to see where the fail is, but it is important to have the ET emit rather trivial code and let the compiler polish the code generation afterward.
Under GCC, the CSE stuff is done byt he compiler as if it was normal code. We noticed loop unrolling done through matrix ET and other such effect. Which compiler did you used ? How are your ET base code done ?
The problems come from load/store, which if you emit too many the compiler generally doesn't optimize them away. You need to somehow manage which varaibles are in register and which are in memory and if you get that wrong it is hard to right that wrong. For one thing alias analysis won't be successful enough to do what the programmer could. For another, when it comes to intrinsics the compiler puts some trust in the programmer's choices and doesn't always try to second guess. Loading and storing memory can get complicated in a great hurry and you generally want to rearrange your algorithm and data structures for more efficient memory access. Just aligning things is a very big deal and it isn't always enough to say "oh, the allocator will take care of that for me." There are a lot of complications once you get into the details of it. The compiler can sometimes do amazing things, but other times produce a really weak effort and it will never replace your algorithm with a better one based on knowledge of the instruction set. There might be good explanations for why the compiler can't emit the code you expected, but that doesn't change the fact that it often dissapoints. I know you may object that first I say that the problem is having the same variable in multiple places to map to instructions like madd and now I'm saying the problem is load and store. Trust me, these problems are very much interrelated. To implement optimal infinite prcision integer arithmetic with lrb vector instructions (or altavec) you have tons of goodies to help you in the instruction set like multiply and store high result and the carry add borrow sub and gather scatter, but if you abstract the instruction set away you don't even get the basics right and all the goodies are out of reach entirely because you need to craft your algorithm to fit the features of the instruction set to make use of them. Regards Luke

On 11/03/11 22:57, Simonson, Lucanus J wrote:
The problems come from load/store, which if you emit too many the compiler generally doesn't optimize them away. You need to somehow manage which varaibles are in register and which are in memory and if you get that wrong it is hard to right that wrong. For one thing alias analysis won't be successful enough to do what the programmer could. For another, when it comes to intrinsics the compiler puts some trust in the programmer's choices and doesn't always try to second guess. Loading and storing memory can get complicated in a great hurry and you generally want to rearrange your algorithm and data structures for more efficient memory access. Just aligning things is a very big deal and it isn't always enough to say "oh, the allocator will take care of that for me." There are a lot of complications once you get into the details of it. The compiler can sometimes do amazing things, but other times produce a really weak effort and it will never replace your algorithm with a better one based on knowledge of the instruction set. There might be good explanations for why the compiler can't emit the code you expected, but that doesn't change the fact that it often dissapoints.
Alignment is one issue, the only thigns it solves it that you can be sure you can mova* instead of movu* without thinking about it. For data rearrangement, this is a whole different level of stuff and we solved it by *not* havign to care at the SIMD DSL level, the upper array-like interface take care of this usign some fusion adaptation and transformation mumbo-jumbo. It is vry crucial that you separate the concerns there. The SIMD DSL should only help you lay down your SIMD data-flow algorithm and help you map it to w/e IS you may have in a platform independent way, provide you tools for iterating SIMD chunk by SIMD chunk over arbitrary memory and have the significant pattern matcher in place for your ISA. Now, all the other algorithmic level transform and other memory layout and access problems have to be solved at an upper level (in w/e abstraction you need matrix, array, big int, image, polynoms etc) because the very core optimizations you may want are expressed in a proper way in the high-level domain, not the low level one. Same goes for unrolling or stripmining loop nests, it is not the job of the SIMD DSL. It is somethign that has uses in soem domain but the other. The SIMD DSL should provide an interface tat make writing such numerical code the same wether it is indeed in SIMD or scalar mode. I really think it is this problem you encountered by trying to make your SIMD DSL do too much at its level, something that ultimately fails. The only such a beast should do is hide the intricacies of setting up load/store from memory into SIMD registers, uniformize function and operators across ISA, provide SIMD based iterator and range travrsal support. It should not do many more but provide, mainly in its pattern matching facility or in its way to define a function, ways to be extended for any new subtle low level combination of operations leading to optimized SIMD block or IS support. See below for the rest of what can be done.
To implement optimal infinite prcision integer arithmetic with lrb vector instructions (or altavec) you have tons of goodies to help you in the instruction set like multiply and store high result and the carry add borrow sub and gather scatter, but if you abstract the instruction set away you don't even get the basics right and all the goodies are out of reach entirely because you need to craft your algorithm to fit the features of the instruction set to make use of them.
At some point, it all depends of the way you did your homework. Having a one-to-one mapping between one SIMD extension instruction to one operator only works so far for classical operations. However, and we agree on this, if IS 1 has a funky 'add-two-vectors-and-take-the-sqrt-of-them' fused operations and not the other, it is rather trivial to have some function like sqrt_sum(a,b) and have it using the intrinsic in ISA1 and do it "by the book" in IS 2 support. In the same way, your pattern matcher will depends on the IS and do the proper contraction if needed. Our stance on this matter was not to say "abstract to the common lowest denominator of features" but try to see all SIMD IS as an aggregate pool of features and fill the gap of IS with less features than others. Our approach was to seggregate functions into "toolbox" with different aims and providing mid-level functionnality that get locally optimized with w/e IS support. Now, w/e the IS, you can call it and depending on what you have it generates the proper algorithm. Now if i was about to rewrite some large integer function or some crypto algorithm, what I will do is look at the proper SIMD way to do it, seekign for the most fused possible intrinsic for the job (for crypto, IIRC SSE4 has some specific stuff for ex., so let's sue them in the sse4 version). And after that, ho and behold, generic programming strikes again: we will probably take a few time to lift/abstract/generalize said algorithm in proper conceptualized series of operations. Said operations will then be mapped into new function in the SIMD DSL (and prolly in a sequential fashion too). Rewrite the algorithm operating on a parametric type T, use said new functions and watch the code generated be either SIMD or scalar depending on the ISs selected. In terms of integer stuff at hand, this means the high level algorithm should be expressed as a combination of lifted concepts where the actual bit twiddling can be aggregated in logical, semantic-rich, functions operating on w/e. Let the SIMD DSL provide such function, implement them as properly needed (use the proper specialized itnrinsic or add pattern replacer) and voila. No need to dumb down the feature set of the SIMD DSL, it will grow naturally as cocnept requiring it and some special stuff will arise from algorthimic analysis, much more like normal code. And int his case, you'll see that these problem of optimality actually lessens as they willbe captured at high level by your concept based analysis. I bet we have a lot of stuff to exchange and argue on the subject as we seems to have taken radically different angle on the matter, I will be glad if we can go over our first premises of arguments and maybe try to see if our respective experiences in some concrete area can't be shared for the greater good of everyone. Especially, if you have actual algorithm you see as something hard that didn't fit your design (and by concrete I mean a snapshot of code not a mere description in words were both of us can misinterpret parts (see the CSE fiasco above :p) so we can look at the problem in our framework and see where problem arise. Hoping to help :)

Joel Falcou wrote: ...
In terms of integer stuff at hand, this means the high level algorithm should be expressed as a combination of lifted concepts where the actual bit twiddling can be aggregated in logical, semantic-rich, functions operating on w/e. Let the SIMD DSL provide such function, implement them as properly needed (use the proper specialized itnrinsic or add pattern replacer) and voila.
No need to dumb down the feature set of the SIMD DSL, it will grow naturally as cocnept requiring it and some special stuff will arise from algorthimic analysis, much more like normal code. And int his case, you'll see that these problem of optimality actually lessens as they willbe captured at high level by your concept based analysis.
I bet we have a lot of stuff to exchange and argue on the subject as we seems to have taken radically different angle on the matter, I will be glad if we can go over our first premises of arguments and maybe try to see if our respective experiences in some concrete area can't be shared for the greater good of everyone. Especially, if you have actual algorithm you see as something hard that didn't fit your design (and by concrete I mean a snapshot of code not a mere description in words were both of us can misinterpret parts (see the CSE fiasco above :p) so we can look at the problem in our framework and see where problem arise.
Hoping to help :)
I'm not so sure we took that different and angle on it, perhaps more a different point of view on the results based on different circumstances. Providing a more productive syntax than the intrinsics themselves was pretty much what we had in mind for the simd ET library. I think the fact that the compiler was pre-production in our case meant that everywhere your experience tells you the compiler should do the right thing our experience was that it often did not and we kept having to go into the generated assembly code and figure out what went wrong. The other very significant disadvantage the compiler put us at with our ET implementation efforts was it was using an old version of EDG frontend that had quadratic compile time for template instantiation. This problem also gives me headaches with the Polygon library when compiled with the Intel compiler. The compile time really is excessive. This made using the ET wrapper too inconvenient. Also, given that we weren't abstracting away any of the complexity of the instruction set using the SIMD library wasn't any less complicated than intrisics, but the syntax errors were harder to understand. The user couldn't expect to write templated code that worked for int or float types and replace int with simd_type and get anything reasonable except in the very simplest of cases since that limits us to only instructions that map to operations on basic types and algorithms that aren't branchy. Given that designing the vector algorithms was hard with or without the wrapper, requiring a strong understanding of the instruction set, it wasn't clear that a more productive syntax than intrinsics themselves was really going to be of much benefit, since that wasn't where time was being spent anyway. Regards, Luke

On 15/03/11 20:47, Simonson, Lucanus J wrote:
I'm not so sure we took that different and angle on it, perhaps more a different point of view on the results based on different circumstances. Providing a more productive syntax than the intrinsics themselves was pretty much what we had in mind for the simd ET library. I think the fact that the compiler was pre-production in our case meant that everywhere your experience tells you the compiler should do the right thing our experience was that it often did not and we kept having to go into the generated assembly code and figure out what went wrong. The other very significant disadvantage the compiler put us at with our ET implementation efforts was it was using an old version of EDG frontend that had quadratic compile time for template instantiation.
Ah well, blame the compiler then :p
Also, given that we weren't abstracting away any of the complexity of the instruction set using the SIMD library wasn't any less complicated than intrisics, but the syntax errors were harder to understand. The user couldn't expect to write templated code that worked for int or float types and replace int with simd_type and get anything reasonable except in the very simplest of cases since that limits us to only instructions that map to operations on basic types and algorithms that aren't branchy. Given that designing the vector algorithms was hard with or without the wrapper, requiring a strong understanding of the instruction set, it wasn't clear that a more productive syntax than intrinsics themselves was really going to be of much benefit, since that wasn't where time was being spent anyway.
Well, our stuff just works as a drop in replacement. Branches has tu be abstracted away either using some where(c,a,b) constructs that mapped on w/e is there or by using horizontal branching. As a test, you can check that all our trigonometric functions are basically using the same stuff for both SIMD and scalar with great performances. We also have a Range interface on them so you can just ditch them into a regular std algorithm and have your cake. I think you got distracted by working with some unstable compiler. I can tell it works when you look at it in a sane environment.

On Fri, Mar 11, 2011 at 3:04 PM, Simonson, Lucanus J < lucanus.j.simonson@intel.com> wrote:
If you have the same variable in more than one place in an expression you can map that expression to different/fewer instructions than if they were all different variables in different places. There is no way to perform that mapping using expression templates because it cannot be detected in template metaprogramming which varaibles are the same, only that they have the same type. Therefor you cannot map expression to the optimal instructions using template metaprogramming.
Even if a given compiler can't do such further optimizations for you, I wouldn't consider it the end of the road for expression templates. Not that I would necessarily suggest the following solution for all situations, but you could always attach a type-tag to a reused variable when forming an expression, for instance: some_type a, b, c; some_tagger< a_tag > a_( a ); c = a_ * b + a_; Since a_ here has a unique tag "a_tag" associated with it, the hypothetical DSEL can be made to treat the two instances of a_ as the same variable and would therefore know to use specific instructions in such a case. Again, I'm not proposing that this is a great solution as it requires some effort on the part of the user, but by type-tagging objects that appear multiple times in a single expression you can do the kinds of things that you are talking about. -- -Matt Calabrese

On 11/03/11 22:05, Matt Calabrese wrote:
Even if a given compiler can't do such further optimizations for you, I wouldn't consider it the end of the road for expression templates. Not that I would necessarily suggest the following solution for all situations, but you could always attach a type-tag to a reused variable when forming an expression, for instance:
some_type a, b, c;
some_tagger< a_tag> a_( a );
c = a_ * b + a_;
Since a_ here has a unique tag "a_tag" associated with it, the hypothetical DSEL can be made to treat the two instances of a_ as the same variable and would therefore know to use specific instructions in such a case. Again, I'm not proposing that this is a great solution as it requires some effort on the part of the user, but by type-tagging objects that appear multiple times in a single expression you can do the kinds of things that you are talking about. This is an usual solution yes. It requires a bit of housekeeping but it is usually what we do. In nt2 matrix can be tagged using mpl string even.
However the compiler should CSE , ET or not as stated in my last post.

On 11/03/2011 22:05, Matt Calabrese wrote:
Even if a given compiler can't do such further optimizations for you, I wouldn't consider it the end of the road for expression templates. Not that I would necessarily suggest the following solution for all situations, but you could always attach a type-tag to a reused variable when forming an expression, for instance:
some_type a, b, c;
some_tagger< a_tag> a_( a );
c = a_ * b + a_;
Since a_ here has a unique tag "a_tag" associated with it, the hypothetical DSEL can be made to treat the two instances of a_ as the same variable and would therefore know to use specific instructions in such a case. Again, I'm not proposing that this is a great solution as it requires some effort on the part of the user, but by type-tagging objects that appear multiple times in a single expression you can do the kinds of things that you are talking about.
In practice, such tagging on SIMD registers is not necessary. On bigger quantities, such as a vector, a matrix, or something like that, you can afford to simply do a runtime test if no such tagging is present.

At Fri, 11 Mar 2011 12:04:15 -0800, Simonson, Lucanus J wrote:
More seriously, I fail to see an vector operation that is "limited by the language". I have very eager to see the one you speak about.
If you have the same variable in more than one place in an expression
sometimes called "aliasing"
you can map that expression to different/fewer instructions than if they were all different variables in different places. There is no way to perform that mapping using expression templates because it cannot be detected in template metaprogramming which varaibles are the same, only that they have the same type. Therefor you cannot map expression to the optimal instructions using template metaprogramming.
I beg to differ, at least most of the time. You can detect aliasing at runtime and choose an optimal implementation on that basis, and when the sameness of the two variables is information available to the compiler (as it often is), constant-folding and dead-code-elimination will throw away the test and the dead branch, leaving only the optimal code. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Dave Abrahams wrote:
At Fri, 11 Mar 2011 12:04:15 -0800, Simonson, Lucanus J wrote:
More seriously, I fail to see an vector operation that is "limited by the language". I have very eager to see the one you speak about.
If you have the same variable in more than one place in an expression
sometimes called "aliasing"
you can map that expression to different/fewer instructions than if they were all different variables in different places. There is no way to perform that mapping using expression templates because it cannot be detected in template metaprogramming which varaibles are the same, only that they have the same type. Therefor you cannot map expression to the optimal instructions using template metaprogramming.
I beg to differ, at least most of the time. You can detect aliasing at runtime and choose an optimal implementation on that basis, and when the sameness of the two variables is information available to the compiler (as it often is), constant-folding and dead-code-elimination will throw away the test and the dead branch, leaving only the optimal code.
Clever. I like how you think. If the two values happened not to be aliases I think the compiler would leave in the compare and branch, which isn't what we would want. Regards, Luke

On 11/03/2011 20:29, Simonson, Lucanus J wrote:
Mathias Gaunard wrote:
On 11/03/2011 01:06, Simonson, Lucanus J wrote:
For infinite precision integer in particular you need access to the carry bit that is generated by vector artihmetic and other instruction level features.
SSE provides no mean whatsoever to obtain the carry bits when doing addition of vectors.
Altivec does, however.
I don't see how that changes anything with regards to the user anyway. Those are details from the internals of the 128-bit integer type.
Shouldn't the infinite precision algorithm use carry bit if the hardware supports it? Do we program to the lowest common demoninator then claim optimality?
Everything would be dictated by the interface the algorithms expect, and how they detect and handle overflows. Then it's simply a matter to implement the interface using the best possible set of instructions. It's true the right interface to deal with carry is not entirely obvious. Experience with writing XInt should help the author come up with one though.

On 11/03/2011 00:36, Jeremy Maitin-Shepard wrote:
Maybe what you have in mind is letting the digit type be a template parameter, and then substituting in a user-defined type (or in the case of some compilers, a compiler-defined non-standard type) that serves as a 128-bit unsigned integer. I'm not convinced that this level of abstraction is compatible with generation of optimal code, though. Furthermore, this abstraction doesn't seem particularly useful, as the only purpose I can imagine of specifying a non-default digit type would be for this particular optimization.
That's what I had in mind when I suggested the idea. I don't think there are really missed optimizations, but I may be wrong in this. Otherwise, this abstraction is also necessary to allow arbitrary ranges as input.
participants (10)
-
Chad Nelson
-
Dave Abrahams
-
falcou
-
Jeffrey Lee Hellrung, Jr.
-
Jeremy Maitin-Shepard
-
Joachim Faulhaber
-
Joel Falcou
-
Mathias Gaunard
-
Matt Calabrese
-
Simonson, Lucanus J