Algebraic abstractions related to (big) integers in C++

Dear (big) integer enthusiasts, The goal of this message is to reflect on the two's complement representation of negative integers, from a relatively abstract algebraic perspective. Some of the questions that arise in this context are whether the two's complement representation would be well suited for (i) a fixed width integer data type (ii) a fixed point number data type (iii) a variable width integer data type (iv) a floating point number data type Circumstantial evidence suggests that it's well suited for (i) and (ii), but not well suited for (iv). The following quote from Bill Gosper (1972) "By this strategy, consider the universe, or, more precisely, algebra [...] therefore algebra is run on a machine (the universe) which is twos-complement." (<http://www.inwap.com/pdp10/hbaker/hakmem/hacks.html#item154>) could be seen as evidence in favor of using two's complement representation for (iii). However, XInt and other bigint libraries took the opposite decision, so there is strong circumstantial evidence against using two's complement representation for (iii). Let me start with a disclaimer that I'm not an expert in algebra. The following thoughts emerged when I tried to understand the semi-direct product. I wanted to see how it works for a concrete "small" group. So I tried to write the Quaternion group (the one with the 8 elements 1, -1, i, -i, j, -j, k, -k) as a semi-direct product. This didn't work, because the Quaternion group is not a semi-direct product. But after some time I managed to "decompose" the Quaternion group into abelian groups in a different way. I subsequently learned that this sort of problem is called the group extension problem <http://en.wikipedia.org/wiki/Group_extension>. It's a solved problem, but I would have to learn some homological algebra to understand its solution. Let's look at the "ring of integers" (denoted by "Z") and the "ring of integers modulo m" (denoted by "Z_m"). We want to "approximate" the "ring of integers" by the "ring of integers modulo m". One of the advantages of this approach is that it's relatively easy to represent the "ring of integers modulo m" in a computer, especially for "m = 2^n" (more concretely "m = 2^32" for int32_t and "m = 2^64" for int64_t). There exists exactly one non-trivial ring homomorphism "p : Z -> Z_m" from the "ring of integers" into the "ring of integers modulo m". (The integer "1" is mapped to the corresponding "1", and the additive group of the "ring of integers modulo m" is generated by "1".) A good way to describe how we "approximate" "Z" by "Z_m" is to define a map "q : Z_m -> Z" such that "p(q(i)) = i for all i". Because the map "q" is not uniquely determined by the condition "p(q(i))=i", let's add the additional constraint "-m/2 <= q(i) < m/2". It's easily checked that this uniquely determines the map "q". This map "q" is not a ring homomorphism, but that's OK as there does not exists any non-trivial ring homomorphism from "Z_m" into "Z". Still we should try to measure the deviation of "q" from a ring homomorphism by taking a look at "S(a, b) := q(a) + q(b) - q(a + b)" and "P(a, b) := q(a) q(b) - q(a b)". It's easily checked that "p(S(a, b)) = 0" and "p(P(a, b)) = 0". This shows that "S(a, b)" and "P(a, b)" are multiples of "m". A quick estimation shows "-1 <= S(a, b)/m <= 1" and "-(m+1)/4 < P(a, b)/m <= (m+2)/4". This doesn't look too bad, but still bad enough to question whether the constraint "-m/2 <= q(i) < m/2" is really the way to go. Instead of the constraint "-m/2 <= q(i) < m/2", we could have also used the constraint "0 <= q(i) < m" or the constraint "-m/2 < q(i) <= m/2". One reason to prefer the constraint "-m/2 <= q(i) < m/2" over the constraint "-m/2 < q(i) <= m/2" is that we exactly get the two's complement representation of signed integers for "m = 2^n", which was the very reason that we investigated these algebraic relations in the first place. It's also more consistent with the "a <= x < b" representation of ranges so typical for C/C++. I want to give a third reason why the constraint "-m/2 <= q(i) < m/2" should be preferred over the constraint "-m/2 < q(i) <= m/2". Let's define a "round to nearest" division and the corresponding modulo operation for the "ring of integers" by "rdiv(a, b) := floor(1/2 + a/b)" and "rmod(a, b) := a - b rdiv(a, b)". It's easily checked that "-m/2 <= rmod(i, m) < m/2". It could be objected that implementing "round to nearest" by "i = floor(0.5 + x);" is only correct for non-negative values of "x". The correct implementation of "round to nearest" seems to be "i = x >= 0 ? floor(x + 0.5) : ceil(x - 0.5);". But I don't think that the more complicated definition of "round to nearest" is always "more" correct than the simpler one. At least when manipulating polygons, there can be situations where the simpler definition turns out to be the "more" correct one. But let's investigate to constraint "0 <= q(i) < m" now. A quick estimation shows "0 <= S(a, b)/m <= 1" and "0 <= P(a, b)/m < m-1". So "S(a, b)/m" could be stored in a single "carry bit" and "P(a, b)/m" can be stored as a value of "Z_m". It is therefore not uncommon to have access to "S(a, b)/m" in machine language via a "carry bit" and to "P(a, b)/m" by using a machine instruction that computes "a b" and "P(a, b)/m" at the same time and stores "a b" and "P(a, b)" in different registers. (My experiences with machine language are from the time before 1996, so pardon me if these statement should not be fully correct.) The statement about the carry bit is not fully correct, because the carry bit is normally not just a result of an addition in machine language, but is also taken into account during the addition. So if "c" is the carry bit before the addition, the addition will return the value "p(c) + a + b" in "Z_m" and the new carry bit "(c + q(a) + q(b) - q(p(c) + a + b))/m". It's probably even possible to write this "add_with_carry" operation as a function in C or as a template in C++. But I don't really believe that many compilers will notice the equivalence between a template like the following and the "add_with_carry" machine instruction. template <Uint> typename boost::enable_if<boost::is_unsigned<Uint>, bool>::type add_with_carry(Uint& a, const Uint& b, bool c) { a += (c != 0); a += b; return (c != 0) ? (a <= b) : (a < b); } And I'm pretty certain that there doesn't exist any compiler will notice the equivalence between a template like the following and the mentioned machine instruction for a multiplication returning the complete result. I'm not even certain myself that they are equivalent (but I could write a test to verify it for me, if I really wanted to be certain). template <Uint> typename boost::enable_if<boost::is_unsigned<Uint>, Uint>::type multiply_full_result(Uint& a, const Uint& b) { Uint upper_a = (a >> (sizeof(Unit)/2)); Uint upper_b = (b >> (sizeof(Unit)/2)); Uint lower_a = a & (Uint(-1) >> (sizeof(Unit)/2)); Uint lower_b = b & (Uint(-1) >> (sizeof(Unit)/2)); Uint upper = upper_a * upper_b; upper += ((upper_a*lower_b) >> (sizeof(Unit)/2)); upper += ((lower_a*upper_b) >> (sizeof(Unit)/2)); a *= b; return upper; } I'm starting to get problems with the length of this message now, because it is quite reasonable to expect that the number of people that will even try to read the message gets lower and lower as the message gets longer and longer. And this is already my third start from scratch trying to come up with something reasonable in size and readability. My first attempt directly started with p-adic numbers and p-adic integers, then tried to explain how the two's complement representation is related to 2-adic integers. It noted that it is indeed an approximation in the 2-adic metric, but this normally doesn't buy us anything, because the 2-adic metric is almost never the relevant metric for our problem domain at hand. This is also one of the reasons why the two's complement representation should be avoided for floating point numbers, because the relevant metric in this case is the normal metric of the real numbers. My second attempt tried to take the opposite approach, and started with the C++ statement "std::size_t n = std::max(1, c.size());", noting that it will provoke a compile error. After some discussion and additional examples like the expressions "c.size()/-2" and "-c.size()/2", it claimed that Haskell was able to solve these problems by not assigning any concrete type to "2" at all, but just the abstract "Num" typeclass. One part of the problems that are meant here is described in the following statement: "What I consider especially strange from an algebraic point of view, is to automatically promote a signed integer type to its corresponding unsigned integer type, when an arithmetic expression contains both signed and unsigned integer types. Shouldn't it be the other way round? Or wouldn't it be even better, if there would be no automatic conversion at all? At least from an algebraic point of view, any natural number is also an integer, not vice versa like in C." So let's try to come to some sort of conclusion now, even so not all semantical issues I wanted to discuss have been raised, and nearly no solutions for the raised issues have been presented. It looks to me now like it can be advantageous to implement operations for signed (variable or fixed width) integer data types in terms of operations for unsigned integer data types. This can be done independent of whether the two's complement representation is used or not. However, we should still be careful to provide the correct semantic for signed integer types, even if we implement them in terms of unsigned integer types. It is a questionable design when the TTMath bignum library lets it signed integer type derive public from its unsigned integer type ("template<uint value_size> class Int : public UInt<value_size> { ... };"), because the relationship between signed and unsigned integer types is not a "is a" relationship as required for public inheritance. (Think about it, a signed integer is really not an unsigned integer, even if the unsigned integer type would model the "ring of integers modulo m" instead of the "natural numbers".) Is is also questionable when a relatively new language like the D programming language cripples its "Usual Arithmetic Conversions" by stating "The signed type is converted to the unsigned type." as rule 4.4. Even so it is true that an integer can be converted to an "integer modulo m" (remember the ring homomorphism "p : Z -> Z_m"), are there really good reasons to apply such a conversion automatically? For the C/C++ language, the answer is definitively yes: "backward compatibility". But I don't think that this reason would have applied to D. And please pardon me if I imply press the send button now. Regards, Thomas

honestly i didn't get the _whole_ point of this message but i have some comments on it Thomas wrote on Sunday, March 27, 2011 at 3:21:04:
...However, XInt and other bigint libraries took the opposite decision, so there is strong circumstantial evidence against using two's complement representation for (iii)...
i wondered myself why chad didn't use two's complement because it seemed straightforward to me but i didn't ever mentioned it eventually chad stated that sign-magnituted implementation suits better for non-trivial operations and i have no point to protest
...So I tried to write the Quaternion group (the one with the 8 elements 1, -1, i, -i, j, -j, k, -k) as a semi-direct product. This didn't work, because the Quaternion group is not a semi-direct product...
perhaps you ment octonions since quaternions have {1, i, j, k} elements in contrast to what you mentioned
...This is also one of the reasons why the two's complement representation should be avoided for floating point numbers, because the relevant metric in this case is the normal metric of the real numbers.
afaik IEEE 754 / ISO/IEC 60559 standards describe just that sign-magnitude representation which is widely used in the majority of computer processors what i understood is that you suggest to carefully think of signed/unsigned behavior of integers when one develops an integer algebra library but i think that in C++ we have no choice but to follow the historical tradition and to treat integers like our parents did otherwise, such way may introduce confusion to users and render well known idioms unappliable however i do not object to begin treating integers in semantically more correct way even if it can cause consequences -- Pavel P.S. if you notice a grammar mistake or weird phrasing in my message please point it out

pavel wrote:
honestly i didn't get the _whole_ point of this message but i have some comments on it
I liked your comments. I didn't expect anybody to even try to "get the _whole_ point of this message". I realized that the things I don't like about integers in C++ (signed to unsigned promotion, the behavior of the modulo operator for negative numbers, the missing "standard library" functions for "add_with_carry" and "multiply_full_result") couldn't be changed anyway, no matter how convincingly I would prove that they are unfortunate. And with respect to the two's complement representation, I realized that I also didn't know the answer to the question that was bothering me. So I could just as well be honest and tell about my actual algebraic thoughts (and their origin), even knowing that "practical arguments would resonate more deeply with this list".
Thomas wrote on Sunday, March 27, 2011 at 3:21:04:
...However, XInt and other bigint libraries took the opposite decision, so there is strong circumstantial evidence against using two's complement representation for (iii)...
i wondered myself why chad didn't use two's complement because it seemed straightforward to me but i didn't ever mentioned it
eventually chad stated that sign-magnituted implementation suits better for non-trivial operations and i have no point to protest
I guess many of us wondered, but we were all not really sure whether the two's complement representation would be really better than the sign magnitude representation. It would have been just an implementation detail, but some people requested separation of algorithms and data representation, including actual access to the underlying representation. I initially thought that the two's complement representation would be the better representation, but I changed my mind during writing the message (because I finally took a deeper look at XInt and related libraries, and read some blogs and articles from Wikipedia). I'm not yet sure whether the two's complement representation must actually be avoided, like for the floating point numbers.
...So I tried to write the Quaternion group (the one with the 8 elements 1, -1, i, -i, j, -j, k, -k) as a semi-direct product. This didn't work, because the Quaternion group is not a semi-direct product...
perhaps you ment octonions since quaternions have {1, i, j, k} elements in contrast to what you mentioned
I mean the "quaternion group", not the quaternions: "In group theory, the quaternion group is a non-abelian group of order 8, isomorphic to a certain eight-element subset of the quaternions under multiplication."
...This is also one of the reasons why the two's complement representation should be avoided for floating point numbers, because the relevant metric in this case is the normal metric of the real numbers.
afaik IEEE 754 / ISO/IEC 60559 standards describe just that sign-magnitude representation which is widely used in the majority of computer processors
Isn't it nice that at least some things are done right? You may ask what's the purpose of theory then, because when the practice got it right, it can't deliver anything new, and when practice got it wrong, you have to live with it anyway. But I very much prefer the situations where theory can't deliver anything new over the situations where you have to live with it anyway.
what i understood is that you suggest to carefully think of signed/unsigned behavior of integers when one develops an integer algebra library
but i think that in C++ we have no choice but to follow the historical tradition and to treat integers like our parents did
I'm not so sure. We happily accept the compile errors of std::min and std::max when mixing signed and unsigned numbers. The problem is that "1", "23" have the concrete type "int", which forces the "signed to unsigned promotion" upon us. If we would be able to solve this problem (like Haskell did with the help of its powerful type system), we could think about fixing the "signed to unsigned promotion".
otherwise, such way may introduce confusion to users and render well known idioms unappliable
You can always cast explicitly (but of course you don't want to be forced to do it too often).
however i do not object to begin treating integers in semantically more correct way even if it can cause consequences
The introduction of some of the missing "standard library" functions like "modulo", "add_with_carry" or "multiply_full_result" would be a good start, and would cause nearly no harm. Of course it would only make sense if it would really enable to write fast bigint libraries in a portable way without falling back to assembler. Regards, Thomas

Thomas Klimpel wrote:
however i do not object to begin treating integers in semantically more correct way even if it can cause consequences
The introduction of some of the missing "standard library" functions like "modulo", "add_with_carry" or "multiply_full_result" would be a good start, and would cause nearly no harm. Of course it would only make sense if it would really enable to write fast bigint libraries in a portable way without falling back to assembler.
There is nothing stopping us from writing a boost library that does just that, implementing it as portable C++ that is slow and providing inline assmeber (or intrinsics based) implementations for common platforms. Then we can make a proposal to add these to the std library. Anyone want to suggest it as a GSOC project they are willing to mentor? I can't mentor more than one project, unfortunately. Regards, Luke

Lucanus wrote on Wednesday, March 30, 2011 at 1:02:32:
There is nothing stopping us from writing a boost library that does just that, implementing it as portable C++ that is slow and providing inline assmeber (or intrinsics based) implementations for common platforms. Then we can make a proposal to add these to the std library.
Anyone want to suggest it as a GSOC project they are willing to mentor? I can't mentor more than one project, unfortunately.
i second this i think that ideally a feature like 'add_with_carry()' should be available as a compiler intrinsic -- Pavel P.S. if you notice a grammar mistake or weird phrasing in my message please point it out

pavel wrote:
i think that ideally a feature like 'add_with_carry()' should be available as a compiler intrinsic
I had a look at this for the XInt review, when I wrote some ARM assembler here: http://article.gmane.org/gmane.comp.lib.boost.devel/215565 The relevant code is this snippet to do a 96-bit add: "adds %0, %3, %6\n" // Add and set carry flag "adcs %1, %4, %7\n" // Add with carry-in and set carry flag "adc %2, %5, %8\n" // Add with carry-in. The difficulty is that there is only one carry flag and that it is implicit. So in a compiler intrinsic for add-with-carry you need to either: - Keep the carry implicit, and somehow rely on the compiler not inserting any instructions that change it in the generated instruction stream (which seems impractical), or - Make it explicit, but since there is only one place that it can be stored, the compiler will have a challenge to generate code, or - Transfer it to and from a general-purpose register, which will greatly reduce the speed (back to what you can get without assembler or intrinsics). I came to the conclusion that it is better to write multi-word addition code (like the above) in assembler for each platform. I believe that the issues are similar on other architectures that have a flag register, but maybe others can confirm or deny that. Any thoughts anyone? Phil.

Phil Endecott wrote:
pavel wrote:
i think that ideally a feature like 'add_with_carry()' should be available as a compiler intrinsic
I had a look at this for the XInt review, when I wrote some ARM assembler here:
http://article.gmane.org/gmane.comp.lib.boost.devel/215565
The relevant code is this snippet to do a 96-bit add:
"adds %0, %3, %6\n" // Add and set carry flag "adcs %1, %4, %7\n" // Add with carry-in and set carry flag "adc %2, %5, %8\n" // Add with carry-in.
The difficulty is that there is only one carry flag and that it is implicit. So in a compiler intrinsic for add-with-carry you need to either: - Keep the carry implicit, and somehow rely on the compiler not inserting any instructions that change it in the generated instruction stream (which seems impractical), or - Make it explicit, but since there is only one place that it can be stored, the compiler will have a challenge to generate code, or - Transfer it to and from a general-purpose register, which will greatly reduce the speed (back to what you can get without assembler or intrinsics).
I came to the conclusion that it is better to write multi-word addition code (like the above) in assembler for each platform.
I believe that the issues are similar on other architectures that have a flag register, but maybe others can confirm or deny that.
Any thoughts anyone?
This is the add with carry instruction I'm most familiar with using: http://software.intel.com/en-us/articles/prototype-primitives-guide/ ADC_PI - Add Int32 Vectors with Carry Performs an element-by-element three-input addition between int32 vector v1, int32 vector v3, and the corresponding bit of vector mask k2. The carry from the sum for the nth element is written into the nth bit of vector mask k2_res. _M512I _mm512_adc_pi(_M512I v1, __mmask k2, _M512I v3, __mmask *k2_res) _M512I _mm512_mask_adc_pi(_M512I v1, __mmask k1, __mmask k2, _M512I v3, __mmask *k2_res) It is 16 wide 32bit integer add with carry that stores the resulting output carry bits in a 16 bit mask register and accepts a mask register as the carry in as well. Regards, Luke

Phil wrote on Thursday, March 31, 2011 at 4:51:18:
I came to the conclusion that it is better to write multi-word addition code (like the above) in assembler for each platform.
I believe that the issues are similar on other architectures that have a flag register, but maybe others can confirm or deny that.
Any thoughts anyone?
taking into account luke's message about adc_pi, i think that it's a perfect opportunity to abstract multiword addition (with carry) like this: unsigned add_with_carry(unsigned *dest, const unsigned *src, size_t size); the returned value is intended to be an overflow flag (read: carry of the result of the most significant word addition) that is the returned value is 0 or 1 (or even 2) obviously, inside the function one may implement any incredible (asm?) code (possibly involving sse/avx/altivec/etc.) -- Pavel P.S. if you notice a grammar mistake or weird phrasing in my message please point it out

pavel wrote on Thursday, March 31, 2011 at 19:59:25:
unsigned add_with_carry(unsigned *dest, const unsigned *src, size_t size);
the returned value is intended to be an overflow flag (read: carry of the result of the most significant word addition)
that is the returned value is 0 or 1 (or even 2)
well, i got heated with the "2" value -- Pavel P.S. if you notice a grammar mistake or weird phrasing in my message please point it out

pavel wrote:
taking into account luke's message about adc_pi, i think that it's a perfect opportunity to abstract multiword addition (with carry) like this:
unsigned add_with_carry(unsigned *dest, const unsigned *src, size_t size);
the returned value is intended to be an overflow flag (read: carry of the result of the most significant word addition)
What you describe is the sort of thing I (and I presume others) had in mind when we suggested that arithmetic algorithms be decoupled from data structures, i.e. you would have template <typename ITER, typename CONST_ITER> carry_type add_with_carry(ITER dest_begin, ITER dest_end, CONST_ITER src_begin, CONST_ITER src_end); with a general-purpose implementation, and then specialise it: carry_type add_with_carry(uint32_t* dest_begin, uint32_t* dest_end, const uint32_t* src_begin, const uint32_t* src_end); and provide an asm implementation for the specialisation. However, this doesn't suit the SIMD instruction that Luke described. That does multiple adds in parallel, and has carry-in and carry-out to each, but those carries are NOT propagated from one adder to its neighbour in that same instruction. So to use that instruction efficiently, say you are adding 4 multi-word numbers together: R = A + B + C + D; You would do (tmp1, carries1) = A + B; (tmp2, carries2) = tmp1 + C + carries1; (tmp3, carries3) = tmp2 + D + carries2; R = tmp3 + carries3; Note that the +s in the first 3 lines use the SIMD instruction Luke described, and the temporaries are incomplete. In the last line, the carries are added in (with the carry potentially propagating across the whole width); this may need one instruction per word. In hardware this is called a "carry save adder"; I'm not sure if the SIMD people use the same terminology. Anyway, the point is that even a multi-word addition abstraction is not specialisable in a way that fully exploits SIMD instructions. Doing that probably needs expression templates. Regards, Phil.

Phil wrote on Thursday, March 31, 2011 at 20:41:08:
What you describe is the sort of thing I (and I presume others) had in mind when we suggested that arithmetic algorithms be decoupled from data structures...
not exactly the thing here i propose a concrete function (to be part of the runtime library)
i.e. you would have template <typename ITER, typename CONST_ITER> carry_type add_with_carry(ITER dest_begin, ITER dest_end, CONST_ITER src_begin, CONST_ITER src_end);
with a general-purpose implementation, and then specialise it:
carry_type add_with_carry(uint32_t* dest_begin, uint32_t* dest_end, const uint32_t* src_begin, const uint32_t* src_end);
and provide an asm implementation for the specialisation.
what is the reason for template function? what is the semantics of the function (what does it do)? requirements for iterators? underlying types? containers?
However, this doesn't suit the SIMD instruction that Luke described. That does multiple adds in parallel, and has carry-in and carry-out to each, but those carries are NOT propagated from one adder to its neighbour in that same instruction. So to use that instruction efficiently, say you are adding 4 multi-word numbers together:
R = A + B + C + D;
You would do
(tmp1, carries1) = A + B; (tmp2, carries2) = tmp1 + C + carries1; (tmp3, carries3) = tmp2 + D + carries2; R = tmp3 + carries3;
Note that the +s in the first 3 lines use the SIMD instruction Luke described, and the temporaries are incomplete. In the last line, the carries are added in (with the carry potentially propagating across the whole width); this may need one instruction per word.
In hardware this is called a "carry save adder"; I'm not sure if the SIMD people use the same terminology.
i figured out myself that this simd instruction operates "vertically" rather than "horizontally" but thanks for explanation anyway
Anyway, the point is that even a multi-word addition abstraction is not specialisable in a way that fully exploits SIMD instructions. Doing that probably needs expression templates.
for the "a + b + c" case -- possibly the question here is: is it practical? but i guess this is far from the original "nice to have add_with_carry abstraction" discussion -- Pavel P.S. if you notice a grammar mistake or weird phrasing in my message please point it out

Phil Endecott wrote:
pavel wrote:
taking into account luke's message about adc_pi, i think that it's a perfect opportunity to abstract multiword addition (with carry) like this:
unsigned add_with_carry(unsigned *dest, const unsigned *src, size_t size);
the returned value is intended to be an overflow flag (read: carry of the result of the most significant word addition)
What you describe is the sort of thing I (and I presume others) had in mind when we suggested that arithmetic algorithms be decoupled from data structures, i.e. you would have
template <typename ITER, typename CONST_ITER> carry_type add_with_carry(ITER dest_begin, ITER dest_end, CONST_ITER src_begin, CONST_ITER src_end);
with a general-purpose implementation, and then specialise it:
carry_type add_with_carry(uint32_t* dest_begin, uint32_t* dest_end, const uint32_t* src_begin, const uint32_t* src_end);
and provide an asm implementation for the specialisation.
However, this doesn't suit the SIMD instruction that Luke described. That does multiple adds in parallel, and has carry-in and carry-out to each, but those carries are NOT propagated from one adder to its neighbour in that same instruction. So to use that instruction efficiently, say you are adding 4 multi-word numbers together:
R = A + B + C + D;
You would do
(tmp1, carries1) = A + B; (tmp2, carries2) = tmp1 + C + carries1; (tmp3, carries3) = tmp2 + D + carries2; R = tmp3 + carries3;
Note that the +s in the first 3 lines use the SIMD instruction Luke described, and the temporaries are incomplete. In the last line, the carries are added in (with the carry potentially propagating across the whole width); this may need one instruction per word.
In hardware this is called a "carry save adder"; I'm not sure if the SIMD people use the same terminology.
Anyway, the point is that even a multi-word addition abstraction is not specialisable in a way that fully exploits SIMD instructions. Doing that probably needs expression templates.
Implementing the add_with_carry function to be the multiword add operation using intrinsics would probably be at least 200 lines of code once you handle all combinations of aligned and unaligned source and dest plus unrolling the loop plus handling partially populated vector in the last iteration of the loop plus shifting and adding the carry back in and looping on that if it itself results in carry, then propigating the last carry forward to the next iteration of the main loop as the carry in. If the function returns a carry you need to extend the result by one word. Phil's forwarding of carries to the next add using the carry in argument is significantly better than calling add_with_carry function three times because not only do you save instructions for adding carries in you also save on load and store because you will read all four inputs simultaneously and store the final result only once, never reading it again. Implementing N argument addition/subtraction of multiword integers as a template using intrinsics and selecting that algorithm using an expression template is feasible, but now we are talking about thousands of lines of code. It would be a very fun project, though. Regards, Luke

pavel wrote:
Phil wrote on Thursday, March 31, 2011 at 4:51:18:
I came to the conclusion that it is better to write multi-word addition code (like the above) in assembler for each platform.
I believe that the issues are similar on other architectures that have a flag register, but maybe others can confirm or deny that.
Any thoughts anyone?
taking into account luke's message about adc_pi, i think that it's a perfect opportunity to abstract multiword addition (with carry) like this:
unsigned add_with_carry(unsigned *dest, const unsigned *src, size_t size);
the returned value is intended to be an overflow flag (read: carry of the result of the most significant word addition)
that is the returned value is 0 or 1 (or even 2)
obviously, inside the function one may implement any incredible (asm?) code (possibly involving sse/avx/altivec/etc.)
So you actually tried to "start" a C-interface for a bigint library. It's actually an interesting idea to think about a pure C interface for such bigint functionality, because the best case scenario would be that compiler vendors provide efficient implementations of such an interface. However, coming up with a good C interface here might be a similar challenge than designing a complete bigint library, possibly even more difficult. The answer Phil gave you to quite surprised me: Phil Endecott wrote:
What you describe is the sort of thing I (and I presume others) had in mind when we suggested that arithmetic algorithms be decoupled from data structures"
It is indeed a good idea, but I had interpreted this suggestion always the other way round. I thought that the main application of this was to implement more algorithms (for example different versions of multiplication), but reusing the existing algorithms definitively also makes sense. But for Phil's use case scenario, restricting the integer types accepted by the algorithms might be important (only unsigned for example, but maybe even more restrictive). Regards, Thomas
participants (4)
-
pavel
-
Phil Endecott
-
Simonson, Lucanus J
-
Thomas Klimpel