My review of Decimal

older
Reminder to submit Boost.Decimal...

Ivan Matek

21 Jan 2025 21 Jan '25

12:53 p.m.

Hi everybody, PDF size exceeds ML limit so here is google drive link https://drive.google.com/file/d/1kH9WkeTDxN7FFqM_g7hV_-e3kJPlTGBr/view

Show replies by date

Glen Fernandes

21 Jan 21 Jan

1:52 p.m.

Ivan Matek wrote:

...

Hi everybody,

PDF size exceeds ML limit so here is google drive link https://drive.google.com/file/d/1kH9WkeTDxN7FFqM_g7hV_-e3kJPlTGBr/view

I would like to have this PDF on the Boost site somewhere, if there are no objections. I can take care of that and reply with the link. When people revisit past reviews in the Boost archives: https://lists.boost.org/Archives/boost/2025/01/date.php It would be good to not lose this, in case the Google Drive link goes away. Especially since a lot of effort was spent on it. And, of course, thank you for writing a review. Glen

Vinnie Falco

3:33 p.m.

On Tue, Jan 21, 2025 at 5:52 AM Glen Fernandes via Boost < boost@lists.boost.org> wrote:

...

Ivan Matek wrote:

...
Hi everybody,

PDF size exceeds ML limit so here is google drive link https://drive.google.com/file/d/1kH9WkeTDxN7FFqM_g7hV_-e3kJPlTGBr/view

I would like to have this PDF on the Boost site somewhere, if there are no objections. I can take care of that and reply with the link.

I would prefer if we ensure that all reviews are posted in text to the mailing list. Otherwise, it will be inaccessible to tooling which analyzes the mailing list, reviews in particular for historical archiving. To the review wizards, would it be possible to make this a rule? Thanks

Matt Borland

1:56 p.m.

...

Hi everybody,

...

PDF size exceeds ML limit so here is google drive link https://drive.google.com/file/d/1kH9WkeTDxN7FFqM_g7hV_-e3kJPlTGBr/view

Since this is a PDF I'll just pick out an add headers to the parts I'll address C++14 - There's a bifurcation here between people who want to Boost to support ancient standards and toolchains forever, and those who want cutting edge. In the implementation side C++14 has been a happy medium where I don't see complaints anymore about supporting older standards, and it's more functional than 03 or 11 with similarly more complaint toolchains. For a math lib there's no real advantage to requiring say 20. We barely use the STL as it is. Header Layout - https://github.com/cppalliance/decimal/issues/804 Correctness - Bugs are not uncommon during these reviews especially given the size of the library. I believe we have addressed, fixed, or in the process of fixing things you have brought up. General Comments Counting Digits - It's easy to say a binary search tree is a naive implementation. A CLZ based implementation from blog posts by Daniel Lemire and Junekey Jeon can be found on branch `better_count_dig`. I found it actually benchmarked worse on a number of platforms. "Only interesting design question is if functions producing new value should be void returning and modify inplace argument passed by reference(as for example std::ranges::sort does) or they should be returning a value." - I don't see any advantage to this as it would be a serious departure from expectations and norms. Taking decimal types by reference instead of by value - fundamentally the decimalXX types are std::uint32_t, std::uint64_t, and struct { uint64_t hi, uint64_t lo }. I don't think you'll see any performance improvements with those. Maybe for the fast types? Those are still reasonably small structs. We can try a few. Thank you for your review and high-quality bug reports. Matt

Peter Dimov

2:14 p.m.

Matt Borland wrote:

...

Taking decimal types by reference instead of by value - fundamentally the decimalXX types are std::uint32_t, std::uint64_t, and struct { uint64_t hi, uint64_t lo }. I don't think you'll see any performance improvements with those.

Trivially copyable types up to 2*uint64_t in size are passed in registers (on non-Windows x86-64), so pass by reference will be a performance regression, probably a significant one.

Ivan Matek

2:59 p.m.

On Tue, Jan 21, 2025 at 2:52 PM Glen Fernandes <glen.fernandes@gmail.com> wrote:

...

I would like to have this PDF on the Boost site somewhere, if there are no objections. I can take care of that and reply with the link.

When people revisit past reviews in the Boost archives: https://lists.boost.org/Archives/boost/2025/01/date.php

If you could do that it would be appreciated although I guess that for at least next few years my drive link is gonna stay alive. On Tue, Jan 21, 2025 at 3:14 PM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

Matt Borland wrote:

...
Taking decimal types by reference instead of by value - fundamentally the decimalXX types are std::uint32_t, std::uint64_t, and struct { uint64_t hi, uint64_t lo }. I don't think you'll see any performance improvements with those.

Trivially copyable types up to 2*uint64_t in size are passed in registers (on non-Windows x86-64), so pass by reference will be a performance regression, probably a significant one.

If you have time I suggest to read the PDF. I know your time is valuable, but I just think it is faster if we are on same page. In this particular case I tried to be pretty clear in document about this. I was not talking about small types. Also not all functions take 1 argument, in PDF I explicitly used copysign as example. And it *may* be possible that even if decimal128 pass by value is faster decimal128_fast pass by value is not. On Tue, Jan 21, 2025 at 3:36 PM John Maddock via Boost < boost@lists.boost.org> wrote:

...

It may be worth the experiment, but I'd be surprised if adding a level of indirection is of benefit over passing arguments by value in a couple of registers: possibly there may be benefit for functions accepting many arguments, but I'm assuming there aren't too many of those?

John.

I am a man of data. :) If there are realistic benchmarks saying passing "large" decimals is fine I will believe them. Till then I believe it is not optimal. Also library seems to prefer to pass by reference in certain cases, e.g. contrast this 2 "opposite" functions and their argument types constexpr auto isfinite(decimal128_fast val) noexcept -> bool { #ifndef BOOST_DECIMAL_FAST_MATH return val.significand_.high < detail::d128_fast_inf_high_bits; #else static_cast<void>(val); return true; #endif } constexpr auto not_finite(const decimal128_fast& val) noexcept -> bool { #ifndef BOOST_DECIMAL_FAST_MATH return val.significand_.high >= detail::d128_fast_inf_high_bits; #else static_cast<void>(val); return false; #endif }

Peter Dimov

3:20 p.m.

Ivan Matek wrote:

...

Trivially copyable types up to 2*uint64_t in size are passed in registers (on non-Windows x86-64), so pass by reference will be a performance regression, probably a significant one.

If you have time I suggest to read the PDF. I know your time is valuable, but I just think it is faster if we are on same page.

I did read the PDF.

...

In this particular case I tried to be pretty clear in document about this. I was not talking about small types.

So, what types were you talking about? decimal32_fast is this public: using significand_type = std::uint_fast32_t; using exponent_type = std::uint_fast8_t; private: significand_type significand_ {}; exponent_type exponent_ {}; bool sign_ {}; which is 64 bit. decimal64_fast is this public: using significand_type = std::uint_fast64_t; using exponent_type = std::uint_fast16_t; private: significand_type significand_ {}; exponent_type exponent_ {}; bool sign_ {}; which is 128 bits. Only decimal128_fast doesn't fit in two registers.

...

Also not all functions take 1 argument, in PDF I explicitly used copysign as example.

x86-64 uses up to 6 registers for parameter passing (RDI, RSI, RDX, RCX, R8, R9), which means that up to three 128 bit trivially copyable types can be passed in registers when pass by value is used. copysign only needs two.

...

And it may be possible that even if decimal128 pass by value is faster decimal128_fast pass by value is not.

It's true that it's probably better to pass decimal128_fast by reference, but the API inconsistency is probably not worth it.

Ivan Matek

4:04 p.m.

On Tue, Jan 21, 2025 at 4:20 PM Peter Dimov <pdimov@gmail.com> wrote:

...

Only decimal128_fast doesn't fit in two registers.

...
Also not all functions take 1 argument, in PDF I explicitly used copysign as example.

x86-64 uses up to 6 registers for parameter passing (RDI, RSI, RDX, RCX, R8, R9), which means that up to three 128 bit trivially copyable types can be passed in registers when pass by value is used.

I got a bit confused to be honest with all this I should have not gone and wrote reply before trying this out on godbolt, my apologies for noise.

But here are 2 things I believe are not correct in your response. Passing up to three 128 bit trivial types in registers is not property of X86-64, but of Linux ABI, Windows ABI is different. please see this slide: https://youtu.be/PCP3ckEqYK8?feature=shared&t=1183 On my machine decimal64_fast is not 128 bit because typedef unsigned long int uint_fast16_t; static_assert(sizeof(decimal64_fast) == 24); if you do not believe me here is godbolt: https://godbolt.org/z/xPzYTP6cM It is kind of ironic that Windows has it as 128bit, but it can not pass in register types larger than 64bits.

Peter Dimov

4:41 p.m.

Ivan Matek wrote:

...

On Tue, Jan 21, 2025 at 4:20 PM Peter Dimov <pdimov@gmail.com <mailto:pdimov@gmail.com> > wrote:

Only decimal128_fast doesn't fit in two registers.

...
Also not all functions take 1 argument, in PDF I explicitly used copysign as example.

x86-64 uses up to 6 registers for parameter passing (RDI, RSI, RDX, RCX, R8, R9), which means that up to three 128 bit trivially copyable types can be passed in registers when pass by value is used.

I got a bit confused to be honest with all this I should have not gone and wrote reply before trying this out on godbolt, my apologies for noise.

But here are 2 things I believe are not correct in your response. Passing up to three 128 bit trivial types in registers is not property of X86-64, but of Linux ABI, Windows ABI is different.

Yes, that's why I said "non-Windows x86-64". :-)

...

On my machine decimal64_fast is not 128 bit because typedef unsigned long int uint_fast16_t; static_assert(sizeof(decimal64_fast) == 24); if you do not believe me here is godbolt: https://godbolt.org/z/xPzYTP6cM

That's true, and it's actually not passed in registers for this reason. https://godbolt.org/z/qs5z6Th9e Interesting. Looks like this is caused by the use of uint_fast16_t, which is actually uint64_t. Maybe not the best choice. https://godbolt.org/z/8fczrzbEY is better. Although as I said in my other e-mail, maybe the exponent and the significand should just be packed into a single uint64_t.

Ivan Matek

4:57 p.m.

On Tue, Jan 21, 2025 at 5:41 PM Peter Dimov <pdimov@gmail.com> wrote:

...

Ivan Matek wrote:

...
On Tue, Jan 21, 2025 at 4:20 PM Peter Dimov <pdimov@gmail.com <mailto:pdimov@gmail.com> > wrote:

Only decimal128_fast doesn't fit in two registers.

> Also not all functions take 1 argument, in PDF I > explicitly used copysign as example.

x86-64 uses up to 6 registers for parameter passing (RDI, RSI, RDX, RCX, R8, R9), which means that up to three 128 bit trivially copyable types can be passed in registers when pass by value is used.

I got a bit confused to be honest with all this I should have not gone and wrote reply before trying this out on godbolt, my apologies for noise.

But here are 2 things I believe are not correct in your response. Passing up to three 128 bit trivial types in registers is not property of X86-64, but of Linux ABI, Windows ABI is different.

Yes, that's why I said "non-Windows x86-64". :-)

Sorry, will try to have larger context *window* when responding in the future. :)

...

...
On my machine decimal64_fast is not 128 bit because typedef unsigned long int uint_fast16_t; static_assert(sizeof(decimal64_fast) == 24); if you do not believe me here is godbolt: https://godbolt.org/z/xPzYTP6cM

That's true, and it's actually not passed in registers for this reason.

https://godbolt.org/z/qs5z6Th9e

Interesting. Looks like this is caused by the use of uint_fast16_t, which is actually uint64_t. Maybe not the best choice.

I actually thought about writing in review about if using std:: fast types makes sense or not, but I was already tired :).

...

From what I know they offer no benefit for most modern architectures, and more importantly even if some type is actually "fastest at least 16 bit integer" there is no guarantee that what std:: picked is actually that type. E.g. I can imagine that for case we are discussing here fastest type is uint16_t or uint32_t but library implementers picked uint64_t 20 years ago and now it is baked in forever(hello ABI my old friend). There is also the fact that fastest type may not be fastest for all compute benchmarks you can imagine, I presume on modern CPUs this does not matter, but maybe for some old CPU for add one type is fastest, for mul other is...

...

https://godbolt.org/z/8fczrzbEY is better.

As I wrote above I have no faith in fastness of _fast_t types in std:: so I agree.

Peter Dimov

5:06 p.m.

Ivan Matek wrote:

...

There is also the fact that fastest type may not be fastest for all compute benchmarks you can imagine, I presume on modern CPUs this does not matter, but maybe for some old CPU for add one type is fastest, for mul other is...

On modern compilers, the knowledge the compiler derives from knowing that the value fits in 16 bits (when uint16_t is used) is generally much more valuable (because it allows the backend to pick the right instructions) than when uint_fast16_t = uint64_t is used. Basically, _fast types are almost never fast. Let's hope this curse doesn't afflict Decimal _fast types as well. :-)

Ivan Matek

5:56 p.m.

On Tue, Jan 21, 2025 at 6:06 PM Peter Dimov <pdimov@gmail.com> wrote:

...

Basically, _fast types are almost never fast. Let's hope this curse doesn't afflict Decimal _fast types as well. :-)

Thank you for confirming this, now I regret not writing that also in review. :)

To recap the discussion wrt pass by reference: points we agree on(wrt 64 bit x86): - if we pass by value on Windows we are out of luck for most decimal types since there is ABI limitation for types greater than 8 bytes - use of uint16_fast_t in implementation pushed decimal64_fast over the limit of Linux ABI(16 bytes), _fast std:: types are not fast, would be nice to change this even if will not help on Windows points we disagree on: - pass by reference for large types (and if necessary mutate inplace) is still my prefered API. If we do want value returning functions then for large types I would prefer to pass args by const reference. Not trying to change your mind, just recapping above discussion, doing all sizeof and ABI math is tricky.

Matt Borland

6:12 p.m.

On Tuesday, January 21st, 2025 at 12:56 PM, Ivan Matek via Boost <boost@lists.boost.org> wrote:

...

On Tue, Jan 21, 2025 at 6:06 PM Peter Dimov pdimov@gmail.com wrote:

...

...
Basically, _fast types are almost never fast. Let's hope this curse doesn't afflict Decimal _fast types as well. :-)

...

...
Thank you for confirming this, now I regret not writing that also in

...

review. :)

...

To recap the discussion wrt pass by reference: points we agree on(wrt 64 bit x86):

...

- if we pass by value on Windows we are out of luck for most decimal types since there is ABI limitation for types greater than 8 bytes - use of uint16_fast_t in implementation pushed decimal64_fast over the limit of Linux ABI(16 bytes), _fast std:: types are not fast, would be nice to change this even if will not help on Windows

...

points we disagree on:

...

- pass by reference for large types (and if necessary mutate inplace) is still my prefered API. If we do want value returning functions then for large types I would prefer to pass args by const reference.

...

Not trying to change your mind, just recapping above discussion, doing all sizeof and ABI math is tricky.

Here's a data point on macOS ARM64: Benchmarks on Current Develop: ===== Comparisons ===== comparisons<dec32_fast >: 555534 us (s=29999985) comparisons<dec64_fast >: 680204 us (s=29999985) comparisons<dec128_fast>: 598125 us (s=29999985) ===== Addition ===== Addition<dec32_fast >: 1112121 us Addition<dec64_fast >: 1282197 us Addition<dec128_fast>: 6967490 us ===== Subtraction ===== Subtraction<dec32_fast >: 930937 us Subtraction<dec64_fast >: 1127780 us Subtraction<dec128_fast>: 3645142 us ===== Multiplication ===== Multiplication<dec32_fast >: 776308 us Multiplication<dec64_fast >: 1145653 us Multiplication<dec128_fast>: 17949983 us ===== Division ===== Division<dec32_fast >: 904825 us Division<dec64_fast >: 1659034 us Division<dec128_fast>: 1648759 us Every uint_fastXX_t or int_fastXX replaced by uintXX_t or intXX_t: ===== Comparisons ===== comparisons<dec32_fast >: 586511 us comparisons<dec64_fast >: 709211 us comparisons<dec128_fast>: 657634 us ===== Addition ===== Addition<dec32_fast >: 1121786 us Addition<dec64_fast >: 1289409 us Addition<dec128_fast>: 7020519 us ===== Subtraction ===== Subtraction<dec32_fast >: 963632 us Subtraction<dec64_fast >: 1160965 us Subtraction<dec128_fast>: 3695792 us ===== Multiplication ===== Multiplication<dec32_fast >: 800159 us Multiplication<dec64_fast >: 1179959 us Multiplication<dec128_fast>: 18459046 us ===== Division ===== Division<dec32_fast >: 928643 us Division<dec64_fast >: 1683552 us Division<dec128_fast>: 1684860 us So we'd need to investigate all the different ABIs and switch the type on platform. I'll note that my primary development is on ARM Mac. Another win for Intel's complete vertical integration. Matt

Ivan Matek

6:26 p.m.

On Tue, Jan 21, 2025 at 7:12 PM Matt Borland <matt@mattborland.com> wrote:

...

Every uint_fastXX_t or int_fastXX replaced by uintXX_t or intXX_t:

Interesting that it is slower. Could be just random optimizer noise... but

without a deep dive to look into generated asm and profile it is a bit tricky to know why. I would also try the middle ground(32 bit) in cases where this change went from 64 bit integer to 16 bit integer.

Peter Dimov

6:37 p.m.

Matt Borland wrote:

...

Here's a data point on macOS ARM64: ... So we'd need to investigate all the different ABIs and switch the type on platform.

ARM64 also passes up to 128 bits in registers (x0-x7, up to four 128 bit types): https://godbolt.org/z/f6nGvx3zM vs https://godbolt.org/z/ds74Excq5 This may make the synthetic benchmarks slightly slower, but I don't think it's going to be faster overall in real code, because +50% size overhead will inevitably translate into more cache pressure.

John Maddock

2:36 p.m.

...

General Comments

Counting Digits - It's easy to say a binary search tree is a naive implementation. A CLZ based implementation from blog posts by Daniel Lemire and Junekey Jeon can be found on branch `better_count_dig`. I found it actually benchmarked worse on a number of platforms. It's worth emphasizing that simple code will always run faster for simple problems than a theoretically better but more complex algorithm:

the classic example is that a linear search will generally beat a binary search when the number of items are low. There should probably be a comment in the code indicating that the alternative has been tried and was found to be slower though.

...

"Only interesting design question is if functions producing new value should be void returning and modify inplace argument passed by reference(as for example std::ranges::sort does) or they should be returning a value." - I don't see any advantage to this as it would be a serious departure from expectations and norms.

Taking decimal types by reference instead of by value - fundamentally the decimalXX types are std::uint32_t, std::uint64_t, and struct { uint64_t hi, uint64_t lo }. I don't think you'll see any performance improvements with those. Maybe for the fast types? Those are still reasonably small structs. We can try a few.

It may be worth the experiment, but I'd be surprised if adding a level of indirection is of benefit over passing arguments by value in a couple of registers: possibly there may be benefit for functions accepting many arguments, but I'm assuming there aren't too many of those? John.

Matt Borland

2:45 p.m.

On Tue, Jan 21, 2025 at 09:36, John Maddock via Boost <boost@lists.boost.org> wrote: > General Comments > > Counting Digits - It's easy to say a binary search tree is a naive implementation. A CLZ based implementation from blog posts by Daniel Lemire and Junekey Jeon can be found on branch `better_count_dig`. I found it actually benchmarked worse on a number of platforms. It's worth emphasizing that simple code will always run faster for simple problems than a theoretically better but more complex algorithm: the classic example is that a linear search will generally beat a binary search when the number of items are low. There should probably be a comment in the code indicating that the alternative has been tried and was found to be slower though. > "Only interesting design question is if functions producing new value should be void returning and > modify inplace argument passed by reference(as for example std::ranges::sort does) or they > should be returning a value." - I don't see any advantage to this as it would be a serious departure from expectations and norms. > > Taking decimal types by reference instead of by value - fundamentally the decimalXX types are std::uint32_t, std::uint64_t, and struct { uint64_t hi, uint64_t lo }. I don't think you'll see any performance improvements with those. Maybe for the fast types? Those are still reasonably small structs. We can try a few. It may be worth the experiment, but I'd be surprised if adding a level of indirection is of benefit over passing arguments by value in a couple of registers: possibly there may be benefit for functions accepting many arguments, but I'm assuming there aren't too many of those? The only ones like that are implementations of fundamental operations like add, sub, mul, div. Those take 2 bools and 4 integers as parameters since we’ve already decoded the decimal number John. _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

200

Age (days ago)

200

Last active (days ago)

List overview

Download

16 comments

6 participants

participants (6)

Glen Fernandes
Ivan Matek
John Maddock
Matt Borland
Peter Dimov
Vinnie Falco