My review of Decimal

Hi everybody, PDF size exceeds ML limit so here is google drive link https://drive.google.com/file/d/1kH9WkeTDxN7FFqM_g7hV_-e3kJPlTGBr/view

Ivan Matek wrote:
Hi everybody,
PDF size exceeds ML limit so here is google drive link https://drive.google.com/file/d/1kH9WkeTDxN7FFqM_g7hV_-e3kJPlTGBr/view
I would like to have this PDF on the Boost site somewhere, if there are no objections. I can take care of that and reply with the link. When people revisit past reviews in the Boost archives: https://lists.boost.org/Archives/boost/2025/01/date.php It would be good to not lose this, in case the Google Drive link goes away. Especially since a lot of effort was spent on it. And, of course, thank you for writing a review. Glen

On Tue, Jan 21, 2025 at 5:52 AM Glen Fernandes via Boost < boost@lists.boost.org> wrote:
Ivan Matek wrote:
Hi everybody,
PDF size exceeds ML limit so here is google drive link https://drive.google.com/file/d/1kH9WkeTDxN7FFqM_g7hV_-e3kJPlTGBr/view
I would like to have this PDF on the Boost site somewhere, if there are no objections. I can take care of that and reply with the link.
I would prefer if we ensure that all reviews are posted in text to the mailing list. Otherwise, it will be inaccessible to tooling which analyzes the mailing list, reviews in particular for historical archiving. To the review wizards, would it be possible to make this a rule? Thanks

Hi everybody,
PDF size exceeds ML limit so here is google drive link https://drive.google.com/file/d/1kH9WkeTDxN7FFqM_g7hV_-e3kJPlTGBr/view
Since this is a PDF I'll just pick out an add headers to the parts I'll address C++14 - There's a bifurcation here between people who want to Boost to support ancient standards and toolchains forever, and those who want cutting edge. In the implementation side C++14 has been a happy medium where I don't see complaints anymore about supporting older standards, and it's more functional than 03 or 11 with similarly more complaint toolchains. For a math lib there's no real advantage to requiring say 20. We barely use the STL as it is. Header Layout - https://github.com/cppalliance/decimal/issues/804 Correctness - Bugs are not uncommon during these reviews especially given the size of the library. I believe we have addressed, fixed, or in the process of fixing things you have brought up. General Comments Counting Digits - It's easy to say a binary search tree is a naive implementation. A CLZ based implementation from blog posts by Daniel Lemire and Junekey Jeon can be found on branch `better_count_dig`. I found it actually benchmarked worse on a number of platforms. "Only interesting design question is if functions producing new value should be void returning and modify inplace argument passed by reference(as for example std::ranges::sort does) or they should be returning a value." - I don't see any advantage to this as it would be a serious departure from expectations and norms. Taking decimal types by reference instead of by value - fundamentally the decimalXX types are std::uint32_t, std::uint64_t, and struct { uint64_t hi, uint64_t lo }. I don't think you'll see any performance improvements with those. Maybe for the fast types? Those are still reasonably small structs. We can try a few. Thank you for your review and high-quality bug reports. Matt

Matt Borland wrote:
Taking decimal types by reference instead of by value - fundamentally the decimalXX types are std::uint32_t, std::uint64_t, and struct { uint64_t hi, uint64_t lo }. I don't think you'll see any performance improvements with those.
Trivially copyable types up to 2*uint64_t in size are passed in registers (on non-Windows x86-64), so pass by reference will be a performance regression, probably a significant one.

On Tue, Jan 21, 2025 at 2:52 PM Glen Fernandes
I would like to have this PDF on the Boost site somewhere, if there are no objections. I can take care of that and reply with the link.
When people revisit past reviews in the Boost archives: https://lists.boost.org/Archives/boost/2025/01/date.php
If you could do that it would be appreciated although I guess that for at
least next few years my drive link is gonna stay alive.
On Tue, Jan 21, 2025 at 3:14 PM Peter Dimov via Boost
Matt Borland wrote:
Taking decimal types by reference instead of by value - fundamentally the decimalXX types are std::uint32_t, std::uint64_t, and struct { uint64_t hi, uint64_t lo }. I don't think you'll see any performance improvements with those.
Trivially copyable types up to 2*uint64_t in size are passed in registers (on non-Windows x86-64), so pass by reference will be a performance regression, probably a significant one.
If you have time I suggest to read the PDF. I know your time is valuable, but I just think it is faster if we are on same page. In this particular case I tried to be pretty clear in document about this. I was not talking about small types. Also not all functions take 1 argument, in PDF I explicitly used copysign as example. And it *may* be possible that even if decimal128 pass by value is faster decimal128_fast pass by value is not. On Tue, Jan 21, 2025 at 3:36 PM John Maddock via Boost < boost@lists.boost.org> wrote:
It may be worth the experiment, but I'd be surprised if adding a level of indirection is of benefit over passing arguments by value in a couple of registers: possibly there may be benefit for functions accepting many arguments, but I'm assuming there aren't too many of those?
John.
I am a man of data. :) If there are realistic benchmarks saying passing "large" decimals is fine I will believe them. Till then I believe it is not optimal. Also library seems to prefer to pass by reference in certain cases, e.g. contrast this 2 "opposite" functions and their argument types constexpr auto isfinite(decimal128_fast val) noexcept -> bool { #ifndef BOOST_DECIMAL_FAST_MATH return val.significand_.high < detail::d128_fast_inf_high_bits; #else static_cast<void>(val); return true; #endif } constexpr auto not_finite(const decimal128_fast& val) noexcept -> bool { #ifndef BOOST_DECIMAL_FAST_MATH return val.significand_.high >= detail::d128_fast_inf_high_bits; #else static_cast<void>(val); return false; #endif }

Ivan Matek wrote:
Trivially copyable types up to 2*uint64_t in size are passed in registers (on non-Windows x86-64), so pass by reference will be a performance regression, probably a significant one.
If you have time I suggest to read the PDF. I know your time is valuable, but I just think it is faster if we are on same page.
I did read the PDF.
In this particular case I tried to be pretty clear in document about this. I was not talking about small types.
So, what types were you talking about? decimal32_fast is this public: using significand_type = std::uint_fast32_t; using exponent_type = std::uint_fast8_t; private: significand_type significand_ {}; exponent_type exponent_ {}; bool sign_ {}; which is 64 bit. decimal64_fast is this public: using significand_type = std::uint_fast64_t; using exponent_type = std::uint_fast16_t; private: significand_type significand_ {}; exponent_type exponent_ {}; bool sign_ {}; which is 128 bits. Only decimal128_fast doesn't fit in two registers.
Also not all functions take 1 argument, in PDF I explicitly used copysign as example.
x86-64 uses up to 6 registers for parameter passing (RDI, RSI, RDX, RCX, R8, R9), which means that up to three 128 bit trivially copyable types can be passed in registers when pass by value is used. copysign only needs two.
And it may be possible that even if decimal128 pass by value is faster decimal128_fast pass by value is not.
It's true that it's probably better to pass decimal128_fast by reference, but the API inconsistency is probably not worth it.

On Tue, Jan 21, 2025 at 4:20 PM Peter Dimov
Only decimal128_fast doesn't fit in two registers.
Also not all functions take 1 argument, in PDF I explicitly used copysign as example.
x86-64 uses up to 6 registers for parameter passing (RDI, RSI, RDX, RCX, R8, R9), which means that up to three 128 bit trivially copyable types can be passed in registers when pass by value is used.
I got a bit confused to be honest with all this I should have not gone and wrote reply before trying this out on godbolt, my apologies for noise.
But here are 2 things I believe are not correct in your response. Passing up to three 128 bit trivial types in registers is not property of X86-64, but of Linux ABI, Windows ABI is different. please see this slide: https://youtu.be/PCP3ckEqYK8?feature=shared&t=1183 On my machine decimal64_fast is not 128 bit because typedef unsigned long int uint_fast16_t; static_assert(sizeof(decimal64_fast) == 24); if you do not believe me here is godbolt: https://godbolt.org/z/xPzYTP6cM It is kind of ironic that Windows has it as 128bit, but it can not pass in register types larger than 64bits.

Ivan Matek wrote:
On Tue, Jan 21, 2025 at 4:20 PM Peter Dimov
mailto:pdimov@gmail.com > wrote: Only decimal128_fast doesn't fit in two registers.
Also not all functions take 1 argument, in PDF I explicitly used copysign as example.
x86-64 uses up to 6 registers for parameter passing (RDI, RSI, RDX, RCX, R8, R9), which means that up to three 128 bit trivially copyable types can be passed in registers when pass by value is used.
I got a bit confused to be honest with all this I should have not gone and wrote reply before trying this out on godbolt, my apologies for noise.
But here are 2 things I believe are not correct in your response. Passing up to three 128 bit trivial types in registers is not property of X86-64, but of Linux ABI, Windows ABI is different.
Yes, that's why I said "non-Windows x86-64". :-)
On my machine decimal64_fast is not 128 bit because typedef unsigned long int uint_fast16_t; static_assert(sizeof(decimal64_fast) == 24); if you do not believe me here is godbolt: https://godbolt.org/z/xPzYTP6cM
That's true, and it's actually not passed in registers for this reason. https://godbolt.org/z/qs5z6Th9e Interesting. Looks like this is caused by the use of uint_fast16_t, which is actually uint64_t. Maybe not the best choice. https://godbolt.org/z/8fczrzbEY is better. Although as I said in my other e-mail, maybe the exponent and the significand should just be packed into a single uint64_t.

On Tue, Jan 21, 2025 at 5:41 PM Peter Dimov
Ivan Matek wrote:
On Tue, Jan 21, 2025 at 4:20 PM Peter Dimov
mailto:pdimov@gmail.com > wrote: Only decimal128_fast doesn't fit in two registers.
> Also not all functions take 1 argument, in PDF I > explicitly used copysign as example.
x86-64 uses up to 6 registers for parameter passing (RDI, RSI, RDX, RCX, R8, R9), which means that up to three 128 bit trivially copyable types can be passed in registers when pass by value is used.
I got a bit confused to be honest with all this I should have not gone and wrote reply before trying this out on godbolt, my apologies for noise.
But here are 2 things I believe are not correct in your response. Passing up to three 128 bit trivial types in registers is not property of X86-64, but of Linux ABI, Windows ABI is different.
Yes, that's why I said "non-Windows x86-64". :-)
Sorry, will try to have larger context *window* when responding in the future. :)
On my machine decimal64_fast is not 128 bit because typedef unsigned long int uint_fast16_t; static_assert(sizeof(decimal64_fast) == 24); if you do not believe me here is godbolt: https://godbolt.org/z/xPzYTP6cM
That's true, and it's actually not passed in registers for this reason.
https://godbolt.org/z/qs5z6Th9e
Interesting. Looks like this is caused by the use of uint_fast16_t, which is actually uint64_t. Maybe not the best choice.
I actually thought about writing in review about if using std:: fast types makes sense or not, but I was already tired :).
From what I know they offer no benefit for most modern architectures, and more importantly even if some type is actually "fastest at least 16 bit integer" there is no guarantee that what std:: picked is actually that type. E.g. I can imagine that for case we are discussing here fastest type is uint16_t or uint32_t but library implementers picked uint64_t 20 years ago and now it is baked in forever(hello ABI my old friend). There is also the fact that fastest type may not be fastest for all compute benchmarks you can imagine, I presume on modern CPUs this does not matter, but maybe for some old CPU for add one type is fastest, for mul other is...
https://godbolt.org/z/8fczrzbEY is better.
As I wrote above I have no faith in fastness of _fast_t types in std:: so I agree.

Ivan Matek wrote:
There is also the fact that fastest type may not be fastest for all compute benchmarks you can imagine, I presume on modern CPUs this does not matter, but maybe for some old CPU for add one type is fastest, for mul other is...
On modern compilers, the knowledge the compiler derives from knowing that the value fits in 16 bits (when uint16_t is used) is generally much more valuable (because it allows the backend to pick the right instructions) than when uint_fast16_t = uint64_t is used. Basically, _fast types are almost never fast. Let's hope this curse doesn't afflict Decimal _fast types as well. :-)

On Tue, Jan 21, 2025 at 6:06 PM Peter Dimov
Basically, _fast types are almost never fast. Let's hope this curse doesn't afflict Decimal _fast types as well. :-)
Thank you for confirming this, now I regret not writing that also in review. :)
To recap the discussion wrt pass by reference: points we agree on(wrt 64 bit x86): - if we pass by value on Windows we are out of luck for most decimal types since there is ABI limitation for types greater than 8 bytes - use of uint16_fast_t in implementation pushed decimal64_fast over the limit of Linux ABI(16 bytes), _fast std:: types are not fast, would be nice to change this even if will not help on Windows points we disagree on: - pass by reference for large types (and if necessary mutate inplace) is still my prefered API. If we do want value returning functions then for large types I would prefer to pass args by const reference. Not trying to change your mind, just recapping above discussion, doing all sizeof and ABI math is tricky.

On Tuesday, January 21st, 2025 at 12:56 PM, Ivan Matek via Boost
On Tue, Jan 21, 2025 at 6:06 PM Peter Dimov pdimov@gmail.com wrote:
Basically, _fast types are almost never fast. Let's hope this curse doesn't afflict Decimal _fast types as well. :-)
Thank you for confirming this, now I regret not writing that also in
review. :)
To recap the discussion wrt pass by reference: points we agree on(wrt 64 bit x86):
- if we pass by value on Windows we are out of luck for most decimal types since there is ABI limitation for types greater than 8 bytes - use of uint16_fast_t in implementation pushed decimal64_fast over the limit of Linux ABI(16 bytes), _fast std:: types are not fast, would be nice to change this even if will not help on Windows
points we disagree on:
- pass by reference for large types (and if necessary mutate inplace) is still my prefered API. If we do want value returning functions then for large types I would prefer to pass args by const reference.
Not trying to change your mind, just recapping above discussion, doing all sizeof and ABI math is tricky.
Here's a data point on macOS ARM64:
Benchmarks on Current Develop:
===== Comparisons =====
comparisons

On Tue, Jan 21, 2025 at 7:12 PM Matt Borland
Every uint_fastXX_t or int_fastXX replaced by uintXX_t or intXX_t:
Interesting that it is slower. Could be just random optimizer noise... but
without a deep dive to look into generated asm and profile it is a bit tricky to know why. I would also try the middle ground(32 bit) in cases where this change went from 64 bit integer to 16 bit integer.

Matt Borland wrote:
Here's a data point on macOS ARM64: ... So we'd need to investigate all the different ABIs and switch the type on platform.
ARM64 also passes up to 128 bits in registers (x0-x7, up to four 128 bit types): https://godbolt.org/z/f6nGvx3zM vs https://godbolt.org/z/ds74Excq5 This may make the synthetic benchmarks slightly slower, but I don't think it's going to be faster overall in real code, because +50% size overhead will inevitably translate into more cache pressure.

General Comments
Counting Digits - It's easy to say a binary search tree is a naive implementation. A CLZ based implementation from blog posts by Daniel Lemire and Junekey Jeon can be found on branch `better_count_dig`. I found it actually benchmarked worse on a number of platforms. It's worth emphasizing that simple code will always run faster for simple problems than a theoretically better but more complex algorithm:
the classic example is that a linear search will generally beat a binary search when the number of items are low. There should probably be a comment in the code indicating that the alternative has been tried and was found to be slower though.
"Only interesting design question is if functions producing new value should be void returning and modify inplace argument passed by reference(as for example std::ranges::sort does) or they should be returning a value." - I don't see any advantage to this as it would be a serious departure from expectations and norms.
Taking decimal types by reference instead of by value - fundamentally the decimalXX types are std::uint32_t, std::uint64_t, and struct { uint64_t hi, uint64_t lo }. I don't think you'll see any performance improvements with those. Maybe for the fast types? Those are still reasonably small structs. We can try a few.
It may be worth the experiment, but I'd be surprised if adding a level of indirection is of benefit over passing arguments by value in a couple of registers: possibly there may be benefit for functions accepting many arguments, but I'm assuming there aren't too many of those? John.

On Tue, Jan 21, 2025 at 09:36, John Maddock via Boost <boost@lists.boost.org> wrote: > General Comments > > Counting Digits - It's easy to say a binary search tree is a naive implementation. A CLZ based implementation from blog posts by Daniel Lemire and Junekey Jeon can be found on branch `better_count_dig`. I found it actually benchmarked worse on a number of platforms. It's worth emphasizing that simple code will always run faster for simple problems than a theoretically better but more complex algorithm: the classic example is that a linear search will generally beat a binary search when the number of items are low. There should probably be a comment in the code indicating that the alternative has been tried and was found to be slower though. > "Only interesting design question is if functions producing new value should be void returning and > modify inplace argument passed by reference(as for example std::ranges::sort does) or they > should be returning a value." - I don't see any advantage to this as it would be a serious departure from expectations and norms. > > Taking decimal types by reference instead of by value - fundamentally the decimalXX types are std::uint32_t, std::uint64_t, and struct { uint64_t hi, uint64_t lo }. I don't think you'll see any performance improvements with those. Maybe for the fast types? Those are still reasonably small structs. We can try a few. It may be worth the experiment, but I'd be surprised if adding a level of indirection is of benefit over passing arguments by value in a couple of registers: possibly there may be benefit for functions accepting many arguments, but I'm assuming there aren't too many of those? The only ones like that are implementations of fundamental operations like add, sub, mul, div. Those take 2 bools and 4 integers as parameters since we’ve already decoded the decimal number John. _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (6)
-
Glen Fernandes
-
Ivan Matek
-
John Maddock
-
Matt Borland
-
Peter Dimov
-
Vinnie Falco