Re: [boost] Decimal: Formal review begins

I mentioned this to Matt, but I would like to see benchmarks comparing boost::decimal64_fast to a Decimal64 backed by Intel's DFP library, which is what I know to be in actual use today: https://www.intel.com/content/www/us/en/developer/articles/tool/intel-decima... Not that the benchmarks comparing to GCC's Decimal64 (which is based on the libbid which ships with libgcc) aren't useful, I personally don't know anyone using that today. Benchmarks should ideally also include Intel's compiler, because at least one of the relevant parties who motivated me to suggest the Decimal64 library (to Vinnie as a potential project) do use the Intel C++ compiler (and their Fortran compiler) for areas where they perform better. (The non-fast versions don't matter to me. I don't know anyone who would want to use them). Glen

On Monday, January 20th, 2025 at 12:48 PM, Glen Fernandes via Boost
I mentioned this to Matt, but I would like to see benchmarks comparing boost::decimal64_fast to a Decimal64 backed by Intel's DFP library, which is what I know to be in actual use today:
https://www.intel.com/content/www/us/en/developer/articles/tool/intel-decima...
Not that the benchmarks comparing to GCC's Decimal64 (which is based on the libbid which ships with libgcc) aren't useful, I personally don't know anyone using that today.
Benchmarks should ideally also include Intel's compiler, because at least one of the relevant parties who motivated me to suggest the Decimal64 library (to Vinnie as a potential project) do use the Intel C++ compiler (and their Fortran compiler) for areas where they perform better.
(The non-fast versions don't matter to me. I don't know anyone who would want to use them).
Glen
Here are some preliminary results:
All tests run on an i9-11900k with Ubuntu 24.04 and the Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205). The Intel benchmarks are written in C but should be a faithful port. Bottom line up front is the Intel library is an order of magnitude faster.
Intel
===== Comparisons =====
Comparisons <Decimal32 >: 73831 us
Comparisons <Decimal64 >: 76725 us
===== Addition =====
Addition <Decimal32 >: 81544 us
Addition <Decimal64 >: 86667 us
===== Subtraction =====
Subtraction <Decimal32 >: 81736 us
Subtraction <Decimal64 >: 86587 us
===== Multiplication =====
Multiplication <Decimal32 >: 81631 us
Multiplication <Decimal64 >: 86939 us
Proposal and built-in types
===== Comparisons =====
comparisons<float >: 73013 us
comparisons<double >: 104019 us
comparisons

On Mon, Jan 20, 2025 at 3:11 PM Matt Borland wrote:
On Monday, January 20th, 2025 at 12:48 PM, Glen Fernandes wrote:
I mentioned this to Matt, but I would like to see benchmarks comparing boost::decimal64_fast to a Decimal64 backed by Intel's DFP library, which is what I know to be in actual use today:
Not that the benchmarks comparing to GCC's Decimal64 (which is based on
https://www.intel.com/content/www/us/en/developer/articles/tool/intel-decima... the
libbid which ships with libgcc) aren't useful, I personally don't know anyone using that today.
Benchmarks should ideally also include Intel's compiler, because at least one of the relevant parties who motivated me to suggest the Decimal64 library (to Vinnie as a potential project) do use the Intel C++ compiler (and their Fortran compiler) for areas where they perform better.
(The non-fast versions don't matter to me. I don't know anyone who would want to use them).
Here are some preliminary results:
All tests run on an i9-11900k with Ubuntu 24.04 and the Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205). The Intel benchmarks are written in C but should be a faithful port. Bottom line up front is the Intel library is an order of magnitude faster.
Thank you Matt. This reconciles with what I know about everyone using the Intel library (especially the BSL proclamation about it being 10 times faster). These numbers should be added to the documentation. I think there is some work to be done to make Boost.Decimal viable. i.e. We don't want to be an order of magnitude slower than what everyone else is actually using today. Glen

On Mon, 20 Jan 2025, 21:28 Glen Fernandes via Boost,
On Mon, Jan 20, 2025 at 3:11 PM Matt Borland wrote:
On Monday, January 20th, 2025 at 12:48 PM, Glen Fernandes wrote:
I mentioned this to Matt, but I would like to see benchmarks comparing boost::decimal64_fast to a Decimal64 backed by Intel's DFP library, which is what I know to be in actual use today:
https://www.intel.com/content/www/us/en/developer/articles/tool/intel-decima...
Not that the benchmarks comparing to GCC's Decimal64 (which is based on the libbid which ships with libgcc) aren't useful, I personally don't know anyone using that today.
Benchmarks should ideally also include Intel's compiler, because at least one of the relevant parties who motivated me to suggest the Decimal64 library (to Vinnie as a potential project) do use the Intel C++ compiler (and their Fortran compiler) for areas where they perform better.
(The non-fast versions don't matter to me. I don't know anyone who would want to use them).
Here are some preliminary results:
All tests run on an i9-11900k with Ubuntu 24.04 and the Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205). The Intel benchmarks are written in C but should be a faithful port. Bottom line up front is the Intel library is an order of magnitude faster.
Thank you Matt. This reconciles with what I know about everyone using the Intel library (especially the BSL proclamation about it being 10 times faster). These numbers should be added to the documentation.
I think there is some work to be done to make Boost.Decimal viable. i.e. We don't want to be an order of magnitude slower than what everyone else is actually using today.
Glen, thanks for pointing this out. In the sight of this new piece of data, I'm afraid I will be adding a new condition to my ACCEPT vote: the library should be optimized to be in the same order of magnitude of performance as the Intel one. Regards, Ruben.
Glen
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Mon, Jan 20, 2025 at 12:53 PM Ruben Perez via Boost < boost@lists.boost.org> wrote:
Glen, thanks for pointing this out. In the sight of this new piece of data, I'm afraid I will be adding a new condition to my ACCEPT vote: the library should be optimized to be in the same order of magnitude of performance as the Intel one.
Disclaimer: Alliance employee I think it's good that Glen pointed out we need to be benchmarking and comparing ourselves to the current pack leaders but I'm hesitant to make performance targets acceptance criteria. If you actually sit down and look at how long Intel's DFP library has been around, it seems like version 1.0 came out in 2009. That gives the library a literal 16 years of a head start here. Performance is also a moving target and what's more, hyper-optimized code is hard to read for us normies who aren't experts in computational techniques. I think for a Boost review, we should be mindful of the current performance and where it stands. But we should also remind ourselves that we're also reviewing the author as well. The question comes down to: do we trust that Matt and Christopher are going to iterate on the implementation? Personally, I'd say that I trust Matt because I've conversed with him in the past. So that's at least half the battle right there. Hash2 has functions much slower than OpenSSL's implementations, which are all these behemoths written using hand-crafted assembly. But we're actively working on closing the gap. And what's more, Unordered didn't always have the fastest hash tables either. Instead, the library slowly evolved over time as new techniques and research became published and used by industry.
From what I can tell, Decimal has well-defined use-cases, its license is more permissive than its competition, it's constexpr, it's C++-first. Saying "just use the C library" always sounds good on paper, but it winds up being a lot lamer than we'd like it to be in practice.
To me, Boost review has boiled down to vibes. Do we trust the vibes that Boost is better off with this library and do we trust that the authors are going to stick around and evolve the library? In my view, this library solves problems people would want Boost to solve so I'd say I give a recommendation to ACCEPT this library. This is a no-frills library that doesn't really have much cognitive overhead, so reviewing it is always kind of tricky. But in these cases, I think a fresh coat of paint just on the interface is a good start. And also, the amount of tests is exactly what I'd expect and I think we should remember that the lion's share of the work went into that. - Christian

On Mon, Jan 20, 2025 at 10:32 PM Christian Mazakas via Boost < boost@lists.boost.org> wrote:
Hash2 has functions much slower than OpenSSL's implementations, which are all these behemoths written using hand-crafted assembly. But we're actively working on closing the gap.
I have a fear this discussion is gonna go offtopic, but I do think it is important to remember that Hash2 was designed as framework for "binding" user types with algorithms, with some default implementations of algorithms. So although I was not delighted it is much slower than alternatives there it was much less problematic. As for Unordered: My history knowledge is bad, but was not Unordered the library that originally inspired C++11 unordered_map? *First Release 1.36.0* Are you talking about speed compared to antique Google hash map implementations like google::dense_hash_map and google::sparse_hash_map?

On Mon, Jan 20, 2025 at 1:48 PM Ivan Matek
I have a fear this discussion is gonna go offtopic, but I do think it is important to remember that Hash2 was designed as framework for "binding" user types with algorithms, with some default implementations of algorithms. So although I was not delighted it is much slower than alternatives there it was much less problematic.
As for Unordered: My history knowledge is bad, but was not Unordered the library that originally inspired C++11 unordered_map? *First Release 1.36.0* Are you talking about speed compared to antique Google hash map implementations like google::dense_hash_map and google::sparse_hash_map?
Ha ha, in my opinion, it's not a good Boost review without going incredibly off-topic and waxing philosophical. My point was, there could've been a strong case to be made against the library for not being as fast as C libraries that have had like a decade and half of optimizations. I am happy to hear that you didn't think it was problematic, however. I love software engineering in terms of goals and there's a lot of cool stuff to be done in the hashing space and we're always trying to make it faster. I was referring to comparisons with the absl::flat_hash_map or any other Swiss Tables implementation. One of these days, I intend to actually write a benchmark comparing Rust's tables as well, because they also use Swiss Tables. I guess my point was, libraries are temporal things and they very rarely stay the same as the library that was reviewed. That's why I think that even though performance obviously matters, I'm hesitant to make it a criteria for acceptance. But others may feel differently, which is the whole point of Boost review. - Christian

On Mon, Jan 20, 2025 at 3:26 PM Joaquín M López Muñoz via Boost < boost@lists.boost.org> wrote:
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Smartest thing you ever said Thanks

On 1/21/25 02:25, Joaquín M López Muñoz via Boost wrote:
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Joaquín, I think there's some issue with your mail client that prevents some of your messages to come though to the ML. It's not the first time I notice empty messages from you. Perhaps, you should enable plain text content in your mail client?

El 21/01/2025 a las 3:11, Andrey Semashev via Boost escribió:
On 1/21/25 02:25, Joaquín M López Muñoz via Boost wrote:
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Joaquín, I think there's some issue with your mail client that prevents some of your messages to come though to the ML. It's not the first time I notice empty messages from you. Perhaps, you should enable plain text content in your mail client?
Thanks for the heads up. It's my phone email client, seems to switch to some HTML format of sorts when I include a link in the response. Re-sent, hopefully it'll work now. Joaquin M Lopez Munoz

El 20/01/2025 a las 22:48, Ivan Matek via Boost escribió:
As for Unordered: My history knowledge is bad, but was not Unordered the library that originally inspired C++11 unordered_map? *First Release 1.36.0*
No, Boost.Unordered first appeared in 2008, whereas the seminal paper for C++ unordered associative containers is from 2003: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1456.html
Are you talking about speed compared to antique Google hash map implementations like google::dense_hash_map and google::sparse_hash_map?
No, as compared to the latest Abseil flat_hash_map. You can find some benchmarks here: https://www.boost.org/doc/libs/1_87_0/libs/unordered/doc/html/unordered.html... https://jacksonallan.github.io/c_cpp_hash_tables_benchmark/ Joaquin M Lopez Munoz

On Tue, Jan 21, 2025 at 9:08 AM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:
El 20/01/2025 a las 22:48, Ivan Matek via Boost escribió:
As for Unordered: My history knowledge is bad, but was not Unordered the library that originally inspired C++11 unordered_map? *First Release 1.36.0*
No, Boost.Unordered first appeared in 2008, whereas the seminal paper for C++ unordered associative containers is from 2003:
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1456.html
Are you talking about speed compared to antique Google hash map implementations like google::dense_hash_map and google::sparse_hash_map?
No, as compared to the latest Abseil flat_hash_map. You can find some benchmarks here:
https://www.boost.org/doc/libs/1_87_0/libs/unordered/doc/html/unordered.html... https://jacksonallan.github.io/c_cpp_hash_tables_benchmark/
Joaquin M Lopez Munoz
Thank you for correcting me. I remembered many boost libraries got merged into C++11 and that probably caused me to mix up this. For interested in history here is a list of libaries that made it into C++11 https://bannalia.blogspot.com/2024/05/wg21-boost-and-ways-of-standardization...

No dia 21 de jan. de 2025, às 18:11, Ivan Matek
escreveu: I remembered many boost libraries got merged into C++11 and that probably caused me to mix up this. For interested in history here is a list of libaries that made it into C++11 https://bannalia.blogspot.com/2024/05/wg21-boost-and-ways-of-standardization...
Well, it was me who wrote that article :) Joaquin M Lopez Munoz

On Mon, Jan 20, 2025 at 4:32 PM Christian Mazakas wrote:
If you actually sit down and look at how long Intel's DFP library has been around, it seems like version 1.0 came out in 2009.
That gives the library a literal 16 years of a head start here.
Just one note about that: The performance won't have changed much since 2009. i.e. The few releases since have been minor fixes: https://www.netlib.org/misc/intel/README.txt My personal experience started after 2015 so I wouldn't know if the two releases in 2011 had some surprise performance improvements that were undocumented.
But we should also remind ourselves that we're also reviewing the author as well. The question comes down to: do we trust that Matt and Christopher are going to iterate on the implementation?
Personally, I'd say that I trust Matt because I've conversed with him in the past. So that's at least half the battle right there.
For the record, I hope nothing I've said gives anyone the impression that I don't trust Matt or Christopher.
To me, Boost review has boiled down to vibes. Do we trust the vibes that Boost is better off with this library and do we trust that the authors are going to stick around and evolve the library?
In my view, this library solves problems people would want Boost to solve so I'd say I give a recommendation to ACCEPT this library.
I am also out of touch with all the vibes these days. :) Glen

On Mon, Jan 20, 2025 at 12:53 PM Ruben Perez via Boost < boost@lists.boost.org> wrote:
Glen, thanks for pointing this out. In the sight of this new piece of data, I'm afraid I will be adding a new condition to my ACCEPT vote: the library should be optimized to be in the same order of magnitude of performance as the Intel one.
Disclaimer: Alliance employee
I think it's good that Glen pointed out we need to be benchmarking and comparing ourselves to the current pack leaders but I'm hesitant to make performance targets acceptance criteria.
My concern is the reference of portable library vs a fully vertically integrated one. I would expect a vendor library compiled with the same vendors compiler on the same vendors hardware to be the fastest available. What about anyone on non-x64 machines? Is there room for performance improvement in the library? For sure. Will we ever be fastest on x64? Probably not. Are we the fastest on an M1 Mac? To my knowledge yes.
If you actually sit down and look at how long Intel's DFP library has been around, it seems like version 1.0 came out in 2009.
That gives the library a literal 16 years of a head start here.
Performance is also a moving target and what's more, hyper-optimized code is hard to read for us normies who aren't experts in computational techniques. I think for a Boost review, we should be mindful of the current performance and where it stands. But we should also remind ourselves that we're also reviewing the author as well. The question comes down to: do we trust that Matt and Christopher are going to iterate on the implementation?
Personally, I'd say that I trust Matt because I've conversed with him in the past. So that's at least half the battle right there.
Hash2 has functions much slower than OpenSSL's implementations, which are all these behemoths written using hand-crafted assembly. But we're actively working on closing the gap. And what's more, Unordered didn't always have the fastest hash tables either. Instead, the library slowly evolved over time as new techniques and research became published and used by industry.
From what I can tell, Decimal has well-defined use-cases, its license is more permissive than its competition, it's constexpr, it's C++-first. Saying "just use the C library" always sounds good on paper, but it winds up being a lot lamer than we'd like it to be in practice.
Yes, we seamlessly provide everything you'd expect in a C++ library today. Ivan Matek has even been submitting issues using std::print. That's pretty modern support.
To me, Boost review has boiled down to vibes. Do we trust the vibes that Boost is better off with this library and do we trust that the authors are going to stick around and evolve the library?
In my view, this library solves problems people would want Boost to solve so I'd say I give a recommendation to ACCEPT this library.
This is a no-frills library that doesn't really have much cognitive overhead, so reviewing it is always kind of tricky. But in these cases, I think a fresh coat of paint just on the interface is a good start. And also, the amount of tests is exactly what I'd expect and I think we should remember that the lion's share of the work went into that.
- Christian
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

> All tests run on an i9-11900k with> Ubuntu 24.04 and the Intel(R) oneAPI> DPC++/C++ Compiler 2025.0.4> (2025.0.4.20241205). The Intel> benchmarks are written in C but> should be a faithful port. Bottom> line up front is the Intel library> is an order of magnitude faster. > Thank you Matt. This reconciles> with what I know about everyone> using the Intel library (especially> the BSL proclamation about it> being 10 times faster). These> numbers should be added to the> documentation. > I think there is some work to> be done to make Boost.Decimal> viable. i.e. We don't want to> be an order of magnitude slower> than what everyone else is actually> using today. This post caused us to think andwe just kicked around a few ideas. In the intermediate period, it wouldbe difficult to not reach the speedpublished by the Intel primitive routines. One idea was to offer the opportunityto wrap a particular backend witha higher-level abstraction,akin to boost::multiprecision::numberwhich can wrap GMP or use our own BSLbackends. This would, however,be a new design. We also noticed the Intel LIBsimply used float, double,long double and/or Intel's _Quadto perform elementary and specialfunctions. Fair enough, fast too. So today we offer: * BSL licensing * Portability * Boost-level testing At the juncture of this review,I'm not sure if a 10-foldspeedup would be achievable.Maybe we would/will findab ig bottleneck in the backendintegral routines. But I feartable-lookup will always have theadvantage. - Chris On Monday, January 20, 2025 at 09:28:50 PM GMT+1, Glen Fernandes via Boostwrote: On Mon, Jan 20, 2025 at 3:11 PM Matt Borland wrote: > On Monday, January 20th, 2025 at 12:48 PM, Glen Fernandes wrote: > > > I mentioned this to Matt, but I would like to see benchmarks comparing > > boost::decimal64_fast to a Decimal64 backed by Intel's DFP library, which > > is what I know to be in actual use today: > > > https://www.intel.com/content/www/us/en/developer/articles/tool/intel-decimal-floating-point-math-library.html > > Not that the benchmarks comparing to GCC's Decimal64 (which is based on > the > > libbid which ships with libgcc) aren't useful, I personally don't know > > anyone using that today. > > > > Benchmarks should ideally also include Intel's compiler, because at least > > one of the relevant parties who motivated me to suggest the Decimal64 > > library (to Vinnie as a potential project) do use the Intel C++ compiler > > (and their Fortran compiler) for areas where they perform better. > > > > (The non-fast versions don't matter to me. I don't know anyone who would > > want to use them). > > Here are some preliminary results: > > All tests run on an i9-11900k with Ubuntu 24.04 and the Intel(R) oneAPI > DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205). The Intel benchmarks are > written in C but should be a faithful port. Bottom line up front is the > Intel library is an order of magnitude faster. > Thank you Matt. This reconciles with what I know about everyone using the Intel library (especially the BSL proclamation about it being 10 times faster). These numbers should be added to the documentation. I think there is some work to be done to make Boost.Decimal viable. i.e. We don't want to be an order of magnitude slower than what everyone else is actually using today. Glen _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

>> I think there is some work to>> be done to make Boost.Decimal>> viable. i.e. We don't want to>> be an order of magnitude slower>> than what everyone else is actually>> using today. > This post caused us to think and> we just kicked around a few ideas. > In the intermediate period, it would> be difficult to reach the speed> published by the Intel primitive> routines. Sorry for the silly typo. Difficult to not reach meansof course difficult to REACH. - Chris On Monday, January 20, 2025 at 09:59:44 PM GMT+1, Christopher Kormanyoswrote: > All tests run on an i9-11900k with> Ubuntu 24.04 and the Intel(R) oneAPI> DPC++/C++ Compiler 2025.0.4> (2025.0.4.20241205). The Intel> benchmarks are written in C but> should be a faithful port. Bottom> line up front is the Intel library> is an order of magnitude faster. > Thank you Matt. This reconciles> with what I know about everyone> using the Intel library (especially> the BSL proclamation about it> being 10 times faster). These> numbers should be added to the> documentation. > I think there is some work to> be done to make Boost.Decimal> viable. i.e. We don't want to> be an order of magnitude slower> than what everyone else is actually> using today. This post caused us to think andwe just kicked around a few ideas. In the intermediate period, it wouldbe difficult to not reach the speedpublished by the Intel primitive routines. One idea was to offer the opportunityto wrap a particular backend witha higher-level abstraction,akin to boost::multiprecision::numberwhich can wrap GMP or use our own BSLbackends. This would, however,be a new design. We also noticed the Intel LIBsimply used float, double,long double and/or Intel's _Quadto perform elementary and specialfunctions. Fair enough, fast too. So today we offer: * BSL licensing * Portability * Boost-level testing At the juncture of this review,I'm not sure if a 10-foldspeedup would be achievable.Maybe we would/will findab ig bottleneck in the backendintegral routines. But I feartable-lookup will always have theadvantage. - Chris On Monday, January 20, 2025 at 09:28:50 PM GMT+1, Glen Fernandes via Boost wrote: On Mon, Jan 20, 2025 at 3:11 PM Matt Borland wrote: > On Monday, January 20th, 2025 at 12:48 PM, Glen Fernandes wrote: > > > I mentioned this to Matt, but I would like to see benchmarks comparing > > boost::decimal64_fast to a Decimal64 backed by Intel's DFP library, which > > is what I know to be in actual use today: > > > https://www.intel.com/content/www/us/en/developer/articles/tool/intel-decimal-floating-point-math-library.html > > Not that the benchmarks comparing to GCC's Decimal64 (which is based on > the > > libbid which ships with libgcc) aren't useful, I personally don't know > > anyone using that today. > > > > Benchmarks should ideally also include Intel's compiler, because at least > > one of the relevant parties who motivated me to suggest the Decimal64 > > library (to Vinnie as a potential project) do use the Intel C++ compiler > > (and their Fortran compiler) for areas where they perform better. > > > > (The non-fast versions don't matter to me. I don't know anyone who would > > want to use them). > > Here are some preliminary results: > > All tests run on an i9-11900k with Ubuntu 24.04 and the Intel(R) oneAPI > DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205). The Intel benchmarks are > written in C but should be a faithful port. Bottom line up front is the > Intel library is an order of magnitude faster. > Thank you Matt. This reconciles with what I know about everyone using the Intel library (especially the BSL proclamation about it being 10 times faster). These numbers should be added to the documentation. I think there is some work to be done to make Boost.Decimal viable. i.e. We don't want to be an order of magnitude slower than what everyone else is actually using today. Glen _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

> be an order of magnitude slower> than what everyone else is actually> using today I won't dispute the stark10-fold comparison, but willdraw attention to the phrase "... what everyone else is actuallyusing today" Please recall this is everyoneon Intel Arch. And farily enoughthat's a lot, but not everyone. - Chris On Monday, January 20, 2025 at 10:01:41 PM GMT+1, Christopher Kormanyoswrote: >> I think there is some work to>> be done to make Boost.Decimal>> viable. i.e. We don't want to>> be an order of magnitude slower>> than what everyone else is actually>> using today. > This post caused us to think and> we just kicked around a few ideas. > In the intermediate period, it would> be difficult to reach the speed> published by the Intel primitive> routines. Sorry for the silly typo. Difficult to not reach meansof course difficult to REACH. - Chris On Monday, January 20, 2025 at 09:59:44 PM GMT+1, Christopher Kormanyos wrote: > All tests run on an i9-11900k with> Ubuntu 24.04 and the Intel(R) oneAPI> DPC++/C++ Compiler 2025.0.4> (2025.0.4.20241205). The Intel> benchmarks are written in C but> should be a faithful port. Bottom> line up front is the Intel library> is an order of magnitude faster. > Thank you Matt. This reconciles> with what I know about everyone> using the Intel library (especially> the BSL proclamation about it> being 10 times faster). These> numbers should be added to the> documentation. > I think there is some work to> be done to make Boost.Decimal> viable. i.e. We don't want to> be an order of magnitude slower> than what everyone else is actually> using today. This post caused us to think andwe just kicked around a few ideas. In the intermediate period, it wouldbe difficult to not reach the speedpublished by the Intel primitive routines. One idea was to offer the opportunityto wrap a particular backend witha higher-level abstraction,akin to boost::multiprecision::numberwhich can wrap GMP or use our own BSLbackends. This would, however,be a new design. We also noticed the Intel LIBsimply used float, double,long double and/or Intel's _Quadto perform elementary and specialfunctions. Fair enough, fast too. So today we offer: * BSL licensing * Portability * Boost-level testing At the juncture of this review,I'm not sure if a 10-foldspeedup would be achievable.Maybe we would/will findab ig bottleneck in the backendintegral routines. But I feartable-lookup will always have theadvantage. - Chris On Monday, January 20, 2025 at 09:28:50 PM GMT+1, Glen Fernandes via Boost wrote: On Mon, Jan 20, 2025 at 3:11 PM Matt Borland wrote: > On Monday, January 20th, 2025 at 12:48 PM, Glen Fernandes wrote: > > > I mentioned this to Matt, but I would like to see benchmarks comparing > > boost::decimal64_fast to a Decimal64 backed by Intel's DFP library, which > > is what I know to be in actual use today: > > > https://www.intel.com/content/www/us/en/developer/articles/tool/intel-decimal-floating-point-math-library.html > > Not that the benchmarks comparing to GCC's Decimal64 (which is based on > the > > libbid which ships with libgcc) aren't useful, I personally don't know > > anyone using that today. > > > > Benchmarks should ideally also include Intel's compiler, because at least > > one of the relevant parties who motivated me to suggest the Decimal64 > > library (to Vinnie as a potential project) do use the Intel C++ compiler > > (and their Fortran compiler) for areas where they perform better. > > > > (The non-fast versions don't matter to me. I don't know anyone who would > > want to use them). > > Here are some preliminary results: > > All tests run on an i9-11900k with Ubuntu 24.04 and the Intel(R) oneAPI > DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205). The Intel benchmarks are > written in C but should be a faithful port. Bottom line up front is the > Intel library is an order of magnitude faster. > Thank you Matt. This reconciles with what I know about everyone using the Intel library (especially the BSL proclamation about it being 10 times faster). These numbers should be added to the documentation. I think there is some work to be done to make Boost.Decimal viable. i.e. We don't want to be an order of magnitude slower than what everyone else is actually using today. Glen _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Matt Borland wrote:
Here are some preliminary results:
All tests run on an i9-11900k with Ubuntu 24.04 and the Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205). The Intel benchmarks are written in C but should be a faithful port. Bottom line up front is the Intel library is an order of magnitude faster.
Intel
===== Comparisons ===== Comparisons <Decimal32 >: 73831 us Comparisons <Decimal64 >: 76725 us
...
Proposal and built-in types
===== Comparisons ===== comparisons<float >: 73013 us comparisons<double >: 104019 us comparisons
: 562165 us comparisons : 592997 us
I took a look at operator< https://github.com/cppalliance/decimal/blob/069c26a111b33123e6ce5c34b4bb0077... https://github.com/cppalliance/decimal/blob/069c26a111b33123e6ce5c34b4bb0077... https://github.com/cppalliance/decimal/blob/069c26a111b33123e6ce5c34b4bb0077... and I think it can be improved somewhat - unless I'm missing something subtle. Assuming normalized numbers (which should be the case) of the same sign, operator< is equivalent (NaN notwithstanding) to comparing the two integers eeeeeeee... mmmmmmmmmm... I'm not quite sure whether zero needs to be a special case here. There's special case logic in operator< for it, but I'm not positive it's needed. This makes me wonder whether { uintNN_t exp_and_mant; bool sign; } wouldn't be a better representation for _fast, although that would depend on how important comparisons are, performance-wise. We can actually apply a similar trick to BID encoding, by first checking whether the exponents start with something other than 11. In that case, we can use a single compare without having to unpack. (And if not, do something else clever :-) )

Assuming normalized numbers (which should be the case)...
This is not the case for the decimalXX types. IEEE 754 requires proper handling of cohorts (e.g. 1e-1 has a different bit pattern than 10e-2) so we have to normalize them before we can compare. The fast types normalize in the constructor to get rid of this effect.
...of the same sign, operator< is equivalent (NaN notwithstanding) to comparing the two integers
eeeeeeee... mmmmmmmmmm...
I'm not quite sure whether zero needs to be a special case here. There's special case logic in operator< for it, but I'm not positive it's needed.
This makes me wonder whether { uintNN_t exp_and_mant; bool sign; } wouldn't be a better representation for _fast, although that would depend on how important comparisons are, performance-wise.
We can actually apply a similar trick to BID encoding, by first checking whether the exponents start with something other than 11. In that case, we can use a single compare without having to unpack.
(And if not, do something else clever :-) )
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Matt Borland wrote:
Assuming normalized numbers (which should be the case)...
This is not the case for the decimalXX types. IEEE 754 requires proper handling of cohorts (e.g. 1e-1 has a different bit pattern than 10e-2) so we have to normalize them before we can compare.
Amazing. This makes the Intel DFP numbers even more impressive.

Some random observations: https://github.com/cppalliance/decimal/blob/069c26a111b33123e6ce5c34b4bb0077... The class doesn't need to be marked as EXPORT, because all its members are constexpr (or templated), and hence, nothing is actually exported. https://github.com/cppalliance/decimal/blob/069c26a111b33123e6ce5c34b4bb0077... The special inf value is higher than the special NaN values, which causes isnan() to perform a biased compare https://godbolt.org/z/78P73r986 instead of a straight one https://godbolt.org/z/ebr3EsxWf (I've changed the exponent type here to uint16_t to avoid the pessimization identified earlier.)

I'm not quite sure whether zero needs to be a special case here. There's special case logic in operator< for it, but I'm not positive it's needed.
Specifically, https://github.com/cppalliance/decimal/blob/069c26a111b33123e6ce5c34b4bb0077... here it looks like -0.0 is considered less than +0.0, but is this correct? -0 and +0 are equal.

On Tuesday, January 21st, 2025 at 12:50 PM, Peter Dimov via Boost
I'm not quite sure whether zero needs to be a special case here. There's special
case logic in operator< for it, but I'm not positive it's needed.
Specifically,
https://github.com/cppalliance/decimal/blob/069c26a111b33123e6ce5c34b4bb0077...
here it looks like -0.0 is considered less than +0.0, but is this correct? -0 and +0 are equal.
I don't remember why the distinction was made, but I just looked through IEEE 754-2019 and there is not a difference in the Decimal case. Funny enough I asked Claude 3.5 Sonnet, and it linked me to my own documentation. Matt

Assuming normalized numbers (which should be the case) of the same sign, operator< is equivalent (NaN notwithstanding) to comparing the two integers
eeeeeeee... mmmmmmmmmm...
I'm not quite sure whether zero needs to be a special case here. There's special case logic in operator< for it, but I'm not positive it's needed.
The current operator< is this: https://godbolt.org/z/Tj7Me5aW3 This is a sketch of what I was thinking about: https://godbolt.org/z/9M1cevYef I haven't run this against the test suite, so it might not be quite correct, but it looks correct to me. :-) It assumes that zeroes are normalized. The current implementation seems to not assume that; it handles 0e+7, but since the _fast types are supposed to be normalized, this should never happen, should it?
This makes me wonder whether { uintNN_t exp_and_mant; bool sign; } wouldn't be a better representation for _fast, although that would depend on how important comparisons are, performance-wise.
Should look somewhat like this: https://godbolt.org/z/bqWKxbdjd

On Tuesday, January 21st, 2025 at 2:15 PM, Peter Dimov via Boost
Assuming normalized numbers (which should be the case) of the same sign,
operator< is equivalent (NaN notwithstanding) to comparing the two integers
eeeeeeee... mmmmmmmmmm...
I'm not quite sure whether zero needs to be a special case here. There's special case logic in operator< for it, but I'm not positive it's needed.
The current operator< is this:
This is a sketch of what I was thinking about:
I haven't run this against the test suite, so it might not be quite correct, but it looks correct to me. :-)
It assumes that zeroes are normalized. The current implementation seems to not assume that; it handles 0e+7, but since the _fast types are supposed to be normalized, this should never happen, should it?
For fast types everything is normalized. Originally I was just skipping the decoding step by storing in the struct, but then after profiling adding normalization to the constructor provided significant gains. I guess I never removed all the old assumptions.
This makes me wonder whether { uintNN_t exp_and_mant; bool sign; } wouldn't be a better representation for _fast, although that would depend on how important comparisons are, performance-wise.
Should look somewhat like this:
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

The current operator< is this:
https://godbolt.org/z/Tj7Me5aW3
This is a sketch of what I was thinking about:
https://godbolt.org/z/9M1cevYef
I haven't run this against the test suite, so it might not be quite correct, but it looks correct to me. :-)
It assumes that zeroes are normalized. The current implementation seems to not assume that; it handles 0e+7, but since the _fast types are supposed to be normalized, this should never happen, should it?
And here's another option, using bitfields: https://godbolt.org/z/Y7asaMczz GCC seems to like this better because it's able to figure out that the exponent and the significand are adjacent in memory, so the shift+or can be optimized out.
This makes me wonder whether { uintNN_t exp_and_mant; bool sign; } wouldn't be a better representation for _fast, although that would depend on how important comparisons are, performance-wise.
Should look somewhat like this:

For me the interesting thing about comparisons beside Peter improvements is
that benefits from normalization seem to depend on how many times user does
certain operation.
user does
auto x = a * b * c * d * e;
x < y;
versus user does
auto x = a*b;
x

Ivan Matek wrote:
For me the interesting thing about comparisons beside Peter improvements is that benefits from normalization seem to depend on how many times user does certain operation.
user does auto x = a * b * c * d * e; x < y;
Multiplication probably won't suffer much, but addition with different signs might take quite a hit because of normalization of the result. One could in theory engage the type system here and make operator+ for decimal64_fast return decimal64_fast_non_normalized, which will then only be normalized once on the final store to x. Who knows whether this will be worth doing or not, perf-wise.

On Tue, Jan 21, 2025 at 9:02 PM Peter Dimov
Ivan Matek wrote:
For me the interesting thing about comparisons beside Peter improvements is that benefits from normalization seem to depend on how many times user does certain operation.
user does auto x = a * b * c * d * e; x < y;
Multiplication probably won't suffer much, but addition with different signs might take quite a hit because of normalization of the result.
One could in theory engage the type system here and make operator+ for decimal64_fast return decimal64_fast_non_normalized, which will then only be normalized once on the final store to x.
Yeah, that would be nice, but as you said maybe overkill.
Also we could have ugly way of doing the multiplication with function...
but again as you said without benchmarks not sure it is worth it
auto x = mult(a, b, c, d, e);
On Tue, Jan 21, 2025 at 8:52 PM Peter Dimov via Boost
Also, see how much better is_zero becomes.
wow...

Ivan Matek wrote:
On Tue, Jan 21, 2025 at 8:52 PM Peter Dimov via Boost
mailto:boost@lists.boost.org > wrote: Also, see how much better is_zero becomes.
wow...
In fairness the same logic can be applied to the current representation: https://godbolt.org/z/d13nWx8Wq It's two comparisons instead of one for the exponent+significand, but otherwise much the same.

On Tue, Jan 21, 2025 at 8:32 PM Peter Dimov via Boost
And here's another option
Implementations of _fast type without expensive decoding of implied bits and our discussions about ABI made me wonder... I know authors currently want to only implement IEEE754 so this is not directly related to review of current library, more of a future potential developments discussion. Do you think there could be a benefit from additional nonIEEE754 decimal types? Let's use decimal64 as example. IEEE754 does plenty of clever tricks as implied digits to pack as much data as possible in 64 bits. If we want to go around the performance overhead of decoding that clever encoding we must use more than 64 bits of data. This has size issues, and as we discussed before potential function call overhead issues. Now what if we flip the design restrictions? So instead of asking: How many bytes we need to represent decimal64 so that operations are fast? we ask: What is the largest precision/range of values we can put inside 64 bits while making sure operations are fast? Here we have tradeoff: are we gonna "steal" from exponent or mantissa so design is not clear, but does not matter hugely for general discussion here. For interchange this format is bad because for example C# application will not know how to decode invented format. Now you may say that nobody will use this because it is nonstandard(afaik IEEE754 demands multiples of 32bits encoded as they prescribe). I think people could be convinced if the speed benefits are there. Let me elaborate my speculation: most of the time people do not pick decimal64 because they need max values/max precision that decimal64 offers. They pick decimal64 because they need values/precision over what decimal32 offers. This is where decimal60(or whatever the name is) would fit in.

This makes me wonder whether { uintNN_t exp_and_mant; bool sign; } wouldn't be a better representation for _fast, although that would depend on how important comparisons are, performance-wise.
Should look somewhat like this:
Here's the current operator==: https://godbolt.org/z/qc54e3hxT It's not that bad but (1) it's wrong because it says -0 != +0 and (2) is_zero doesn't optimize well. Here's operator== when using packed exponent+mantissa: https://godbolt.org/z/sE7f3fWf5 The number of instructions in op== doesn't decrease much, but in practice it should be much faster because of the early reject that will be taken most of the time. Also, see how much better is_zero becomes.

On Tuesday, January 21st, 2025 at 2:51 PM, Peter Dimov via Boost
This makes me wonder whether { uintNN_t exp_and_mant; bool sign; }
wouldn't be a better representation for _fast, although that would depend on how important comparisons are, performance-wise.
Should look somewhat like this:
Here's the current operator==:
It's not that bad but (1) it's wrong because it says -0 != +0 and (2) is_zero doesn't optimize well.
Here's operator== when using packed exponent+mantissa:
The number of instructions in op== doesn't decrease much, but in practice it should be much faster because of the early reject that will be taken most of the time.
Also, see how much better is_zero becomes.
This design could be worth investigation. We looked at bitfields originally, but since they require C++20 for constexpr support it wasn't worth pursuing.

This design could be worth investigation. We looked at bitfields originally, but since they require C++20 for constexpr support it wasn't worth pursuing. It seems only the member initializer is C++20 which can easily be avoided: https://godbolt.org/z/5v6sYaMj4
participants (12)
-
Alexander Grund
-
Andrey Semashev
-
Christian Mazakas
-
Christopher Kormanyos
-
Glen Fernandes
-
Ivan Matek
-
Joaquin M López Muñoz
-
Joaquín M López Muñoz
-
Matt Borland
-
Peter Dimov
-
Ruben Perez
-
Vinnie Falco