Regarding comparison to Intel's implementation (bid based) which is used in both GCC's decimal64 and also in Bloomberg, that's especially important because it used some impressively large tables to get performance (which also bloats binaries)
So far we don't rely on giant tables. Chris added an STM board QEMU to our CI that checks our ROM usage among other things. I'll work on some benchmarks and see if it's worth building out the above non-IEEE754 decimal32.
I have added benchmarks for the GCC builtin _Decimal32, _Decimal64, and _Decimal128. For operations add, sub, mul, div the geometric mean of the runtime ratios (boost.decimal runtime / GCC runtime) are: decimal32: 0.932 decimal64: 1.750 decimal128: 4.837 It's interesting that for every operation the GCC _Decimal64 is faster than _Decimal32 where as ours increases run time with size. In any event I should be able to use all of the existing boost::decimal::decimal32 implementations for the basic operations (since they are already benchmarking faster than reference) with a class that directly stores the sign, exp, and significand to see if it's noticeably faster. Matt