[multiprecision] Getting the best performance on windows for uint128_t etc

Neill Clift

11 Jul 2019 11 Jul '19

5:04 p.m.

Hi, I develop a program for generating optimal addition chains. I normally use 64 bit integers but in order to attack some conjectures I compile up my program with boost using uint128_t. In order to get some functionality I need I used the backend access to limbs. For example I need population count/hamming weight/binary digit count. I did this like this: static LPTYPER bits(NTYPE n) /*++ Calculates the number of 1 bits in a number. --*/ { LPTYPER b = 0; std::size_t size = n.backend().size(); boost::multiprecision::limb_type* p = n.backend().limbs(); for (std::size_t i = 0; i < size; i++) { b += (LPTYPER)__popcnt64(p[i]); } return b; } I use the same sort of logic to calculate Floor(Log2(n)) (number of largest bit set), to clear the bottom say b bits of an integer. Is this a reasonable (though probably not idea) way to go? I noted that the limbs are 32 bits on Windows and this leads me to my second question. How can I get the internals of boost using bigger chunks like 64 bit? It seems I need a compiler supporting __int64. I have access to the intel compiler (19) and MSVC but neither seem to change this. I tried LLVM but that seemed quite slow. I may not have put enough effort into investigating LLVM up to this point. Various bit operations seem overly slow in boost. For example I use sequences like this: Control &= Control + 1; Mask = ((Control - 1) ^ Control) >> 1; Control |= Mask; These seem to be bottlenecks under boost. That sequence above is meant to generate masks that straddle the first run of zero bits in control so 1010100000111111_2 produces a mask of 11111111111_2. Thanks. Neill. Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

Attachments:

attachment.html (text/html — 5.2 KB)

Show replies by date

degski

12 Jul 12 Jul

4:38 a.m.

On Thu, 11 Jul 2019 at 22:15, Neill Clift via Boost-users < boost-users@lists.boost.org> wrote:

...

I use the same sort of logic to calculate Floor(Log2(n)) (number of largest bit set), to clear the bottom say b bits of an integer.

Is this a reasonable (though probably not idea) way to go?

Iff you're gonna fiddle with the limbs anyway, why not just set the bits to zero, directly.

...

I noted that the limbs are 32 bits on Windows and this leads me to my second question. How can I get the internals of boost using bigger chunks like 64 bit?

You'll have to compile the code for 64-bits (x64), then you'll get 64-bit limbs. It seems I need a compiler supporting __int64.

...

You should use the std-types std::int64_t and std::uint64_t (#include <cstdint>). I tried LLVM but that seemed quite slow. I may not have put enough effort

...

into investigating LLVM up to this point.

Clang/LLVM (and gcc) has a builtin type (in 64-bit mode only) as an extension __uint128_t (pass (-Xclang) -fforce-enable-int128). degski -- @realdegski https://edition.cnn.com/interactive/2019/06/middleeast/saudi-teen-death-pena... "Anyone who believes that exponential growth can go on forever in a finite world is either a madman or an economist" - Kenneth E. Boulding

Neill Clift

5:15 a.m.

...

Iff you're gonna fiddle with the limbs anyway, why not just set the bits to zero, directly.

I am unsure what you mean by this. For clearing the bottom b set bits in a value I do assign them to zero unless I won’t be clearing them all. Then I use the packed deposit instruction of x86. This is a bit complicated but it has a huge effect on performance for the 64 bit code.

...

You'll have to compile the code for 64-bits (x64), then you'll get 64-bit limbs.

I am compiling with 64 bit and I get 32 bit limbs. It seems I need a compiler supporting __int64.

...

You should use the std-types std::int64_t and std::uint64_t (#include <cstdint>).

Sorry I misspoke here. My reading suggests the compiler needs to support __int128 to get 64 bit limbs. I compile normally as 64 bit and use 64 bit unsigned ints.

...

Clang/LLVM (and gcc) has a builtin type (in 64-bit mode only) as an extension __uint128_t (pass (-Xclang) -fforce-enable-int128).

I’ll give that a go thanks. Neill.

John Maddock

7:38 a.m.

On 11/07/2019 18:04, Neill Clift via Boost-users wrote:

...

Hi,

I develop a program for generating optimal addition chains. I normally use 64 bit integers but in order to attack some conjectures I compile up my program with boost using uint128_t.

In order to get some functionality I need I used the backend access to limbs. For example I need population count/hamming weight/binary digit count. I did this like this:

static

LPTYPER

bits(NTYPE n)

/*++

Calculates the number of 1 bits in a number.

--*/

{

                LPTYPER b = 0;

                std::size_t size = n.backend().size();

boost::multiprecision::limb_type* p = n.backend().limbs();

                for (std::size_t i = 0; i < size; i++) {

                                b += (LPTYPER)__popcnt64(p[i]);

                }

                return b;

}

If you're writing platform/compiler specific code anyway, you could rely on the fact that uint128_t will have an even number of limbs, and reinterpret_cast the limb pointer to a pointer to uint64_t's and have things still work - for performance reasons multiprecision does this internally when initializing from a type twice the width of a limb. There are also typedefs: boost::multiprecision::detail::limb_type, and boost::multiprecision::detail::double_limb_type. Which could help here. Aside: adding a native population count function might well be a useful addition to the library...

...

I use the same sort of logic to calculate Floor(Log2(n)) (number of largest bit set), to clear the bottom say b bits of an integer.

Is this a reasonable (though probably not idea) way to go?

I noted that the limbs are 32 bits on Windows and this leads me to my second question. How can I get the internals of boost using bigger chunks like 64 bit?

The limb size is always half the width of the largest compiler supported integer type - this is a necessary condition to implement arithmetic entirely within the language. So on GCC we have __int128 and so the limb size is 64-bits, but on msvc there's nothing wider than a 64 bit integer type, so the limbs are 32 bits. If clang-win can have __int128 support enabled, then defining BOOST_HAS_INT128 will activate it's use and you'll get 64-bit limbs in general, though in this specific case boost::uint128_t will become a thin wrapper around unsigned __int128 (ie will have only one limb). Of course in this specific case, if you have unsigned __int128 then you don't really need Boost.Multiprecision unless it's to ensure portability of the program to platforms without that type.

...

It seems I need a compiler supporting __int64. I have access to the intel compiler (19) and MSVC but neither seem to change this. I tried LLVM but that seemed quite slow. I may not have put enough effort into investigating LLVM up to this point.

Various bit operations seem overly slow in boost. For example I use sequences like this:

                                Control &= Control + 1;

                                Mask = ((Control - 1) ^ Control) >> 1;

                                Control |= Mask;

These seem to be bottlenecks under boost. That sequence above is meant to generate masks that straddle the first run of zero bits in control so 1010100000111111_2 produces a mask of 11111111111_2.

Well I'm always open to suggestions for performance improvements, but there's not much low hanging fruit in the bitwise operations which are pretty simple on the whole. BTW I'm not sure they help here, but there are a few operations like lsb/msb which use compiler intrinsics on msvc should they help. HTH, John.

...

Thanks.

Neill.

Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org https://lists.boost.org/mailman/listinfo.cgi/boost-users

--- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus

degski

8:12 a.m.

On Fri, 12 Jul 2019 at 10:38, John Maddock via Boost-users < boost-users@lists.boost.org> wrote:

...

Aside: adding a native population count function might well be a useful addition to the library...

Native clz/ctz (_BitScanReverse/_BitScanForward) might be useful as well.

...

The limb size is always half the width of the largest compiler supported integer type - this is a necessary condition to implement arithmetic entirely within the language.

This seems a high price to pay [I guess the 'within the language' is key here]. At some point in time [when 64-bit started to be a general thing] there was some info regarding this on gmplib.org, which basically stated that 64-bit limbs [on x64] were 4 times faster than 32-bit limbs. GMP (MPIR) does 64-bit limbs on Windows, no problem [but yes it's C]. If clang-win can have __int128 support enabled clang-cl can/will have __int128 support enabled by passing it '-Xclang -fforce-enable-int128', i.e. pass '-fforce-enable-int128' to clang++. degski -- @realdegski https://edition.cnn.com/interactive/2019/06/middleeast/saudi-teen-death-pena... "Anyone who believes that exponential growth can go on forever in a finite world is either a madman or an economist" - Kenneth E. Boulding

John Maddock

10:21 a.m.

On 12/07/2019 09:12, degski via Boost-users wrote:

...

On Fri, 12 Jul 2019 at 10:38, John Maddock via Boost-users <boost-users@lists.boost.org <mailto:boost-users@lists.boost.org>> wrote:

Aside: adding a native population count function might well be a useful addition to the library...

Native clz/ctz (_BitScanReverse/_BitScanForward) might be useful as well.

That's done as msb/lsb (most/least significant bit).

...

The limb size is always half the width of the largest compiler supported integer type - this is a necessary condition to implement arithmetic entirely within the language.

This seems a high price to pay [I guess the 'within the language' is key here]. At some point in time [when 64-bit started to be a general thing] there was some info regarding this on gmplib.org <http://gmplib.org>, which basically stated that 64-bit limbs [on x64] were 4 times faster than 32-bit limbs. GMP (MPIR) does 64-bit limbs on Windows, no problem [but yes it's C].

That's only possible with assembly level add/multiply/divide. Ah OK, just looked at MPIR on win64 with the mpir_gc backend, and they do use 64-bit limbs, but rely on GNU's longlong.h for add/subtract/multiply which appears not to use inline assembly for msvc, but instead breaks the inputs into high/low parts and then does schoolbook multplication. I would expect that to be quite a bit slower than just using 32-bit limbs in the first place, but no doubt you would make gains on the bitwise operations, likewise if you're using an assembly level backend. Best, John.

...

If clang-win can have __int128 support enabled

clang-cl can/will have __int128 support enabled by passing it '-Xclang -fforce-enable-int128', i.e. pass '-fforce-enable-int128' to clang++.

degski -- @realdegski https://edition.cnn.com/interactive/2019/06/middleeast/saudi-teen-death-pena... "Anyone who believes that exponential growth can go on forever in a finite world is either a madman or an economist" - Kenneth E. Boulding

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org https://lists.boost.org/mailman/listinfo.cgi/boost-users

--- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus

degski

11:01 a.m.

On Fri, 12 Jul 2019 at 13:21, John Maddock via Boost-users < boost-users@lists.boost.org> wrote:

...

That's only possible with assembly level add/multiply/divide.

Right. Ah OK, just looked at MPIR on win64 with the mpir_gc backend, and they

...

do use 64-bit limbs, but rely on GNU's longlong.h for add/subtract/multiply which appears not to use inline assembly for msvc, but instead breaks the inputs into high/low parts and then does schoolbook multplication.

The mpir_gc ([g]eneric [c]) backend will by definition not use assembly, one needs to build something a bit closer to one's hardware (run the mpir_config.py script [in the msvc sub-folder] to configure, generate solution files and build) to get any performance. It requires yasm as assembler [installed and properly integrated into VS20XX]. Altogether it's neither obvious nor trivial to get something [a binary] that works well. degski -- @realdegski https://edition.cnn.com/interactive/2019/06/middleeast/saudi-teen-death-pena... "Anyone who believes that exponential growth can go on forever in a finite world is either a madman or an economist" - Kenneth E. Boulding

Stian Zeljko Vrba

14 Jul 14 Jul

9:23 a.m.

...

That's only possible with assembly level add/multiply/divide.

For the record, MSVC has compiler intrinsics for double-wide arithmetics, both signed and unsigned. For example: https://docs.microsoft.com/en-us/cpp/intrinsics/mul128?view=vs-2019 -- Stian ________________________________ From: Boost-users <boost-users-bounces@lists.boost.org> on behalf of John Maddock via Boost-users <boost-users@lists.boost.org> Sent: Friday, July 12, 2019 12:21:30 PM To: degski via Boost-users Cc: John Maddock Subject: Re: [Boost-users] [multiprecision] Getting the best performance on windows for uint128_t etc On 12/07/2019 09:12, degski via Boost-users wrote:

...

On Fri, 12 Jul 2019 at 10:38, John Maddock via Boost-users <boost-users@lists.boost.org <mailto:boost-users@lists.boost.org>> wrote:

Aside: adding a native population count function might well be a useful addition to the library...

Native clz/ctz (_BitScanReverse/_BitScanForward) might be useful as well.

That's done as msb/lsb (most/least significant bit).

...

The limb size is always half the width of the largest compiler supported integer type - this is a necessary condition to implement arithmetic entirely within the language.

This seems a high price to pay [I guess the 'within the language' is key here]. At some point in time [when 64-bit started to be a general thing] there was some info regarding this on gmplib.org <http://gmplib.org>, which basically stated that 64-bit limbs [on x64] were 4 times faster than 32-bit limbs. GMP (MPIR) does 64-bit limbs on Windows, no problem [but yes it's C].

...

If clang-win can have __int128 support enabled

clang-cl can/will have __int128 support enabled by passing it '-Xclang -fforce-enable-int128', i.e. pass '-fforce-enable-int128' to clang++.

degski -- @realdegski https://edition.cnn.com/interactive/2019/06/middleeast/saudi-teen-death-pena... "Anyone who believes that exponential growth can go on forever in a finite world is either a madman or an economist" - Kenneth E. Boulding

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org https://lists.boost.org/mailman/listinfo.cgi/boost-users

--- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org https://lists.boost.org/mailman/listinfo.cgi/boost-users

2196

Age (days ago)

2199

Last active (days ago)

List overview

Download

7 comments

4 participants

participants (4)

degski
John Maddock
Neill Clift
Stian Zeljko Vrba