unordered_map 32bit VS 64bit

Hi, I'm porting a project from Linux 32bit to Linux 64bit. While doing so I notice that I have a dramatic performance degradation with unordered_map. My platform is Ubuntu 10.4.3 64bit running on Intel Xeon X3220 but, I tested similar program on WindowsXP SP2 64bit which running on Intel i5 and I get the same degradation. My boost version is 1_46_1. I use the following program in order to test unordered_map - Hop I didn't make any copy mistakes :-). #include <stdio.h> #include <sys/time.h> #include <boost/unordered_map.hpp> int main(int argc, char *argv[]) { struct timeval tv_before, tv_after; boost::unordered_map<boost::int_32_t, boost::int32_t> boost_map; boost_map[1] = 2; boost_map[3] = 4; gettimeofday(&tv_before, NULL); for(int i = ; i < 100000; i++) boost_map.find(1); gettimeofday(&tv_after, NULL); printf("Boost time is %f \n", (double)(((tv_after.tv_sec - tv_before.tv_sec) * 1000000 + (double)(tv_after.tv_usec - tv_before.tv_usec)) / 100000)); return 0; } I use the following compile commands: 64bit - g++ -O3 perf_test -o perf_test_64bit 32bit - g++ -m32 -O3 perf_test -o perf_test_32bit When I run the 32bit program I get 0.5 sec When I run the 64bit program I get 1.54 sec Any ideas ???

Avikam Bara-nes wrote:
I'm porting a project from Linux 32bit to Linux 64bit. While doing so I notice that I have a dramatic performance degradation with unordered_map. My platform is Ubuntu 10.4.3 64bit running on Intel Xeon X3220 but, I tested similar program on WindowsXP SP2 64bit which running on Intel i5 and I get the same degradation. My boost version is 1_46_1.
I presume that you've ensured that there's sufficient memory available for the 64b process? That is, it isn't suffering from swapping, is it?
I use the following program in order to test unordered_map - Hop I didn't make any copy mistakes :-).
#include <stdio.h> #include <sys/time.h> #include <boost/unordered_map.hpp>
int main(int argc, char *argv[]) { struct timeval tv_before, tv_after; boost::unordered_map<boost::int_32_t, boost::int32_t> boost_map;
Out of curiosity, what happens if you use int64_t instead of int32_t?
I use the following compile commands: 64bit - g++ -O3 perf_test -o perf_test_64bit 32bit - g++ -m32 -O3 perf_test -o perf_test_32bit
When I run the 32bit program I get 0.5 sec When I run the 64bit program I get 1.54 sec
Have you tried -O2 or -O1 to see if they make any difference? _____ Rob Stewart robert.stewart@sig.com Software Engineer using std::disclaimer; Dev Tools & Components Susquehanna International Group, LLP http://www.sig.com ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On 25 January 2012 12:33, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Avikam Bara-nes wrote:
I'm porting a project from Linux 32bit to Linux 64bit. While doing so I notice that I have a dramatic performance degradation with unordered_map. My platform is Ubuntu 10.4.3 64bit running on Intel Xeon X3220 but, I tested similar program on WindowsXP SP2 64bit which running on Intel i5 and I get the same degradation. My boost version is 1_46_1.
I presume that you've ensured that there's sufficient memory available for the 64b process? That is, it isn't suffering from swapping, is it?
FWIW I had a quick look at this earlier, and it appears to be mainly caused by a check to see if the container is empty before searching (after removing it, the benchmark was about the same for 32 and 64 bit). The check is there because an empty unordered_map doesn't allocate any memory (a feature request from a while back). I'm a bit surprised that it's so costly, considering the other things that find has to do. I suspect it's something to do with the optimizer. I'll look into it in more detail later.

I used both gcc 4.4.3 and 4.6.2 and got the same results On Jan 25, 2012 4:56 PM, "Olaf van der Spek" <ml@vdspek.org> wrote:
On Wed, Jan 25, 2012 at 8:52 AM, Avikam Bara-nes <avikamb@gmail.com> wrote:
Any ideas ???
Why doesn't the optimizer prune the entire loop? What compiler version are you running?
-- Olaf
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

I'm porting a project from Linux 32bit to Linux 64bit. While doing so I notice that I have a dramatic performance degradation with unordered_map. My platform is Ubuntu 10.4.3 64bit running on Intel Xeon X3220 but, I tested similar program on WindowsXP SP2 64bit which running on Intel i5 and I get the same degradation. My boost version is 1_46_1. I use the following program in order to test unordered_map - Hop I didn't make any copy mistakes :-).
#include <stdio.h> #include <sys/time.h> #include <boost/unordered_map.hpp>
int main(int argc, char *argv[]) { struct timeval tv_before, tv_after; boost::unordered_map<boost::int_32_t, boost::int32_t> boost_map;
boost_map[1] = 2; boost_map[3] = 4;
gettimeofday(&tv_before, NULL); for(int i = ; i < 100000; i++) boost_map.find(1); gettimeofday(&tv_after, NULL);
printf("Boost time is %f \n", (double)(((tv_after.tv_sec - tv_before.tv_sec) * 1000000 + (double)(tv_after.tv_usec - tv_before.tv_usec)) / 100000));
return 0; }
I use the following compile commands: 64bit - g++ -O3 perf_test -o perf_test_64bit 32bit - g++ -m32 -O3 perf_test -o perf_test_32bit
When I run the 32bit program I get 0.5 sec When I run the 64bit program I get 1.54 sec
Any ideas ???
My guess would be that there is 64-bit division for transforming the hash to a bucket index, which is killing perf (though one would wonder why there are no peeps for an instructions like MOD(1,x) in the codegen). I guess, you could try out by changing the bucket count and hash code to a uint32_t in the 64-bit build. Most compilers will transform a div by constant to a mul. Maybe some toolchains would replace a div by bucketcount with a MUL when doing value-range PGO and numbers are reasonably stable (i.e. there is some geometric growth with a factor of ~2 or so, so that there's only a handful of divisors). -hg -------------------------------------------------------------------------- NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers. If you cannot access these links, please notify us by reply message and we will send the contents to you. By messaging with Morgan Stanley you consent to the foregoing.

On 25 January 2012 19:05, Grund, Holger (ISGT) <Holger.Grund@morganstanley.com> wrote:
My guess would be that there is 64-bit division for transforming the hash to a bucket index, which is killing perf (though one would wonder why there are no peeps for an instructions like MOD(1,x) in the codegen).
You're right. I tried removing all the modulus calculations from the container (fine in this case since it's only ever looking for hash values < 4 - I'm also using a modified benchmark which looks up i % 4 rather than 1, to stop things getting optimised away), and the difference became much smaller. I didn't realise how expensive they are on 64 bit computers, I think I might change the design of the container to reduce the number of times they're used, although that will make some other things a bit slower.

On Thu, Jan 26, 2012 at 3:35 AM, Daniel James <dnljms@gmail.com> wrote:
On 25 January 2012 19:05, Grund, Holger (ISGT) <Holger.Grund@morganstanley.com> wrote:
My guess would be that there is 64-bit division for transforming the hash to a bucket index, which is killing perf (though one would wonder why there are no peeps for an instructions like MOD(1,x) in the codegen).
You're right. I tried removing all the modulus calculations from the container (fine in this case since it's only ever looking for hash values < 4 - I'm also using a modified benchmark which looks up i % 4 rather than 1, to stop things getting optimised away), and the difference became much smaller. I didn't realise how expensive they are on 64 bit computers, I think I might change the design of the container to reduce the number of times they're used, although that will make some other things a bit slower.
Is it possible to always use power-of-2 number of buckets and bitwise operations instead of modulus division?

On 26 January 2012 07:11, Andrey Semashev <andrey.semashev@gmail.com> wrote:
Is it possible to always use power-of-2 number of buckets and bitwise operations instead of modulus division?
Since the container can use arbitrary hash functions, and the standard's quality requirements for hash functions aren't very high, it's problematic. It might be possible to use a mix function on the hash value and then use a power of 2, although that adds to the platform specific issues, which I've been trying to avoid - I've had enough problems dealing with varying compilers. I'll have to look into how much faster it is (if at all). I originally developed this on 32 bit machines and it didn't seem worth it there.

On Thu, Jan 26, 2012 at 11:29 AM, Daniel James <dnljms@gmail.com> wrote:
On 26 January 2012 07:11, Andrey Semashev <andrey.semashev@gmail.com> wrote:
Is it possible to always use power-of-2 number of buckets and bitwise operations instead of modulus division?
Since the container can use arbitrary hash functions, and the standard's quality requirements for hash functions aren't very high, it's problematic. It might be possible to use a mix function on the hash value and then use a power of 2, although that adds to the platform specific issues, which I've been trying to avoid - I've had enough problems dealing with varying compilers. I'll have to look into how much faster it is (if at all). I originally developed this on 32 bit machines and it didn't seem worth it there.
I think, the problem of insufficient quality of a hash function should not be solved by the container itself. Those users who provide good hash functions will just waste CPU cycles if additional hash values mixing is done within the container. IMHO, the best way to address this problem is to provide a set of "good" hash functions for common types (I believe functional/hash already does this) and possibly a wrapper function that just does bit mixing in a user-provided hash function. Boost.Unordered docs should mention these tools and advise to use them to get better performance.

On 26 January 2012 07:51, Andrey Semashev <andrey.semashev@gmail.com> wrote:
I think, the problem of insufficient quality of a hash function should not be solved by the container itself. Those users who provide good hash functions will just waste CPU cycles if additional hash values mixing is done within the container.
That's really an issue for the standard (a cop out I know, but I don't want to get into this debate).
IMHO, the best way to address this problem is to provide a set of "good" hash functions for common types (I believe functional/hash already does this) and possibly a wrapper function that just does bit mixing in a user-provided hash function. Boost.Unordered docs should mention these tools and advise to use them to get better performance.
Actually functional/hash doesn't. It's good enough for the standard, but no better. For numbers that fit into the hash value, it just returns them unchanged which is fine for a prime number of buckets but not for power of 2 containers.

On Thu, Jan 26, 2012 at 12:16 PM, Daniel James <dnljms@gmail.com> wrote:
On 26 January 2012 07:51, Andrey Semashev <andrey.semashev@gmail.com> wrote:
IMHO, the best way to address this problem is to provide a set of "good" hash functions for common types (I believe functional/hash already does this) and possibly a wrapper function that just does bit mixing in a user-provided hash function. Boost.Unordered docs should mention these tools and advise to use them to get better performance.
Actually functional/hash doesn't. It's good enough for the standard, but no better. For numbers that fit into the hash value, it just returns them unchanged which is fine for a prime number of buckets but not for power of 2 containers.
Well, we probably better fix functional/hash then?

On 26 January 2012 11:16, Andrey Semashev <andrey.semashev@gmail.com> wrote:
On Thu, Jan 26, 2012 at 12:16 PM, Daniel James <dnljms@gmail.com> wrote:
Actually functional/hash doesn't. It's good enough for the standard, but no better. For numbers that fit into the hash value, it just returns them unchanged which is fine for a prime number of buckets but not for power of 2 containers.
Well, we probably better fix functional/hash then?
No, it's deliberate. It meets the standard's requirements. No more, no less.

On Thu, Jan 26, 2012 at 3:29 PM, Daniel James <dnljms@gmail.com> wrote:
On 26 January 2012 11:16, Andrey Semashev <andrey.semashev@gmail.com> wrote:
On Thu, Jan 26, 2012 at 12:16 PM, Daniel James <dnljms@gmail.com> wrote:
Actually functional/hash doesn't. It's good enough for the standard, but no better. For numbers that fit into the hash value, it just returns them unchanged which is fine for a prime number of buckets but not for power of 2 containers.
Well, we probably better fix functional/hash then?
No, it's deliberate. It meets the standard's requirements. No more, no less.
I don't see why it shouldn't do better than required by the standard. It's quite normal for Boost components to extend the standard and provide superior solutions. Anyway, if existing functional/hash functions are not suitable for the task, we can add the bit mixing wrapper and recommend its usage with Boost.Unordered.

On 26 January 2012 12:46, Andrey Semashev <andrey.semashev@gmail.com> wrote:
I don't see why it shouldn't do better than required by the standard. It's quite normal for Boost components to extend the standard and provide superior solutions.
A few emails ago you were worried about wasted cycles.
Anyway, if existing functional/hash functions are not suitable for the task, we can add the bit mixing wrapper and recommend its usage with Boost.Unordered.
boost::hash should work well with std::unordered_map as it's defined in the standard and boost::unordered_map should work well with std::hash as it's defined in the standard. If that isn't the case then they're failures. These are generic components, they should just work without a wrapper or any such fuss.

Hi, Sorry but I"m a bit confuse, 1. What should I do ? 2. Is there a solution ? 3. Is there a workaround ? Thanks in advance. On Thu, Jan 26, 2012 at 6:26 PM, Daniel James <dnljms@gmail.com> wrote:
On 26 January 2012 12:46, Andrey Semashev <andrey.semashev@gmail.com> wrote:
I don't see why it shouldn't do better than required by the standard. It's quite normal for Boost components to extend the standard and provide superior solutions.
A few emails ago you were worried about wasted cycles.
Anyway, if existing functional/hash functions are not suitable for the task, we can add the bit mixing wrapper and recommend its usage with Boost.Unordered.
boost::hash should work well with std::unordered_map as it's defined in the standard and boost::unordered_map should work well with std::hash as it's defined in the standard. If that isn't the case then they're failures. These are generic components, they should just work without a wrapper or any such fuss.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Avikam Bara-Nes

On 6 February 2012 09:10, Avikam Bara-nes <avikamb@gmail.com> wrote:
Hi, Sorry but I"m a bit confuse,
1. What should I do ? 2. Is there a solution ? 3. Is there a workaround ?
I'm going to try to do something about this for 1.50 (i.e. to be released in 3 or 4 months). I tried gcc 4.7's implementation and it's pretty similar to boost 1.46, a tad faster at lookup. Your benchmark might perform better but it wouldn't if you always look up different values (as you would in real code). I don't know about libc++ or Visual C++'s implementation. If you can't wait for 1.50 then your best bet is to use something like google's dense hash containers, and try to make sure your hash function works well with them. Or if you feel like being brave, I'll add the new version to trunk around the time 1.49 is released.
participants (6)
-
Andrey Semashev
-
Avikam Bara-nes
-
Daniel James
-
Grund, Holger (ISGT)
-
Olaf van der Spek
-
Stewart, Robert