[unordered] request for ways to get more space efficient allocations

Hi Jeremy, You implementation is great. I have one problem though, and that is that the size of the buckets is of way larger that it has to be. I find that in many applications I know the size of the map Iøm going to construct. Furthermore, I also find that N elements of type T* or unsigned int can be put in a hash map/set with N (or very close to N) buckets without gettng a max load factor above 1 (so the distribution of elements is perfect). This is probably because my T* objects points to consecutive memory locations or because my integers fall in a narrow range like [0,N]. Nevertheless, in such situations the container will allocate a huge number of buckets --- far more than needed wasting too much memory. I tried expanding your list of primes so that it contains more numbers in the small range and in particulat such that it contains numbers closely above powers of 2. It happens frequently that powers of 2 comes up. Here's my new list: static const std::size_t prime_list[] = { 5ul, 11ul, 17ul, 29ul, 37ul, 53ul, 67ul, 79ul, 97ul, 131ul, 193ul, 257ul, 389ul, 512ul, 769ul, 1031ul, 1543ul, 2053ul, 3079ul, 6151ul, 12289ul, 24593ul, 49157ul, 98317ul, 196613ul, 393241ul, 786433ul, 1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul, 50331653ul, 100663319ul, 201326611ul, 402653189ul, 805306457ul, 1610612741ul, 3221225473ul, 4294967291ul }; For the test I conducted, not once did I get a max load factor above 1 for containers of size 4, 5, 7, 9 etc up to about 512. Another alternative would be to allow users to control the initial bucket coutn explcitly in the constructor. I think I might like this better. Something like boost::unordered_set<T*> the_set( boost::unordered::bucket_count(32u) ); assert( the_set.bucket_count() == 32u ); Comments? -Thorsten

On 19/04/2008, Thorsten Ottosen <thorsten.ottosen@dezide.com> wrote:
Hi Jeremy,
Jeremy wrote the original version, I'm responsible for the library nowadays.
I tried expanding your list of primes so that it contains more numbers in the small range and in particulat such that it contains numbers closely above powers of 2. It happens frequently that powers of 2 comes up.
The theory is that you get less collisions when the size is as far from a power of 2 as possible. But I haven't looked into the research that this is based on, so I'm not sure how trustworthy it is.
Here's my new list:
static const std::size_t prime_list[] = { 5ul, 11ul, 17ul, 29ul, 37ul, 53ul, 67ul, 79ul, 97ul, 131ul, 193ul, 257ul, 389ul, 512ul, 769ul, 1031ul, 1543ul, 2053ul, 3079ul, 6151ul, 12289ul, 24593ul, 49157ul, 98317ul, 196613ul, 393241ul, 786433ul, 1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul, 50331653ul, 100663319ul, 201326611ul, 402653189ul, 805306457ul, 1610612741ul, 3221225473ul, 4294967291ul };
It was requested that I use smaller increments during the review. I haven't looked at it yet, but I created a ticket and I should get round to it before 1.36: http://svn.boost.org/trac/boost/ticket/1710
For the test I conducted, not once did I get a max load factor above 1 for containers of size 4, 5, 7, 9 etc up to about 512.
It's written so that it rehashes whenever an insert will cause the maximum load factor to be exceeded. So you never will. The main performance worry is that it will rehash too often so It might be necessary to either let it exceed the maximum load factor a little, or ensure that it always grows by at least a certain percentage. Although for larger sizes you'll generally find that the memory use of the nodes (the element size, one or two pointers and the memory management overhead for each node) is much larger than the size of the bucket array (one pointer per bucket and the overhead for a single allocation).
Another alternative would be to allow users to control the initial bucket coutn explcitly in the constructor. I think I might like this better. Something like
boost::unordered_set<T*> the_set( boost::unordered::bucket_count(32u) ); assert( the_set.bucket_count() == 32u );
The problem is that performance will typically degrade when the bucket count isn't a prime number. Although if you have a high quality hash function this shouldn't be the case. If you did this, would you want the bucket count to remain fixed, regardless of how large the container gets? An alternative that has some appeal is letting the user specify a policy class to control the bucket count. Although I'm not sure how that would play with the standard. Daniel

Daniel James skrev:
On 19/04/2008, Thorsten Ottosen <thorsten.ottosen@dezide.com> wrote:
Hi Jeremy,
Jeremy wrote the original version, I'm responsible for the library nowadays.
Of course, sorry about that.
I tried expanding your list of primes so that it contains more numbers in the small range and in particulat such that it contains numbers closely above powers of 2. It happens frequently that powers of 2 comes up.
The theory is that you get less collisions when the size is as far from a power of 2 as possible. But I haven't looked into the research that this is based on, so I'm not sure how trustworthy it is.
Here's my new list:
static const std::size_t prime_list[] = { 5ul, 11ul, 17ul, 29ul, 37ul, 53ul, 67ul, 79ul, 97ul, 131ul, 193ul, 257ul, 389ul, 512ul, 769ul, 1031ul, 1543ul, 2053ul, 3079ul, 6151ul, 12289ul, 24593ul, 49157ul, 98317ul, 196613ul, 393241ul, 786433ul, 1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul, 50331653ul, 100663319ul, 201326611ul, 402653189ul, 805306457ul, 1610612741ul, 3221225473ul, 4294967291ul };
It was requested that I use smaller increments during the review. I haven't looked at it yet, but I created a ticket and I should get round to it before 1.36:
http://svn.boost.org/trac/boost/ticket/1710
For the test I conducted, not once did I get a max load factor above 1 for containers of size 4, 5, 7, 9 etc up to about 512.
It's written so that it rehashes whenever an insert will cause the maximum load factor to be exceeded. So you never will. The main performance worry is that it will rehash too often
Right.
so It might be necessary to either let it exceed the maximum load factor a little, or ensure that it always grows by at least a certain percentage.
Although for larger sizes you'll generally find that the memory use of the nodes (the element size, one or two pointers and the memory management overhead for each node) is much larger than the size of the bucket array (one pointer per bucket and the overhead for a single allocation).
Ok. I assumed wrongly that the buckets where not a pointer to something.
Another alternative would be to allow users to control the initial bucket coutn explcitly in the constructor. I think I might like this better. Something like
boost::unordered_set<T*> the_set( boost::unordered::bucket_count(32u) ); assert( the_set.bucket_count() == 32u );
The problem is that performance will typically degrade when the bucket count isn't a prime number. Although if you have a high quality hash function this shouldn't be the case.
If you did this, would you want the bucket count to remain fixed, regardless of how large the container gets?
My typical use case is that I will fill the container once (and I know how many elements that will be put in there). Then this object is used again and again for lookup. In this case, allocating room for 2-3 times the need seems like overkill, and perhaps only allocating 20%-30% extra would be more fair.
An alternative that has some appeal is letting the user specify a policy class to control the bucket count. Although I'm not sure how that would play with the standard.
Poor, probably. I guess for cases where I know the intergers fall in the range, say, [0,64], I can just use vector<optional<unsigned>>. -Thorsten

On 21/04/2008, Thorsten Ottosen <thorsten.ottosen@dezide.com> wrote:
Daniel James skrev:
An alternative that has some appeal is letting the user specify a policy class to control the bucket count. Although I'm not sure how that would play with the standard.
Poor, probably.
I guess for cases where I know the intergers fall in the range, say, [0,64], I can just use vector<optional<unsigned>>.
Thinking about it some more, I could create a trait class for the hash function. template <class T> struct is_good_hash : public boost::mpl::false_ {}; template <> struct is_good_hash<groovy_hash> : public boost::mpl::true_ {}; And use power of 2 tables when the trait is true. Daniel

Daniel James skrev:
On 21/04/2008, Thorsten Ottosen <thorsten.ottosen@dezide.com> wrote:
Daniel James skrev:
An alternative that has some appeal is letting the user specify a policy class to control the bucket count. Although I'm not sure how that would play with the standard.
Poor, probably.
I guess for cases where I know the intergers fall in the range, say, [0,64], I can just use vector<optional<unsigned>>.
Thinking about it some more, I could create a trait class for the hash function.
template <class T> struct is_good_hash : public boost::mpl::false_ {}; template <> struct is_good_hash<groovy_hash> : public boost::mpl::true_ {};
And use power of 2 tables when the trait is true.
As for the integers case, I think the vector approach is actually more appropriate. It's harder to use with pointers though. I'd rather just see a way to tell the container "have this initial bucket capacity no matter what", then I can find a good size that avoids rehashing and comes close to the number of needed buckets (for some application specific definition of "close" ). -Thorsten

Daniel James skrev:
On 21/04/2008, Thorsten Ottosen <thorsten.ottosen@dezide.com> wrote:
Daniel James skrev:
An alternative that has some appeal is letting the user specify a policy class to control the bucket count. Although I'm not sure how that would play with the standard.
Poor, probably.
I guess for cases where I know the intergers fall in the range, say, [0,64], I can just use vector<optional<unsigned>>.
Thinking about it some more, I could create a trait class for the hash function.
template <class T> struct is_good_hash : public boost::mpl::false_ {}; template <> struct is_good_hash<groovy_hash> : public boost::mpl::true_ {};
And use power of 2 tables when the trait is true.
As for the integers case, I think the vector approach is actually more appropriate. It's harder to use with pointers though. I'd rather just see a way to tell the container "have this initial bucket capacity no matter what", then I can find a good size that avoids rehashing and comes close to the number of needed buckets (for some application specific definition of "close" ). For example, having a max of 24 pointers, it is probably not necessary to go for 53 buckets. This can matter some if the number of hashmaps are thousands or millions (as they often are in my case). -Thorsten

On Mon, Apr 21, 2008 at 4:39 AM, Thorsten Ottosen <thorsten.ottosen@dezide.com> wrote:
My typical use case is that I will fill the container once (and I know how many elements that will be put in there). Then this object is used again and again for lookup. In this case, allocating room for 2-3 times the need seems like overkill, and perhaps only allocating 20%-30% extra would be more fair.
Would there be interest in a perfect_hash object to handle this use case specifically? My impression is that perfect hashes are well understood and routinely used, which might make them a candidate for boost. Johnicholas

On Sat, Apr 19, 2008 at 12:26 PM, Daniel James <daniel_james@fmail.co.uk> wrote:
The theory is that you get less collisions when the size is as far from a power of 2 as possible. But I haven't looked into the research that this is based on, so I'm not sure how trustworthy it is.
That is what I was first told in college. Anyway, Java's HashTable (at least Sun's implementation) uses the opposite approach: hash tables are powers of two. A power of two number of buckets implies that the container is more sensitive to a bad hash function, but the 'modulo' operation can be performed as a left-shift, which is much faster. To minimize the impact of a bad hash function Java implements a second (fast, bitwise operations) hashcode conversion after the user provided function. (Note: Java VM modulo an integer division operations are really slow, some real live testing should be performed in C++) David

On 21/04/2008, David Rodríguez Ibeas <dibeas@ieee.org> wrote:
Anyway, Java's HashTable (at least Sun's implementation) uses the opposite approach: hash tables are powers of two. A power of two number of buckets implies that the container is more sensitive to a bad hash function, but the 'modulo' operation can be performed as a left-shift, which is much faster. To minimize the impact of a bad hash function Java implements a second (fast, bitwise operations) hashcode conversion after the user provided function.
Do you mean something like Thomas Wang's integer hash function? (or maybe exactly like it). http://www.concentric.net/~Ttwang/tech/inthash.htm It's a possibility, although I'll probably have to provide a fallback for when std::size_it isn't known to be 32 or 64 bit (is there anyone using such a platform nowadays?). The main appeal of using prime numbers is that they don't require any assumptions about the platform. And since most STL implementations use them, the hash function is likely to use them.
(Note: Java VM modulo an integer division operations are really slow, some real live testing should be performed in C++)
At the moment I'm mostly concentrating on implementing the library (move semantics, emplace, equality functions, getting it working on different platforms etc.) and haven't done much work on performance. I'll get round to it eventually, but if anyone fancies starting on this now, it could be worked on independently. Daniel

Thorsten Ottosen wrote:
I tried expanding your list of primes so that it contains more numbers in the small range and in particulat such that it contains numbers closely above powers of 2. It happens frequently that powers of 2 comes up.
Here's my new list:
static const std::size_t prime_list[] = { 5ul, 11ul, 17ul, 29ul, 37ul, 53ul, 67ul, 79ul, 97ul, 131ul, 193ul, 257ul, 389ul, 512ul, 769ul,
512ul means 521ul?
1031ul, 1543ul, 2053ul, 3079ul, 6151ul, 12289ul, 24593ul, 49157ul, 98317ul, 196613ul, 393241ul, 786433ul, 1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul, 50331653ul, 100663319ul, 201326611ul, 402653189ul, 805306457ul, 1610612741ul, 3221225473ul, 4294967291ul };
Bo Persson

Bo Persson skrev:
Thorsten Ottosen wrote:
I tried expanding your list of primes so that it contains more numbers in the small range and in particulat such that it contains numbers closely above powers of 2. It happens frequently that powers of 2 comes up.
Here's my new list:
static const std::size_t prime_list[] = { 5ul, 11ul, 17ul, 29ul, 37ul, 53ul, 67ul, 79ul, 97ul, 131ul, 193ul, 257ul, 389ul, 512ul, 769ul,
512ul means 521ul?
yes :-) -Thorsten
participants (5)
-
Bo Persson
-
Daniel James
-
David Rodríguez Ibeas
-
Johnicholas Hines
-
Thorsten Ottosen