
Daniel James skrev:
On 19/04/2008, Thorsten Ottosen <thorsten.ottosen@dezide.com> wrote:
Hi Jeremy,
Jeremy wrote the original version, I'm responsible for the library nowadays.
Of course, sorry about that.
I tried expanding your list of primes so that it contains more numbers in the small range and in particulat such that it contains numbers closely above powers of 2. It happens frequently that powers of 2 comes up.
The theory is that you get less collisions when the size is as far from a power of 2 as possible. But I haven't looked into the research that this is based on, so I'm not sure how trustworthy it is.
Here's my new list:
static const std::size_t prime_list[] = { 5ul, 11ul, 17ul, 29ul, 37ul, 53ul, 67ul, 79ul, 97ul, 131ul, 193ul, 257ul, 389ul, 512ul, 769ul, 1031ul, 1543ul, 2053ul, 3079ul, 6151ul, 12289ul, 24593ul, 49157ul, 98317ul, 196613ul, 393241ul, 786433ul, 1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul, 50331653ul, 100663319ul, 201326611ul, 402653189ul, 805306457ul, 1610612741ul, 3221225473ul, 4294967291ul };
It was requested that I use smaller increments during the review. I haven't looked at it yet, but I created a ticket and I should get round to it before 1.36:
http://svn.boost.org/trac/boost/ticket/1710
For the test I conducted, not once did I get a max load factor above 1 for containers of size 4, 5, 7, 9 etc up to about 512.
It's written so that it rehashes whenever an insert will cause the maximum load factor to be exceeded. So you never will. The main performance worry is that it will rehash too often
Right.
so It might be necessary to either let it exceed the maximum load factor a little, or ensure that it always grows by at least a certain percentage.
Although for larger sizes you'll generally find that the memory use of the nodes (the element size, one or two pointers and the memory management overhead for each node) is much larger than the size of the bucket array (one pointer per bucket and the overhead for a single allocation).
Ok. I assumed wrongly that the buckets where not a pointer to something.
Another alternative would be to allow users to control the initial bucket coutn explcitly in the constructor. I think I might like this better. Something like
boost::unordered_set<T*> the_set( boost::unordered::bucket_count(32u) ); assert( the_set.bucket_count() == 32u );
The problem is that performance will typically degrade when the bucket count isn't a prime number. Although if you have a high quality hash function this shouldn't be the case.
If you did this, would you want the bucket count to remain fixed, regardless of how large the container gets?
My typical use case is that I will fill the container once (and I know how many elements that will be put in there). Then this object is used again and again for lookup. In this case, allocating room for 2-3 times the need seems like overkill, and perhaps only allocating 20%-30% extra would be more fair.
An alternative that has some appeal is letting the user specify a policy class to control the bucket count. Although I'm not sure how that would play with the standard.
Poor, probably. I guess for cases where I know the intergers fall in the range, say, [0,64], I can just use vector<optional<unsigned>>. -Thorsten