Re: [boost] [unordered] request for ways to get more space efficient allocations

21 Apr 2008

      Daniel James skrev:
...
On 19/04/2008, Thorsten Ottosen <thorsten.ottosen@dezide.com> wrote:
...
Hi Jeremy,
Jeremy wrote the original version, I'm responsible for the library nowadays.
Of course, sorry about that.
...
...
I tried expanding your list of primes so that it contains more numbers
 in the small range and in particulat such that it contains numbers
 closely above powers of 2. It happens frequently that powers of 2 comes up.
The theory is that you get less collisions when the size is as far
from a power of 2 as possible. But I haven't looked into the research
that this is based on, so I'm not sure how trustworthy it is.
...
Here's my new list:
static const std::size_t prime_list[] = {
             5ul, 11ul, 17ul, 29ul, 37ul, 53ul, 67ul, 79ul,
             97ul, 131ul, 193ul, 257ul, 389ul, 512ul, 769ul,
             1031ul, 1543ul, 2053ul, 3079ul, 6151ul, 12289ul, 24593ul,
             49157ul, 98317ul, 196613ul, 393241ul, 786433ul,
             1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul,
             50331653ul, 100663319ul, 201326611ul, 402653189ul, 805306457ul,
             1610612741ul, 3221225473ul, 4294967291ul };
It was requested that I use smaller increments during the review. I
haven't looked at it yet, but I created a ticket and I should get
round to it before 1.36:
http://svn.boost.org/trac/boost/ticket/1710
...
For the test I conducted, not once did I get a max load factor above 1
 for containers of size 4, 5, 7, 9 etc up to about 512.
It's written so that it rehashes whenever an insert will cause the
maximum load factor to be exceeded. So you never will. The main
performance worry is that it will rehash too often
Right.
...
so It might be
necessary to either let it exceed the maximum load factor a little, or
ensure that it always grows by at least a certain percentage.
Although for larger sizes you'll generally find that the memory use of
the nodes (the element size, one or two pointers and the memory
management overhead for each node) is much larger than the size of the
bucket array (one pointer per bucket and the overhead for a single
allocation).
Ok. I assumed wrongly that the buckets where not a pointer to something.
...
...
Another alternative would be to allow users to control the initial
 bucket coutn explcitly in the constructor. I think I might like this
 better. Something like
boost::unordered_set<T*> the_set( boost::unordered::bucket_count(32u) );
 assert( the_set.bucket_count() == 32u );
The problem is that performance will typically degrade when the bucket
count isn't a prime number. Although if you have a high quality hash
function this shouldn't be the case.
If you did this, would you want the bucket count to remain fixed,
regardless of how large the container gets?
My typical use case is that I will fill the container once
(and I know how many elements that will be put in there). Then this
object is used again and again for lookup. In this case, allocating room 
for 2-3 times the need seems like overkill, and perhaps only allocating
20%-30% extra would be more fair.
...
An alternative that has some appeal is letting the user specify a
policy class to control the bucket count. Although I'm not sure how
that would play with the standard.
Poor, probably.

I guess for cases where I know the intergers fall in the range, say, 
[0,64], I can just use vector<optional<unsigned>>.

-Thorsten