Re: [boost] Re: [review] hash functions

19 Mar 2005

      Daniel James wrote:
...
Peter Dimov wrote:
...
This is not an assumption. It is the specification of hash_value that
determines how it is to be used. Currently it does not offer any
guarantees for the entropy of any particular bits, so the client of
hash_value is expected to not do something stupid, like throw away
28 of its 32 bits.
From google's sparse hash table docs:
"(As a side note, using a table size that's a power of two has several
advantages, including the speed of calculating (i % table_size). On the 
other hand, power-of-two tables are not very forgiving of a poor hash
function. Make sure your hash function is a good one! There are plenty
of dos and don'ts on the web (and in Knuth), for writing hash
functions.)"
It seems intuitively obvious to me that we should stick with the original 
requirements on hash_value, that is, "equivalent inputs produce equal 
outputs, distinct inputs likely produce distinct outputs". That's because 
the hash table components are outnumbered by the hash_value overloads by 
several orders of magnitude, so it makes sense to shift the responsibility 
to produce a good distribution away from hash_value.

As for table sizes being a power of two being a source of speedup, this is 
obviously only true when hv % 2^n doesn't produce collisions. If the key set 
doesn't already have this property, the cost of fixing it (in the general 
case, where it is not known which bits vary, something like x ^ (x % p1 + x 
* p2) needs to be used) may cancel the speedup obtained by using a 2^n table 
size.

On the other hand, it doesn't seem to cost us anything to apply x + (x >> 3) 
to our pointer hash function, so we might as well do it.