Re: [boost] [hash] regular behaviour of hash function for double values

1 Feb 2012

      On 2/1/2012 3:59 AM, Daniel James wrote:
...
...
...
and values that differ by a small amount end up in
 nearby bins -- which can cause excess collisions for data with some
 corelational structure.
The STL hash containers are pretty much required to be implemented
using a chained hash table.
I've seen that claim, and I might well be missing something, but I 
believe that coalesced hashing would also work.  Deletion (erasure) 
algorithms that retain an expectation of a few probes up to very high 
load factors are complicated and somewhat costly though (as I remember 
-- its been a long time since I worked with this) asymptotically within 
bounds.  I don't know of any shortcuts in resizing beyond rehashing all 
the entries, but the requirements are consistent with that, and resizing 
a growing table needs to be done less often for chained hashing due to 
good performance up to much higher load factors than external chaining.  
The cost is a less trivial implementation (but that is what libraries 
are for), and constant higher expected cost of element erasure.  The 
gain is frequent much better memory usage.

In any case, how collisions are handled is irrelevant to my point.  If 
it wasn't clear, the structure that I was describing as possibly 
"corelational" is structure in the data to be hashed rather than the 
structure of the container.

Let's take an example.  Let's say that I'm mapping usernames to some 
kind of object, and that usernames consist of a last name plus a number 
assigned in order when there is a duplicate -- a common practice.  The 
presence of a last name increases the probability that a last name is 
common, so the presence of "cooper3" makes it more likely that there 
will be a "cooper4".  If the hashing scheme hashes those "adjacent" 
usernames to adjacent hash-values/bins then if "james23" happens to 
collide with "cooper3" then "james24" will also collide with "cooper4".  
Consequently, under these conditions, moderate load factors will result 
in somewhat worse than the constant performance expected from truly 
uniform hashes.

For what it is worth, FNV1 hashing suffers from precisely this 
correlational sensitivity under these circumstances (although having the 
next last byte being likely to end up in either the next /or/ the 
previous bin but will do so consistently and this may result in the 
third in the sequence being more likely to collide with the first), 
which is why the so called "slightly better" distribution properties of 
FNV1a (which corrects this defect) are not as trivial as the developers 
imply.

Topher

Re: [boost] [hash] regular behaviour of hash function for double values

Topher Cooper