
On 2/1/2012 3:59 AM, Daniel James wrote:
and values that differ by a small amount end up in nearby bins -- which can cause excess collisions for data with some corelational structure.
The STL hash containers are pretty much required to be implemented using a chained hash table.
I've seen that claim, and I might well be missing something, but I believe that coalesced hashing would also work. Deletion (erasure) algorithms that retain an expectation of a few probes up to very high load factors are complicated and somewhat costly though (as I remember -- its been a long time since I worked with this) asymptotically within bounds. I don't know of any shortcuts in resizing beyond rehashing all the entries, but the requirements are consistent with that, and resizing a growing table needs to be done less often for chained hashing due to good performance up to much higher load factors than external chaining. The cost is a less trivial implementation (but that is what libraries are for), and constant higher expected cost of element erasure. The gain is frequent much better memory usage. In any case, how collisions are handled is irrelevant to my point. If it wasn't clear, the structure that I was describing as possibly "corelational" is structure in the data to be hashed rather than the structure of the container. Let's take an example. Let's say that I'm mapping usernames to some kind of object, and that usernames consist of a last name plus a number assigned in order when there is a duplicate -- a common practice. The presence of a last name increases the probability that a last name is common, so the presence of "cooper3" makes it more likely that there will be a "cooper4". If the hashing scheme hashes those "adjacent" usernames to adjacent hash-values/bins then if "james23" happens to collide with "cooper3" then "james24" will also collide with "cooper4". Consequently, under these conditions, moderate load factors will result in somewhat worse than the constant performance expected from truly uniform hashes. For what it is worth, FNV1 hashing suffers from precisely this correlational sensitivity under these circumstances (although having the next last byte being likely to end up in either the next /or/ the previous bin but will do so consistently and this may result in the third in the sequence being more likely to collide with the first), which is why the so called "slightly better" distribution properties of FNV1a (which corrects this defect) are not as trivial as the developers imply. Topher