Re: [boost] [review] hash functions

13 Mar 2005

      On Mar 13, 2005, at 2:04 PM, Peter Dimov wrote:
...
Topher Cooper wrote:
...
Fundamentally, what makes for a good hash function is that things that
are equal (however you are defining equality) always hash to the same
value, while any small difference between two things should cause them
to hash to completely different values.  In an arbitrary program would
I expect an int[3] to be equal to a struct { short, short, short }
when the values happened to be equal?  No.  What if [...]
Since the keys in an unordered_map are of the same type, it is usually 
not possible for one of them to contain int[3] and for another to hold 
struct { short, short, short }.
A discriminated union type (boost::variant) that is able to hold 
either type can easily hash_combine the index of the currently active 
type to avoid these accidental collisions. Similarly, a 
boost::any-alike can hash_combine the address of a structure that 
uniquely identifies the type, or typeid().name(). This mirrors their 
implementation of operator==.
I agree that the current design may not be well suited for a general 
purpose hash function that needs to be able to sort objects of 
arbitrary types into buckets. I'm not sure how often this occurs in 
practice. Do you have examples?
Actually, I agree that if the issue is only vectors, pairs and 
sequences that the pragmatics argues to this approach.  This behavior, 
though it is contrary to the spirit of object oriented programming (at 
least as I've interpreted that concept for 20+ years) is harmless.  
Different types (kinds, classes) of objects are different except as 
specified by is-a inheritance.  I would still argue that it is not 
behavior to be sought, but it is harmless if it is the most convenient 
way to code things.

I am made very uncomfortable, however when we put structs (and 
therefore, by a primary identity in the C++ language, classes) into the 
mix.  Here we may have inheritance relationship (not a union) where two 
different types with different meanings to the quantities can end up 
introducing a tendency for rather different things to hash to the same 
values.  I presented the example of two different classes representing 
polar and Cartesian representation of complex numbers (and, yes, I have 
written packages where these two representations co-exist and intermix 
to allow optimizations).  Of course for this example, you would clearly 
want a custom hash, but the point is, subclassing does not make this 
situation that bizarre.  What about a table containing structs for 
Resources, which could represent either employees or pieces of office 
equipment where the characteristics of interest in each happen to all 
be represented as a bunch of booleans, that happen to be the same 
length?

I've perhaps ended up over-stating things.  Let me summarize the 
nitty-gritty: I see no good reason to promote this as *desirable* 
behavior, and I worry about it occasionally producing some odd results. 
  Good hashing errs on the side of not building in a bias towards 
inappropriate collisions IF it can be conveniently avoided.

Topher