Re: [boost] [review] hash functions

In-Reply-To: <001201c527e4$7405efa0$6501a8c0@pdimov2> pdimov@mmltd.net (Peter Dimov) wrote (abridged):
In my experience, if you want two different things to hash the same, you need to write their hash functions with that in mind. It is a special case which justifies not using the default hash function.
Maybe, but this is not an argument for making the hash values different.
True. Arguments for that have been given elsewhere. One argument is to maximise variation (and for types like bool it is desperately needed). Another is to allow treating of 32-bit values as bytes. For example, size_t hash_range( const int *first, const int *last ) { return hash_range( static_cast<const unsigned char *>(first), static_cast<const unsigned char *>(last) ); } may be a good implementation on some platforms. It does 4 times as many mixing steps (assuming sizeof(int) == 4) as the proposed algorithm. It is, I think, how the designer of the proposed hash_combine algorithm expected it to be used. However, it yields hash values for int * different to the values for char *, so with the current design aims we can't use it.
But doesn't this mean that hash_value( pair<int,int> ) also needs a seed argument, if you want it to match hash_range( int[2] ) ?
No. The seed version of hash_range is used when you want to compute the hash function of a sequence, but you don't have access to the whole sequence at once. It's goal is not to provide a way to change the initial seed, but to make it possible to continue from an intermediate seed.
But suppose I have pair< pair<int,int>, pair<int,int> > ? Shouldn't that yield the same hash value as int[4] ? -- Dave Harris, Nottingham, UK

On Tue, 15 Mar 2005 16:20 +0000 (GMT), Dave Harris <brangdon@cix.compulink.co.uk> wrote:
Another is to allow treating of 32-bit values as bytes. For example,
size_t hash_range( const int *first, const int *last ) { return hash_range( static_cast<const unsigned char *>(first), static_cast<const unsigned char *>(last) ); }
may be a good implementation on some platforms. It does 4 times as many mixing steps (assuming sizeof(int) == 4) as the proposed algorithm. It is, I think, how the designer of the proposed hash_combine algorithm expected it to be used.
However, it yields hash values for int * different to the values for char *, so with the current design aims we can't use it.
It also has the undesirable side-effect of making the hash value depend on the machine endianness. The same set of integers would hash to different values on (say) Sparc and Intel machines. -- Caleb Epstein caleb dot epstein at gmail dot com

Dave Harris wrote:
But suppose I have pair< pair<int,int>, pair<int,int> > ? Shouldn't that yield the same hash value as int[4] ?
I don't expect a pair< pair<int, int>, pair<int, int> > to be frequently refactored into an int[4]. So no, I don't think that it should necessarily have the same hash value as an int[4]. Similarly, I don't think that vector< vector<int> > should necessarily have the same hash value as its flattened vector<int> counterpart.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Peter Dimov wrote: | Dave Harris wrote: | |> |> But suppose I have pair< pair<int,int>, pair<int,int> > ? Shouldn't |> that yield the same hash value as int[4] ? | | | I don't expect a pair< pair<int, int>, pair<int, int> > to be frequently | refactored into an int[4]. So no, I don't think that it should | necessarily have the same hash value as an int[4]. Similarly, I don't | think that vector< vector<int> > should necessarily have the same hash | value as its flattened vector<int> counterpart. Just to comment on this specifically, you really, really should have a vector<vector<int> > hash to the same value as it's flatterned conterpart. I have previously written a program where i had a set of vector<vector<int> >s, where when flatterned they were all the same, and I don't think it's that uncommon.. Chris -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (MingW32) iD8DBQFDAMeJ3FpDzErifpIRAqrVAJ4jZP0gG5DEttSIccejCW3RzuTdywCfRC5I xaPcjBfTeA6+3sBxKMKPAk8= =Lrj9 -----END PGP SIGNATURE-----

Chris Jefferson wrote:
Peter Dimov wrote:
Dave Harris wrote:
But suppose I have pair< pair<int,int>, pair<int,int> > ? Shouldn't that yield the same hash value as int[4] ?
I don't expect a pair< pair<int, int>, pair<int, int> > to be frequently refactored into an int[4]. So no, I don't think that it should necessarily have the same hash value as an int[4]. Similarly, I don't think that vector< vector<int> > should necessarily have the same hash value as its flattened vector<int> counterpart.
Just to comment on this specifically, you really, really should have a vector<vector<int> > hash to the same value as it's flatterned conterpart.
Just to clarify, did you mean "should not have" here? Because...
I have previously written a program where i had a set of vector<vector<int> >s, where when flatterned they were all the same, and I don't think it's that uncommon..
... this part seems to imply that these vectors should not have the same hash value, or the unordered_set will degrade into a linked list.

Peter Dimov wrote:
Dave Harris wrote:
But suppose I have pair< pair<int,int>, pair<int,int> > ? Shouldn't that yield the same hash value as int[4] ?
I don't expect a pair< pair<int, int>, pair<int, int> > to be frequently refactored into an int[4]. So no, I don't think that it should necessarily have the same hash value as an int[4]. Similarly, I don't think that vector< vector<int> > should necessarily have the same hash value as its flattened vector<int> counterpart.
They definitely should not for vector< vector<int> >. boost::hash should reflect std::equal_to, and std::equal_to does not flatten sequences. And since we're treating pairs like ranges of two values, nested pairs should mirror nested vectors. Of course, when using a custom equality function object which flattens sequences, a custom hash function with this behaviour is correct.
participants (5)
-
brangdon@cix.compulink.co.uk
-
Caleb Epstein
-
Chris Jefferson
-
Daniel James
-
Peter Dimov