Re: [boost] [review] hash functions

13 Mar 2005

      In-Reply-To: <d0kr8r$gl6$1@sea.gmane.org>
nesotto@cs.auc.dk (Thorsten Ottosen) wrote (abridged):
...
It's a pleasure to announce that the review of Daniel James' hash 
functions library starts today (8th of March) and ends the 12th
of March.
I see I have missed the deadline. Oh, well.

Several queries have been raised during the review, some of which have 
merit. Firstly, I think we have some agreement that the signature:

    size_t hash_range( It first, It last );

is wrong and there should be some way to provide an initial seed. I am not 
yet decided on what the best signature should be; I am tempted towards a 
3rd argument with a default value, but I can also see some merit in 
mimicking hash_combine() by making the seed the first argument.

Secondly, the proposed implementation for hashing pointers produces 
undefined behaviour because it subtracts pointers not in the same array. 
We have some agreement that using reinterpret_cast would be better.

Thirdly, I think there is also a legitimate query over whether the default 
hash_range should provide better quality hashes for arrays of bytes, eg 
along the lines discussed in:
   http://www.azillionmonkeys.com/qed/hash.html

With boost interface is more important than implementation, but apparently 
implementations of the hash library will not be allowed to vary. My 
impression is that the proposed implementation of hash_combine:
    seed ^= hash_value(v) + (seed << 6) + (seed >> 2);

was designed to combine 8-bit character values; do we have any evidence it 
also performs well to combine 32-bit values? Perhaps hash_value(v) would 
be better treated as an array of 4 bytes?

Fourthly, there seems to be a design philosophy behind the current 
proposal which is not stated explicitly in the documentation. It 
apparently favours some notion of safety over speed and hash quality, 
where "safety" means that different objects should have the same hash 
values if they look at all alike to someone possibly.

For example, the proposal has no specialisation for chars at all. I can't 
tell if that is an oversight, or whether it is a design aim that:

    char chars[] = { 1, 2, 3, 4 };
    int ints[] = { 1, 2, 3, 4 };

    assert( hash_range( chars, chars+4 ) == hash_range( ints, ints+4 ) );

I gather it is a design aim that hash_value( pair<int,int> ) match 
hash_range( int[2] ). Presumably if hash_range takes a seed argument, this 
means hash_value must too, else how can it match it?

I am not sure about all this philosophy. I had kinda hoped that a boost 
hashing library would offer good quality industrial-strength hashing; that 
it would save me having to do my own research. Regardless, I think the 
aims and guarantees of the proposed library should be documented more 
explicitly.

In conclusion, I feel that if we given only 5 days to review a proposal, 
that proposal needs to be spot on, uncontroversial, with all issues nailed 
accurately in advance. I don't think that is the case here. Although I 
think all the problems could be resolved, I don't think we have enough 
time to do that in the given review period. If I have to decide now, I'd 
say:

   This should not be accepted as a Boost library.

Stock questions: I have not tried to compile the code. I have looked at 
the documentation, the source, and followed some of the references. I have 
raised design questions here and followed the discussion. Total time is a 
few hours.

I am not an expert on hashing, but I have implemented my own hashing 
library. I have encountered practical problems due to different objects 
having the same hash values.

-- Dave Harris, Nottingham, UK

Re: [boost] [review] hash functions

brangdon＠cix.compulink.co.uk