Re: [boost] Re: [review] hash functions

15 Mar 2005

      In-Reply-To: <d11prg$uvt$1@sea.gmane.org>
daniel@calamity.org.uk (Daniel James) wrote (abridged):
...
...
Firstly, I think we have some agreement that the signature:
size_t hash_range( It first, It last );
is wrong
I didn't get the impression of any agreement about that.
...
and there should be some way to provide an initial seed.
Although, I agree with this.
That's all I meant. I knew there wasn't much agreement over whether side 
effects are bad :-)

My review was a lot more conservative than my comments in discussion.
...
...
We have some agreement that using reinterpret_cast would be better.
Yes, and this will be fixed.
When I wrote it seemed there were still unresolved issues. In fact, I 
don't believe anyone has yet posted an acceptable hash_value for pointers. 
Presumably we want to shift out bits which are always 0 for alignment 
reasons.
...
It is proposed that the standard explicitly specifies the hash
functions (and I've been criticized for not doing that) but part
of the purpose of this library is to get real world experience
on what the specification should be. And change it, if necessary.
OK.
...
I do intend to spend more time looking into alternatives in the future 
(and will look into adding a constant to hash_combine very soon).
OK. Although I suggested it, I'm not sure it's the best fix, as compared 
with using non-zero initial seeds.

    const size_t default_seed = 0x9e3779b9; // Or whatever.

    size_t hash_range( It first, It last, seed=default_seed );

A signature like that provides a hook to introduce variation between 
similar types if the author desires, and also allows ranges to be hashed 
as subranges with the output of one subrange feeding into the next. The 
default value allows the issue to be safely ignored by naive users. So 
it's a good idea anyway, and since it also fixes the problem with zeros 
maybe we don't need to fix that twice.

This does leave the issue of other places where hash_combine is called. It 
would be sufficient if we could be sure that all used a non-zero initial 
seed, eg:

    size_t hash_value( std::pair<int,int> p ) {
        size_t hash = default_seed;
        combine_hash( hash, p.first );
        combine_hash( hash, p.second );
        return hash;
    }

and the documentation would encourage that, but we can't really be sure of 
it. Another approach is to change all the primitive functions. For 
example:

    size_t hash_value( int i ) {
        return static_cast<unsigned int>(i) + default_seed;
    }

Introducing the non-zero constant here is surely as efficient as in 
combine_hash, if not more so. Admittedly it doesn't eliminate the zero, it 
just moves it around, but it should massively reduce its dynamic 
frequency. (Actually it will eliminate the zero where it matters most, ie 
for types like bool, char and short which are smaller than size_t.)
...
Did you see Peter's original proposal?
Yes, I had followed most of the links.
...
...
For example, the proposal has no specialisation for chars at all. I 
can't tell if that is an oversight, or whether it is a design
aim [...]
I don't think it's either. Peter?
Peter has now said it is a design aim. That even you weren't sure 
illustrates my point :-)
...
When hash_range is called with a seed argument, it would be equivalent 
to calling hash_combine for pair<int, int>.
If I understand the principles being advocated here, pair< pair<int,int>, 
pair<int,int> > should have the same hash_value as int[4]. How do you 
achieve that without giving hash_value( pair ) an initial seed?

-----
I find myself increasingly drawn to the idea of a hasher object. The main 
signature should be like:

    template <typename T>
    Hasher &hash( Hasher &hasher, const T &t );

with implementations looking like, eg:

    template <typename First, typename Second>
    Hasher &hash( Hasher &hasher, const pair<First,Second> &p ) {
        hash( hasher, first );
        hash( hasher, second );
        return hash;
    }

Then hash_value, if we still need it, would be like:

    template <typename T>
    size_t hash_value( T &t ) {
        Hasher hasher;
        hash( hasher, t );
        hasher.finalise();
        return hasher.value();
    }

but would rarely be called directly. This solves the problem of the 
default value, allows us to store intermediate results bigger than size_t 
(which may be important for cryptographic-strength hashing), and the 
finalise() step provides a hook for the final mixing/avalanching which 
some algorithms use (eg Paul Hsieh's).
...
Perhaps it would be better spelt as:
template <class It>
void hash_combine(std::size_t& seed, It begin, It end);
Yes, I considered that. I wasn't sure whether overloading the name would 
help or hinder - hash_range and hash_combine are not substitutable for 
each other.
...
Well, we did discus the type of review before it started, and nobody 
raised any objections.
I don't object to quick reviews in principle (although if someone can only 
contribute, eg, on Sunday mornings, a 5-day period is going to leave them 
out completely).
...
...
I am not an expert on hashing, but I have implemented my own hashing 
library. I have encountered practical problems due to different 
objects having the same hash values.
Is this in C++?
Yes. The main problem was zeros. My hash primitives produced zeros and my 
hash_combine turned zeros into more zeros. Just as with the initial boost 
proposal.
...
I would have thought that type safety would have helped avoid
such problems.
This was specifically an issue with optional or alternate types. For 
example, I had something similar to:

    size_t MyType::hash_value() const {
        return use_a ? hash_value(a) : hash_value(b);
    }

where a and b could be different types. We can avoid that mistake in 
library classes like boost:variant, but that doesn't prevent it recurring 
in user code.

When I am dealing with types without much intrinsic variation, eg bools, I 
found it important to exploit such variation as I did have as much as 
possible. Hashing false and 0 to different values helped.
...
Thank you for reviewing the library and bringing up several good points.
My pleasure.

-- Dave Harris, Nottingham, UK

Re: [boost] Re: [review] hash functions

brangdon＠cix.compulink.co.uk