
In-Reply-To: <d11prg$uvt$1@sea.gmane.org> daniel@calamity.org.uk (Daniel James) wrote (abridged):
Firstly, I think we have some agreement that the signature:
size_t hash_range( It first, It last );
is wrong
I didn't get the impression of any agreement about that.
and there should be some way to provide an initial seed.
Although, I agree with this.
That's all I meant. I knew there wasn't much agreement over whether side effects are bad :-) My review was a lot more conservative than my comments in discussion.
We have some agreement that using reinterpret_cast would be better.
Yes, and this will be fixed.
When I wrote it seemed there were still unresolved issues. In fact, I don't believe anyone has yet posted an acceptable hash_value for pointers. Presumably we want to shift out bits which are always 0 for alignment reasons.
It is proposed that the standard explicitly specifies the hash functions (and I've been criticized for not doing that) but part of the purpose of this library is to get real world experience on what the specification should be. And change it, if necessary.
OK.
I do intend to spend more time looking into alternatives in the future (and will look into adding a constant to hash_combine very soon).
OK. Although I suggested it, I'm not sure it's the best fix, as compared with using non-zero initial seeds. const size_t default_seed = 0x9e3779b9; // Or whatever. size_t hash_range( It first, It last, seed=default_seed ); A signature like that provides a hook to introduce variation between similar types if the author desires, and also allows ranges to be hashed as subranges with the output of one subrange feeding into the next. The default value allows the issue to be safely ignored by naive users. So it's a good idea anyway, and since it also fixes the problem with zeros maybe we don't need to fix that twice. This does leave the issue of other places where hash_combine is called. It would be sufficient if we could be sure that all used a non-zero initial seed, eg: size_t hash_value( std::pair<int,int> p ) { size_t hash = default_seed; combine_hash( hash, p.first ); combine_hash( hash, p.second ); return hash; } and the documentation would encourage that, but we can't really be sure of it. Another approach is to change all the primitive functions. For example: size_t hash_value( int i ) { return static_cast<unsigned int>(i) + default_seed; } Introducing the non-zero constant here is surely as efficient as in combine_hash, if not more so. Admittedly it doesn't eliminate the zero, it just moves it around, but it should massively reduce its dynamic frequency. (Actually it will eliminate the zero where it matters most, ie for types like bool, char and short which are smaller than size_t.)
Did you see Peter's original proposal?
Yes, I had followed most of the links.
For example, the proposal has no specialisation for chars at all. I can't tell if that is an oversight, or whether it is a design aim [...] I don't think it's either. Peter?
Peter has now said it is a design aim. That even you weren't sure illustrates my point :-)
When hash_range is called with a seed argument, it would be equivalent to calling hash_combine for pair<int, int>.
If I understand the principles being advocated here, pair< pair<int,int>, pair<int,int> > should have the same hash_value as int[4]. How do you achieve that without giving hash_value( pair ) an initial seed? ----- I find myself increasingly drawn to the idea of a hasher object. The main signature should be like: template <typename T> Hasher &hash( Hasher &hasher, const T &t ); with implementations looking like, eg: template <typename First, typename Second> Hasher &hash( Hasher &hasher, const pair<First,Second> &p ) { hash( hasher, first ); hash( hasher, second ); return hash; } Then hash_value, if we still need it, would be like: template <typename T> size_t hash_value( T &t ) { Hasher hasher; hash( hasher, t ); hasher.finalise(); return hasher.value(); } but would rarely be called directly. This solves the problem of the default value, allows us to store intermediate results bigger than size_t (which may be important for cryptographic-strength hashing), and the finalise() step provides a hook for the final mixing/avalanching which some algorithms use (eg Paul Hsieh's).
Perhaps it would be better spelt as:
template <class It> void hash_combine(std::size_t& seed, It begin, It end);
Yes, I considered that. I wasn't sure whether overloading the name would help or hinder - hash_range and hash_combine are not substitutable for each other.
Well, we did discus the type of review before it started, and nobody raised any objections.
I don't object to quick reviews in principle (although if someone can only contribute, eg, on Sunday mornings, a 5-day period is going to leave them out completely).
I am not an expert on hashing, but I have implemented my own hashing library. I have encountered practical problems due to different objects having the same hash values.
Is this in C++?
Yes. The main problem was zeros. My hash primitives produced zeros and my hash_combine turned zeros into more zeros. Just as with the initial boost proposal.
I would have thought that type safety would have helped avoid such problems.
This was specifically an issue with optional or alternate types. For example, I had something similar to: size_t MyType::hash_value() const { return use_a ? hash_value(a) : hash_value(b); } where a and b could be different types. We can avoid that mistake in library classes like boost:variant, but that doesn't prevent it recurring in user code. When I am dealing with types without much intrinsic variation, eg bools, I found it important to exploit such variation as I did have as much as possible. Hashing false and 0 to different values helped.
Thank you for reviewing the library and bringing up several good points.
My pleasure. -- Dave Harris, Nottingham, UK