Re: [boost] [review] hash functions

In-Reply-To: <d0kr8r$gl6$1@sea.gmane.org> nesotto@cs.auc.dk (Thorsten Ottosen) wrote (abridged):
It's a pleasure to announce that the review of Daniel James' hash functions library starts today (8th of March) and ends the 12th of March.
I see I have missed the deadline. Oh, well. Several queries have been raised during the review, some of which have merit. Firstly, I think we have some agreement that the signature: size_t hash_range( It first, It last ); is wrong and there should be some way to provide an initial seed. I am not yet decided on what the best signature should be; I am tempted towards a 3rd argument with a default value, but I can also see some merit in mimicking hash_combine() by making the seed the first argument. Secondly, the proposed implementation for hashing pointers produces undefined behaviour because it subtracts pointers not in the same array. We have some agreement that using reinterpret_cast would be better. Thirdly, I think there is also a legitimate query over whether the default hash_range should provide better quality hashes for arrays of bytes, eg along the lines discussed in: http://www.azillionmonkeys.com/qed/hash.html With boost interface is more important than implementation, but apparently implementations of the hash library will not be allowed to vary. My impression is that the proposed implementation of hash_combine: seed ^= hash_value(v) + (seed << 6) + (seed >> 2); was designed to combine 8-bit character values; do we have any evidence it also performs well to combine 32-bit values? Perhaps hash_value(v) would be better treated as an array of 4 bytes? Fourthly, there seems to be a design philosophy behind the current proposal which is not stated explicitly in the documentation. It apparently favours some notion of safety over speed and hash quality, where "safety" means that different objects should have the same hash values if they look at all alike to someone possibly. For example, the proposal has no specialisation for chars at all. I can't tell if that is an oversight, or whether it is a design aim that: char chars[] = { 1, 2, 3, 4 }; int ints[] = { 1, 2, 3, 4 }; assert( hash_range( chars, chars+4 ) == hash_range( ints, ints+4 ) ); I gather it is a design aim that hash_value( pair<int,int> ) match hash_range( int[2] ). Presumably if hash_range takes a seed argument, this means hash_value must too, else how can it match it? I am not sure about all this philosophy. I had kinda hoped that a boost hashing library would offer good quality industrial-strength hashing; that it would save me having to do my own research. Regardless, I think the aims and guarantees of the proposed library should be documented more explicitly. In conclusion, I feel that if we given only 5 days to review a proposal, that proposal needs to be spot on, uncontroversial, with all issues nailed accurately in advance. I don't think that is the case here. Although I think all the problems could be resolved, I don't think we have enough time to do that in the given review period. If I have to decide now, I'd say: This should not be accepted as a Boost library. Stock questions: I have not tried to compile the code. I have looked at the documentation, the source, and followed some of the references. I have raised design questions here and followed the discussion. Total time is a few hours. I am not an expert on hashing, but I have implemented my own hashing library. I have encountered practical problems due to different objects having the same hash values. -- Dave Harris, Nottingham, UK

Dave Harris wrote:
My impression is that the proposed implementation of hash_combine: seed ^= hash_value(v) + (seed << 6) + (seed >> 2);
was designed to combine 8-bit character values; do we have any evidence it also performs well to combine 32-bit values?
(seed << 6) + (seed >> 2) does "neatly" expand an 8 bit value into a 14 bit one, but the seed is not necessarily an 8 bit value. Think of the above as seed ^= hv + seed * 64.25. But no, we don't have hard data on how it performs on 32 bit values.

Dave Harris wrote:
Fourthly, there seems to be a design philosophy behind the current proposal which is not stated explicitly in the documentation. It apparently favours some notion of safety over speed and hash quality, where "safety" means that different objects should have the same hash values if they look at all alike to someone possibly.
"Safety" is not the correct way to put it. "Predictability" or maybe "stability" is the proper word. This is also the reason to fix the implementation, not usually the case when aiming for standardization. The problem is that hash functions vary wildly in quality. Leaving the default function unspecified (or implementation defined) is good if you are lucky and the implementation you happen to use is good. It's a tradeoff. A fixed and predictable default means that you don't have to measure the quality of the hash function for every implementation to which you port your program. Your original tests are still valid. The other side of predictability - that the hash function is not affected by types or the particular representation of a sequence of values - means that you don't necessarily have to re-run the performance benchmark of your unordered_maps when you refactor the key to use int[3] instead of struct { short, short, short }, or a wstring instead of a string.

On Mar 13, 2005, at 11:13 AM, Peter Dimov wrote:
implementation to which you port your program. Your original tests are still valid.
The other side of predictability - that the hash function is not affected by types or the particular representation of a sequence of values - means that you don't necessarily have to re-run the performance benchmark of your unordered_maps when you refactor the key to use int[3] instead of struct { short, short, short }, or a wstring instead of a string.
Fundamentally, what makes for a good hash function is that things that are equal (however you are defining equality) always hash to the same value, while any small difference between two things should cause them to hash to completely different values. In an arbitrary program would I expect an int[3] to be equal to a struct { short, short, short } when the values happened to be equal? No. What if the change in representation you spoke of was to rotate the coordinate axes for my vector, would I expect values with the same coordinates in the two different coordinate-systems, represented by different structs to be equal? No. If I have two structs representing complex numbers, one using a Cartesian representation and the other polar would I expect them to be treated as equal because their coordinates are equal? No -- I would be appalled if they were. A general purpose hash function that behaves like this is a weak one at best (and in some situations it can be a horrendously bad one). Your argument is that one should allow those developers who are making one specific change in representation out of the hundreds of possible reasonable ones should be allowed to be able to get away with being slipshod (and yes, not re-running the performance benchmark even with this deliberate weakening of the benchmark would be slipshod, though justifiably under many conditions). I don't buy it. Now if you made an argument that the benefits of strengthening it were, in practice, too minor to consider and the costs (to, e.g., seed the hash combination sequence with a class dependent value) was too high, then I might agree. Doing something that is theoretically "wrong" because it is pragmatically the "right" thing to do, is good engineering. Arguing that something that is theoretically "wrong" is in some real sense the right thing to do is bad engineering, bad math and bad science. Topher

Topher Cooper wrote:
On Mar 13, 2005, at 11:13 AM, Peter Dimov wrote:
implementation to which you port your program. Your original tests are still valid.
The other side of predictability - that the hash function is not affected by types or the particular representation of a sequence of values - means that you don't necessarily have to re-run the performance benchmark of your unordered_maps when you refactor the key to use int[3] instead of struct { short, short, short }, or a wstring instead of a string.
Fundamentally, what makes for a good hash function is that things that are equal (however you are defining equality) always hash to the same value, while any small difference between two things should cause them to hash to completely different values. In an arbitrary program would I expect an int[3] to be equal to a struct { short, short, short } when the values happened to be equal? No. What if [...]
The intent of the proposal is to provide a reasonable default when tr1::unordered_map is used (and to provide the tools to build such a reasonable default for user-defined compound types). It is not an attempt to build the ultimate hash function (a hopeless task). Since the keys in an unordered_map are of the same type, it is usually not possible for one of them to contain int[3] and for another to hold struct { short, short, short }. A discriminated union type (boost::variant) that is able to hold either type can easily hash_combine the index of the currently active type to avoid these accidental collisions. Similarly, a boost::any-alike can hash_combine the address of a structure that uniquely identifies the type, or typeid().name(). This mirrors their implementation of operator==. I agree that the current design may not be well suited for a general purpose hash function that needs to be able to sort objects of arbitrary types into buckets. I'm not sure how often this occurs in practice. Do you have examples?

On Mar 13, 2005, at 2:04 PM, Peter Dimov wrote:
Topher Cooper wrote:
Fundamentally, what makes for a good hash function is that things that are equal (however you are defining equality) always hash to the same value, while any small difference between two things should cause them to hash to completely different values. In an arbitrary program would I expect an int[3] to be equal to a struct { short, short, short } when the values happened to be equal? No. What if [...]
Since the keys in an unordered_map are of the same type, it is usually not possible for one of them to contain int[3] and for another to hold struct { short, short, short }.
A discriminated union type (boost::variant) that is able to hold either type can easily hash_combine the index of the currently active type to avoid these accidental collisions. Similarly, a boost::any-alike can hash_combine the address of a structure that uniquely identifies the type, or typeid().name(). This mirrors their implementation of operator==.
I agree that the current design may not be well suited for a general purpose hash function that needs to be able to sort objects of arbitrary types into buckets. I'm not sure how often this occurs in practice. Do you have examples?
Actually, I agree that if the issue is only vectors, pairs and sequences that the pragmatics argues to this approach. This behavior, though it is contrary to the spirit of object oriented programming (at least as I've interpreted that concept for 20+ years) is harmless. Different types (kinds, classes) of objects are different except as specified by is-a inheritance. I would still argue that it is not behavior to be sought, but it is harmless if it is the most convenient way to code things. I am made very uncomfortable, however when we put structs (and therefore, by a primary identity in the C++ language, classes) into the mix. Here we may have inheritance relationship (not a union) where two different types with different meanings to the quantities can end up introducing a tendency for rather different things to hash to the same values. I presented the example of two different classes representing polar and Cartesian representation of complex numbers (and, yes, I have written packages where these two representations co-exist and intermix to allow optimizations). Of course for this example, you would clearly want a custom hash, but the point is, subclassing does not make this situation that bizarre. What about a table containing structs for Resources, which could represent either employees or pieces of office equipment where the characteristics of interest in each happen to all be represented as a bunch of booleans, that happen to be the same length? I've perhaps ended up over-stating things. Let me summarize the nitty-gritty: I see no good reason to promote this as *desirable* behavior, and I worry about it occasionally producing some odd results. Good hashing errs on the side of not building in a bias towards inappropriate collisions IF it can be conveniently avoided. Topher

Topher Cooper wrote:
I am made very uncomfortable, however when we put structs (and therefore, by a primary identity in the C++ language, classes) into the mix. Here we may have inheritance relationship (not a union) where two different types with different meanings to the quantities can end up introducing a tendency for rather different things to hash to the same values. I presented the example of two different classes representing polar and Cartesian representation of complex numbers (and, yes, I have written packages where these two representations co-exist and intermix to allow optimizations). Of course for this example, you would clearly want a custom hash, but the point is, subclassing does not make this situation that bizarre. What about a table containing structs for Resources, which could represent either employees or pieces of office equipment where the characteristics of interest in each happen to all be represented as a bunch of booleans, that happen to be the same length?
Think about how such a class would be stored in the proposed unordered associative containers. If you use unordered_set<base*> then the hash function will be called for the pointer, not the class. If you use unordered_set<boost::shared_ptr<base> >, then it depends on how Peter implements the hash function for shared_ptr, but I expect it will be the pointer again. If you use unordered_set<boost::variant<...> > or unordered_set<boost::any<...> > then a correctly written hash function for boost::variant or boost::any will be fine. If Thorsten writes ptr_unordered_set<base> then there might be a problem with boost::hash<base>. Although it will equally apply to std::equal_to<base>. boost::hash<X> and boost::hash<Y> (where X != Y) aren't designed to be used together.

Daniel James wrote:
If you use unordered_set<base*> then the hash function will be called for the pointer, not the class. If you use unordered_set<boost::shared_ptr<base> >, then it depends on how Peter implements the hash function for shared_ptr, but I expect it will be the pointer again.
It will have to be the pointer, since the key has to be const. Only a hash function for a 'deep-const' pointer will be able to use the pointee.

"Daniel James" <daniel@calamity.org.uk> wrote in message news:d1409a$kc1$1@sea.gmane.org... | If Thorsten writes ptr_unordered_set<base> then there might be a problem | with boost::hash<base>. Although it will equally apply to | std::equal_to<base>. yeah, I defiintely will do so at some point. I'm thinking ptr_unordered_set sounds worse than unordered_ptr_set, but the former would be more of what we expect. Anyway, ptr_XX generally forwards to the underlying object and not the pointers, but when it comes to hash-functions I can see no good reason for this not to be based on the pointer itself. -Thorsten

I am made very uncomfortable, however when we put structs (and therefore, by a primary identity in the C++ language, classes) into the mix. Here we may have inheritance relationship (not a union) where two different types with different meanings to the quantities can end up introducing a tendency for rather different things to hash to the same values.
How so, there is no automatic hashing of classes or structs (since there is no reflection in C++), and different classes can't be stored in the same hash container anyway. It's true that a vector<int> and a deque<int> with the same contents would hash to the same value, but again vectors and deques can't be stored in the same indexed container, and I rather like the predictability that this offers: if I change my code from vector<int> to deque<int> I would expect hashes to behave in the same way (I don't care whether they produce the same values, but I do care that the risk of collisions is the same in either case, given that they are holding the same data). We could use a different seed value for each T ( hash_value(&typeinfo(T)) ? ), but that seems like overkill.
I presented the example of two different classes representing polar and Cartesian representation of complex numbers (and, yes, I have written packages where these two representations co-exist and intermix to allow optimizations).
Can you store these in the same container? If so then how could any implementation tell them apart anyway?
Of course for this example, you would clearly want a custom hash, but the point is, subclassing does not make this situation that bizarre. What about a table containing structs for Resources, which could represent either employees or pieces of office equipment where the characteristics of interest in each happen to all be represented as a bunch of booleans, that happen to be the same length?
Again, either the tables are the same type, and no implementation can tell them apart anyway, or they are different types, and they can't be mixed in the same container anyway. Would this concern be fixed by adding some information to the documentation? John.

"Topher Cooper" <topher@topherc.net> wrote in message news:5d405343dda154b3e8190cf9087b52a1@topherc.net... | | On Mar 13, 2005, at 11:13 AM, Peter Dimov wrote: | > implementation to which you port your program. Your original tests are | > still valid. | > | > The other side of predictability - that the hash function is not | > affected by types or the particular representation of a sequence of | > values - means that you don't necessarily have to re-run the | > performance benchmark of your unordered_maps when you refactor the key | > to use int[3] instead of struct { short, short, short }, or a wstring | > instead of a string. | > | | Fundamentally, what makes for a good hash function is that things that | are equal (however you are defining equality) always hash to the same | value, while any small difference between two things should cause them | to hash to completely different values. In an arbitrary program would | I expect an int[3] to be equal to a struct { short, short, short } when | the values happened to be equal? No. What if the change in | representation you spoke of was to rotate the coordinate axes for my | vector, would I expect values with the same coordinates in the two | different coordinate-systems, represented by different structs to be | equal? No. If I have two structs representing complex numbers, one | using a Cartesian representation and the other polar would I expect | them to be treated as equal because their coordinates are equal? No -- | I would be appalled if they were. can you point to any examples where this is a problem in real code? I mean, the hashing is all about looking up some element, as long as I get it, then what do I care how its gotten? In particular, we mostly have a container of *one* type, so what does it matter that a different type could have a similar hash index? br -Thorten

Dave Harris wrote:
Firstly, I think we have some agreement that the signature:
size_t hash_range( It first, It last );
is wrong
I didn't get the impression of any agreement about that.
and there should be some way to provide an initial seed.
Although, I agree with this.
Secondly, the proposed implementation for hashing pointers produces undefined behaviour because it subtracts pointers not in the same array. We have some agreement that using reinterpret_cast would be better.
Yes, and this will be fixed.
Thirdly, I think there is also a legitimate query over whether the default hash_range should provide better quality hashes for arrays of bytes, eg along the lines discussed in: http://www.azillionmonkeys.com/qed/hash.html
Well, I was the one who pointed to that one. I'll keep quiet in the future ;)
With boost interface is more important than implementation, but apparently implementations of the hash library will not be allowed to vary.
This isn't exactly true. It is proposed that the standard explicitly specifies the hash functions (and I've been criticized for not doing that) but part of the purpose of this library is to get real world experience on what the specification should be. And change it, if necessary.
My impression is that the proposed implementation of hash_combine: seed ^= hash_value(v) + (seed << 6) + (seed >> 2);
was designed to combine 8-bit character values; do we have any evidence it also performs well to combine 32-bit values? Perhaps hash_value(v) would be better treated as an array of 4 bytes?
I do intend to spend more time looking into alternatives in the future (and will look into adding a constant to hash_combine very soon).
Fourthly, there seems to be a design philosophy behind the current proposal which is not stated explicitly in the documentation. It apparently favours some notion of safety over speed and hash quality, where "safety" means that different objects should have the same hash values if they look at all alike to someone possibly.
Did you see Peter's original proposal? It's issue 6.18 in: http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2005/n1756.pdf Looking back at the documentation I should make the link more obvious. Originally, I had a link in the first sentence, but that slipped while editing it. Maybe (if it's okay with Peter) I should copy some of it into the documentation.
For example, the proposal has no specialisation for chars at all. I can't tell if that is an oversight, or whether it is a design aim that:
char chars[] = { 1, 2, 3, 4 }; int ints[] = { 1, 2, 3, 4 };
assert( hash_range( chars, chars+4 ) == hash_range( ints, ints+4 ) );
I don't think it's either. Peter?
I gather it is a design aim that hash_value( pair<int,int> ) match hash_range( int[2] ). Presumably if hash_range takes a seed argument, this means hash_value must too, else how can it match it?
When hash_range is called with a seed argument, it would be equivalent to calling hash_combine for pair<int, int>. Perhaps it would be better spelt as: template <class It> void hash_combine(std::size_t& seed, It begin, It end);
I am not sure about all this philosophy. I had kinda hoped that a boost hashing library would offer good quality industrial-strength hashing; that it would save me having to do my own research. Regardless, I think the aims and guarantees of the proposed library should be documented more explicitly.
In conclusion, I feel that if we given only 5 days to review a proposal, that proposal needs to be spot on, uncontroversial, with all issues nailed accurately in advance. I don't think that is the case here. Although I think all the problems could be resolved, I don't think we have enough time to do that in the given review period.
Well, we did discus the type of review before it started, and nobody raised any objections. I'm sorry that you didn't feel that it was enough. This does not have to be the end of the discussion. On a related note, I am going to be requesting a review for the unordered associative containers soon. It has been suggested that since they are based on TR1 they should have a shorter than normal review (but longer than this one). So I guess you think it should have a full length review? Although, it might turn out to be less controversial since it doesn't add anything to the specification. Which is the reverse of what I thought.
If I have to decide now, I'd say:
This should not be accepted as a Boost library.
I'm sorry to hear that. Thank you for reviewing the library and bringing up several good points.
I am not an expert on hashing, but I have implemented my own hashing library. I have encountered practical problems due to different objects having the same hash values.
Is this in C++? I would have thought that type safety would have helped avoid such problems. To use the hash function object with a polymorphic type you'll need to write your own function which should take account of the type. Such a function could be written for boost::variant and similar types. Daniel

Daniel James wrote:
Dave Harris wrote:
For example, the proposal has no specialisation for chars at all. I can't tell if that is an oversight, or whether it is a design aim that: char chars[] = { 1, 2, 3, 4 }; int ints[] = { 1, 2, 3, 4 };
assert( hash_range( chars, chars+4 ) == hash_range( ints, ints+4 ) );
I don't think it's either. Peter?
Sequences of the same values produce the same hash value by design. In a language where character values are different from integer values this would not imply char[] and int[] compatibility, of course, but a C++ "char" is not necessarily a character, and "unsigned int" is not necessarily a non-character.

"Dave Harris" <brangdon@cix.compulink.co.uk> wrote in message news:memo.437373@cix.compulink.co.uk... | In conclusion, I feel that if we given only 5 days to review a proposal, | that proposal needs to be spot on, uncontroversial, with all issues nailed | accurately in advance. I don't think that is the case here. Although I | think all the problems could be resolved, I don't think we have enough | time to do that in the given review period. Then *please* continue the discussion in this thread. All those interersted will take part of the discussion and I won't make any review results until I feel this particular discussion is over. best regards Thorsten, review manager
participants (6)
-
brangdon@cix.compulink.co.uk
-
Daniel James
-
John Maddock
-
Peter Dimov
-
Thorsten Ottosen
-
Topher Cooper