[functional/hash] integer hash function

i was quite amazed that the default hash function for integers seems to be the identity function: from boost/functional/hash/hash.hpp: inline std::size_t hash_value(int v) { return static_cast<std::size_t>(v); } while this is probably the most efficient hash function, it is also the worst possible. is this done intentionally? tim

On 4 October 2011 15:03, Tim Blechmann <tim@klingt.org> wrote:
inline std::size_t hash_value(int v) { return static_cast<std::size_t>(v); }
while this is probably the most efficient hash function, it is also the worst possible. is this done intentionally?
Why is this the "worst possible"? It never has any collisions (if sizeof(int) <= sizeof(size_t)), and typical use is mod-ing it with a prime number to determine the bucket for an unordered container. gcc also uses this for trivial hashes. -- Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

On 10/04/2011 10:03 PM, Tim Blechmann wrote:
i was quite amazed that the default hash function for integers seems to be the identity function:
from boost/functional/hash/hash.hpp: inline std::size_t hash_value(int v) { return static_cast<std::size_t>(v); }
while this is probably the most efficient hash function, it is also the worst possible. is this done intentionally?
It's only bad if you know your possible values are not equally distributed over the whole range of "int". Which is not something a default hashing function can know.

Mathias Gaunard wrote:
On 10/04/2011 10:03 PM, Tim Blechmann wrote:
i was quite amazed that the default hash function for integers seems to be the identity function:
from boost/functional/hash/hash.hpp: inline std::size_t hash_value(int v) { return static_cast<std::size_t>(v); }
while this is probably the most efficient hash function, it is also the worst possible. is this done intentionally?
It's only bad if you know your possible values are not equally distributed over the whole range of "int".
The distribution of the input is irrelevant. It's always mirrored in the distribution of the output (regardless of the hash function, as long as there are no collisions) because there is 1:1 correspondence between the input and output values. The problem arises when output bits are discarded. In the pathological case, if the input is 0..31 and one discards the bottom 5 bits, the result is always 0. But the default hash function can't know that you're going to discard bits, or how to fix that for you.

The distribution of the input is irrelevant. It's always mirrored in the distribution of the output (regardless of the hash function, as long as there are no collisions) because there is 1:1 correspondence between the input and output values. The problem arises when output bits are discarded. In the pathological case, if the input is 0..31 and one discards the bottom 5 bits, the result is always 0. But the default hash function can't know that you're going to discard bits, or how to fix that for you.
fwiw, there are algorithms like knuth's multiplicative hash, which are almost as fast as a trivial hash function, but provide at least a certain degree of hash distribution ... this is what i'm now using for my use case tim

On 5 Oct 2011, at 05:54, Peter Dimov wrote:
Mathias Gaunard wrote:
On 10/04/2011 10:03 PM, Tim Blechmann wrote:
i was quite amazed that the default hash function for integers seems to > be the identity function:
from boost/functional/hash/hash.hpp: inline std::size_t hash_value(int v) { return static_cast<std::size_t>(v); }
while this is probably the most efficient hash function, it is also the > worst possible. is this done intentionally?
It's only bad if you know your possible values are not equally distributed over the whole range of "int".
The distribution of the input is irrelevant. It's always mirrored in the distribution of the output (regardless of the hash function, as long as there are no collisions) because there is 1:1 correspondence between the input and output values. The problem arises when output bits are discarded. In the pathological case, if the input is 0..31 and one discards the bottom 5 bits, the result is always 0. But the default hash function can't know that you're going to discard bits, or how to fix that for you.
A default hash function certainly can shuffle bits, to create a better distribution. It all depends on what we want the hash function to do. Simply reduce the risk of collisions, or also have low correlation between the distribution of input and output data. At the moment boost has chosen to simply do the first. There are applications where people also want the second, and adding this would add a small cost. The question is wether boost should do that. I would say not, as it is simple to add once a shuffle/distribution function, which can be applied to the hash, if the user wants one. It might still be worth documenting this weakness however. Chris

Christopher Jefferson wrote:
A default hash function certainly can shuffle bits, to create a better distribution.
What does "better distribution" mean?
It might still be worth documenting this weakness however.
This is not a weakness. The definition of a hash function is that equal inputs produce the same output and different inputs should produce different outputs. This is exactly what this function does. If an algorithm doesn't work well with this hash function, this is a sign that the algorithm itself has a weakness, because it doesn't work well with functions that satisfy the requirements for a hash function. In other words, the algorithm places additional requirements on its hash function, requirements that a hash function author has no way of knowing. Now, to be honest, I know what the perceived defect is, that the default hash function is not cryptographic enough. In particular, that a single bit change in the input results in a predictable single bit change in the output, whereas one would like it to result in random bit changes all over the size_t. But, again, if your algorithm requires this property, it is not generic and will not necessarily work with a hash function written by someone who is not aware of this requirement. If you require the output bits to be properly shuffled, do that yourself, this is something that all hash functions may potentially need. Which is basically what you said. :-)
I would say not, as it is simple to add once a shuffle/distribution function, which can be applied to the hash, if the user wants one.
participants (5)
-
Christopher Jefferson
-
Mathias Gaunard
-
Nevin Liber
-
Peter Dimov
-
Tim Blechmann