
In-Reply-To: <001001c529b2$4841da80$6501a8c0@pdimov2> pdimov@mmltd.net (Peter Dimov) wrote (abridged):
Alignment is only going to be a problem if the lower sizeof(size_t) * CHAR_BIT bits of the value produced by reinterpret_cast do not contain enough entropy. I think that in practice the hash function will be good enough.
It partly depends on how the hash_value is used. If the value is always a multiple of 8, say, and it is used with a bucket count which is a power of 2, then only 1 out of every 8 buckets will get used. Entropy in the higher bits doesn't help. You may be assuming that no-one will use a bucket count which is not a prime number. I'd rather not make that assumption if we can reasonably avoid it. It shouldn't matter if the values are pushed through hash_combine a few times, but that might not always happen. -- Dave Harris, Nottingham, UK

Dave Harris wrote:
In-Reply-To: <001001c529b2$4841da80$6501a8c0@pdimov2> pdimov@mmltd.net (Peter Dimov) wrote (abridged):
Alignment is only going to be a problem if the lower sizeof(size_t) * CHAR_BIT bits of the value produced by reinterpret_cast do not contain enough entropy. I think that in practice the hash function will be good enough.
It partly depends on how the hash_value is used. If the value is always a multiple of 8, say, and it is used with a bucket count which is a power of 2, then only 1 out of every 8 buckets will get used. Entropy in the higher bits doesn't help.
You may be assuming that no-one will use a bucket count which is not a prime number. I'd rather not make that assumption if we can reasonably avoid it.
This is not an assumption. It is the specification of hash_value that determines how it is to be used. Currently it does not offer any guarantees for the entropy of any particular bits, so the client of hash_value is expected to not do something stupid, like throw away 28 of its 32 bits.

Peter Dimov wrote:
This is not an assumption. It is the specification of hash_value that determines how it is to be used. Currently it does not offer any guarantees for the entropy of any particular bits, so the client of hash_value is expected to not do something stupid, like throw away 28 of its 32 bits.
From google's sparse hash table docs: "(As a side note, using a table size that's a power of two has several advantages, including the speed of calculating (i % table_size). On the other hand, power-of-two tables are not very forgiving of a poor hash function. Make sure your hash function is a good one! There are plenty of dos and don'ts on the web (and in Knuth), for writing hash functions.)" But they seem to be using SGI's hash function, which isn't very good for power of two hash tables. I tried changing their benchmarks (which use increasing integers for values, so there are very few collisions) to only use multiples of 8 and performance dropped as you'd expect. The problem here is that a hash function which works well with such a container would probably punish hash tables which don't discard bits. A better approach to using a power of 2 might be to apply a further transformation to the hash value, which might take into account the current hash table size. Daniel

Daniel James wrote:
Peter Dimov wrote:
This is not an assumption. It is the specification of hash_value that determines how it is to be used. Currently it does not offer any guarantees for the entropy of any particular bits, so the client of hash_value is expected to not do something stupid, like throw away 28 of its 32 bits.
From google's sparse hash table docs:
"(As a side note, using a table size that's a power of two has several advantages, including the speed of calculating (i % table_size). On the other hand, power-of-two tables are not very forgiving of a poor hash function. Make sure your hash function is a good one! There are plenty of dos and don'ts on the web (and in Knuth), for writing hash functions.)"
It seems intuitively obvious to me that we should stick with the original requirements on hash_value, that is, "equivalent inputs produce equal outputs, distinct inputs likely produce distinct outputs". That's because the hash table components are outnumbered by the hash_value overloads by several orders of magnitude, so it makes sense to shift the responsibility to produce a good distribution away from hash_value. As for table sizes being a power of two being a source of speedup, this is obviously only true when hv % 2^n doesn't produce collisions. If the key set doesn't already have this property, the cost of fixing it (in the general case, where it is not known which bits vary, something like x ^ (x % p1 + x * p2) needs to be used) may cancel the speedup obtained by using a 2^n table size. On the other hand, it doesn't seem to cost us anything to apply x + (x >> 3) to our pointer hash function, so we might as well do it.

Peter Dimov wrote:
As for table sizes being a power of two being a source of speedup, this is obviously only true when hv % 2^n doesn't produce collisions. If the key set doesn't already have this property, the cost of fixing it (in the general case, where it is not known which bits vary, something like x ^ (x % p1 + x * p2) needs to be used) may cancel the speedup obtained by using a 2^n table size.
I agree. This wasn't clear in my last mail, but when I wrote:
A better approach to using a power of 2 might be to apply a further transformation to the hash value, which might take into account the current hash table size.
I meant that a power of 2 hash table should do this, and might be able to do it better since it knows how many bits are required.
On the other hand, it doesn't seem to cost us anything to apply x + (x
3) to our pointer hash function, so we might as well do it.
Okay, I'll do that.
participants (3)
-
brangdon@cix.compulink.co.uk
-
Daniel James
-
Peter Dimov