[hash] regular behaviour of hash function for double values

Hi, the following code: double fStart = 31337; double fEnd = 65536; double fStep = fStart - fEnd; fStep /= 512; for (int i = 0; i < 512; i++) { std::cout << std::hex << boost::hash<double>()(fStart + i * fStep) << std::endl; } leads to the output of surprisingly regular hash values (see end of mail). This seems to occur mainly if a power of two number of points is sampled equidistantly within an interval. Especially since some hash table implementations (e.g. intel's threading building blocks) only use the last few bits to determine the bucket, this can lead to very bad caching behaviour for the rather common case of regularly sampled points (equidistant samplings of power of 2 are e.g. common in computer graphics and numerical algorithms). Are you aware of this? Would you consider it a bug? Best regards, Kai Schroeder Output of code using Boost 1.48 (similar behaviour can be observed for other start and end values and other power of 2 step counts): ff9f20000000000f e5042f900000000f 52f13f200000000f 297ac8b00000000f 48721e400000000f 2f9b8dd00000000f 3df0b1600000000f d19cf2f00000000f 9201c800000000f 9409aa100000000f 7cb3fba00000000f d7a42b300000000f b15982c00000000f 5f7900500000000f e0ba85e00000000f 255ee5700000000f cec119000000000f 955bd6900000000f 6a58f4200000000f 843a37b00000000f 590797400000000f 7e8db4d00000000f 1721b6600000000f 37acf9f00000000f df5f65800000000f 88b1a5100000000f 494a0a00000000f bb1082300000000f 1ea62bc00000000f ...

On Tue, Jan 31, 2012 at 4:10 PM, Kai Schroeder <kaischroeder3@googlemail.com> wrote:
Especially since some hash table implementations (e.g. intel's threading building blocks) only use the last few bits to determine the bucket, this can lead to very bad caching behaviour for the rather common case of regularly sampled points (equidistant samplings of power of 2 are e.g. common in computer graphics and numerical algorithms).
Are you aware of this? Would you consider it a bug?
Yes, sounds like a bug in TBB. I don't think it's a good idea to discard bits. Olaf

I have found that this behaviour of boost hash is described in the documentation: http://www.boost.org/doc/libs/1_48_0/doc/html/hash/rationale.html So, this is by design. This rationale is understandable, however it can lead to difficult to find performance problems as in our case. Btw.: std::hash in MSVC10 does not show this behaviour. Well, I tend to agree that this is a bug in TBB, still I would personally prefer a better hash function (even if it is a bit slower) by default. Maybe a fast_hash option could be provided as an alternative. Best regards, Kai On 1/31/12, Olaf van der Spek <ml@vdspek.org> wrote:
On Tue, Jan 31, 2012 at 4:10 PM, Kai Schroeder <kaischroeder3@googlemail.com> wrote:
Especially since some hash table implementations (e.g. intel's threading building blocks) only use the last few bits to determine the bucket, this can lead to very bad caching behaviour for the rather common case of regularly sampled points (equidistant samplings of power of 2 are e.g. common in computer graphics and numerical algorithms).
Are you aware of this? Would you consider it a bug?
Yes, sounds like a bug in TBB. I don't think it's a good idea to discard bits.
Olaf
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

[Kai Schroeder]
I have found that this behaviour of boost hash is described in the documentation: http://www.boost.org/doc/libs/1_48_0/doc/html/hash/rationale.html
So, this is by design. This rationale is understandable, however it can lead to difficult to find performance problems as in our case. Btw.: std::hash in MSVC10 does not show this behaviour. Well, I tend to agree that this is a bug in TBB, still I would personally prefer a better hash function (even if it is a bit slower) by default. Maybe a fast_hash option could be provided as an alternative.
In VC11, we've reimplemented std::hash. We now use FNV-1a for everything, which is incredibly fast (significantly faster than before, according to my measurements) and pretty good at mixing bits. STL

On 31 January 2012 16:19, Kai Schroeder <kaischroeder3@googlemail.com> wrote:
Well, I tend to agree that this is a bug in TBB
I actually don't. If TBB requires a better quality hash function, then that's fine. I also wouldn't use boost::hash or std::hash with google's hash tables.

On 31 Jan 2012, at 19:58, Daniel James wrote:
On 31 January 2012 16:19, Kai Schroeder <kaischroeder3@googlemail.com> wrote:
Well, I tend to agree that this is a bug in TBB
I actually don't. If TBB requires a better quality hash function, then that's fine. I also wouldn't use boost::hash or std::hash with google's hash tables.
It would be easy to provide a boost::hash_shuffle, that could be applied to any boost::hash and provide this stronger requirement (that there is no corrolation between (a-b) and (hash(a)-hash(b)). This would avoid the need to re-write all the existing hash functions. Chris

On 31 January 2012 20:31, Christopher Jefferson <chris@bubblescope.net> wrote:
On 31 Jan 2012, at 19:58, Daniel James wrote:
On 31 January 2012 16:19, Kai Schroeder <kaischroeder3@googlemail.com> wrote:
Well, I tend to agree that this is a bug in TBB
I actually don't. If TBB requires a better quality hash function, then that's fine. I also wouldn't use boost::hash or std::hash with google's hash tables.
It would be easy to provide a boost::hash_shuffle, that could be applied to any boost::hash and provide this stronger requirement (that there is no corrolation between (a-b) and (hash(a)-hash(b)). This would avoid the need to re-write all the existing hash functions.
Feel free to do so. Although, I'd probably write it to use better algorithms where possible, and just use boost::hash as a fallback for when they're not. I think I've mentioned before that I regret calling it 'boost::hash', 'boost::functional::hash' would have been a better name. Btw. to go back to the original post, I might change the mix function for floats, it might be problematic as it is. Also, the float hash is actually a bit slow because it's doesn't rely on the binary representation of floating point numbers. But when the binary representation is known to be safely hashable, it might be better to hash that. I already do that for cygwin because the floating point functions weren't available there. The problem is accounting for the different representations, which I don't want to spend too much time dealing with.

on Tue Jan 31 2012, Daniel James <dnljms-AT-gmail.com> wrote:
On 31 January 2012 20:31, Christopher Jefferson <chris@bubblescope.net> wrote:
On 31 Jan 2012, at 19:58, Daniel James wrote:
On 31 January 2012 16:19, Kai Schroeder <kaischroeder3@googlemail.com> wrote:
Well, I tend
to agree that this is a bug in TBB
I actually don't. If TBB requires a better quality hash function, then that's fine. I also wouldn't use boost::hash or std::hash with google's hash tables.
It would be easy to provide a boost::hash_shuffle, that could be applied to any boost::hash and provide this stronger requirement (that there is no corrolation between (a-b) and (hash(a)-hash(b)). This would avoid the need to re-write all the existing hash functions.
Feel free to do so. Although, I'd probably write it to use better algorithms where possible, and just use boost::hash as a fallback for when they're not. I think I've mentioned before that I regret calling it 'boost::hash', 'boost::functional::hash' would have been a better name.
Would you mind explaining why? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 1 February 2012 15:09, Dave Abrahams <dave@boostpro.com> wrote:
on Tue Jan 31 2012, Daniel James <dnljms-AT-gmail.com> wrote:
Feel free to do so. Although, I'd probably write it to use better algorithms where possible, and just use boost::hash as a fallback for when they're not. I think I've mentioned before that I regret calling it 'boost::hash', 'boost::functional::hash' would have been a better name.
Would you mind explaining why?
There has been talk of other, more comprehensive, hashing libraries since which would like to use the namespace 'boost::hash', but it isn't available. It feels wrong to me that such a slight library grabbed the name. Since it was accepted boost has moved more towards putting classes in sub-namespaces. 'functional' isn't that a big a deal, it could be 'boost::container::hash', or 'boost::unordered::hash'.

on Tue Jan 31 2012, Olaf van der Spek <ml-AT-vdspek.org> wrote:
On Tue, Jan 31, 2012 at 4:10 PM, Kai Schroeder <kaischroeder3@googlemail.com> wrote:
Especially since some hash table implementations (e.g. intel's threading building blocks) only use the last few bits to determine the bucket, this can lead to very bad caching behaviour for the rather common case of regularly sampled points (equidistant samplings of power of 2 are e.g. common in computer graphics and numerical algorithms).
Are you aware of this? Would you consider it a bug?
Yes, sounds like a bug in TBB. I don't think it's a good idea to discard bits.
Perhaps not, but it seems to me that we have a problem, too. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Tuesday, January 31, 2012 11:19:29 Dave Abrahams wrote:
on Tue Jan 31 2012, Olaf van der Spek <ml-AT-vdspek.org> wrote:
On Tue, Jan 31, 2012 at 4:10 PM, Kai Schroeder
<kaischroeder3@googlemail.com> wrote:
Especially since some hash table implementations (e.g. intel's threading building blocks) only use the last few bits to determine the bucket, this can lead to very bad caching behaviour for the rather common case of regularly sampled points (equidistant samplings of power of 2 are e.g. common in computer graphics and numerical algorithms).
Are you aware of this? Would you consider it a bug?
Yes, sounds like a bug in TBB. I don't think it's a good idea to discard bits.
Perhaps not, but it seems to me that we have a problem, too.
In case you missed it, there was this [1] conversation which basically boiled down to where bit mixing of the hash value should be implemented - in the hash function or the container itself. The standard doesn't mandate this, AFAICT, but my understanding is that it's the hash function's job. [1] <http://news.gmane.org/find- root.php?group=gmane.comp.lib.boost.devel&article=227609>

Andrey Semashev wrote:
In case you missed it, there was this [1] conversation which basically boiled down to where bit mixing of the hash value should be implemented - in the hash function or the container itself. The standard doesn't mandate this, AFAICT,
At least that is how I understood that conversation
but my understanding is that it's the hash function's job.
What do you mean by "my understanding"? Your understanding of the standard, or your understanding of the best solution, or your understanding of that conversation? You also advocated your "understanding" during that conversation. However, I got the impression that the other party didn't really wanted to discuss about that: Daniel James wrote:
That's really an issue for the standard (a cop out I know, but I don't want to get into this debate).
Regards, Thomas

On Tuesday, January 31, 2012 18:33:25 Thomas Klimpel wrote:
but my understanding is that it's the hash function's job.
What do you mean by "my understanding"? Your understanding of the standard, or your understanding of the best solution, or your understanding of that conversation? You also advocated your "understanding" during that conversation.
The standard doesn't favor or preclude either approach, so either way the implementation would be conforming. So, you might say it's my understanding of both the standard and the "best" solution. I realize that my opponent had a different opinion on this and I didn't convince him.
However, I got the impression that the other party didn't really wanted to discuss about that: Daniel James wrote:
That's really an issue for the standard (a cop out I know, but I don't want to get into this debate).
I remember. And it saddens me.

On 31 January 2012 17:57, Andrey Semashev <andrey.semashev@gmail.com> wrote:
On Tuesday, January 31, 2012 18:33:25 Thomas Klimpel wrote:
but my understanding is that it's the hash function's job.
What do you mean by "my understanding"? Your understanding of the standard, or your understanding of the best solution, or your understanding of that conversation? You also advocated your "understanding" during that conversation.
The standard doesn't favor or preclude either approach, so either way the implementation would be conforming. So, you might say it's my understanding of both the standard and the "best" solution. I realize that my opponent had a different opinion on this and I didn't convince him.
The requirements for the hash function are: For two different values t1 and t2, the probability that h(t1) and h(t2) compare equal should be very small, approaching 1.0 / numeric_- limits<size_t>::max(). There's no requirement about how h(t1) and h(t2) are distributed so it must be the container's responsibility, since the container must work well with any hash function that meets these requirements. How else can you interpret it?

On Tuesday, January 31, 2012 18:55:21 Daniel James wrote:
The requirements for the hash function are:
For two different values t1 and t2, the probability that h(t1) and h(t2) compare equal should be very small, approaching 1.0 / numeric_- limits<size_t>::max().
There's no requirement about how h(t1) and h(t2) are distributed
Right.
so it must be the container's responsibility, since the container must work well with any hash function that meets these requirements.
Wrong, there's no such requirement. It must work, not necessarilly well.
How else can you interpret it?
Simple: it is possible to design a hash function that will cause suboptimal performance with the container. No matter how you implement the container, with bit mixing or not, this statement still holds. I understand your reasoning. You're trying to reduce the probability of unordered container being inefficient because of an inefficient hash function. This also approach has the benefit of simplification of hash functions definition. What I'm saying is that this pessimization is the wrong thing to do because you make users pay for what they don't need. Most of the time, when you deal with hash containers, you have a pretty good idea of the key nature and the better ways to hash key values. There are times when the key values already have a good distribution (e.g. the keys are GUIDs) and there is no need in additional bit mixing. For other cases there are known good hashing algorithms so you often just use the one which better fits your needs. And for standard types I would expect std::hash to do the right thing. I know, the standard doesn't spell out that explicitly (which is a bad thing), but this is how it works in practice (from my experience).

On 31 January 2012 19:28, Andrey Semashev <andrey.semashev@gmail.com> wrote:
On Tuesday, January 31, 2012 18:55:21 Daniel James wrote:
so it must be the container's responsibility, since the container must work well with any hash function that meets these requirements.
Wrong, there's no such requirement. It must work, not necessarilly well.
Wow. I give up.
And for standard types I would expect std::hash to do the right thing. I know, the standard doesn't spell out that explicitly (which is a bad thing), but this is how it works in practice (from my experience).
From libc++:
template <> struct _LIBCPP_VISIBLE hash<int> : public unary_function<int, size_t> { _LIBCPP_INLINE_VISIBILITY size_t operator()(int __v) const _NOEXCEPT {return static_cast<size_t>(__v);} }; Last time I tried it, libstdc++ was similar. I do want the containers to work well with popular implementations of std::hash.

On Tuesday, January 31, 2012 20:06:23 Daniel James wrote:
From libc++:
template <> struct _LIBCPP_VISIBLE hash<int>
: public unary_function<int, size_t>
{ _LIBCPP_INLINE_VISIBILITY size_t operator()(int __v) const _NOEXCEPT {return static_cast<size_t>(__v);} };
Last time I tried it, libstdc++ was similar.
Too bad. I really hope these libraries will get improved eventually. Otherwise, std::hash is useless aside from std::unordered* containers.

On Jan 31, 2012, at 3:17 PM, Andrey Semashev wrote:
On Tuesday, January 31, 2012 20:06:23 Daniel James wrote:
From libc++:
template <> struct _LIBCPP_VISIBLE hash<int>
: public unary_function<int, size_t>
{ _LIBCPP_INLINE_VISIBILITY size_t operator()(int __v) const _NOEXCEPT {return static_cast<size_t>(__v);} };
Last time I tried it, libstdc++ was similar.
Too bad. I really hope these libraries will get improved eventually. Otherwise, std::hash is useless aside from std::unordered* containers.
You've alluded to the crux of the issue: Is std::hash<scalar> a general purpose hash function? Or is it the best default hash function for the std::unordered containers implemented in this same library? The standard is silent on this question. A good general purpose std::hash<scalar> will be more expensive, but will likely work well in any application. A std::hash<scalar> optimized to work with the std::unordered containers will take advantage of knowing the details of those containers, and can be made faster than a general purpose hash function. As a judgement call, I calculated that I would have *many* more clients who are casual users of the std::unordered containers, defaulting most options such as the hash function, than I would have clients interested in using std::hash<scalar> in their own containers, or in other places a hash function might be useful. So I stand by the current implementation of std::hash<int> in libc++ as that which will give the vast majority of my clients the highest quality possible. libc++ currently contains both CityHash and murmur2 as well (the former explicitly contributed by Google and properly credited in the CREDITS.TXT file in the libc++ root). And these are chosen for some of the other scalars and in some other situations. The current hash<int> isn't a product of ignorance or laziness on this subject. It is a carefully chosen solution based on the assumption that the main purpose of std::hash<int> is to serve as a good default for the std::unordered containers. Implications of that assumption: If you're in need of a hash function for some need other than a good match for the std::unordered containers, std::hash may not be the solution you're looking for. The only std::solutions I'm aware of that might come close to fitting such a need are found in <random>. Non-portable solutions may be found in individual std::lib implementations such as libc++ by searching for things like "cityhash" or "murmur2". Howard

On Wednesday, February 01, 2012 10:57:18 Howard Hinnant wrote:
As a judgement call, I calculated that I would have *many* more clients who are casual users of the std::unordered containers, defaulting most options such as the hash function, than I would have clients interested in using std::hash<scalar> in their own containers, or in other places a hash function might be useful.
Thank you for your careful answer. As I understand, in order to compensate the std::hash simplicity you do additional hashing (bit mixing or modulus) in the container. May I ask, have you considered the other approach - to make the container simpler but add bit mixing to std::hash? Why you chose not to follow this path?

2012/2/1 Andrey Semashev <andrey.semashev@gmail.com>
On Wednesday, February 01, 2012 10:57:18 Howard Hinnant wrote:
As a judgement call, I calculated that I would have *many* more clients
are casual users of the std::unordered containers, defaulting most
who options
such as the hash function, than I would have clients interested in using std::hash<scalar> in their own containers, or in other places a hash function might be useful.
Thank you for your careful answer. As I understand, in order to compensate the std::hash simplicity you do additional hashing (bit mixing or modulus) in the container. May I ask, have you considered the other approach - to make the container simpler but add bit mixing to std::hash? Why you chose not to follow this path?
std::hash can be specialized for user defined types. If unordered containers were to assume that std::has does bit mixing, users would have harder time specializing std::hash which could lead to performance bugs. Roman Perepelitsa.

On Wednesday, February 01, 2012 18:20:33 Roman Perepelitsa wrote:
std::hash can be specialized for user defined types. If unordered containers were to assume that std::has does bit mixing, users would have harder time specializing std::hash which could lead to performance bugs.
When writing a custom hash function, whether std::hash specialization or not, one cannot rely on bit mixing in the container because it is not mandated by the standard. So one would write the function with bit mixing anyway.

On Feb 1, 2012, at 12:15 PM, Andrey Semashev wrote:
On Wednesday, February 01, 2012 10:57:18 Howard Hinnant wrote:
As a judgement call, I calculated that I would have *many* more clients who are casual users of the std::unordered containers, defaulting most options such as the hash function, than I would have clients interested in using std::hash<scalar> in their own containers, or in other places a hash function might be useful.
Thank you for your careful answer. As I understand, in order to compensate the std::hash simplicity you do additional hashing (bit mixing or modulus) in the container. May I ask, have you considered the other approach - to make the container simpler but add bit mixing to std::hash? Why you chose not to follow this path?
Yes, I've considered that path. The libc++ containers are constrained to hold only a prime number of buckets. And all hashes are modulo'd to fit within the number of buckets. The use of modulus combined with a prime number of buckets has a long and successful history for hash containers, dating back longer than I've been in this industry. Another approach I've read about is to use a power of 2 number of buckets and an & operation to bring the hash into the bucket range. If I had chosen this path I would probably also choose to use a stronger bit mixing for all default hash functions. Otherwise, poor collision performance is all but assured. One of the reasons I prefer the prime solution to the power of 2 solution is I believe it provides better protection against user-written hash functions that happen to be poorly chosen. I have decided to opt against "well, you wrote a poor hash function, you get what you deserve." Had this protection been unreasonably expensive, I would not have chosen this path. However, in my estimation this cost is very, very low. The use of the modulus is more expensive than an &, but not greatly so on modern hardware. Expensive things to do on older hardware include instructions like 64 bit modulus and floating point divide. When I wrote Fortran 30 years ago you could literally predict the performance of numerical code by just counting the floating point divides. You could completely ignore the additions and multiplications because they were practically free by comparison. On modern hardware expensive things to do include flushing the L1 cache and allocating memory. By comparison, individual instructions (except for memory barriers) are practically free. The power of 2 solution is often associated with "incremental rehashing". As I recall this has the advantage of spreading a rehash out over many insertions so that no one insertion is very expensive. However the API of the std::unordered containers is very "vector-like": You can predict when rehash's are coming by inspecting things like the number of buckets, the load factor and the maximum load factor. You can even plan when your rehashes are going to happen by pre-allocating a predicted number of needed buckets, and or manipulating max_load_factor, and then calling rehash explicitly at convenient times. So because of the ability to fine tune when and how rehashes happen, the chief selling point of incremental rehashing (and subsequently power of 2 number of buckets) is somewhat muted, imho. I am not always right. Indeed I pride myself on continually learning new things. Another way of pronouncing "continually learning" is: I was wrong, I learned something new, and now I think I'm right. So I could be wrong here too, and am about to learn something. I accept that. However I've been studying and shipping hash containers for about a dozen years now, and my current position is based on my customer feedback over that period of time. Howard

[Howard Hinnant]
On modern hardware expensive things to do include flushing the L1 cache and allocating memory. By comparison, individual instructions (except for memory barriers) are practically free.
Integer divisions are surprisingly expensive, relatively speaking. While changing deque, we introduced integer divisions that noticeably harmed performance. We got rid of them (restoring the perf and then some) by introducing a power-of-2 guarantee that let us use bitwise operations instead. Of course, this will only matter in very high performance code, like the fundamental operations performed by a container. STL

On 2/1/2012 12:59 PM, Stephan T. Lavavej wrote:
Integer divisions are surprisingly expensive, relatively speaking. While changing deque, we introduced integer divisions that noticeably harmed performance. We got rid of them (restoring the perf and then some) by introducing a power-of-2 guarantee that let us use bitwise operations instead.
Of course, this will only matter in very high performance code, like the fundamental operations performed by a container.
STL
Ah, good. Something non-contentious that I can contribute. There is a standard technique used in hashing and some other situations, that allows a divide/mod to reduce a value to an arbitrary range with a multiply and shift (frequently, in practice, multiplying into a wider width location, then indexing). For this to work well, however, higher order bits/digits/oits/hits/... must be well distributed. If this cannot be relied upon, techniques like bit rotates, swapping halves, xor-oring low bytes onto the high ones or doing a multiply by a constant ignoring overflow, before applying the algorithm can fix things. The technique is to multiply the original value by the range and just use the overflow. Just think of the value to be hashed as the numerator of a rational with one plus the maximum value as the denominator. Topher

On 1 February 2012 15:57, Howard Hinnant <howard.hinnant@gmail.com> wrote:
libc++ currently contains both CityHash and murmur2 as well (the former explicitly contributed by Google and properly credited in the CREDITS.TXT file in the libc++ root). And these are chosen for some of the other scalars and in some other situations.
This is worth a read: http://blog.reverberate.org/2012/01/29/state-of-the-hash-functions-2012/

On 1/31/2012 1:55 PM, Daniel James wrote:
The requirements for the hash function are:
For two different values t1 and t2, the probability that h(t1) and h(t2) compare equal should be very small, approaching 1.0 / numeric_ limits<size_t>::max().
There's no requirement about how h(t1) and h(t2) are distributed so it must be the container's responsibility, since the container must work well with any hash function that meets these requirements. How else can you interpret it?
As what it says. It does not say anything about the distribution from which the two different values are drawn so the implication is that for /any /two values the probability must approach 1/max-value. Technically, of course, this is impossible -- you can always choose the values from a distribution based on inverting the hash (that this might be extremely costly or even infeasible is irrelevant). A reasonable view, however, is that if the values are chosen using any reasonable method that is not too specific to the particular hash method then the probability of a collision is as stated. Any simple pattern in the input that results in a pattern in the output -- a restriction over the possible range of output values for meaningful subsets of all possible values -- will produce collision probabilities much higher than that. If we do not read it that way -- if we take "two different values" to mean "two values uniformly selected from all possible input values" -- then the requirement is trivial. For any kind of value that is represented in log-2(numeric_limits<size_t>::max()) bits or less, just use the identity function, for anything larger, just use the first bits-per-size_t-value bits and throw out the rest. That is all it takes to conform to the standard by that reading. But even if this was not the case and this hash function can be taken to conform to the standard -- There is no explicit requirement for quality anywhere in the standards to the best of my knowledge. For example, a linear asymptotic performance requirement does not /specify/ that the there isn't a two minute overhead independent of N for each operation -- avoiding that in an implementation is about /quality/ rather than /conformance/. You can consider it to be an accurate implementation of the standard but you can't consider it a good one. There are specialized applications where a poorly distributed hash (at least poorly distributed in just the right way) is not only acceptable but required. Sometimes you want to ignore some specific differences so that values that differ only by those characteristics still hash to the same value (obvious example -- case-insensitive string hashes). When hashing is used to locate similar objects the hash function must transform values that differ in small ways -- that are similar -- to hash values close to each other. The classic Gorin spelling correction algorithm hashed each word from a document by its length and first two letters allowing any word with a single letter insertion, deletion or replacement or a two letter transposition to be easily found, and there are lots of geometric algorithms that reduce dimensionality or range in a desirable way so that "close" points are still close, or at least likely to be. But there are a number of characteristics that are routinely considered the characteristics of a /general purpose/ hash function of at least moderate quality and the avoidance of this kind of regularity is one of them. The suggestion that the output could be fixed by adding a "bit mixing function" is not really relevant, and neither is the statement that because the standard does not specify a good quality hash that it is the container's responsibility to "fix it". The only way for the container to do that is to rehash the hash value -- which is essentially what the bit-mixing function does (such a function would be a not-terrible hash function for fixed width values for many purposes -- as long as input and output having the same number of 1-bits is not a problem. Note, however, that the specific example that demonstrated the problem produced lots of bits the same, which would still reduce the number of possible values for structured inputs however much the bits were shuffled around). To specify that a hash-based algorithm should be written to be proof against poorly distributed inputs essentially means that the expectation is that the input hash value should be rehashed. Of course, its standard for have hash tables to have prime-like lengths (generally taken to mean to avoid lengths with small -- especially powers of 2 -- divisors) but that has to do with the quality of the necessary range reduction (a.k.a., value folding). This is generally considered good enough precaution to allow "raw integer values" to be used directly for many applications, but for less casual use, this is not acceptable (values that differ by the table size end up in the same bin, and values that differ by a small amount end up in nearby bins -- which can cause excess collisions for data with some corelational structure). Lets turn one of the arguments made on its head. Why should someone who finds it convenient to use a reasonable quality hash on their data type pay the overhead of using a container that compensates for the possibility of a poor one? Is it a more reasonable expectation that someone who wishes a different balance of quality vs performance then the default should rewrite the standard hash or rewrite the standard unordered_map? Sorry for the length -- I had a lot to say, I guess. Topher Cooper

On 1 February 2012 00:44, Topher Cooper <topher@topherc.net> wrote:
values that differ by the table size end up in the same bin
Is that likely though? And any more likely than other causes of collisions?
and values that differ by a small amount end up in nearby bins -- which can cause excess collisions for data with some corelational structure.
The STL hash containers are pretty much required to be implemented using a chained hash table.
Lets turn one of the arguments made on its head. Why should someone who finds it convenient to use a reasonable quality hash on their data type pay the overhead of using a container that compensates for the possibility of a poor one? Is it a more reasonable expectation that someone who wishes a different balance of quality vs performance then the default should rewrite the standard hash or rewrite the standard unordered_map?
There are very good alternative open source implementations out there. You shouldn't need to rewrite anything.
Sorry for the length -- I had a lot to say, I guess.
Why do you think I said I didn't want to get into this debate? I do realise that there's a lot to say, and I don't have any enthusiasm for saying it. I wrote the thing 7 years ago. IMO a more appropriate forum would be one of the C++ newsgroups, as this is more about how the standard should be implemented, rather than my particular implementation which should really become obsolete over the coming years. Boost.Hash's advantages over standard implementations are its portability and its customisation abilities. The former will hopefully become less relevant, the latter doesn't seem to have been that popular. In other respects the standard implementations should be better. I'm not going to compete with them. If anyone does want to give it a shot, I'd recommend starting from scratch (probably using existing hash algorithms) or using something like libc++'s implementation as a starting point. Most of the effort that's gone into Boost.Hash was to get it working with older compilers, and the implementation has been wrapped around that. There's also the problem with implicit casts which is something of a flaw in the customisation mechanism, so that could do with a rethink.

On Wed, Feb 1, 2012 at 12:59 PM, Daniel James <dnljms@gmail.com> wrote:
On 1 February 2012 00:44, Topher Cooper <topher@topherc.net> wrote:
IMO a more appropriate forum would be one of the C++ newsgroups, as this is more about how the standard should be implemented, rather than my particular implementation which should really become obsolete over the coming years.
I think, this list is appropriate for the discussion, as it is about Boost.Hash (and Boost.Unordered, as long as it is intended to be used with Boost.Hash). At least, it's how it started.
Boost.Hash's advantages over standard implementations are its portability and its customisation abilities. The former will hopefully become less relevant, the latter doesn't seem to have been that popular.
I don't think Boost.Hash and Boost.Unordered have to die out eventually. One possible direction of further development of these libraries could be an attempt to fix flaws in the STL implementations by providing superior tools.
In other respects the standard implementations should be better. I'm not going to compete with them.
A few posts back you cited std::hash implementation from libc++. I doubt it can be worse than that.

On 2/1/2012 3:59 AM, Daniel James wrote:
On 1 February 2012 00:44, Topher Cooper<topher@topherc.net> wrote:
values that differ by the table size end up in the same bin
Is that likely though? And any more likely than other causes of collisions?
It is immensely more likely than any of the causes of /systematic/ collisions in a reasonable implementations that avoids destroying expected performance (O(1)) and replacing it with worst case performance (O(N)) for simple regularities in the input data. The expected performance takes into account the occurrence of non-systematic collisions. No algorithm can guarantee that there cannot be systematic collisions (since it must be deterministic) but it is easy to make sure that regularities in the input data would have to be incredibly arbitrary to cause them. Topher Cooper

On 2/1/2012 3:59 AM, Daniel James wrote:
and values that differ by a small amount end up in nearby bins -- which can cause excess collisions for data with some corelational structure.
The STL hash containers are pretty much required to be implemented using a chained hash table.
I've seen that claim, and I might well be missing something, but I believe that coalesced hashing would also work. Deletion (erasure) algorithms that retain an expectation of a few probes up to very high load factors are complicated and somewhat costly though (as I remember -- its been a long time since I worked with this) asymptotically within bounds. I don't know of any shortcuts in resizing beyond rehashing all the entries, but the requirements are consistent with that, and resizing a growing table needs to be done less often for chained hashing due to good performance up to much higher load factors than external chaining. The cost is a less trivial implementation (but that is what libraries are for), and constant higher expected cost of element erasure. The gain is frequent much better memory usage. In any case, how collisions are handled is irrelevant to my point. If it wasn't clear, the structure that I was describing as possibly "corelational" is structure in the data to be hashed rather than the structure of the container. Let's take an example. Let's say that I'm mapping usernames to some kind of object, and that usernames consist of a last name plus a number assigned in order when there is a duplicate -- a common practice. The presence of a last name increases the probability that a last name is common, so the presence of "cooper3" makes it more likely that there will be a "cooper4". If the hashing scheme hashes those "adjacent" usernames to adjacent hash-values/bins then if "james23" happens to collide with "cooper3" then "james24" will also collide with "cooper4". Consequently, under these conditions, moderate load factors will result in somewhat worse than the constant performance expected from truly uniform hashes. For what it is worth, FNV1 hashing suffers from precisely this correlational sensitivity under these circumstances (although having the next last byte being likely to end up in either the next /or/ the previous bin but will do so consistently and this may result in the third in the sequence being more likely to collide with the first), which is why the so called "slightly better" distribution properties of FNV1a (which corrects this defect) are not as trivial as the developers imply. Topher

On 1 February 2012 20:15, Topher Cooper <topher@topherc.net> wrote:
On 2/1/2012 3:59 AM, Daniel James wrote:
The STL hash containers are pretty much required to be implemented using a chained hash table.
Note that I said 'pretty much required', not 'required'.
I've seen that claim, and I might well be missing something, but I believe that coalesced hashing would also work.
Container elements can be uncopyable and unmovable.

On 2/1/2012 4:01 PM, Daniel James wrote:
On 1 February 2012 20:15, Topher Cooper<topher@topherc.net> wrote:
On 2/1/2012 3:59 AM, Daniel James wrote:
The STL hash containers are pretty much required to be implemented using a chained hash table.
Note that I said 'pretty much required', not 'required'.
I've seen that claim, and I might well be missing something, but I believe that coalesced hashing would also work.
Container elements can be uncopyable and unmovable. Ah, I did miss that, but it just means that the underlying table include indirection as open chaining generally does intrinsically (I say generally, because it is possible to implement open chaining if that were not a requirement with the first entry contained in the table).
Topher Cooper

On 1 February 2012 21:32, Topher Cooper <topher@topherc.net> wrote:
On 2/1/2012 4:01 PM, Daniel James wrote:
On 1 February 2012 20:15, Topher Cooper<topher@topherc.net> wrote:
I've seen that claim, and I might well be missing something, but I believe that coalesced hashing would also work.
Container elements can be uncopyable and unmovable.
Ah, I did miss that, but it just means that the underlying table include indirection as open chaining generally does intrinsically (I say generally, because it is possible to implement open chaining if that were not a requirement with the first entry contained in the table).
Wouldn't that defeat the purpose? If you have to follow an indirection for every key comparison, then you've lost the main advantage of using coalesced hashing.

On 2/2/2012 5:28 PM, Daniel James wrote:
Ah, I did miss that, but it just means that the underlying table include indirection as open chaining generally does intrinsically (I say generally, because it is possible to implement open chaining if that were not a requirement with the first entry contained in the table).
Wouldn't that defeat the purpose? If you have to follow an indirection for every key comparison, then you've lost the main advantage of using coalesced hashing.
I'd have to look at the figures in detail, but I don't think so. You do straight-forward coalesced hashing either with the key kept in the table, or perhaps its size_t sized hash-values as proxies, and pointers to separately allocated vector for the values pointed at. You still get all the goodness -- essentially better paging, less space, and compared to external chaining with as many entries as bins, about equal failure search probes and slightly better success search probes. In any case, the issue is whether it conforms to the standard and is a reasonably reasonable choice. I'm not recommending it -- and certainly not as the primary implementation. Its possible that a library might provide it as an alternative, chosen either manually or via policies. Topher Cooper

On 2/1/2012 3:59 AM, Daniel James wrote:
There are very good alternative open source implementations out there. You shouldn't need to rewrite anything.
But if, as you say, the standard implies this trade-off, then conformant implementations will end up with roughly the same trade-off. This is only an option if your justification for a low-quality hash function is incorrect. Hashing functions are easily implemented and fairly easily adjusted, replacing a complete container implementation or adjusting one is much more work. If one has a separate hash function and hash container it makes more sense to depend on the hash function to be responsible for the quality of the hash rather than the container, in addition to it being simpler for the developer using it. Since there are /very good/ implementations available, what is the justification for a poor one in the Boost library? Besides, it would not occur to most developers that the Boost implementation of standard hash is not up to the quality standards that Boost is known for overall. It never occurred to me to check, and I have the necessary background to determine that it isn't if I had checked, and to understand the consequences that it is not. I have some code to replace, and some performance results to review to see if they were effected (probably not, but I need to check to be sure). Topher

On 1 February 2012 20:43, Topher Cooper <topher@topherc.net> wrote:
On 2/1/2012 3:59 AM, Daniel James wrote:
There are very good alternative open source implementations out there. You shouldn't need to rewrite anything.
But if, as you say, the standard implies this trade-off, then conformant implementations will end up with roughly the same trade-off.
I meant alternative open source hash tables, not necessarily ones that meet the standard's unordered container requirements.

On 2/1/2012 4:02 PM, Daniel James wrote:
On 1 February 2012 20:43, Topher Cooper<topher@topherc.net> wrote:
On 2/1/2012 3:59 AM, Daniel James wrote:
There are very good alternative open source implementations out there. You shouldn't need to rewrite anything.
But if, as you say, the standard implies this trade-off, then conformant implementations will end up with roughly the same trade-off.
I meant alternative open source hash tables, not necessarily ones that meet the standard's unordered container requirements.
So your suggestion is that if they discover a problem that requires a different performance trade-off, that they should be expected to either rewrite their code or write an interface adaptor, instead of being expected to rewrite a few lines of hash function code to meet their needs? Topher

On 1 February 2012 21:39, Topher Cooper <topher@topherc.net> wrote:
So your suggestion is that if they discover a problem that requires a different performance trade-off, that they should be expected to either rewrite their code or write an interface adaptor, instead of being expected to rewrite a few lines of hash function code to meet their needs?
No, my suggestion is that they do whatever floats their boat.

Why do you think I said I didn't want to get into this debate? I'm sympathetic -- I don't have time for this either, but I didn't feel
On 2/1/2012 3:59 AM, Daniel James wrote: that leaving these points of your justification unanswered was responsible.
I do realise that there's a lot to say, and I don't have any enthusiasm for saying it. I wrote the thing 7 years ago.
I've made lots of much worse errors in the past -- nothing wrong about that. And you are to be commended, like all the Boost library implementors, for having made a substantial contribution. If you don't want to defend the decisions you made, though, just admit to misjudgement or let it pass.
IMO a more appropriate forum would be one of the C++ newsgroups, as this is more about how the standard should be implemented, rather than my particular implementation which should really become obsolete over the coming years. Boost.Hash's advantages over standard implementations are its portability and its customisation abilities.
The quality of the Boost implementation and whether it conforms to the requirements and Boost quality standards belongs in the Boost list. There may also be a place for part of the discussion in more general C++ lists, if the belief that the specs imply low quality hash functions is widespread in the community. Could be, but I have no evidence to that effect. The point is that there is a case to be made that this implementation should have been obsolete seven years ago at its birth. That this wasn't caught by the community back then is a failure of the whole community -- especially those of us who know something about the subject, even those like me whose community participation is sporadic. We can't do anything about the last seven years, but I think that it is clear that it should already be considered obsolete, and that replacing it should be a group priority. Unfortunately, my clients would be justified, I'm afraid, in considering the time I've spent writing on this as being somewhat irresponsible to them (deadlines long since unmet on work already paid for). If someone hasn't done something when I come up for air, I'll see about fixing this myself. As for its customisation (cap)abilities you argued earlier in the same message that a major aspect of its customizability is that it can be replaced with something else.

On 1 February 2012 21:13, Topher Cooper <topher@topherc.net> wrote:
On 2/1/2012 3:59 AM, Daniel James wrote:
I do realise that there's a lot to say, and I don't have any enthusiasm for saying it. I wrote the thing 7 years ago.
I've made lots of much worse errors in the past -- nothing wrong about that.
Funnily enough, that wasn't actually my design decision. Well, I suppose I chose to follow someone else's design. Given the requirements (meeting the draft standard, being highly portable without requiring too much development effort), I do think it was the right decision. And you've provided no evidence to the contrary, which is
And you are to be commended, like all the Boost library implementors, for having made a substantial contribution. If you don't want to defend the decisions you made, though, just admit to misjudgement or let it pass.
I do feel an obligation to respond to emails concerning my libraries. I know I shouldn't but there you go.
The quality of the Boost implementation and whether it conforms to the requirements and Boost quality standards belongs in the Boost list. There may also be a place for part of the discussion in more general C++ lists, if the belief that the specs imply low quality hash functions is widespread in the community. Could be, but I have no evidence to that effect.
I mentioned the implementation of std::hash<int> in both libc++ and libstdc++ earlier.
The point is that there is a case to be made that this implementation should have been obsolete seven years ago at its birth. That this wasn't caught by the community back then is a failure of the whole community -- especially those of us who know something about the subject, even those like me whose community participation is sporadic. We can't do anything about the last seven years, but I think that it is clear that it should already be considered obsolete, and that replacing it should be a group priority.
Blimey. If it wasn't clear, I'm very happy for other people to propose a new hash library. This one was written to meet a need, not to be the bestest hash function ever. The only real problem has been the implicit cast issue.
Unfortunately, my clients would be justified, I'm afraid, in considering the time I've spent writing on this as being somewhat irresponsible to them (deadlines long since unmet on work already paid for). If someone hasn't done something when I come up for air, I'll see about fixing this myself.
I'm sorry I forced you to waste your time.
As for its customisation (cap)abilities you argued earlier in the same message that a major aspect of its customizability is that it can be replaced with something else.
You seem to be arguing against a point of view that no one holds by using my explanation for not holding that point of view.

On 2/2/2012 5:26 PM, Daniel James wrote:
I'm sorry I forced you to waste your time.
If I considered it a waste, I wouldn't have devoted the time -- there was no "force" involved. Things have gotten a bit heated, my fault, I apologize.
As for its customisation (cap)abilities you argued earlier in the same message that a major aspect of its customizability is that it can be replaced with something else.
You seem to be arguing against a point of view that no one holds by using my explanation for not holding that point of view.
My misunderstanding. I argued that the choice required that a user wishing some different trade-offs would need to modify or rewrite a large complex piece of code instead of a small, simple one. Your response was that they could just use a completely different, non-standard conformant, library instead of customizing the code. I'm not sure what you are now saying (really) -- that I was right and that there is no reasonable way for library users to choose different performance trade-offs? I really am not trying to attack, just to understand. Topher Cooper

On 3 February 2012 01:07, Topher Cooper <topher@topherc.net> wrote:
My misunderstanding. I argued that the choice required that a user wishing some different trade-offs would need to modify or rewrite a large complex piece of code instead of a small, simple one. Your response was that they could just use a completely different, non-standard conformant, library instead of customizing the code. I'm not sure what you are now saying (really) -- that I was right and that there is no reasonable way for library users to choose different performance trade-offs?
It isn't that bad. STL style containers are pretty well established so containers are often replaceable. For example, Google's hash containers are written to have a similar API to the standard containers, although they add some extra requirements to their elements. I don't think they implement anything from C++11 (yet?). Bizarrely, they even supply local iterators, even though they don't really do what they should, and I don't think anyone would miss them. To some extent, you need to accept reduced performance when using generic components. They do tend to be less efficient than more specialised ones. The STL and Boost have both reduced that cost, but it is still there. An example of this is allocators, fully supporting C++11 allocators can introduce inefficiencies or prevent optimisations because of the need to support custom pointer classes. A certain amount of template magic can reduce that cost though. Something I've considered in the past is adding a type trait for hash functions which specifies when power of two buckets can be used, although I'm always wary of introducing extra complexity to the implementation.

on Tue Jan 31 2012, Daniel James <dnljms-AT-gmail.com> wrote:
On 31 January 2012 17:57, Andrey Semashev <andrey.semashev@gmail.com> wrote:
On Tuesday, January 31, 2012 18:33:25 Thomas Klimpel wrote:
but my understanding is that it's the hash function's job.
What do you mean by "my understanding"? Your understanding of the standard, or your understanding of the best solution, or your understanding of that conversation? You also advocated your "understanding" during that conversation.
The standard doesn't favor or preclude either approach, so either way the implementation would be conforming. So, you might say it's my understanding of both the standard and the "best" solution. I realize that my opponent had a different opinion on this and I didn't convince him.
The requirements for the hash function are:
For two different values t1 and t2, the probability that h(t1) and h(t2) compare equal should be very small, approaching 1.0 / numeric_- limits<size_t>::max().
There's no requirement about how h(t1) and h(t2) are distributed so it must be the container's responsibility, since the container must work well with any hash function that meets these requirements. How else can you interpret it?
You can interpret it as a QOI issue, and we should aim to provide the highest QOI. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 1 February 2012 15:11, Dave Abrahams <dave@boostpro.com> wrote:
on Tue Jan 31 2012, Daniel James <dnljms-AT-gmail.com> wrote:
On 31 January 2012 17:57, Andrey Semashev <andrey.semashev@gmail.com> wrote:
The standard doesn't favor or preclude either approach, so either way the implementation would be conforming. So, you might say it's my understanding of both the standard and the "best" solution. I realize that my opponent had a different opinion on this and I didn't convince him.
The requirements for the hash function are:
For two different values t1 and t2, the probability that h(t1) and h(t2) compare equal should be very small, approaching 1.0 / numeric_- limits<size_t>::max().
There's no requirement about how h(t1) and h(t2) are distributed so it must be the container's responsibility, since the container must work well with any hash function that meets these requirements. How else can you interpret it?
You can interpret it as a QOI issue, and we should aim to provide the highest QOI.
I think you might have misunderstood my point, the question here is whether the containers should work well with hash functions (i.e. other than std::hash and boost::hash) that meet the standard's requirements but don't have a good distribution, even though that makes the container a bit slower (this came up on the other thread because the hit from doing a modulus is much higher on 64-bit machines). Andrey thinks they shouldn't, I think they should.
participants (11)
-
Andrey Semashev
-
Christopher Jefferson
-
Daniel James
-
Dave Abrahams
-
Howard Hinnant
-
Kai Schroeder
-
Olaf van der Spek
-
Roman Perepelitsa
-
Stephan T. Lavavej
-
Thomas Klimpel
-
Topher Cooper