boost::hash and string conflicts

Hi, We have been using boost hash to hash filenames, but have found with some of our user data that certain strings will produce the same hash code ( "0012g6" and "0012fu" for example). Is there a recommended way to predict or resolve these sorts of conflicts? Thanks in advance!

On Thu, Apr 7, 2011 at 10:08, Erik Scorelle
We have been using boost hash to hash filenames, but have found with some of our user data that certain strings will produce the same hash code ( "0012g6" and "0012fu" for example). Is there a recommended way to predict or resolve these sorts of conflicts?
Having some collisions is the expected behaviour of a (non-cryptographic) hash function. With a birthday search you can easily find thousands of examples. I'm not sure what you mean by "resolve". Normally, code using hashes expects collisions, and uses a full equality operator to resolve them. If you cannot accept any collisions, you could use perfect hashing (libraries like http://cmph.sourceforge.net/ or http://www.gnu.org/software/gperf/gperf.html) or use a cryptographic hash (I have a usable, but WIP library at http://svn.boost.org/svn/boost/sandbox/hash/). ~ Scott

On Thu, Apr 7, 2011 at 7:56 PM, Scott McMurray
We have been using boost hash to hash filenames, but have found with some of our user data that certain strings will produce the same hash code ( "0012g6" and "0012fu" for example). Is there a recommended way to
On Thu, Apr 7, 2011 at 10:08, Erik Scorelle
wrote: predict or resolve these sorts of conflicts?
Having some collisions is the expected behaviour of a (non-cryptographic) hash function. With a birthday search you can easily find thousands of examples.
I'm not sure what you mean by "resolve". Normally, code using hashes expects collisions, and uses a full equality operator to resolve them.
If you cannot accept any collisions, you could use perfect hashing (libraries like http://cmph.sourceforge.net/ or http://www.gnu.org/software/gperf/gperf.html) or use a cryptographic hash (I have a usable, but WIP library at http://svn.boost.org/svn/boost/sandbox/hash/).
~ Scott
Scott, AFAIK perfect hashing works only with the known set of keys, otherwise it is not possible. Just to make a note: crypto-hashes are much slower as normal ones and use much more memory. Erik, how many file names are you going to hash? There was an article on Dr.Dobbs which compared unordered_map (hash table) implementation with std::map. And it turned out that until some certain amount of items it makes no sence to use a hash table, because a normal map is much faster. Here is the link: http://drdobbs.com/cpp/198800559?pgno=1 With Kind Regards, Ovanes

Thanks for the quick responses!
The solution to our particular problem ended up being very simple: just a
few tweaks to the boost::unordered_map functors. I just wasn't quite
thinking about it clearly.
Also, that Dr. Dobbs article was a helpful read.
Cheers
On Thu, Apr 7, 2011 at 2:08 PM, Ovanes Markarian
On Thu, Apr 7, 2011 at 7:56 PM, Scott McMurray
wrote: We have been using boost hash to hash filenames, but have found with some of our user data that certain strings will produce the same hash code ( "0012g6" and "0012fu" for example). Is there a recommended way to
On Thu, Apr 7, 2011 at 10:08, Erik Scorelle
wrote: predict or resolve these sorts of conflicts?
Having some collisions is the expected behaviour of a (non-cryptographic) hash function. With a birthday search you can easily find thousands of examples.
I'm not sure what you mean by "resolve". Normally, code using hashes expects collisions, and uses a full equality operator to resolve them.
If you cannot accept any collisions, you could use perfect hashing (libraries like http://cmph.sourceforge.net/ or http://www.gnu.org/software/gperf/gperf.html) or use a cryptographic hash (I have a usable, but WIP library at http://svn.boost.org/svn/boost/sandbox/hash/).
~ Scott
Scott,
AFAIK perfect hashing works only with the known set of keys, otherwise it is not possible. Just to make a note: crypto-hashes are much slower as normal ones and use much more memory. Erik, how many file names are you going to hash? There was an article on Dr.Dobbs which compared unordered_map (hash table) implementation with std::map. And it turned out that until some certain amount of items it makes no sence to use a hash table, because a normal map is much faster.
Here is the link: http://drdobbs.com/cpp/198800559?pgno=1
With Kind Regards, Ovanes
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

On 7 April 2011 18:56, Scott McMurray
Having some collisions is the expected behaviour of a (non-cryptographic) hash function. With a birthday search you can easily find thousands of examples.
True, but there are better hash functions than boost::hash. Murmer hash, perhaps: http://code.google.com/p/smhasher/

On Thu, Apr 7, 2011 at 7:08 PM, Erik Scorelle
Hi,
We have been using boost hash to hash filenames, but have found with some of our user data that certain strings will produce the same hash code ( "0012g6" and "0012fu" for example). Is there a recommended way to predict or resolve these sorts of conflicts?
Thanks in advance!
Hi! Hash functions do not guarantee uniqueness. The case when the key is not unique called collision. Usually hash tables not only compare the hash key, but also the actual data to be sure that the right record was referenced. The most primitive way to implement a hash-table is to put all records hashing to the same key into a linked list and search linear. Hashing is effecient, if the choosen hash function has a good distribution and results in not so many collisions, which in turn results in amortized constant time for the access. Boost unordered_map is the implementation of a hash table: http://www.boost.org/doc/libs/1_46_0/doc/html/boost/unordered_map.html I suggest reading this article from Wikipedia: http://en.wikipedia.org/wiki/Hash_function And for further reading refer to Cormen et al: Introduction to algorithms. They have the whole chapter dedicated to hashing techniques. With Kind Regards, Ovanes
participants (4)
-
Daniel James
-
Erik Scorelle
-
Ovanes Markarian
-
Scott McMurray