[Gauge for interest] Minimal perfect hash function generator

Hi all, recently I've written a minimal perfect hash function generator, which generates very efficient hashing functions for arbitrary (reasonable large) key sets containing strictly positive integers to achieve a minimal and perfect mapping of these keys to hash values in the range [0..N), where N is the size of the key set (minimality requirement) and the hash values are unique (perfection requirement). IOW, given a set of N arbitrary positve integers (xi > 0) this hash function generator allows to map the integers to another set of integers, the values of which are in between and including 0 and N-1. No duplicate mappings are generated. Generally the developed MPH (minimal perfect hash) generates the coefficients for the following hash functions: h(x) = (C / (D*x + E) mod N or h(x) = (C / x) mod N (D == 1, E == 0) where the second one is choosen almost all the time (generally C, D and E are integers). Especially the second one requires only a couple of instructions to be executed, which makes it a very effective selection alternative for table oriented algorithms. The requirement, that the input keys should be strictly positive isn't a big limitation, because it is always possible to accordingly transform key sets containing possibly negative numbers, or to derive the keys from strings to match. Sounds really perfect! But there are some drawbacks: - the keys (obviously) must be known in advance, so a corresponding hash function may be generated. - some initial computation is required to find the coefficients C, D and E (which is obvious as well :-). The initial overhead is tolerable, though, here are some timings on my machine - Pentium M, 1.6GHz, 1GByte: 20 keys: 0.005 [s], 200 keys: 0.03 [s], 2000 keys: 0.15 [s], 20000 keys: 4.45[s] where the time is raising slightly more worse than linear (actually it has a linear and an exponential component, but the exponential component kicks in for larger key sets only), depending on the number of keys. The memory requirement for the MPH generator is raising linearily with the number of keys: roughly N/3 * sizeof(int), where N is the number of keys. This hash function generator is usable for a wide range of applications, where efficient data lookup is a major goal and where sparse data sets have to be treated efficiently. Is there anybody interested in such a thing? Regards Hartmut

On Jan 17, 2005, at 8:34 AM, Hartmut Kaiser wrote:
recently I've written a minimal perfect hash function generator, which generates very efficient hashing functions for arbitrary (reasonable large) key sets containing strictly positive integers to achieve a minimal and perfect mapping of these keys to hash values in the range [0..N), where N is the size of the key set (minimality requirement) and the hash values are unique (perfection requirement). [...]
The initial overhead is tolerable, though, here are some timings on my machine - Pentium M, 1.6GHz, 1GByte:
20 keys: 0.005 [s], 200 keys: 0.03 [s], 2000 keys: 0.15 [s], 20000 keys: 4.45[s]
where the time is raising slightly more worse than linear (actually it has a linear and an exponential component, but the exponential component kicks in for larger key sets only), depending on the number of keys. The memory requirement for the MPH generator is raising linearily with the number of keys: roughly N/3 * sizeof(int), where N is the number of keys.
I'm very interested if this scales well for a larger number of keys, of the order of 10^6 - 10^8 keys. I guess that for those sizes the scaling will already be exponential. Can you tell us what the maximum number of keys is that can be efficiently treated? Matthias

On Mon, 17 Jan 2005 09:37:34 +0100, Matthias Troyer <troyer@itp.phys.ethz.ch> wrote:
I'm very interested if this scales well for a larger number of keys, of the order of 10^6 - 10^8 keys. I guess that for those sizes the scaling will already be exponential. Can you tell us what the maximum number of keys is that can be efficiently treated?
Matthias
Howdy, Perfect hashing and statistically reasonable hashing are quite different but practically the same ;-) For 2^128 keys you might like to try something these papers allude to: R. Uzgalis and M. Tong. Hashing myths. Technical Report 97, Department of Computer Science University of Auckland, July 1994. and John Hamer. Hashing Revisited. Proceedings of the 7th annual conference on Innovation and technology in computer science education Aarhus, Denmark SESSION: Algorithms Pages: 80 - 83 Year of Publication: 2002 ISBN:1-58113-499-1 I feel for R. Uzgalis though, as I think of these two papers as Hamer and Tong ;-) Hope this helps, matt. ______________________________ PS: A bit of bad code to show how it works is below, feel free to use it to start something decent if you like: Not sure if the code works or it is a w.i.p. as it is from an odd set of files I have on this machine. Looks about right though. /*------------------------------------------------------------------- created: 2004-4-16 15:58 filename: hash.hpp author: Matt Hurd purpose: ---------------------------------------------------------------------*/ #ifndef hash_hpp_2004416 #define hash_hpp_2004416 #include <boost/random.hpp> #include <vector> namespace net { static unsigned char hash_byte_transform_1[256] = { 129, 186, 126, 139, 50, 173, 232, 9, 144, 73, 174, 153, 203, 110, 206, 254, 150, 83, 108, 109, 249, 55, 231, 62, 115, 191, 101, 105, 252, 237, 49, 182, 64, 37, 13, 46, 152, 239, 42, 18, 247, 195, 86, 98, 102, 215, 227, 200, 65, 246, 80, 118, 250, 207, 214, 210, 149, 8, 241, 85, 243, 107, 199, 188, 202, 59, 162, 25, 103, 204, 43, 168, 221, 197, 53, 253, 234, 19, 20, 15, 242, 7, 56, 190, 233, 14, 79, 4, 163, 29, 170, 194, 184, 72, 240, 16, 45, 176, 31, 196, 238, 235, 140, 217, 93, 39, 143, 164, 40, 78, 99, 17, 212, 216, 166, 171, 167, 142, 224, 97, 169, 75, 48, 159, 180, 141, 209, 104, 28, 116, 44, 255, 47, 156, 77, 58, 91, 181, 245, 87, 0, 160, 133, 111, 26, 6, 230, 95, 70, 161, 137, 89, 68, 172, 244, 90, 1, 223, 81, 33, 67, 30, 132, 82, 236, 100, 138, 36, 183, 251, 165, 179, 60, 185, 66, 123, 61, 218, 125, 189, 3, 178, 92, 35, 34, 130, 69, 198, 177, 32, 63, 208, 21, 94, 193, 134, 71, 10, 154, 96, 128, 157, 54, 187, 120, 192, 145, 41, 11, 225, 57, 52, 131, 23, 84, 222, 74, 151, 127, 27, 5, 219, 228, 201, 113, 76, 38, 136, 2, 88, 146, 106, 22, 147, 213, 205, 114, 226, 229, 24, 175, 122, 124, 135, 121, 158, 220, 12, 211, 117, 112, 51, 148, 248, 119, 155 }; class hash { public: static unsigned char transform( unsigned char val) { return hash_byte_transform_1[val]; } template< typename T > static std::size_t transform( const T& val_a ) { union unholy { T val_; unsigned char byte_[sizeof(T)]; }; const unholy * t = reinterpret_cast<const unholy*>(&val_a); std::size_t result = 123456789; result ^= hash_byte_transform_1[t->byte_[0]]; for (int i = 1; i < sizeof(T); ++i) { result = (result << 8) | (result >> (8 * (sizeof(std::size_t) - 1))); result ^= hash_byte_transform_1[t->byte_[i]]; } return result; } template <> static std::size_t transform( const std::string& val ) { const int size = val.length(); std::size_t result = 123456789; if ( size > 0 ) { std::string::const_iterator i = val.begin(); std::string::const_iterator i_end = val.end(); result ^= hash_byte_transform_1[ unsigned char(*i) ]; for ( ;i != i_end; ++i) { result = (result << 8) | (result >> (8 * (sizeof(std::size_t) - 1))); result ^= hash_byte_transform_1[ unsigned char(*i) ]; } } return result; } static void prepare_random_list() { boost::mt19937 rng; rng.seed( 42 ); // the base to the meaning of life is chance boost::uniform_int<> byte_range(0,255); boost::variate_generator<boost::mt19937, boost::uniform_int<>
die(rng, byte_range);
std::vector<int> v; v.resize(256); // initialise for (std::size_t i = 0; i < 256; ++i) { v[i] = i; } // shuffle for (std::size_t i = 0; i < 256; ++i) { std::swap( v[i], v[255 - (die() % (256 - i))] ); } // output results for pasting for (std::size_t i = 0; i < 256; ++i) { std::cout << v[i] << ", "; } std::cout << "\n"; } }; } // namespace #endif

At 04:51 AM 1/17/2005, Matt Hurd wrote:
Perfect hashing and statistically reasonable hashing are quite different but practically the same ;-)
Isn't that stretching it a bit? Collision handling code can be eliminated if you know the hashing is perfect, for example. Although I do understand your point that for many practical applications, perfect hashing and statistically reasonable hashing would yield very similar performance. Seems like both could have a place in Boost. Statistically reasonable hashing of particular sets of strings has been a need that I've often run into. --Beman

On Jan 17, 2005, at 5:16 PM, Beman Dawes wrote:
At 04:51 AM 1/17/2005, Matt Hurd wrote:
Perfect hashing and statistically reasonable hashing are quite different but practically the same ;-)
Isn't that stretching it a bit? Collision handling code can be eliminated if you know the hashing is perfect, for example. Although I do understand your point that for many practical applications, perfect hashing and statistically reasonable hashing would yield very similar performance.
Seems like both could have a place in Boost.
Statistically reasonable hashing of particular sets of strings has been a need that I've often run into.
To me perfect hashing has the big advantage that it vectorizes on vector supercomputer, in which case it is not at all comparable to statistically reasonable hashing which will always be dominated by worst-case performance on vector machines. Matthias

Matthias Troyer wrote:
The initial overhead is tolerable, though, here are some timings on my machine - Pentium M, 1.6GHz, 1GByte:
20 keys: 0.005 [s], 200 keys: 0.03 [s], 2000 keys: 0.15 [s], 20000 keys: 4.45[s]
where the time is raising slightly more worse than linear (actually it has a linear and an exponential component, but the exponential component kicks in for larger key sets only), depending on the number of keys. The memory requirement for the MPH generator is raising linearily with the number of keys: roughly N/3 * sizeof(int), where N is the number of keys.
I'm very interested if this scales well for a larger number of keys, of the order of 10^6 - 10^8 keys. I guess that for those sizes the scaling will already be exponential. Can you tell us what the maximum number of keys is that can be efficiently treated?
Honestly, I've never tried it for such a large set of keys, but I expect the times to be significantly larger. For a random key set of 200.000 keys I've got 88.3 [s] on my machine... I haven't expected somebody to need such large hash tables yet, but I'm willing to think about how to optimise it further. OTOH, I think the time to generate a MPH for larger key sets could be improved significantly, if you're ready to accept a more complex hashing function (used at hashing time) consisting out of several chained calculations. But currently I'm not able to give any prediction wrt this. One of the main goals for this implementation was to get a very simple and efficient hashing function. I've tried to get an account on the new Boost vault, but still waiting for the activation mail... As soon as I get this I'll upload the files, so you'll have the possibility to play around with it. Regards Hartmut

Hartmut Kaiser wrote:
I've tried to get an account on the new Boost vault, but still waiting for the activation mail... As soon as I get this I'll upload the files, so you'll have the possibility to play around with it.
I just activated your account. I'll add a note about SF disabling email. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com - 102708583/icq

Rene Rivera wrote:
Hartmut Kaiser wrote:
I've tried to get an account on the new Boost vault, but still waiting for the activation mail... As soon as I get this I'll upload the files, so you'll have the possibility to play around with it.
I just activated your account. I'll add a note about SF disabling email.
Thanks a lot! Regards Hartmut

Hartmut Kaiser wrote:
Rene Rivera wrote:
Hartmut Kaiser wrote:
I've tried to get an account on the new Boost vault, but
still waiting
for the activation mail... As soon as I get this I'll upload the files, so you'll have the possibility to play around with it.
I just activated your account. I'll add a note about SF disabling email.
Thanks a lot!
You are welcome. And I got a work around for SF disabling email implemented. So email registration and the digests for the vault should be back in working order now. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com - 102708583/icq

Rene Rivera wrote:
Hartmut Kaiser wrote:
I've tried to get an account on the new Boost vault, but still waiting for the activation mail... As soon as I get this I'll upload the files, so you'll have the possibility to play around with it.
I just activated your account. I'll add a note about SF disabling email.
Thanks a lot!
I've uploaded the MPH generator to the boost-sandbox vault: http://tinyurl.com/3knya. Regards Hartmut

Hello Harmut, This looks useful. I have several questions:
The requirement, that the input keys should be strictly positive isn't a big limitation, because it is always possible to accordingly transform key sets containing possibly negative numbers, ...
? By addition of a constant positive offset?
or to derive the keys from strings to match.
? You lost me here. Would you explain this point in a little more detail please.
The memory requirement for the MPH generator is raising linearily with the number of keys: roughly N/3 * sizeof(int), where N is the number of keys.
? Is it possible to generalize to keys outside of unsigned int's range w/linear impact on generator performance? Specifically, I'm looking for an efficient hash function for an associative hash table keyed by 16-byte UUID's where I know the set of UUID keys a-priori, hash lookup time is constant (primary consideration), hash generation time is linear (secondary consideration). - Regards Chris "Hartmut Kaiser" <hartmutkaiser@t-online.de> wrote in message news:1CqRR7-1vg7iS0@afwd00.sul.t-online.com...
Hi all,
recently I've written a minimal perfect hash function generator, which generates very efficient hashing functions for arbitrary (reasonable large) key sets containing strictly positive integers to achieve a minimal and perfect mapping of these keys to hash values in the range [0..N), where N is the size of the key set (minimality requirement) and the hash values are unique (perfection requirement).

Christopher D. Russell wrote:
The requirement, that the input keys should be strictly positive isn't a big limitation, because it is always possible to accordingly transform key sets containing possibly negative numbers, ...
? By addition of a constant positive offset?
Sure.
or to derive the keys from strings to match.
? You lost me here. Would you explain this point in a little more detail please.
You're certainly right, that there is no generic and simple way to map strings to integers besides reinterpreting the bit pattern of the string as a (probably long) integer. What I've meant is, that sometimes you may use some simple mapping, for instance adding all the character values of the string given that these give no collisions for your input key set.
The memory requirement for the MPH generator is raising linearily with the number of keys: roughly N/3 * sizeof(int), where N is the number of keys.
? Is it possible to generalize to keys outside of unsigned int's range w/linear impact on generator performance? Specifically, I'm looking for an efficient hash function for an associative hash table keyed by 16-byte UUID's where I know the set of UUID keys a-priori, hash lookup time is constant (primary consideration), hash generation time is linear (secondary consideration).
This problem is very similar to the string mapping problem you've mentioned above. The alogorithm in principle could be applied to integers of arbitrary precision (given you have a plugin implementation for such a thing, which could be used in place of the builtin data type), but I'm not able to make any timing predictions for this. The only thing which seems to be obvious is that the general dependency from the key set size is still valid. Regards Hartmut

On Mon, 17 Jan 2005 13:19:34 +0100, Hartmut Kaiser <hartmutkaiser@t-online.de> wrote:
Christopher D. Russell wrote:
The requirement, that the input keys should be strictly positive isn't a big limitation, because it is always possible to accordingly transform key sets containing possibly negative numbers, ...
? By addition of a constant positive offset?
Sure.
or to derive the keys from strings to match.
? You lost me here. Would you explain this point in a little more detail please.
You're certainly right, that there is no generic and simple way to map strings to integers besides reinterpreting the bit pattern of the string as a (probably long) integer. What I've meant is, that sometimes you may use some simple mapping, for instance adding all the character values of the string given that these give no collisions for your input key set.
The code I posted as part of this thread does this in a statistically optimal way, it may be argued, for most types, including strings and is quite efficient. It works for contiguous types as well, more or less, including doubles and such things as UUIDs. Where it differs is that the MPH generator generates a wonderful collision free set for a fixed universe which leads to a nice mapping from a code to an index suitable for contiguous storage or enumeration. I will interested to have a look at the MPH as see how it can adapt to arbitrarily long keys, especially strings, fixed and variable length. I can also imagine an "optimal" implementation touching the minimum number of bytes and doing the minimum number of operations to derive an optimal hash for a given set of strings... I'd be able to thwack such a clever thing into many systems I have running and gain very tangible benefits if such a thing were possible. $0.02 matt.
participants (6)
-
Beman Dawes
-
Christopher D. Russell
-
Hartmut Kaiser
-
Matt Hurd
-
Matthias Troyer
-
Rene Rivera