Idea for O(1) hash-table

Hi All, I hope that you will find the following idea interesting. I have been using hash-table with, AFAICT, unique implementation and I thought that it might be handy for boost. The hash table is O(1) *non-amortized* lookup via extremely balanced buckets (for example, max 6 entries per bucket). The implementation itself is absurdly simple ;-) The hash-table is basically variation of linear addressing using double-hashing hash-table with special re-insert strategy. Algorithm for linear addressing (easy to adjust for chained hashing): Consider a two dimensional N*K array – N buckets with up to K elements per bucket. Assume that the zero-key is marking an empty entry. Lookup(key): for (hash = key, I = 0 ; I < K ;++i) { hash = HASH(hash); // double-hashing if (A[hash % N][i] == key) return found; } return not-found; Insert(key): for (hash = key, I = 0 ; I < K ;++i) { hash = HASH(hash); // double-hashing if (A[hash % N][i] == 0) { A[hash % N][i] = key; return true; } } return false; The above insert is basically a bounded double-hashing and as-is it has a low utilization (a bit similar to bloom filter). For example, for 100K buckets and K=6 - insert will fail when the table has less than 50% utilization (i.e. fail to insert though most entries are empty). The trick is to perform “re-insert”: Pick some random element, key1, from failed insert phase, replace it with the new key, and (re) Insert(key1). Retrying up to, say, 12 rounds of re-insert improves the utilization, of N=100K/K=6, to ~97%! Table resize is done (if needed) when such repetitive re-insert fails. Altogether this implementation trade-off insert for fast lookup which is very appealing to caches (i.e. insert done in the expensive path - only after a cache miss). Just for comparison, hash-table using naïve liner probing (A[HASH(key)%N][i]) has only ~80% utilization even when using K=128 (i.e. deep buckets)! Such unbalanced behavior is also typical classical hash-tables with link-lists (chained-hashing) since statistically some buckets will outgrow the average. The performance of the classic double-hashing algorithm itself is quite poor when the table is highly utilize since many steps are required to find empty entries. Though this hash-table is general purposed it has some interesting properties: 1) O(1) non-amortized with very low chance for collisions (each key has its own unique “virtual” bucket). 2) Can be implemented on continuous memory without pointers and with excellent utilization 3) Allow wait free implementation of caches (i.e. no lock or even interlocked needed). 4) Allow caches on shared memory – corruption safe/hazard algorithm (assuming at least 64 bit keys). 5) Allow persistent using direct memory to file mapping (i.e. disk representation is same as in-memory one). 6) Average 50% LRU eviction for caches (i.e. on average entry half from the oldest is evicted) BTW – there is little chance for patenting this algorithm since I already published similar idea in the public domain ;-) Thanks, Rani

Hi,
May be you can add in boost.hash. it will be a good add.
On Mar 16, 2013 2:42 PM, "Rani Sharoni"
Hi All,
I hope that you will find the following idea interesting.
I have been using hash-table with, AFAICT, unique implementation and I thought that it might be handy for boost. The hash table is O(1) *non-amortized* lookup via extremely balanced buckets (for example, max 6 entries per bucket).
The implementation itself is absurdly simple ;-) The hash-table is basically variation of linear addressing using double-hashing hash-table with special re-insert strategy.
Algorithm for linear addressing (easy to adjust for chained hashing): Consider a two dimensional N*K array – N buckets with up to K elements per bucket. Assume that the zero-key is marking an empty entry.
Lookup(key): for (hash = key, I = 0 ; I < K ;++i) { hash = HASH(hash); // double-hashing if (A[hash % N][i] == key) return found; }
return not-found;
Insert(key): for (hash = key, I = 0 ; I < K ;++i) { hash = HASH(hash); // double-hashing if (A[hash % N][i] == 0) { A[hash % N][i] = key; return true; } }
return false;
The above insert is basically a bounded double-hashing and as-is it has a low utilization (a bit similar to bloom filter). For example, for 100K buckets and K=6 - insert will fail when the table has less than 50% utilization (i.e. fail to insert though most entries are empty).
The trick is to perform “re-insert”: Pick some random element, key1, from failed insert phase, replace it with the new key, and (re) Insert(key1). Retrying up to, say, 12 rounds of re-insert improves the utilization, of N=100K/K=6, to ~97%! Table resize is done (if needed) when such repetitive re-insert fails.
Altogether this implementation trade-off insert for fast lookup which is very appealing to caches (i.e. insert done in the expensive path - only after a cache miss).
Just for comparison, hash-table using naïve liner probing (A[HASH(key)%N][i]) has only ~80% utilization even when using K=128 (i.e. deep buckets)! Such unbalanced behavior is also typical classical hash-tables with link-lists (chained-hashing) since statistically some buckets will outgrow the average. The performance of the classic double-hashing algorithm itself is quite poor when the table is highly utilize since many steps are required to find empty entries.
Though this hash-table is general purposed it has some interesting properties: 1) O(1) non-amortized with very low chance for collisions (each key has its own unique “virtual” bucket). 2) Can be implemented on continuous memory without pointers and with excellent utilization 3) Allow wait free implementation of caches (i.e. no lock or even interlocked needed). 4) Allow caches on shared memory – corruption safe/hazard algorithm (assuming at least 64 bit keys). 5) Allow persistent using direct memory to file mapping (i.e. disk representation is same as in-memory one). 6) Average 50% LRU eviction for caches (i.e. on average entry half from the oldest is evicted)
BTW – there is little chance for patenting this algorithm since I already published similar idea in the public domain ;-)
Thanks, Rani
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sat, Mar 16, 2013 at 2:35 PM, Shakti Misra
Hi, May be you can add in boost.hash. it will be a good add. On Mar 16, 2013 2:42 PM, "Rani Sharoni"
wrote: Hi All,
I hope that you will find the following idea interesting. [...]
Lookup(key): for (hash = key, I = 0 ; I < K ;++i) { hash = HASH(hash); // double-hashing if (A[hash % N][i] == key) return found; }
return not-found;
The trick is to perform “re-insert”: Pick some random element, key1, from failed insert phase, replace it with the new key, and (re) Insert(key1).
Though this hash-table is general purposed it has some interesting properties: 1) O(1) non-amortized with very low chance for collisions (each key has its own unique “virtual” bucket). 2) Can be implemented on continuous memory without pointers and with excellent utilization 3) Allow wait free implementation of caches (i.e. no lock or even interlocked needed). 4) Allow caches on shared memory – corruption safe/hazard algorithm (assuming at least 64 bit keys). 5) Allow persistent using direct memory to file mapping (i.e. disk representation is same as in-memory one). 6) Average 50% LRU eviction for caches (i.e. on average entry half from the oldest is evicted)
BTW – there is little chance for patenting this algorithm since I already published similar idea in the public domain ;-)
After little wiki digging I saw that my idea is already well established and known as "Cuckoo Hashing" (2001): http://en.wikipedia.org/wiki/Cuckoo_hashing http://www.ru.is/faculty/ulfar/CuckooHash.pdf Sorry for the noise. IMHO, you should consider having such hashing/caching scheme in boost...

On 18:10 Sat 16 Mar , Rani Sharoni wrote:
After little wiki digging I saw that my idea is already well established and known as "Cuckoo Hashing" (2001): http://en.wikipedia.org/wiki/Cuckoo_hashing http://www.ru.is/faculty/ulfar/CuckooHash.pdf
Sorry for the noise. IMHO, you should consider having such hashing/caching scheme in boost...
It's a smart idea nonetheless. We often use such hashing schemes for searching game trees. Having to implement the containers manually is really annoying. :-) Best -Andreas -- ========================================================== Andreas Schäfer HPC and Grid Computing Chair of Computer Science 3 Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-27910 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

Have not looked at anything, If if it is not there,
Can I give a shot for the library so that it can be part of boost.
On Sat, Mar 16, 2013 at 9:55 PM, Andreas Schäfer
On 18:10 Sat 16 Mar , Rani Sharoni wrote:
After little wiki digging I saw that my idea is already well established and known as "Cuckoo Hashing" (2001): http://en.wikipedia.org/wiki/Cuckoo_hashing http://www.ru.is/faculty/ulfar/CuckooHash.pdf
Sorry for the noise. IMHO, you should consider having such hashing/caching scheme in boost...
It's a smart idea nonetheless. We often use such hashing schemes for searching game trees. Having to implement the containers manually is really annoying. :-)
Best -Andreas
-- ========================================================== Andreas Schäfer HPC and Grid Computing Chair of Computer Science 3 Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-27910 PGP/GPG key via keyserver http://www.libgeodecomp.org ==========================================================
(\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Warm Regards --Dev OpenPegasus Developer "It's Always better to try and fail instead of not doing/trying anything"
participants (4)
-
Andreas Schäfer
-
Devchandra L Meetei
-
Rani Sharoni
-
Shakti Misra