[Tokenmap] A (perfect) hash container library chcked-in to sandbox (RFC)

http://svn.boost.org/svn/boost/sandbox/tokenmap/libs/tokenmap/doc/html/index... A tokenmap is a data structure that uniquely maps pseudo-random generated keys with elements of the collection. Boost.Tokenmap is a (perfect) hash container library for C++. An important distinction between tokenmap and other dictionary-like containers, such as std::map or Boost.Unordered, is that tokenmap generates keys internally (referred to as "tokens") rather than relying on user to provide them. Specifically, when a new element is inserted into the tokenmap, an apparently random key is returned back to the caller which uniquely maps to the stored element. The returned key can later be used in a very efficient lookup. Although the code and documentation are far from complete, the implementation in the repo is functional and hopefully serves as a basis for further work. Slawomir

On 21.02.2010 23:59, Slawomir Lisznianski wrote:
http://svn.boost.org/svn/boost/sandbox/tokenmap/libs/tokenmap/doc/html/index...
A tokenmap is a data structure that uniquely maps pseudo-random generated keys with elements of the collection.
Boost.Tokenmap is a (perfect) hash container library for C++. An important distinction between tokenmap and other dictionary-like containers, such as std::map or Boost.Unordered, is that tokenmap generates keys internally (referred to as "tokens") rather than relying on user to provide them. Specifically, when a new element is inserted into the tokenmap, an apparently random key is returned back to the caller which uniquely maps to the stored element. The returned key can later be used in a very efficient lookup.
Although the code and documentation are far from complete, the implementation in the repo is functional and hopefully serves as a basis for further work.
How is it different from having an iterator or a pointer to the inserted element?

On Sun, Feb 21, 2010 at 4:44 PM, Andrey Semashev <andrey.semashev@gmail.com> wrote:
On 21.02.2010 23:59, Slawomir Lisznianski wrote:
http://svn.boost.org/svn/boost/sandbox/tokenmap/libs/tokenmap/doc/html/index...
A tokenmap is a data structure that uniquely maps pseudo-random generated keys with elements of the collection.
Boost.Tokenmap is a (perfect) hash container library for C++. An important distinction between tokenmap and other dictionary-like containers, such as std::map or Boost.Unordered, is that tokenmap generates keys internally (referred to as "tokens") rather than relying on user to provide them. Specifically, when a new element is inserted into the tokenmap, an apparently random key is returned back to the caller which uniquely maps to the stored element. The returned key can later be used in a very efficient lookup.
Although the code and documentation are far from complete, the implementation in the repo is functional and hopefully serves as a basis for further work.
How is it different from having an iterator or a pointer to the inserted element?
Because those can get invalidated if the map resizes I would wager...

OvermindDL1 escribió:
On Sun, Feb 21, 2010 at 4:44 PM, Andrey Semashev <andrey.semashev@gmail.com> wrote:
On 21.02.2010 23:59, Slawomir Lisznianski wrote:
http://svn.boost.org/svn/boost/sandbox/tokenmap/libs/tokenmap/doc/html/index...
[...]
Specifically, when a new element is inserted into the tokenmap, an apparently random key is returned back to the caller which uniquely maps to the stored element. The returned key can later be used in a very efficient lookup.
[...]
How is it different from having an iterator or a pointer to the inserted element?
Because those can get invalidated if the map resizes I would wager...
Given the very limited interface of tokenmap, you can easily implement in on top of a std::list, whose iterators do not get invalidated. I think Andrey's question is a critical one and has not yet been addressed. Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

On 22 February 2010 15:59, <joaquin@tid.es> wrote:
Given the very limited interface of tokenmap, you can easily implement in on top of a std::list, whose iterators do not get invalidated. I think Andrey's question is a critical one and has not yet been addressed.
I believe Tokenmap's tokens will remain valid if the container is copied. Or even serialized. Daniel

Given the very limited interface of tokenmap, you can easily implement in on top of a std::list, whose iterators do not get invalidated. I think Andrey's question is a critical one and has not yet been addressed.
Sharing implementation details such as iterators or pointers with external agents poses safety problems. The tokenmap introduces a lightweight level of indirection which hides implementation details. It also decouples the lifecycles of objects (the token can outlive the pointed to object, say stored in some context, and code can safely detect that). You can also think of tokenmap as provider of a "handle", an idiom popular in C.

On Mon, Feb 22, 2010 at 6:03 PM, Slawomir Lisznianski <slisznia@paramay.com> wrote:
Given the very limited interface of tokenmap, you can easily implement in on top of a std::list, whose iterators do not get invalidated. I think Andrey's question is a critical one and has not yet been addressed.
Sharing implementation details such as iterators or pointers with external agents poses safety problems. The tokenmap introduces a lightweight level of indirection which hides implementation details. It also decouples the lifecycles of objects (the token can outlive the pointed to object, say stored in some context, and code can safely detect that).
It is also very useful as a simple and lightweight replacement for string interning: sometimes in text processing you need to represent unique words or phrases with a short numeric id, for example for fast comparison or simple compression, both in memory and on disk. In particular I need the following: - test if a value is already in the container and find the existing key in that case (i.e. it is not a multimap). It is not immediately clear from the documentation whether this is the case. In practice I need a way to eficiently convert a value to a key. - key size selectable at compile time (for example uint16_t if one needs very short keys). It seems that this is already possibile. I do not necessarily need random keys (in fact a simple decreasing counter would be enough), but it doesn't hurt. I see that pop returns by pointer, implying that there is an extra level of indirection to store the value. Maybe having a sepearate token pointer map (similar to boost pointer containers) would be better. (or even better support some form of unique_ptr). All in all, I think it would be a nice and simple addition to boost. -- gpd

- key size selectable at compile time (for example uint16_t if one needs very short keys). It seems that this is already possibile.
correct
level of indirection to store the value. Maybe having a sepearate token pointer map (similar to boost pointer containers) would be better. (or even better support some form of unique_ptr).
the plan is to introduce ptr_tokenmap to replace existing tokenmap and have a tokenmap with by-value semantics. naturally, all the STL-pieces like iterators are coming shortly too.

It is also very useful as a simple and lightweight replacement for string interning: sometimes in text processing you need to represent unique words or phrases with a short numeric id, for example for fast comparison or simple compression, both in memory and on disk.
I don't think that a `tokenmap` will be suitable for string interning because a `tokenmap` essentially needs to choose the generated hashes (tokens). With string interning, the hash of the string should serve as the token. You probably want an unordered map.
In practice I need a way to efficiently convert a value to a key.
`tokenmap`s do not allow this.

On 02/22/2010 03:51 AM, OvermindDL1 wrote:
On Sun, Feb 21, 2010 at 4:44 PM, Andrey Semashev <andrey.semashev@gmail.com> wrote:
How is it different from having an iterator or a pointer to the inserted element?
Because those can get invalidated if the map resizes I would wager...
No, they don't. IIRC, both are stable during the insertion/erasure. In case of unordered containers, rehashing invalidates iterators, but not pointers.

From the rationale page of the documentation, it appears that one of the intended applications of a tokenmap is in the implementation of state management on top of an otherwise stateless protocol (such as HTTP). With HTTP, a common trick of tracking state (whether a user is logged in, their username, etc.) is to set a cookie that contains a session id. The client stores the session id and the server serializes all state data, saving it on the server side, and associates this data with that id. For each subsequent HTTP request by the client, the client sends the session id to the server, which then finds the serialized data. The server can not trust anything that the client sends, so the server needs a way to verify the id. At a minimum, an associative container of some sort is needed that maps the session id (token) to the serialized data. To safely use an iterator or a pointer to the inserted element, the server would need a separate map from tokens to iterators. The tokenmap solution combines the two maps that would be involved (token to session data and token to iterator into the other map, with the token-to-iterator map being extremely fast in performing lookups).
How is it different from having an iterator or a pointer to the inserted element?

On 02/22/2010 03:51 AM, Daniel Trebbien wrote:
From the rationale page of the documentation, it appears that one of the intended applications of a tokenmap is in the implementation of state management on top of an otherwise stateless protocol (such as HTTP).
With HTTP, a common trick of tracking state (whether a user is logged in, their username, etc.) is to set a cookie that contains a session id. The client stores the session id and the server serializes all state data, saving it on the server side, and associates this data with that id. For each subsequent HTTP request by the client, the client sends the session id to the server, which then finds the serialized data.
The server can not trust anything that the client sends, so the server needs a way to verify the id. At a minimum, an associative container of some sort is needed that maps the session id (token) to the serialized data. To safely use an iterator or a pointer to the inserted element, the server would need a separate map from tokens to iterators. The tokenmap solution combines the two maps that would be involved (token to session data and token to iterator into the other map, with the token-to-iterator map being extremely fast in performing lookups).
In that case I would prefer the token generation mechanism to be separated from the container structure. This would enable users to use tokens in different containers, from STL to Boost.MultiIndex and Boost.Intrusive. The proposed tokenmap would be equivalent to the standard unordered containers with the tokened values then.

On 23 February 2010 09:02, Andrey Semashev <andrey.semashev@gmail.com> wrote:
In that case I would prefer the token generation mechanism to be separated from the container structure. This would enable users to use tokens in different containers, from STL to Boost.MultiIndex and Boost.Intrusive. The proposed tokenmap would be equivalent to the standard unordered containers with the tokened values then.
I don't think you could do that without special support from the container. For MultiIndex you could implement a new index type. It could possibly reuse this implementation if it had some sort of ownership policy but MultiIndex normally implements that kind of thing itself. I can't see any benefit from Intrusive in this case. There are no collisions so there's no need to do anything like chaining the elements. Daniel.

On 02/23/2010 03:07 PM, Daniel James wrote:
On 23 February 2010 09:02, Andrey Semashev<andrey.semashev@gmail.com> wrote:
In that case I would prefer the token generation mechanism to be separated from the container structure. This would enable users to use tokens in different containers, from STL to Boost.MultiIndex and Boost.Intrusive. The proposed tokenmap would be equivalent to the standard unordered containers with the tokened values then.
I don't think you could do that without special support from the container.
Why?
For MultiIndex you could implement a new index type. It could possibly reuse this implementation if it had some sort of ownership policy but MultiIndex normally implements that kind of thing itself.
I can't see any benefit from Intrusive in this case. There are no collisions so there's no need to do anything like chaining the elements.
Session information in the suggested use case may not be suitable for storing in a regular container (say, for the lack of copyability). Also, I might want different access types to the collection of sessions, token-based lookup being one of them. Boost.Intrusive is very efficient in cases like these, and allowing to use tokens in intrusive hash containers makes perfect sense to me.

On 23 February 2010 13:38, Andrey Semashev <andrey.semashev@gmail.com> wrote:
On 02/23/2010 03:07 PM, Daniel James wrote:
I don't think you could do that without special support from the container.
Why?
The token generation is tied to the data structure, so that a generated tokens will hash to a free location and, if the container resizes, all the tokens will still hash to unique locations.
Session information in the suggested use case may not be suitable for storing in a regular container (say, for the lack of copyability).
There's going to be a pointer based container.
Also, I might want different access types to the collection of sessions, token-based lookup being one of them.
An ownership policy would allow this.
Boost.Intrusive is very efficient in cases like these, and allowing to use tokens in intrusive hash containers makes perfect sense to me.
Intrusive hash containers have slower lookup and require more memory since they have to deal with collisions. Daniel

I think that there is a problem with the `find` and `pop` members of `tokenmap`: in `find_free_slot`, which is used by the `insert` member, you apply collision resolution to find an unused element of `store_` where the new element can be inserted, but the `find` and `pop` members do not employ a loop when looking up the given key (to account for the possibility that they key resulted in a collision when it was inserted). It could be, for example, that the element of the given key was placed at `ix + 3` by `find_free_slot`, in which case `find` and `pop` would fail to find the element. Also, there are some improvements that can be made: supporting an allocator type and getting rid of the floating-point arithmetic around where you check whether a doubling of the capacity can succeed (lines 175-176; you can use the `max_size` member of `std::vector` and halve the result, rather than double `store_.capacity()`). By the way, you might be interested in taking a look at the JDK implementation of Java's `java.util.HashMap` class because the bucketing strategy, and the use of an underlying array for storage, is very similar.

Daniel, your client-server example nailed the purpose of tokenmap quite nicely.
new element can be inserted, but the `find` and `pop` members do not employ a loop when looking up the given key (to account for the possibility that they key resulted in a collision when it was inserted).
Once the key is generated, it's "perfect" and collisions in find or pop are not possible.
Also, there are some improvements that can be made: supporting an allocator type
OK
and getting rid of the floating-point arithmetic around where you check whether a doubling of the capacity can succeed (lines 175-176; you can use the `max_size` member of `std::vector` and halve the result, rather than double `store_.capacity()`).
I will look into it. Thanks.

Once the key is generated, it's "perfect" and collisions in find or pop are not possible.
I spent some time trying to understand the secret of this property, and came up with a description of the problem and solution which I thought was helpful. A `tokenmap` is essentially a dynamic array-backed hash map (from tokens [32-bit unsigned integers, WLOG] to `mapped_type` objects) with custom find, rehash, and insert members. Calling the array `store` and its size `store_size`, the find member returns the `mapped_type` object `store[key % store_size]`. The rehash member doubles the size of `store`, and for each element in the old store, the element is placed at `new_store[key % 2*old_store_size]`. Finally, the insert member basically finds an unused element of `store` where the given `mapped_type` object can be placed, and returns some token value (key) such that the token mod `store_size` is the index where the `mapped_type` object was placed. I denote the set of integers by Z and the ring of integers mod s by Z_s. Elements of Z_s are denoted with so-called "class notation": [x]_s is the set of integers that have the same remainder as x when divided by s; in other words, for x an integer, [x]_s is the set of all integers that are congruent mod s. The size of `store` begins with s_0, the initial size, and proceeds to s_1 after the first rehash. Generally, s_k is the size of the store after k rehashes. In order for the token-generating strategy to be "perfect", then the key property of generated tokens x and y is: [x]_{s_0} != [y]_{s_0} -> [x]_{s_k} != [y]_{s_k} (If x and y mod s_0 have different indices in the store of length s_0, then they must have different indices in the store of length s_k.) and: [x]_{s_k} != [y]_{s_k} -> [x]_{s_{k + 1}} != [y]_{s_{k + 1}} Some choices of the sequence {s_k} will not work. For example, suppose s_0 was 2 and s_1 was 3. Then, tokens 0x25b786ef and 0x0010b86c can be generated while the store size is 2 (because mod 2, they are 1 and 0, respectively), but once the store size becomes 3, then they hash to the same location (they are both 2 mod 3). In `tokenmap`, the sequence of sizes is 2^k*s_0. Generally, because the size of the store is always a multiple of the initial size, then the "perfect" property is maintained; I claim that [x]_s != [y]_s -> [x]_{c*s} != [y]_{c*s} for any natural number c. In order to prove the claim, I consider the logically-equivalent contrapositive: [x]_{c*s} == [y]_{c*s} -> [x]_s == [y]_s: [x]_{c*s} == [y]_{c*s} -> c*s | (x - y) c*s | (x - y) -> s | (x - y) -> [x]_s == [y]_s qed. So, in conclusion: 1. In order to maintain the "perfect" property, the size of the store only needs to be a multiple of the initial size. 2. Special support from the container is needed so that the size of the store is always a multiple of the initial size and rehashing is performed correctly.

Daniel Trebbien wrote:
I spent some time trying to understand the secret of this property, and came up with a description of the problem and solution which I thought was helpful.
Although I disagree with your conclusion that the container as-is doesn't guarantee perfect hashing, your posting was very insightful. Please read on why I think so. 1. In order to maintain the "perfect" property, the size of the store
only needs to be a multiple of the initial size.
Correct. The container grows exponentially, that is: a) SIZE_new = SIZE_prev * 2 where the initial size is a multiple of 2. Specifically, the constructor establishes the initial size as: double target_cap = (1 << key_type(ceil(log(capacity_hint)/log(2)))); Subsequently, the container grows exactly following the formula (a): double target_cap = store_.capacity() << 1; after it fails to obtain a free slot or the load_factor tells it so. 2. Special support from the container is needed so that the size of
the store is always a multiple of the initial size and rehashing is performed correctly.
I agree. The container does not allow users to influence its size past construction, ensuring the invariant of perfect hashing stays. Given the conditions 1 and 2 from your posting are met by the container's present implementation, I believe there is no problem. -sl

Although I disagree with your conclusion that the container as-is doesn't guarantee perfect hashing, your posting was very insightful. Please read on why I think so.
I don't quite understand. I was hoping to explain that I think your implementation does guarantee perfect hashing. Maybe I should not have used the word "problem" (it wasn't a good use of the word). By "problem" I meant a general problem, like a "math problem", but not a problem with your code. With full context, the "problem" was the problem of writing a hash map that implements perfect hashing.
Given the conditions 1 and 2 from your posting are met by the container's present implementation, I believe there is no problem.
Agreed. I, too, believe that there is no problem with your implementation.

On Tue, Mar 2, 2010 at 9:44 PM, Daniel Trebbien <dtrebbien@gmail.com> wrote:
I was hoping to explain that I think your implementation does guarantee perfect hashing.
:) If OK, I suggest to include your explanation of the "problem" in the tokenmap documentation. Let me know if you prefer to format/edit it beforehand. The text would live in this file: http://svn.boost.org/svn/boost/sandbox/tokenmap/libs/tokenmap/doc/tokenmap.q...

If OK, I suggest to include your explanation of the "problem" in the tokenmap documentation. Let me know if you prefer to format/edit it beforehand. The text would live in this file:
http://svn.boost.org/svn/boost/sandbox/tokenmap/libs/tokenmap/doc/tokenmap.q... Yes. Seems like a great idea. I would like to rewrite it, possibly expand some parts, and format it a little, but I haven't worked with QBK files before so it might take me some time. I'll send an e-mail to your paramay account with the new text when I am finished.

Interesting. As I was doing some more research on token maps, I discovered a template in the ACE library <http://www.cs.wustl.edu/~schmidt/ACE.html>which seems to have the same purpose as tokenmap: ACE_Active_Map_Manager<http://www.dre.vanderbilt.edu/Doxygen/Stable/ace/a00006.html> . Do you know of any other libraries (C++ or otherwise) that provide functionality that is similar to tokenmap?

Daniel Trebbien wrote:
Interesting. As I was doing some more research on token maps, I discovered a template in the ACE library
From reading the doc, the functionality is very similar.
Do you know of any other libraries (C++ or otherwise) that provide functionality that is similar to tokenmap?
No, I don't. I'm glad you found the ACE one though. It surely will be added to the "similar implementations" section. I briefly went over the ace/Active_Map_Manager code and saw they keep track of occupied/free lists. Their implementation is quite a reading...

I found another implementation in http://www.cognitionresearch.org.uk/source_code/sp61_tt.zip by Tichomir Tenev. The archive contains a header, `idmap.hpp`, that defines the `idmap` template: " // Implements a map between integer IDs and values of type T // (specified by the template argument). Given its ID, a value // can be accessed in O(1). // // Insertions into the map also occur in O(1). // // IDs are assigned by the map and cannot be specified by the // client. Prior to adding an element into the map, the client // can call next_avail_id() to get the next ID that will be used " It's different and quite interesting to look through.

On 21 February 2010 20:59, Slawomir Lisznianski <slisznia@paramay.com> wrote:
http://svn.boost.org/svn/boost/sandbox/tokenmap/libs/tokenmap/doc/html/index...
A tokenmap is a data structure that uniquely maps pseudo-random generated keys with elements of the collection.
I think it looks useful. I have some comments. It looks like the code for finding a free slot could get slow as the index fills up. I suppose a small load factor would help, but a better alternative might be to use the pointers in empty slots to maintain a linked list of free slots. There would be an additional cost when creating and resizing the container, but I think inserting should be faster and it would avoid the need for a load factor. It would also be useful to be able choose the random number generator, probably using Boost.Random, as your generator doesn't look very strong. It might also be useful to able to obfuscate the key by some means, so that the index isn't exposed. A single customization point might be useful; say, a stateful object with two functions, one which takes an index and capacity and returns a token, and one which takes a token and capacity and returns an index (with a suitable guarantee that the index would be unique for larger capacities). That might overcomplicate things though, since obfuscation could be done outside of the container. Daniel

Daniel James wrote:
index fills up. I suppose a small load factor would help, but a better alternative might be to use the pointers in empty slots to maintain
sounds really good. fyi, in early tests under various load factors and usage scenarios, the insertion times don't appear to be affected much at all, as long as the 0.5 < load_factor < 0.7 condition is met (less than 0.5 doesn't help much, only wastes space).
creating and resizing the container, but I think inserting should be faster and it would avoid the need for a load factor.
agree
It would also be useful to be able choose the random number generator,
absolutely. consider the one there for "demo" purposes, plus it allowed me to come up with a proof of concept that has no external dependencies. as a side note, the final generator should adopt to the token-type template param, as different key sizes ask for different (in complexity) hashing.
probably using Boost.Random, as your generator doesn't look very strong.
it shows decent distribution. btw, it's based off of research done by Thomas Wang, http://www.concentric.net/~Ttwang/tech/inthash.htm I agree that in the end, one should be able to plug an external generator which is adaptive and more robust.
It might also be useful to able to obfuscate the key by some means, so that the index isn't exposed.
as a side note, if you let the container "grow", the index and random parts are indistinguishable. unfortunately, there is some cost to doing that, as opposed to starting with a sufficiently large container (using capacity hint) at which point the index part becomes fairly obvious.
that the index would be unique for larger capacities). That might overcomplicate things though, since obfuscation could be done outside of the container.
that's how I'd typically solve it too, by using external obfuscation strategy of sort. Thanks.

AMDG Slawomir Lisznianski wrote:
Daniel James wrote:
probably using Boost.Random, as your generator doesn't look very strong.
it shows decent distribution. btw, it's based off of research done by Thomas Wang, http://www.concentric.net/~Ttwang/tech/inthash.htm
That page does say: There has been queries whether these mix functions can be used for pseudo random purposes. Although the out does look random to the naked eye, the official recommendation is to use a real pseudo random number generator instead, such as the Mercenne Twister. The hash functions listed in this article were only tested for hashing purpose. No tests of randomness were performed. In Christ, Steven Watanabe
participants (9)
-
Andrey Semashev
-
Daniel James
-
Daniel Trebbien
-
Giovanni Piero Deretta
-
joaquin@tid.es
-
OvermindDL1
-
Slawomir Lisznianski
-
Slawomir Lisznianski
-
Steven Watanabe