[multi_index] User-defined indices

Hello, Reading the Boost.MultiIndex docs I found that support for user-defined index types is planned. Quite occasionally recently I had the need of such functionality and finally developed my own container class for that. Are there any more concrete plans on this improvement of Boost.MultiIndex (a release version or a date)? I guess, the major part of it will be documenting the interface between indices and container itself? -- Best regards, Andrey mailto:andysem@mail.ru

----- Mensaje original ----- De: Andrey Semashev <andysem@mail.ru> Fecha: Viernes, Junio 8, 2007 6:04 pm Asunto: [boost] [multi_index] User-defined indices Para: boost@lists.boost.org
Hello,
Reading the Boost.MultiIndex docs I found that support for user-defined index types is planned. Quite occasionally recently I had the need of such functionality and finally developed my own container class for that. Are there any more concrete plans on this improvement of Boost.MultiIndex (a release version or a date)? I guess, the major part of it will be documenting the interface between indices and container itself?
Hello Andrey, That feature is there on the future work section, but for the moment being I don't really know when/if is going to be provided. The main problem with documenting the internal interface of indices is that I'd be bound to freeze it, and as it happens I've had ocassionally the need to modify the interface or augment it to support new funcionalities or optimize some aspects. I'm curious about those special indices you've been in need of. If they are of general interest and you want to contribute the code maybe they can be adapted and included into the lib. Also, maybe if you explain their purpose we can come up with a way to have the functionality with the current public interface of Boost.MultiIndex. Best regards, Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

Hello JOAQUIN, Friday, June 8, 2007, 8:31:55 PM, you wrote:
Hello Andrey,
That feature is there on the future work section, but for the moment being I don't really know when/if is going to be provided. The main problem with documenting the internal interface of indices is that I'd be bound to freeze it, and as it happens I've had ocassionally the need to modify the interface or augment it to support new funcionalities or optimize some aspects.
It's a pity to hear that. Although I share your worries about interface modification problems, I believe that as all major STL-style indices are covered in the lib, the main direction of its evolution is to allow users to extend it. You won't be able to provide indices for every real life case, though you will cover quite a number of them. I'm not familiar with the implementation, but I think it should be possible to provide at least basic user-level interface without constraining you much to do some magic behind the scene in your out-of-box indices. I had some experience in developing similar containers (a bit more case-specific, though) and it almost always comes down to a std::list of nodes and whatever-you-need containers that hold iterators to the list of nodes and represent indices. If your design is similar to that, I don't see many problems if a user sees list-like interface of the main container and some helper facilities to implement index iterators and projecting.
I'm curious about those special indices you've been in need of. If they are of general interest and you want to contribute the code maybe they can be adapted and included into the lib. Also, maybe if you explain their purpose we can come up with a way to have the functionality with the current public interface of Boost.MultiIndex.
Well, I'm not sure it's generic enough to be contributed. The container I developed holds elements (not necessarily std::pairs). Each element contains a string member which is a wildcard as a key. The wildcard may contain fixed characters and placeholders '?' (any single char) and '*' (any number of any chars). Overlapping wildcards (two wildcards are overlapping if they are different, but there is a string which they both match to) are allowed to coexist in the container, but not the same ones. A user, having a complete string, is able to find an element whose wildcard matches most closely (i.e. is the most detailed among all wildcards in the container that match the string). Additionally, he is able to find all elements whose wildcards match the string. On top of that, it should be possible to operate on the container elements by having their unique integer identifiers (which essentially means another index). The major requirement was lookup performance, even in a reasonable expense of storage overhead and other operations speed. A typical sizes would be tens to hundreds of thousands of elements. I failed to make use of B.MI ordered or hashed indices because of the constraints they apply to the CompatibleKey, CompatibleCompare, CompatibleHash and CompatiblePred objects I would have to provide. My technique was based on a number of hash tables and a little knowledge of the wildcards nature. :) Do you consider this kind of index could be generalized more to be included in B.MI (some kind of "like_index" or "matching_index")? -- Best regards, Andrey mailto:andysem@mail.ru

Well, I'm not sure it's generic enough to be contributed. The container I developed holds elements (not necessarily std::pairs). Each element contains a string member which is a wildcard as a key. The wildcard may contain fixed characters and placeholders '?' (any single char) and '*' (any number of any chars). Overlapping wildcards (two wildcards are overlapping if they are different, but there is a string which they both match to) are allowed to coexist in the container, but not the same ones. A user, having a complete string, is able to find an element whose wildcard matches most closely (i.e. is the most detailed among all wildcards in the container that match the string). Additionally, he is able to find all elements whose wildcards match the string. On top of that, it should be possible to operate on the container elements by having their unique integer identifiers (which essentially means another index).
The major requirement was lookup performance, even in a reasonable expense of storage overhead and other operations speed. A typical sizes would be tens to hundreds of thousands of elements.
I failed to make use of B.MI ordered or hashed indices because of the constraints they apply to the CompatibleKey, CompatibleCompare, CompatibleHash and CompatiblePred objects I would have to provide.
If given two strings s1 and s2 with or without wildcards you can define an order between them, something like: ? * ?? ?* . . a a? a* . . You can use ordered_index if your application can work with O(log(n)) lookup. With respect to the structure layout you are using, a bimap can be used: bimap< set_of<string, WildcardOrder>, unordered_set_of<int> > Best regards Matias

Hello Matias, Sunday, June 10, 2007, 1:04:47 AM, you wrote:
I failed to make use of B.MI ordered or hashed indices because of the constraints they apply to the CompatibleKey, CompatibleCompare, CompatibleHash and CompatiblePred objects I would have to provide.
If given two strings s1 and s2 with or without wildcards you can define an order between them, something like:
? * ?? ?* .. .. a a? a* .. ..
You can use ordered_index if your application can work with O(log(n)) lookup.
I don't see how could I do that since for ordering the elements in the container I need strict weak ordering, and for element lookup by a full string I need a more relaxed ordering. For example: ax a? a* Now, if I search by "ab" string, there are two wildcards that fit the string (and in terms of ordered_index, there are two elements whose keys are equivalent to the string). And this is not allowed by the requirements in the docs. BTW, the second element should have been found since "a?" describes "ab" more precisely than "a*".
With respect to the structure layout you are using, a bimap can be used:
bimap< set_of<string, WildcardOrder>, unordered_set_of<int> >
No, bimap doesn't seem to suit me since the container holds not only the wildcard and int. -- Best regards, Andrey mailto:andysem@mail.ru

On 6/10/07, Andrey Semashev <andysem@mail.ru> wrote:
Hello Matias,
Sunday, June 10, 2007, 1:04:47 AM, you wrote:
I failed to make use of B.MI ordered or hashed indices because of the constraints they apply to the CompatibleKey, CompatibleCompare, CompatibleHash and CompatiblePred objects I would have to provide.
If given two strings s1 and s2 with or without wildcards you can define an order between them, something like:
? * ?? ?* .. .. a a? a* .. ..
You can use ordered_index if your application can work with O(log(n)) lookup.
I don't see how could I do that since for ordering the elements in the container I need strict weak ordering, and for element lookup by a full string I need a more relaxed ordering. For example:
ax a? a*
It was a long shot, I was more interested in you trying out bimap :) Anyway as you have put it, it seem that the lexicographical approach is not bothering you to much, you may want to try orered_non_unique that allows you to insert multiple keys.
With respect to the structure layout you are using, a bimap can be used:
bimap< set_of<string, WildcardOrder>, unordered_set_of<int> >
No, bimap doesn't seem to suit me since the container holds not only the wildcard and int.
You can use it anyway :) The last version of the lib includes a new feature called: relations information. If you do not need indexes over that information you can include it in bimap: bimap< set_of<string, WildcardOrder>, unordered_set_of<int>, with_info< string > > And then: bm.left.find("ab*c")->info = "info"; // or... bm.left.info_at("ab*c") = "info"; bm.right.find(3)->info = "info"; // or... bm.right.info_at(3) = "info"; Best regards Matias

Hello Matias, Sunday, June 10, 2007, 3:33:03 PM, you wrote:
On 6/10/07, Andrey Semashev <andysem@mail.ru> wrote:
Hello Matias,
You can use ordered_index if your application can work with O(log(n)) lookup.
I don't see how could I do that since for ordering the elements in the container I need strict weak ordering, and for element lookup by a full string I need a more relaxed ordering. For example:
ax a? a*
It was a long shot, I was more interested in you trying out bimap :) Anyway as you have put it, it seem that the lexicographical approach is not bothering you to much, you may want to try orered_non_unique that allows you to insert multiple keys.
That allows the same wildcards to coexist in the container which is not what I need either.
No, bimap doesn't seem to suit me since the container holds not only the wildcard and int.
You can use it anyway :) The last version of the lib includes a new feature called: relations information. If you do not need indexes over that information you can include it in bimap:
bimap< set_of<string, WildcardOrder>, unordered_set_of<int>, with_info< string > >
And then:
bm.left.find("ab*c")->info = "info"; // or... bm.left.info_at("ab*c") = "info"; bm.right.find(3)->>info = "info"; // or... bm.right.info_at(3) = "info";
I guess, if I need to put more than one value as info, I have to make a struct of them? struct MyInfo { string Data1; int Data2; bool Data3; }; bimap< set_of<string, WildcardOrder>, unordered_set_of<int>, with_info< MyInfo > > What is the value_type of the container then? What are the names of the two keys and info in it? As I'm not hoping it is more or less relevant to their real semantic, I don't think it's the best choice for me, because this value_type is exposed to other components of the application. Obscure names of value_type members an dependency on bimap are very undesirable. -- Best regards, Andrey mailto:andysem@mail.ru

Hello JOAQUIN,
Friday, June 8, 2007, 8:31:55 PM, you wrote:
Hello Andrey,
That feature is there on the future work section, but for the moment being I don't really know when/if is going to be provided. The main problem with documenting the internal interface of indices is that I'd be bound to freeze it, and as it happens I've had ocassionally the need to modify the interface or augment it to support new funcionalities or optimize some aspects.
It's a pity to hear that. Although I share your worries about interface modification problems, I believe that as all major STL-
----- Mensaje original ----- De: Andrey Semashev <andysem@mail.ru> Fecha: Sábado, Junio 9, 2007 10:22 pm Asunto: Re: [boost] [multi_index] User-defined indices Para: "\"JOAQUIN LOPEZ MU?Z\"" <boost@lists.boost.org> style
indices are covered in the lib, the main direction of its evolution is to allow users to extend it. You won't be able to provide indices for every real life case, though you will cover quite a number of them.
I'm not familiar with the implementation, but I think it should be possible to provide at least basic user-level interface without constraining you much to do some magic behind the scene in your out-of-box indices. I had some experience in developing similar containers (a bit more case-specific, though) and it almost always comes down to a std::list of nodes and whatever-you-need containers that hold iterators to the list of nodes and represent indices. If your design is similar to that, I don't see many problems if a user sees list-like interface of the main container and some helper facilities to implement index iterators and projecting.
It is not exactly like that, if I'm understanding your description correctly. Indices do not hold iterators to some base container, which would be the simplest implementation technique, but they're rather much more tightly coupled to each other internally. For instance, a multi_index_container consisting of N indices holds only one node per element --this node is constructed by metaprogrammatically aggregating in one single struct the N headers for each index. The savings achieved by this technique are calculated at http://boost.org/libs/multi_index/doc/performance.html The inner workings of the indices are also constrained by some internal APIs that must be complied with so that multi_index_containers can assemble and use the indices.
I'm curious about those special indices you've been in need of. If they are of general interest and you want to contribute the code maybe they can be adapted and included into the lib. Also, maybe if you explain their purpose we can come up with a way to have the functionality with the current public interface of Boost.MultiIndex.
Well, I'm not sure it's generic enough to be contributed. The container I developed holds elements (not necessarily std::pairs). Each element contains a string member which is a wildcard as a key. The wildcard may contain fixed characters and placeholders '?' (any single char) and '*' (any number of any chars). Overlapping wildcards (two wildcards are overlapping if they are different, but there is a string which they both match to) are allowed to coexist in the container, but not the same ones.
Do '*'s always happen at the end of the string or are they allowed to be used in the middle of a expression, for instance like in "ba*ing"?
A user, having a complete string, is able to find an element whose wildcard matches most closely (i.e. is the most detailed among all wildcards in the container that match the string).
In O(log n) time?
Additionally, he is able to find all elements whose wildcards match the string. On top of that, it should be possible to operate on the container elements by having their unique integer identifiers (which essentially means another index).
The major requirement was lookup performance, even in a reasonable expense of storage overhead and other operations speed. A typical sizes would be tens to hundreds of thousands of elements.
I failed to make use of B.MI ordered or hashed indices because of the constraints they apply to the CompatibleKey, CompatibleCompare, CompatibleHash and CompatiblePred objects I would have to provide.
My technique was based on a number of hash tables and a little knowledge of the wildcards nature. :) Do you consider this kind of index could be generalized more to be included in B.MI (some kind of "like_index" or "matching_index")?
Could be. The scenario looks very attractive, and rather intriguing too. Could I learn more about your implementation techniques? This does not seem an algoritmhically trivial problem at first sight. Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

Hello JOAQUIN, Monday, June 11, 2007, 1:46:53 AM, you wrote:
Hello JOAQUIN,
Friday, June 8, 2007, 8:31:55 PM, you wrote:
I'm not familiar with the implementation, but I think it should be possible to provide at least basic user-level interface without constraining you much to do some magic behind the scene in your out-of-box indices. I had some experience in developing similar containers (a bit more case-specific, though) and it almost always comes down to a std::list of nodes and whatever-you-need containers that hold iterators to the list of nodes and represent indices. If your design is similar to that, I don't see many problems if a user sees list-like interface of the main container and some helper facilities to implement index iterators and projecting.
It is not exactly like that, if I'm understanding your description correctly. Indices do not hold iterators to some base container, which would be the simplest implementation technique, but they're rather much more tightly coupled to each other internally. For instance, a multi_index_container consisting of N indices holds only one node per element --this node is constructed by metaprogrammatically aggregating in one single struct the N headers for each index. The savings achieved by this technique are calculated at
The inner workings of the indices are also constrained by some internal APIs that must be complied with so that multi_index_containers can assemble and use the indices.
Ok, this makes things more complicated but still possible. I see that the node structure becomes dependant on the index nature and there is no longer "ultimate" begin or end of the container which makes it difficult to clear or destruct it. However, the implementation does not restrict indices from having their internal data that helps them to implement their logic. In fact, I suspect that random_access and hashed indices use this ability. Therefore, if there is: - a way to tell the index the complete node type and means to extract the container value and index's header form it - a way to tell multi_index_container the type of index's header and iterators - a way to notify the index about modification events (such as insert and erase) performed through other indices (at least on per-node basis) then there is a quite usable basic interface to implement user's indices.
Well, I'm not sure it's generic enough to be contributed. The container I developed holds elements (not necessarily std::pairs). Each element contains a string member which is a wildcard as a key. The wildcard may contain fixed characters and placeholders '?' (any single char) and '*' (any number of any chars). Overlapping wildcards (two wildcards are overlapping if they are different, but there is a string which they both match to) are allowed to coexist in the container, but not the same ones.
Do '*'s always happen at the end of the string or are they allowed to be used in the middle of a expression, for instance like in "ba*ing"?
This was an acceptable restriction in my case. But if we plan to make it more general, we can't rely neither on this nor on the fact we're matching strings in the first place. For example, it should be possible to customize such "like_index" to perform element lookups by int, which is the closest to the int keys in the container.
A user, having a complete string, is able to find an element whose wildcard matches most closely (i.e. is the most detailed among all wildcards in the container that match the string).
In O(log n) time?
I'd rather prefer in pseudo-constant time which I get with hash tables. -- Best regards, Andrey mailto:andysem@mail.ru

Andrey Semashev <andysem <at> mail.ru> writes:
Hello JOAQUIN,
Monday, June 11, 2007, 1:46:53 AM, you wrote:
[...]
It is not exactly like that, if I'm understanding your description correctly. Indices do not hold iterators to some base container, which would be the simplest implementation technique, but they're rather much more tightly coupled to each other internally. For instance, a multi_index_container consisting of N indices holds only one node per element --this node is constructed by metaprogrammatically aggregating in one single struct the N headers for each index. The savings achieved by this technique are calculated at
The inner workings of the indices are also constrained by some internal APIs that must be complied with so that multi_index_containers can assemble and use the indices.
Ok, this makes things more complicated but still possible. I see that the node structure becomes dependant on the index nature and there is no longer "ultimate" begin or end of the container which makes it difficult to clear or destruct it.
However, the implementation does not restrict indices from having their internal data that helps them to implement their logic. In fact, I suspect that random_access and hashed indices use this ability.
Correct, these indices allocate their own supplemental structures.
Therefore, if there is:
- a way to tell the index the complete node type and means to extract the container value and index's header form it - a way to tell multi_index_container the type of index's header and iterators - a way to notify the index about modification events (such as insert and erase) performed through other indices (at least on per-node basis)
then there is a quite usable basic interface to implement user's indices.
The interface is not a passive one like you seem to be describing, but an active one in which some operations are performed in an orchestrated manner by all the indices. Allow me to give a very incomplete description of how indices work internally. Rouhgly speaking, an index has the following structure: template<typename Super> class index_implementation { public: // public interface typedef typename Super::value_type value_type; typedef ... iterator; ... iterator begin(); ... iterator find(...)const; ... protected: // backbone typedef ... node_type; ... index_implementation(...); void copy_(...); node_type* insert_(...); void erase_(..); void swap_(...); bool replace_(...); ... }; where Super contains a linear hierarchy of the preceding indices and index_implementation is supposed to derive from this to add itself to the chain. The important part is the backbone protected interface by which all the important "primitive" operations of the multi_index_container are implemented: for instance, index_implementation::insert_(...) must do its part of the insertion operation and then call Super::insert_(...) so that the remaining indices do the same, etc. The public member function insert() then resolves to a mere call to this primitive operation. As you see, the degree of coupling is very high --this has been done so in order to provide maximal efficienfy time- and memory-wise. Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

Hello Joaquin, Tuesday, June 12, 2007, 1:00:41 AM, you wrote:
Andrey Semashev <andysem <at> mail.ru> writes:
Hello JOAQUIN,
Monday, June 11, 2007, 1:46:53 AM, you wrote:
[snip]
Therefore, if there is:
- a way to tell the index the complete node type and means to extract the container value and index's header form it - a way to tell multi_index_container the type of index's header and iterators - a way to notify the index about modification events (such as insert and erase) performed through other indices (at least on per-node basis)
then there is a quite usable basic interface to implement user's indices.
The interface is not a passive one like you seem to be describing, but an active one in which some operations are performed in an orchestrated manner by all the indices. Allow me to give a very incomplete description of how indices work internally. Rouhgly speaking, an index has the following structure:
[snip]
where Super contains a linear hierarchy of the preceding indices and index_implementation is supposed to derive from this to add itself to the chain. The important part is the backbone protected interface by which all the important "primitive" operations of the multi_index_container are implemented: for instance, index_implementation::insert_(...) must do its part of the insertion operation and then call Super::insert_(...) so that the remaining indices do the same, etc. The public member function insert() then resolves to a mere call to this primitive operation. As you see, the degree of coupling is very high --this has been done so in order to provide maximal efficienfy time- and memory-wise.
I'm not sure how you perform insertions then. You have to check if an element is allowed to be inserted before the insertion. That means any index in the hierarchy may reject it and all other indices should remain intact. Anyway, you've managed to implement it some way. Then you just need to document the minimum set of these protected members and, additionally, final_*() members which begin each operation. But there may be ways of decoupling indices. For example, you may implement these final_* functions so they call each index's protected function separately. There would be no need to call base class recursively in each index and I don't think it's going to make a performance impact. This is essentially what I was writing above. In addition this might help to encapsulate exception safety burden in these final_* functions instead of distributing it in indices. -- Best regards, Andrey mailto:andysem@mail.ru

Andrey Semashev ha escrito: [...]
But there may be ways of decoupling indices. For example, you may implement these final_* functions so they call each index's protected function separately. There would be no need to call base class recursively in each index and I don't think it's going to make a performance impact. This is essentially what I was writing above. In addition this might help to encapsulate exception safety burden in these final_* functions instead of distributing it in indices.
I thought about this more decoupled approach when first designing the lib, but it presents serious problems. Let me elaborate: suppose we implement final_insert_() like this (pseudocode follows): node_type* final_insert(v)_ { node_type* x=allocate_node(); for(int i=0;i<N;++i){ if(!index<i>::insert_(v,x)){ for(int j=i;j--;)index<j>::erase_(x); deallocate_node(x); return 0; } } x->value()=v; return x; } This is suboptimal because when a given index bans an insertion we have to undo all the previous work by calling erase_() on the preceding indices. We can try to remedy this by writing the code as follows: node_type* final_insert(v)_ { node_type* x=allocate_node(); for(int i=0;i<N;++i){ if(!index<i>::can_insert_(v)){ deallocate_node(x); return 0; } } for(int i=0;i<N;++i){ index<i>::insert_(v,x)){ } x->value()=v; return x; } But then another problem pops up: the code in can_insert_() performs some calculations that must be repeated when doing the insertion proper. It'd be so much better if we can somehow store this info as some sort of local data that can later be used by insert_()... and the simplest way to do this is to let the insert_() functions the responsibility to cascade the call by themselves: node_type* final_insert(v)_ { node_type* x=allocate_node(); if(!index<0>::insert_(v,x)){ deallocate_node(x); return 0; } x->value()=v; return x; } node_type* insert(v,x)_ { link_info lnk; if(!calculate_link_info(lnk,v)){ // collision return 0; } if(!super::insert_(v,x)_){ // remaining indices return 0; } link(x,lnk); // do the actual linking; return x; // ok } I don't know if the explanation is clear enough... Basically, there might be conceptually cleaner ways to implement the backbone functionality of a multi_index_container, but I doubt they can compare to the current one in terms of efficiency. The drawback is that writing an index becomes an admittedly arduous task. Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

Hello Joaquín, Tuesday, June 12, 2007, 5:33:41 PM, you wrote:
Andrey Semashev ha escrito: [...]
But there may be ways of decoupling indices. For example, you may implement these final_* functions so they call each index's protected function separately. There would be no need to call base class recursively in each index and I don't think it's going to make a performance impact. This is essentially what I was writing above. In addition this might help to encapsulate exception safety burden in these final_* functions instead of distributing it in indices.
I thought about this more decoupled approach when first designing the lib, but it presents serious problems. Let me elaborate: suppose we implement final_insert_() like this (pseudocode follows):
[snip]
This is suboptimal because when a given index bans an insertion we have to undo all the previous work by calling erase_() on the preceding indices. We can try to remedy this by writing the code as follows:
[snip code with can_insert_]
But then another problem pops up: the code in can_insert_() performs some calculations that must be repeated when doing the insertion proper. It'd be so much better if we can somehow store this info as some sort of local data that can later be used by insert_()...
Yes, I'm aware of this problem. I solved it in my implementations by introducing special members of the index class. They are not seen to user and only participate as a temporary storage in these operations. In fact, insertions seems to be the only operation where the problem arises, and in most cases the storage needed is quite small (either an iterator hint or a hash result), so it doesn't make the container significantly larger. I understand that this solution looks awkward, but that's just an optimization, after all. And it allows to simplify indices, which I consider as a great advantage.
and the simplest way to do this is to let the insert_() functions the responsibility to cascade the call by themselves:
[snip]
I don't know if the explanation is clear enough... Basically, there might be conceptually cleaner ways to implement the backbone functionality of a multi_index_container, but I doubt they can compare to the current one in terms of efficiency. The drawback is that writing an index becomes an admittedly arduous task.
Well, that's your call. I can only hope then that your current interface will be documented some day. -- Best regards, Andrey mailto:andysem@mail.ru

Hello JOAQUIN, Monday, June 11, 2007, 1:46:53 AM, you wrote:
My technique was based on a number of hash tables and a little knowledge of the wildcards nature. :) Do you consider this kind of index could be generalized more to be included in B.MI (some kind of "like_index" or "matching_index")?
Could be. The scenario looks very attractive, and rather intriguing too. Could I learn more about your implementation techniques? This does not seem an algoritmhically trivial problem at first sight.
Like I said, I used the fact that only the end of the string may be masked. This allowed me to calculate hash over the fixed chars in the beginning of the wildcards and guarantee that these hash results will be equal to the corresponding partial hash results of the complete strings, calculated for the same number of the first chars of the string. For example: abc* abc??* abc??? will have the same hash, and it will be equal to hash of "abcde", calculated for the first 3 chars. Since I have all wildcards passing through insert methods, I can determine the maximum number of the leading fixed chars in the wildcards. This allows me an ability to calculate only needed partial hash results of the full strings, as follows, for "abcde": h0 - for empty string "" h1 - for "a" h2 - for "ab" h3 - for "abc" Other hash results (for "abcd" and "abcde") are not needed and this not calculated. Obviously, these three wildcards above would reside in the same bucket in this approach, so I had to order them within the bucket by their priority: abc??? abc??* abc* That guarantees that lookup will yield a more detailed wildcard. The downside is that the bucket (if the number of buckets is not very big) may contain many elements which will decrease lookup speed. That's why I went further and separated hash tables (i.e. lists of buckets) by the number of the leading fixed chars. This made these wildcards: abc??? de* f???? reside in different hash tables even if their hash results are equal. The lookup algorithm remains the same except that it repeats for each number of leading fixed chars decreasingly. The worst lookup case would be N hash table lookups, where N is the maximum number of leading fixed chars (which will most likely be 3 to 5 in my case). Considering that the bucket will likely contain no more than one element, there will commonly be no more than 5 matching attempts per a lookup. Not bad if there are 100 000 elements in the container. However, this may look like a questionable optimization in general, and I have yet to test whether it does any good. But at least it simplified implementation as I didn't have to complicate things with dynamic rehashing heuristic. As you may see now, my solution is tightly coupled with the specific problem I faced. B.MI index approach should be way more general, if such an index is to be developed. -- Best regards, Andrey mailto:andysem@mail.ru
participants (5)
-
"JOAQUIN LOPEZ MU?Z"
-
Andrey Semashev
-
Joaquin M Lopez Munoz
-
Joaquín Mª López Muñoz
-
Matias Capeletto