
Hello JOAQUIN, Monday, June 11, 2007, 1:46:53 AM, you wrote:
My technique was based on a number of hash tables and a little knowledge of the wildcards nature. :) Do you consider this kind of index could be generalized more to be included in B.MI (some kind of "like_index" or "matching_index")?
Could be. The scenario looks very attractive, and rather intriguing too. Could I learn more about your implementation techniques? This does not seem an algoritmhically trivial problem at first sight.
Like I said, I used the fact that only the end of the string may be masked. This allowed me to calculate hash over the fixed chars in the beginning of the wildcards and guarantee that these hash results will be equal to the corresponding partial hash results of the complete strings, calculated for the same number of the first chars of the string. For example: abc* abc??* abc??? will have the same hash, and it will be equal to hash of "abcde", calculated for the first 3 chars. Since I have all wildcards passing through insert methods, I can determine the maximum number of the leading fixed chars in the wildcards. This allows me an ability to calculate only needed partial hash results of the full strings, as follows, for "abcde": h0 - for empty string "" h1 - for "a" h2 - for "ab" h3 - for "abc" Other hash results (for "abcd" and "abcde") are not needed and this not calculated. Obviously, these three wildcards above would reside in the same bucket in this approach, so I had to order them within the bucket by their priority: abc??? abc??* abc* That guarantees that lookup will yield a more detailed wildcard. The downside is that the bucket (if the number of buckets is not very big) may contain many elements which will decrease lookup speed. That's why I went further and separated hash tables (i.e. lists of buckets) by the number of the leading fixed chars. This made these wildcards: abc??? de* f???? reside in different hash tables even if their hash results are equal. The lookup algorithm remains the same except that it repeats for each number of leading fixed chars decreasingly. The worst lookup case would be N hash table lookups, where N is the maximum number of leading fixed chars (which will most likely be 3 to 5 in my case). Considering that the bucket will likely contain no more than one element, there will commonly be no more than 5 matching attempts per a lookup. Not bad if there are 100 000 elements in the container. However, this may look like a questionable optimization in general, and I have yet to test whether it does any good. But at least it simplified implementation as I didn't have to complicate things with dynamic rehashing heuristic. As you may see now, my solution is tightly coupled with the specific problem I faced. B.MI index approach should be way more general, if such an index is to be developed. -- Best regards, Andrey mailto:andysem@mail.ru