Re: [boost] [Review:Algorithms] Comments about search algorithms

Phil Endecott wrote:
- A user faced with these three (or four) algorithms will need to choose one. The docs should offer some guidance, e.g. benchmarks or rules-of-thumb, about which to use.
In particular, I've not yet found a case where KMP is the best choice. BM seems better in all the cases that I've tried and has the same or better complexity bounds. (Well, it might perform slightly better when the pattern length is 1 but in that case, std::find could be used.) Does anyone know of any cases where KMP is likely to be better than BM? (If we can't find any, is it worth including it?)
I will try to give a complete answer... The most notable (and common) techniques for improving the sequence searching performance are the bad-character shift (occurrence shift) and the good-suffix shift (matching shift). The Boyer-Moore algorithm implements both these techniques, while the Boyer-Moore-Horspool only implements the bad-character shift and the KMP is merely based on the good-suffix shift. (Although the Boost.algorithm documentation only states that Horspool "is a refinement of the Boyer-Moore algorithm that trades space for time", this is by no means the most notable difference between these two algorithms.) In general, bad-character shift performs very good when the search-pattern is relatively small and the alphabet relatively large. On the other hand, good-suffix shift helps more when the pattern is relatively large and the alphabet small. In the most common cases of text searching, the patterns are usually small and the alphabets large, (> 255) hence the KMP (which is only using good-suffix shift) will almost never outperform BM in usual text-searching situations. (Note that KMP has not been used in the original std::search implementation of the STL, because even the straightforward (brute-force) algorithm has been considered to be superior in the average case!) The only practical case that I could think of, where the KMP could possibly outperform the BM are the DNA searches, where the patterns can be very long and the alphabet has only four characters. And even then, I would recommend careful performance testing, in order to ensure that KMP can be significantly faster than BM. However, in my opinion the most interesting and useful property of the KMP algorithm is not its performance, but its ability to help in building very generic sequence searching algorithms. In fact KMP has helped me to build an efficient generic sequence searching algorithm, which actually works with ranges accessed by single-pass (input) iterators. (See http://www.codeproject.com/KB/stl/single_pass_search.aspx and also in boost-vault/Algorithms) Needless to say that I would be more than happy to submit this work to boost, whenever there is an interest. Best regards Jim Xochellis Homepage: https://sites.google.com/site/xochellis/

dpxguard-boost@yahoo.gr wrote:
Phil Endecott wrote:
- A user faced with these three (or four) algorithms will need to? choose one. ?The docs should offer some guidance, e.g. benchmarks or? rules-of-thumb, about which to use.
In particular, I've not yet found a case where KMP is the best choice. ? BM seems better in all the cases that I've tried and has the same or? better complexity bounds. ?(Well, it might perform slightly better when? the pattern length is 1 but in that case, std::find could be used.) ? Does anyone know of any cases where KMP is likely to be better than? BM? ?(If we can't find any, is it worth including it?)
I will try to give a complete answer...
The most notable (and common) techniques for improving the sequence searching performance are the bad-character shift (occurrence shift) and the good-suffix shift (matching shift). The Boyer-Moore algorithm implements both these techniques, while the Boyer-Moore-Horspool only implements the bad-character shift and the KMP is merely based on the good-suffix shift. (Although the Boost.algorithm documentation only states that Horspool "is a refinement of the Boyer-Moore algorithm that trades space for time", this is by no means the most notable difference between these two algorithms.)
In general, bad-character shift performs very good when the search-pattern is relatively small and the alphabet relatively large. On the other hand, good-suffix shift helps more when the pattern is relatively large and the alphabet small. In the most common cases of text searching, the patterns are usually small and the alphabets large, (> 255) hence the KMP (which is only using good-suffix shift) will almost never outperform BM in usual text-searching situations. (Note that KMP has not been used in the original std::search implementation of the STL, because even the straightforward (brute-force) algorithm has been considered to be superior in the average case!)?
The only practical case that I could think of, where the KMP could possibly outperform the BM are the DNA searches, where the patterns can be very long and the alphabet has only four characters. And even then, I would recommend careful performance testing, in order to ensure that KMP can be significantly faster than BM.
Thanks Jim. Marshall, could you distil something from that for the docs? How efficiently could the current algorithms be used with DNA data? Does anyone have any experience with a packed base-4 representation?
However, in my opinion the most interesting and useful property of the KMP algorithm is not its performance, but its ability to help in building very generic sequence searching algorithms. In fact KMP has helped me to build an efficient generic sequence searching algorithm, which actually works with ranges accessed by single-pass (input) iterators. (See http://www.codeproject.com/KB/stl/single_pass_search.aspx and also in boost-vault/Algorithms) Needless to say that I would be more than happy to submit this work to boost, whenever there is an interest.
I am reminded that Spirit has some sort of adaptor that allows an input iterator to be used with a grammar that requires backtracking by storing recent inputs in a buffer. Presumably something like this could be used here. How much does your algorithm gain over a combination of such a buffer and one of Marshall's algorithms? Regards, Phil.

Oh, and something important which I forgot to mention: Bad-character shifting requires the presence of random-access iterators in order to work efficiently. Consequently, since Boyer-Moore and Boyer-Moore-Horspool are using bad-character shifting, they are only efficient for random-access iterators. Hence KMP will be still useful, even if it is much slower in the most common cases. And a question for Marshall: I have noticed that inside the search.hpp file you have a requirement for random-access iterators regarding the Knuth-Pratt-Morris searching algorithm as well. Is this for both the pattern and the corpus? Phil Endecott wrote:
How efficiently could the current algorithms be used with DNA data? Does anyone have any experience with a packed base-4 representation?
May I propose an easy way: Why not creating random sequences, using only four char values (ie "abcd") This will easily give you a very good idea about the KMP performance. For a more elaborate aproach, you can always use the tests of Musser & Nishanov: http://www.cs.rpi.edu/~musser/gp/gensearch/index.html
Phil Endecott wrote:
However, in my opinion the most interesting and useful property of the KMP algorithm is not its performance, but its ability to help in building very generic sequence searching algorithms. In fact KMP has helped me to build an efficient generic sequence searching algorithm, which actually works with ranges accessed by single-pass (input) iterators. (See http://www.codeproject.com/KB/stl/single_pass_search.aspx and also in boost-vault/Algorithms) Needless to say that I would be more than happy to submit this work to boost, whenever there is an interest.
I am reminded that Spirit has some sort of adaptor that allows an input iterator to be used with a grammar that requires backtracking by storing recent inputs in a buffer. Presumably something like this could be used here. How much does your algorithm gain over a combination of such a buffer and one of Marshall's algorithms?
Very good idea! I will try to compare my algorithm against this adaptor. A very interesting benchmark for me thank you Phil. Best regards Jim Xochellis Homepage: https://sites.google.com/site/xochellis/

On Sep 28, 2011, at 10:05 AM, dpxguard-boost@yahoo.gr wrote:
Oh, and something important which I forgot to mention: Bad-character shifting requires the presence of random-access iterators in order to work efficiently. Consequently, since Boyer-Moore and Boyer-Moore-Horspool are using bad-character shifting, they are only efficient for random-access iterators. Hence KMP will be still useful, even if it is much slower in the most common cases.
And a question for Marshall: I have noticed that inside the search.hpp file you have a requirement for random-access iterators regarding the Knuth-Pratt-Morris searching algorithm as well. Is this for both the pattern and the corpus?
Yes, and the here's the reason for that. There are basically two steps to the KMP algorithm (and frankly, the other two as well), and those two steps are coded in the do_search method. Given a potential match point (i.e position in the pattern and the corpus): 1) test and advance until you find a mismatch or come to the end (and report success) 2) Find the next potential match point. Step #1 can be done with ForwardIterators (or Bidirectional, in the case of Boyer-Moore), but in the second step, you have to be able to advance to arbitrary location in both the pattern and the corpus: // Figure out where to start searching again 544: match_start += idx - skip_ [ idx ]; 545: idx = skip_ [ idx ] >= 0 ? skip_ [ idx ] : 0; In this case, idx is the position in the pattern. Hm.. if the skip table stored iterators instead of indexes, you could just use ForwardIterators..... -- Marshall Marshall Clow Idio Software <mailto:mclow.lists@gmail.com> A.D. 1517: Martin Luther nails his 95 Theses to the church door and is promptly moderated down to (-1, Flamebait). -- Yu Suzuki

And a question for Marshall: I have noticed that inside the search.hpp file you have a requirement for random-access iterators regarding the Knuth-Pratt-Morris searching algorithm as well. Is this for both the pattern and the corpus?
Yes, and the here's the reason for that.
There are basically two steps to the KMP algorithm (and frankly, the other two as well), and those two steps are coded in the do_search method.
Given a potential match point (i.e position in the pattern and the corpus): 1) test and advance until you find a mismatch or come to the end (and report success) 2) Find the next potential match point.
Step #1 can be done with ForwardIterators (or Bidirectional, in the case of Boyer-Moore), but in the second step, you have to be able to advance to arbitrary location in both the pattern and the corpus:
// Figure out where to start searching again 544: match_start += idx - skip_ [ idx ]; 545: idx = skip_ [ idx ] >= 0 ? skip_ [ idx ] : 0;
In this case, idx is the position in the pattern.
Hm.. if the skip table stored iterators instead of indexes, you could just use ForwardIterators.....
Using iterators instead of indexes is not a bad idea, but even with indexes, what is the reason for needing random-access iterators for the corpus? Typically in good-suffix shift we are shifting the pattern and not the corpus! I will try to study your code asap in order to become more specific. I ensure you that requiring no more than forward iterators for both the pattern and the corpus is both feasible and important. And it is particularly important, since in the random-access iterators case BM will almost always be faster, hence the usefulness of a KMP implementation, which requires random-access iterators, is rather limited. Best regards Jim Xochellis Homepage: https://sites.google.com/site/xochellis/

On Sep 28, 2011, at 1:45 PM, dpxguard-boost@yahoo.gr wrote:
And a question for Marshall: I have noticed that inside the search.hpp file you have a requirement for random-access iterators regarding the Knuth-Pratt-Morris searching algorithm as well. Is this for both the pattern and the corpus?
Yes, and the here's the reason for that.
There are basically two steps to the KMP algorithm (and frankly, the other two as well), and those two steps are coded in the do_search method.
Given a potential match point (i.e position in the pattern and the corpus): 1) test and advance until you find a mismatch or come to the end (and report success) 2) Find the next potential match point.
Step #1 can be done with ForwardIterators (or Bidirectional, in the case of Boyer-Moore), but in the second step, you have to be able to advance to arbitrary location in both the pattern and the corpus:
// Figure out where to start searching again 544: match_start += idx - skip_ [ idx ]; 545: idx = skip_ [ idx ] >= 0 ? skip_ [ idx ] : 0;
In this case, idx is the position in the pattern.
Hm.. if the skip table stored iterators instead of indexes, you could just use ForwardIterators.....
Using iterators instead of indexes is not a bad idea, but even with indexes, what is the reason for needing random-access iterators for the corpus? Typically in good-suffix shift we are shifting the pattern and not the corpus! I will try to study your code asap in order to become more specific.
Unless I'm sadly mistaken (which has been known to happen), we are shifting both in the pattern and the corpus. In that code, "match_start" is the place where the match might be, and "idx" is where we want to start comparing. If idx == 0, then we start comparing from the beginning of the pattern, but otherwise, from somewhere in the middle.
I ensure you that requiring no more than forward iterators for both the pattern and the corpus is both feasible and important. And it is particularly important, since in the random-access iterators case BM will almost always be faster, hence the usefulness of a KMP implementation, which requires random-access iterators, is rather limited.
Understood. -- Marshall Marshall Clow Idio Software <mailto:mclow.lists@gmail.com> A.D. 1517: Martin Luther nails his 95 Theses to the church door and is promptly moderated down to (-1, Flamebait). -- Yu Suzuki

Yes, and the here's the reason for that.
There are basically two steps to the KMP algorithm (and frankly, the other two as well), and those two steps are coded in the do_search method.
Given a potential match point (i.e position in the pattern and the corpus): 1) test and advance until you find a mismatch or come to the end (and report success) 2) Find the next potential match point.
Step #1 can be done with ForwardIterators (or Bidirectional, in the case of Boyer-Moore), but in the second step, you have to be able to advance to arbitrary location in both the pattern and the corpus:
// Figure out where to start searching again 544: match_start += idx - skip_ [ idx ]; 545: idx = skip_ [ idx ] >= 0 ? skip_ [ idx ] : 0;
In this case, idx is the position in the pattern.
Hm.. if the skip table stored iterators instead of indexes, you could just use ForwardIterators.....
Using iterators instead of indexes is not a bad idea, but even with indexes, what is the reason for needing random-access iterators for the corpus? Typically in good-suffix shift we are shifting the pattern and not the corpus! I will try to study your code asap in order to become more specific.
I ensure you that requiring no more than forward iterators for both the pattern and the corpus is both feasible and important. And it is particularly important, since in the random-access iterators case BM will almost always be faster, hence the usefulness of a KMP implementation, which requires random-access iterators, is rather limited.
Furthermore a generic KMP implementation, which requires no more than forward iterators, can be used as drop-in replacement for the std::search of the STL. Just think about the added value... Best regards Jim Xochellis Homepage: https://sites.google.com/site/xochellis/

Marshall Clow wrote:
Unless I'm sadly mistaken (which has been known to happen), we are shifting both in the pattern and the corpus. In that code, "match_start" is the place where the match might be, and "idx" is where we want to start comparing. If idx == 0, then we start comparing from the beginning of the pattern, but otherwise, from somewhere in the middle.
I recommend you to study carefully this info: http://www-igm.univ-mlv.fr/~lecroq/string/node8.html#SECTION0080 And it will also help to press the "visualization" button and experiment with the applet. Furthermore I have included bellow a typical C/C++ KMP implementation, which is a slightly modified version of the example presented in the above site. You will clearly see in this sample that the corpus is never shifted. Best regards Jim Xochellis Homepage: https://sites.google.com/site/xochellis/ //************************************* #include <cstdio> #include <cstring> #define XSIZE 256 void KMP(const char *pPattern, int patternSize , const char *pCorpus, int corpusSize); //------------------------------------------ int main(int argc, char* argv[]) { const char *pCorpus= "Knu of Knuth!"; const char *pPattern= "Knuth"; KMP(pPattern, strlen(pPattern), pCorpus, strlen(pCorpus)); getchar(); return 0; } //------------------------------------------ void preKmp(const char *pPattern, int patternSize , int kmpNext[]) { int i, j; i = 0; j = kmpNext[0] = -1; while (i < patternSize ) { while (j > -1 && pPattern[i] != pPattern[j]) j = kmpNext[j]; i++; j++; if (pPattern[i] == pPattern[j]) kmpNext[i] = kmpNext[j]; else kmpNext[i] = j; } } void KMP(const char *pPattern, int patternSize , const char *pCorpus, int corpusSize) { int patternIdx, corpusIdx, kmpNext[XSIZE]; /* Preprocessing */ preKmp(pPattern, patternSize , kmpNext); /* Searching */ patternIdx = corpusIdx = 0; while (corpusIdx < corpusSize) { while (patternIdx > -1 && pPattern[patternIdx] != pCorpus[corpusIdx]) patternIdx = kmpNext[patternIdx]; //<--- Shifting the pattern on mismatch printf("patternIdx: %d -> corpusIdx: %d\n", patternIdx, corpusIdx); patternIdx++; corpusIdx++; //<--- corpus is always increased by 1 if (patternIdx >= patternSize ) { printf("*** Success at: %u\n", corpusIdx - patternIdx); return; //sucess } } puts("*** failure"); }
participants (3)
-
dpxguard-boost@yahoo.gr
-
Marshall Clow
-
Phil Endecott