[Review:Algorithms] Comments about search algorithms

Dear All, Here are some observations about the proposed search algorithms: - I was confused as to whether the caller must keep the pattern pointed to by the iterators passed to the constructor after the constructor returns. My guess was that once the internal tables were built the pattern would not be needed any more - but that turns out to be wrong. At the least, the docs need to explain this. - The three algorithms have very similar interfaces. The docs could perhaps define a "searcher" concept and explicitly declare these algorithms' classes to be models of that concept. - std::search is similar to the three algorithms' free function versions (congratulations on getting the arguments in the same order!). For completeness, how about adding a searcher class that uses std::search? - A user faced with these three (or four) algorithms will need to choose one. The docs should offer some guidance, e.g. benchmarks or rules-of-thumb, about which to use. - It would sometimes be useful to have a variant of the constructor that takes a const char* pattern. - The tables store ints. For patterns shorter than 256 characters I think that uint8_t would be sufficient, and would reduce the size of a table from 1024 or 2048 bytes to only 256 bytes. Similarly uint16_t or uint32_t could be used for larger patterns. It is not obvious how to implement this, though; perhaps an optional max_pattern_length template parameter, making it the user's responsibility to ensure the supplied pattern is not too long. - The docs describe some methods that are private to the implementation; why? Regards, Phil.

On Sep 24, 2011, at 11:22 AM, Phil Endecott wrote:
Dear All,
Here are some observations about the proposed search algorithms:
- I was confused as to whether the caller must keep the pattern pointed to by the iterators passed to the constructor after the constructor returns. My guess was that once the internal tables were built the pattern would not be needed any more - but that turns out to be wrong. At the least, the docs need to explain this.
OK.
- The three algorithms have very similar interfaces. The docs could perhaps define a "searcher" concept and explicitly declare these algorithms' classes to be models of that concept.
I'll see what i can do.
- std::search is similar to the three algorithms' free function versions (congratulations on getting the arguments in the same order!). For completeness, how about adding a searcher class that uses std::search?
They weren't at first ;-)
- A user faced with these three (or four) algorithms will need to choose one. The docs should offer some guidance, e.g. benchmarks or rules-of-thumb, about which to use.
OK.
- It would sometimes be useful to have a variant of the constructor that takes a const char* pattern.
You can do that today with ( str, str + strlen (str), but I'll think about how to do it better. Maybe just a specialization for char * (and wchar?) The trick, I think, will be to avoid having a general pointer-based version that insists on null termination.
- The tables store ints. For patterns shorter than 256 characters I think that uint8_t would be sufficient, and would reduce the size of a table from 1024 or 2048 bytes to only 256 bytes. Similarly uint16_t or uint32_t could be used for larger patterns. It is not obvious how to implement this, though; perhaps an optional max_pattern_length template parameter, making it the user's responsibility to ensure the supplied pattern is not too long.
- The docs describe some methods that are private to the implementation; why?
Are you referring to the traits class? If so, that section is there because the implementation is a customization point, rather than a private implementation detail. The current implementation has two options (array and sparse table), and it chooses which one based on the input types - the array is faster, but is really only suitable for character-sized types (which is probably the most common case). The unordered_map is slower, but much more memory efficient for large alphabets. I guess I didn't make this clear in the documentation. I will see if I can make it better. All this is now under: https://github.com/mclow/Boost.Algorithm/issues/8 (for code issues) and https://github.com/mclow/Boost.Algorithm/issues/9 (for documentation issues) -- Marshall Marshall Clow Idio Software <mailto:mclow.lists@gmail.com> A.D. 1517: Martin Luther nails his 95 Theses to the church door and is promptly moderated down to (-1, Flamebait). -- Yu Suzuki

Marshall Clow wrote:
- The docs describe some methods that are private to the implementation; why?
Are you referring to the traits class?
No, I'm referring to e.g. // private member functions void build_bm_tables(); template<typename corpusIter> corpusIter do_search(corpusIter, corpusIter, std::size_t) const; template<typename Iter, typename Container> void compute_bm_prefix(Iter, Iter, Container &); void create_suffix_table(patIter, patIter) on the "Class template boyer_moore" reference page. Regards, Phil.

On Sep 25, 2011, at 4:27 AM, Phil Endecott wrote:
Marshall Clow wrote:
- The docs describe some methods that are private to the implementation; why?
Are you referring to the traits class?
No, I'm referring to e.g.
// private member functions void build_bm_tables(); template<typename corpusIter> corpusIter do_search(corpusIter, corpusIter, std::size_t) const; template<typename Iter, typename Container> void compute_bm_prefix(Iter, Iter, Container &); void create_suffix_table(patIter, patIter)
on the "Class template boyer_moore" reference page.
Ok, thanks. [ Those are generated through doxygen; I'll make sure that they get nixed ] -- Marshall Marshall Clow Idio Software <mailto:mclow.lists@gmail.com> A.D. 1517: Martin Luther nails his 95 Theses to the church door and is promptly moderated down to (-1, Flamebait). -- Yu Suzuki

Phil Endecott wrote:
- A user faced with these three (or four) algorithms will need to choose one. The docs should offer some guidance, e.g. benchmarks or rules-of-thumb, about which to use.
In particular, I've not yet found a case where KMP is the best choice. BM seems better in all the cases that I've tried and has the same or better complexity bounds. (Well, it might perform slightly better when the pattern length is 1 but in that case, std::find could be used.) Does anyone know of any cases where KMP is likely to be better than BM? (If we can't find any, is it worth including it?) FYI a benchmark program that I've been using is included below. Regards, Phil. #include <iostream> #include <string> #include <cstdlib> #include <algorithm> #include "boost/algorithm/search.hpp" using namespace std; template <typename patIter> class stdsearcher { patIter pat_first, pat_last; public: stdsearcher(patIter first, patIter last): pat_first(first), pat_last(last) {} template <typename corpusIter> corpusIter operator()(corpusIter first, corpusIter last) const { return std::search(first, last, pat_first, pat_last); } }; template <typename searcher> int do_search(const string& pattern, const string& corpus) { searcher s (pattern.begin(), pattern.end()); string::const_iterator i = corpus.begin(); int n = 0; while (1) { i = s(i,corpus.end()); if (i == corpus.end()) break; ++n; ++i; } return n; } static void usage() { cerr << "Usage: search --stdsearch|--bm|--bmh|--kmp <pattern>\n"; exit(1); } int main(int argc, char* argv[]) { if (argc!=3) { usage(); } string algo = argv[1]; string pattern = argv[2]; bool use_stdsearch = false, use_bm = false, use_bmh = false, use_kmp = false; if (algo=="--stdsearch") { use_stdsearch = true; } else if (algo=="--bm") { use_bm = true; } else if (algo=="--bmh") { use_bmh = true; } else if (algo=="--kmp") { use_kmp = true; } else { usage(); } string corpus; while (cin.good()) { string l; getline(cin,l); corpus.append(l); } int matches = 0; for (int round = 0; round<100; ++round) { if (use_stdsearch) { matches = do_search< stdsearcher<string::const_iterator> >(pattern,corpus); } if (use_bm) { matches = do_search< boost::algorithm::boyer_moore<string::const_iterator> >(pattern,corpus); } if (use_bmh) { matches = do_search< boost::algorithm::boyer_moore_horspool<string::const_iterator> >(pattern,corpus); } if (use_kmp) { matches = do_search< boost::algorithm::knuth_morris_pratt<string::const_iterator> >(pattern,corpus); } } cout << matches << " matches found\n"; exit(0); }

Phil Endecott wrote:
FYI a benchmark program that I've been using is included below.
[snip]
for (int round = 0; round<100; ++round) { if (use_stdsearch) { matches = do_search< stdsearcher<string::const_iterator> >(pattern,corpus); } if (use_bm) { [snip]
The conditionals in the loop might affect measurements. Why not multiple loops, each within a conditional? typedef string::const_iterator it_type; if (use_stdsearch) { for (int found(0); round < 100; ++round) { matches = do_search<stdsearched<it_type> >( pattern, corpus); } } else if (use_bm) { . . . Furthermore, you're paying the cost to create and destroy the searchers each iteration. I think some of the searches have higher construction overhead that pays off in faster searches thereafter. _____ Rob Stewart robert.stewart@sig.com Software Engineer using std::disclaimer; Dev Tools & Components Susquehanna International Group, LLP http://www.sig.com ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.
participants (3)
-
Marshall Clow
-
Phil Endecott
-
Stewart, Robert