[GSoC] Some new ideas for this year's GSoC

Hello, I am Stefan and this is my first year at GSoC as it's the first time I'm eligible to participate as a student. Since I have been using C++ and the Boost libraries for quite some time, I would be thrilled to be able to code for Boost. I have quite a few ideas, and I was wondering which ones you consider to benefit the most, therefore I'm awaiting for your feedback. Also, please let me know if any of these ideas are already implemented in any of the libraries. Here we go: * Selection algorithm (find the kth smallest element in a container, O(n) worst-case, very easy to implement) * Optimize Boost String Algorithm finders to use an efficient substring matching algorithm as opposed to the naive algorithm. Most likely because of the fact that boost has implemented the finders for strings as generic algorithms, they haven't used efficient algorithms such as KMP, Boyer-Moore, Rabin-Karp etc. \see boost/algorithm/string/detail/finder.hpp The idea is to implement efficient algorithms such as the ones mentioned above to replace the current naive implementation. Note: it may be difficult to find nth matching substring position using some of these algorithms. Difficulty: easy-medium * Implement a string algorithm for generating the suffix array. The suffix array, after being precomputed on a certain string, allows for very efficient substring searches in the future. This is useful when the string doesn't change too often but there are plenty of substring to search against. * Yet another efficient associative data structure, skip lists? * Space partitioning data structures? kd-trees? quadtrees? octtrees? collision detectors? closest neighbors? * A parser capable of evaluating mathematical expressions which contains operands,operators,function calls,variable names and constant names (also considering priority of operators): 1. a list of valid operators described by: a) number of operands (unary/binary/ternary?); b) allowed operand types; c) the precedence (priority of the operator) 2. a list of existent functions described by: a) function name; b) number of parameters; c) type of each parameter and return type; d) a callback (maybe just automatically deduce type based on the callback's type if possible) 3. a list of constants 4. a list of allowed variable names and types Here's a pseudo-C++ code sample on how it should look like: namespace m = boost::math_expressions_parser; m::parser p; p.add_operators() ('+', 1, ret<int32_t>(_1+_2)) // binary +, precedence 1 ('*', 2, ret<int32_t>(_1*_2)) // binary -, precedence 2 ('+', 3, ret<int32_t>(_1)) // unary +, precedence 3 ('-', 3, ret<int32_t>(-_1)); // unary - (also a funny face), precedence 3 p.add_functions() ("sin", ret<int32_t>(sin(_1))); p.add_constants() ("PI", 3); p.add_allowed_variables() (allowed_var<int32_t>("x")); m::expression E("1 + 3*sin(PI*x)", p); // evaluates mathematical expression E against parser p if (!E.valid()) { /* ... */ } else { std::cout << E.evaluate( ("x", 3) ); // evaluate this expression by letting x=3 E.let() ("x",4); std::cout << E.evaluate(); } //Maybe an optimization so only the parts that depend on the changed variable will be recalculated every time?. E.g.: for expression f(x)+y, f(x) is recomputed only when x is changed. This may cause problems for functions that do not always yield the same results for the same input. Maybe make this optimization optional, only if the user marks the function/operator as "volatile" or something Difficulty: hard-very hard * Someone suggested on this ml earlier that a crypto suite would be useful for boost. Maybe, but it would take a lot of time and dedication to implement. I believe that we should at least add some Cryptographic hash functions (such as md5, sha1, sha256 etc.). No one wants to install Crypto++ or something similar just to be able to compute the SHA1 sum of a file. It is a common necessity to want to compute a hash function of a string. Difficulty: very easy if you copy the implementation and just create a boost-ish interface * In-place radix sort? Radix sort is a very efficient algorithm which performs better than std::sort (my implementation) (also asymptotically better) for some particular types such as: uint8_t, uint16_t, uint32_t, unsigned char, unsigned char[2], unsigned char[4] etc. Radix sorting takes linear time, but unfortunately, linear memory. It is very useful for sorting very large amounts of numbers tho (or genetic codes and maybe some other stuff). I am able to implement some of these stuff and I'm waiting for your opinion in what would be most useful to be part of boost. Let me know if you believe that none will have any use. Yours sincerely, Stefan

Actually the difficulty is "easy". The basic precedence parser could be whipped out in an hour or so. Another day or a bit more to do the interfacing, define the expression data structures, and do the constant folding. Operators also need associativity, or, more expressively, two precedences -- a left and a right precedence. You don't really need to specify "allowed variables" any more than you need to specify allowed literals -- if the parser sees a variable it can use it. Personally I would include templates rather than just an operator string and then depend on the arity to determine the type. Then you can handle ternary and postfix operators. Topher On Apr 6, 2010, at 8:01 AM, Stefan wrote:
* A parser capable of evaluating mathematical expressions which contains operands,operators,function calls,variable names and constant names (also considering priority of operators): 1. a list of valid operators described by: a) number of operands (unary/binary/ternary?); b) allowed operand types; c) the precedence (priority of the operator) 2. a list of existent functions described by: a) function name; b) number of parameters; c) type of each parameter and return type; d) a callback (maybe just automatically deduce type based on the callback's type if possible) 3. a list of constants 4. a list of allowed variable names and types
Here's a pseudo-C++ code sample on how it should look like: namespace m = boost::math_expressions_parser; m::parser p; p.add_operators() ('+', 1, ret<int32_t>(_1+_2)) // binary +, precedence 1 ('*', 2, ret<int32_t>(_1*_2)) // binary -, precedence 2 ('+', 3, ret<int32_t>(_1)) // unary +, precedence 3 ('-', 3, ret<int32_t>(-_1)); // unary - (also a funny face), precedence 3 p.add_functions() ("sin", ret<int32_t>(sin(_1))); p.add_constants() ("PI", 3); p.add_allowed_variables() (allowed_var<int32_t>("x"));
m::expression E("1 + 3*sin(PI*x)", p); // evaluates mathematical expression E against parser p if (!E.valid()) { /* ... */ } else { std::cout << E.evaluate( ("x", 3) ); // evaluate this expression by letting x=3 E.let() ("x",4); std::cout << E.evaluate(); } //Maybe an optimization so only the parts that depend on the changed variable will be recalculated every time?. E.g.: for expression f(x)+y, f(x) is recomputed only when x is changed. This may cause problems for functions that do not always yield the same results for the same input. Maybe make this optimization optional, only if the user marks the function/operator as "volatile" or something Difficulty: hard-very hard

Topher Cooper wrote:
Actually the difficulty is "easy". The basic precedence parser could be whipped out in an hour or so. Another day or a bit more to do the interfacing, define the expression data structures, and do the constant folding.
Operators also need associativity, or, more expressively, two precedences -- a left and a right precedence.
You don't really need to specify "allowed variables" any more than you need to specify allowed literals -- if the parser sees a variable it can use it.
Personally I would include templates rather than just an operator string and then depend on the arity to determine the type. Then you can handle ternary and postfix operators.
Topher
Isn't it somethign that cna be done in a few days using spirit anyway ? -- ___________________________________________ Joel Falcou - Assistant Professor PARALL Team - LRI - Universite Paris Sud XI Tel : (+33)1 69 15 66 35

Topher Cooper wrote:
Actually the difficulty is "easy". The basic precedence parser could be whipped out in an hour or so. Another day or a bit more to do the interfacing, define the expression data structures, and do the constant folding.
Operators also need associativity, or, more expressively, two precedences -- a left and a right precedence.
You don't really need to specify "allowed variables" any more than you need to specify allowed literals -- if the parser sees a variable it can use it.
Personally I would include templates rather than just an operator string and then depend on the arity to determine the type. Then you can handle ternary and postfix operators.
Topher
On Apr 6, 2010, at 8:01 AM, Stefan wrote:
* A parser capable of evaluating mathematical expressions which contains operands,operators,function calls,variable names and constant names (also considering priority of operators): 1. a list of valid operators described by: a) number of operands (unary/binary/ternary?); b) allowed operand types; c) the precedence (priority of the operator) 2. a list of existent functions described by: a) function name; b) number of parameters; c) type of each parameter and return type; d) a callback (maybe just automatically deduce type based on the callback's type if possible) 3. a list of constants 4. a list of allowed variable names and types
Here's a pseudo-C++ code sample on how it should look like: namespace m = boost::math_expressions_parser; m::parser p; p.add_operators() ('+', 1, ret<int32_t>(_1+_2)) // binary +, precedence 1 ('*', 2, ret<int32_t>(_1*_2)) // binary -, precedence 2 ('+', 3, ret<int32_t>(_1)) // unary +, precedence 3 ('-', 3, ret<int32_t>(-_1)); // unary - (also a funny face), precedence 3 p.add_functions() ("sin", ret<int32_t>(sin(_1))); p.add_constants() ("PI", 3); p.add_allowed_variables() (allowed_var<int32_t>("x"));
m::expression E("1 + 3*sin(PI*x)", p); // evaluates mathematical expression E against parser p if (!E.valid()) { /* ... */ } else { std::cout << E.evaluate( ("x", 3) ); // evaluate this expression by letting x=3 E.let() ("x",4); std::cout << E.evaluate(); } //Maybe an optimization so only the parts that depend on the changed variable will be recalculated every time?. E.g.: for expression f(x)+y, f(x) is recomputed only when x is changed. This may cause problems for functions that do not always yield the same results for the same input. Maybe make this optimization optional, only if the user marks the function/operator as "volatile" or something Difficulty: hard-very hard
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Yes, I apologize. The "hard-very hard" wasn't meant to be there (I only wanted to use very-easy and easy on the difficulty tags initially). This is quite trivial to implement and all the other ideas are. I agree that these alone aren't sufficient for a proposal as they would be all finished in a relatively small amount of time, but I was thinking of implementing some of these in addition to one of the boost's ideas. Also, you have good points on associativity and preferring the use of templates, appreciated. Stefan

Stefan wrote:
* Optimize Boost String Algorithm finders to use an efficient substring matching algorithm as opposed to the naive algorithm. Most likely because of the fact that boost has implemented the finders for strings as generic algorithms, they haven't used efficient algorithms such as KMP, Boyer-Moore, Rabin-Karp etc. \see boost/algorithm/string/detail/finder.hpp The idea is to implement efficient algorithms such as the ones mentioned above to replace the current naive implementation. Note: it may be difficult to find nth matching substring position using some of these algorithms. Difficulty: easy-medium
* Implement a string algorithm for generating the suffix array. The suffix array, after being precomputed on a certain string, allows for very efficient substring searches in the future. This is useful when the string doesn't change too often but there are plenty of substring to search against.
Why are those two points separate? It seems to be the same thing: provides alternative and potentially better string searching algorithms. That seems fairly easy to get right and get included into boost.
* Space partitioning data structures? kd-trees? quadtrees? octtrees? collision detectors? closest neighbors?
There were already some projects about that in the past years, but I don't think they were very successful. You might want to take a look at them. Integrating with Boost.Geometry would definitely be good. (it already provides collision detection and closest neighbour I believe)
* In-place radix sort? Radix sort is a very efficient algorithm which performs better than std::sort (my implementation) (also asymptotically better) for some particular types such as: uint8_t, uint16_t, uint32_t, unsigned char, unsigned char[2], unsigned char[4] etc. Radix sorting takes linear time, but unfortunately, linear memory. It is very useful for sorting very large amounts of numbers tho (or genetic codes and maybe some other stuff).
I think there is a sorting library in the review queue that provides that kind of thing.

Stefan wrote:
* Optimize Boost String Algorithm finders to use an efficient substring matching algorithm as opposed to the naive algorithm. Most likely because of the fact that boost has implemented the finders for strings as generic algorithms, they haven't used efficient algorithms such as KMP, Boyer-Moore, Rabin-Karp etc. \see boost/algorithm/string/detail/finder.hpp The idea is to implement efficient algorithms such as the ones mentioned above to replace the current naive implementation. Note: it may be difficult to find nth matching substring position using some of these algorithms. Difficulty: easy-medium
* Implement a string algorithm for generating the suffix array. The suffix array, after being precomputed on a certain string, allows for very efficient substring searches in the future. This is useful when the string doesn't change too often but there are plenty of substring to search against.
Hi Stefan, Have you seen this thread, from a couple of years ago? : http://thread.gmane.org/gmane.comp.lib.boost.devel/171098 I don't know what has changed in string_algo since then, but I think there is probably plenty of opportunity for easy optimisation. Do you have a motivating application for this? A lot of this stuff is rather data-dependant, so having a particular application that can provide a benchmark is a good idea. I guess that a suffix array library would be reasonable for a GSoC project, if you're good! You might also like to consider variants that index at e.g. word boundaries; I have used something like that and it works well for some kinds of search. Regards, Phil.

Phil Endecott wrote:
Stefan wrote:
* Optimize Boost String Algorithm finders to use an efficient substring matching algorithm as opposed to the naive algorithm. Most likely because of the fact that boost has implemented the finders for strings as generic algorithms, they haven't used efficient algorithms such as KMP, Boyer-Moore, Rabin-Karp etc. \see boost/algorithm/string/detail/finder.hpp The idea is to implement efficient algorithms such as the ones mentioned above to replace the current naive implementation. Note: it may be difficult to find nth matching substring position using some of these algorithms. Difficulty: easy-medium
* Implement a string algorithm for generating the suffix array. The suffix array, after being precomputed on a certain string, allows for very efficient substring searches in the future. This is useful when the string doesn't change too often but there are plenty of substring to search against.
Hi Stefan,
Have you seen this thread, from a couple of years ago? :
http://thread.gmane.org/gmane.comp.lib.boost.devel/171098
I don't know what has changed in string_algo since then, but I think there is probably plenty of opportunity for easy optimisation.
Do you have a motivating application for this? A lot of this stuff is rather data-dependant, so having a particular application that can provide a benchmark is a good idea. I guess that a suffix array library would be reasonable for a GSoC project, if you're good! You might also like to consider variants that index at e.g. word boundaries; I have used something like that and it works well for some kinds of search.
Regards, Phil.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I don't have an application proving it's inefficient yet, but a O(nm) algorithm would almost always perform worse than a O(n+m) one. I've just looked at the posted thread, but the algorithm is_any_of only looks for a character in a string, not for a substring in a string. Let me know if you would like a benchmark comparing the boost::algorithm::find_*() with implementations of KMP, Boyer-Moore or other algorithms.

On Apr 7, 2010, at 11:58 AM, Stefan wrote:
Phil Endecott wrote:
Stefan wrote:
* Optimize Boost String Algorithm finders to use an efficient substring matching algorithm as opposed to the naive algorithm. Most likely because of the fact that boost has implemented the finders for strings as generic algorithms, they haven't used efficient algorithms such as KMP, Boyer-Moore, Rabin-Karp etc. \see boost/algorithm/string/detail/finder.hpp The idea is to implement efficient algorithms such as the ones mentioned above to replace the current naive implementation. Note: it may be difficult to find nth matching substring position using some of these algorithms. Difficulty: easy-medium
* Implement a string algorithm for generating the suffix array. The suffix array, after being precomputed on a certain string, allows for very efficient substring searches in the future. This is useful when the string doesn't change too often but there are plenty of substring to search against. Hi Stefan, Have you seen this thread, from a couple of years ago? : http://thread.gmane.org/gmane.comp.lib.boost.devel/171098 I don't know what has changed in string_algo since then, but I think there is probably plenty of opportunity for easy optimisation. Do you have a motivating application for this? A lot of this stuff is rather data-dependant, so having a particular application that can provide a benchmark is a good idea. I guess that a suffix array library would be reasonable for a GSoC project, if you're good! You might also like to consider variants that index at e.g. word boundaries; I have used something like that and it works well for some kinds of search. Regards, Phil.
I, for one, would like to see a lot more algorithms in Boost. I have implemented B-M and B-M-H, but haven't polished them enough for submission. [ Sometimes, if you squint correctly, I think I can make B-M-H (at least) work on bidirectional iterators as well as random-access once. ] -- Marshall

Stefan wrote:
* Optimize Boost String Algorithm finders to use an efficient substring matching algorithm as opposed to the naive algorithm. Most likely because of the fact that boost has implemented the finders for strings as generic algorithms, they haven't used efficient algorithms such as KMP, Boyer-Moore, Rabin-Karp etc. \see boost/algorithm/string/detail/finder.hpp The idea is to implement efficient algorithms such as the ones mentioned above to replace the current naive implementation. Note: it may be difficult to find nth matching substring position using some of these algorithms.
I don't have an application proving it's inefficient yet, but a O(nm) algorithm would almost always perform worse than a O(n+m) one.
So n = size of input and m = size of search pattern. If m is typically small, e.g. 2, then O(n+m) could easily beat O(nm) if the constant factors are in its favour. Even if m is large, then there will often be a dominant case where the potential match can be rejected after looking at just the first element (i.e. if I'm looking for xyz in abcdefgxyzlmnop, then a "worse case O(nm)" algorithm will actually take O(n)). I think that having a test data set in mind in advance would make this much more concrete to reason about. Does anyone have any suggestions?
I've just looked at the posted thread, but the algorithm is_any_of only looks for a character in a string, not for a substring in a string.
I guess my point was that even the simpler problem of finding a single character has lots of potential for interesting work. Regards, Phil.

Phil Endecott wrote:
Stefan wrote:
* Optimize Boost String Algorithm finders to use an efficient substring matching algorithm as opposed to the naive algorithm. Most likely because of the fact that boost has implemented the finders for strings as generic algorithms, they haven't used efficient algorithms such as KMP, Boyer-Moore, Rabin-Karp etc. \see boost/algorithm/string/detail/finder.hpp The idea is to implement efficient algorithms such as the ones mentioned above to replace the current naive implementation. Note: it may be difficult to find nth matching substring position using some of these algorithms.
I don't have an application proving it's inefficient yet, but a O(nm) algorithm would almost always perform worse than a O(n+m) one.
So n = size of input and m = size of search pattern.
If m is typically small, e.g. 2, then O(n+m) could easily beat O(nm) if the constant factors are in its favour.
Even if m is large, then there will often be a dominant case where the potential match can be rejected after looking at just the first element (i.e. if I'm looking for xyz in abcdefgxyzlmnop, then a "worse case O(nm)" algorithm will actually take O(n)).
I think that having a test data set in mind in advance would make this much more concrete to reason about. Does anyone have any suggestions?
The application where this becomes important is DNA sequencing. If you have a very small alphebet and large substring to match in a very large string you can get poor performance from the naïve algorithm *if* there is a lot of repetition of the beginning of your substring in the string you are searching. This can happen frequently when working with genetic data, but otherwise is of very little practical concern. I'm not sure boost string algorithms is the place genetic researchers would turn for their DNA crunching. There is no doubt a host of academic and commercial libraries and applications available for this purpose including multi-threaded and massively distributed search. Regards, Luke

Phil Endecott wrote:
I think that having a test data set in mind in advance would make this much more concrete to reason about. Does anyone have any suggestions?
Some people regard using an algorithm with a really bad complexity as some sort of bug, no? So the question would be how this "bug" can be exposed, in case it is really a bug. One suggestion in this direction might be a denial of service attack. But in my own experience, such "bugs" tend to cause damage in more subtle and unexpected ways. Perhaps somebody can answer me the opposite question: What is the advantage of having subtle bugs in your algorithms that only wait to bite you when you expect it the least? Regards, Thomas

Thomas Klimpel wrote:
Phil Endecott wrote:
I think that having a test data set in mind in advance would make this much more concrete to reason about. Does anyone have any suggestions?
Some people regard using an algorithm with a really bad complexity as some sort of bug, no?
Yes, indeed!
So the question would be how this "bug" can be exposed, in case it is really a bug. One suggestion in this direction might be a denial of service attack. But in my own experience, such "bugs" tend to cause damage in more subtle and unexpected ways. Perhaps somebody can answer me the opposite question: What is the advantage of having subtle bugs in your algorithms that only wait to bite you when you expect it the least?
Well the not-100%-serious answer would be "because they might be faster the rest of the time". I think these algorithms would be useful contributions to Boost. But because their complexity benefits may not be seen for common usage patterns (e.g. short search patterns), I think that studying their actual performance with some realistic test data (and not synthetic pseudo-random patterns!) should be a part of the project. I also wonder about other variations, e.g. avoiding recomputing the table in the KMP or Boyer-Moore algorithms for repeated searches with the same pattern: string text = content_of_file("long_text.txt"); kmp_pattern pat("search for this"); // compute the table here. string::const_iterator i = text.begin(); while (i != text.end()) { i = kmp_search(i,text.end(),pat); ... } If N = length of text, M = number of matches and P = length of pattern, I think this reduces the complexity from O(N+MP) to O(N+P). Do this with a TMP and a compile-time pattern, and you've got something expressive-like. But in my experience, when I have had to do fast search it has always been repeated search for different patterns and so I've used some sort of an index. Stephan mentioned suffix arrays and I think that would be an even more valuable contribution. Regards, Phil.

Phil Endecott wrote:
Stefan wrote:
* Optimize Boost String Algorithm finders to use an efficient substring matching algorithm as opposed to the naive algorithm. Most likely because of the fact that boost has implemented the finders for strings as generic algorithms, they haven't used efficient algorithms such as KMP, Boyer-Moore, Rabin-Karp etc. \see boost/algorithm/string/detail/finder.hpp The idea is to implement efficient algorithms such as the ones mentioned above to replace the current naive implementation. Note: it may be difficult to find nth matching substring position using some of these algorithms. Difficulty: easy-medium
* Implement a string algorithm for generating the suffix array. The suffix array, after being precomputed on a certain string, allows for very efficient substring searches in the future. This is useful when the string doesn't change too often but there are plenty of substring to search against.
Hi Stefan,
Have you seen this thread, from a couple of years ago? :
http://thread.gmane.org/gmane.comp.lib.boost.devel/171098
I don't know what has changed in string_algo since then, but I think there is probably plenty of opportunity for easy optimisation.
Do you have a motivating application for this? A lot of this stuff is rather data-dependant, so having a particular application that can provide a benchmark is a good idea. I guess that a suffix array library would be reasonable for a GSoC project, if you're good! You might also like to consider variants that index at e.g. word boundaries; I have used something like that and it works well for some kinds of search.
Regards, Phil.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I've just implemented KMP and benchmarked against boost::algorithm::find_first using a vector of 500 strings and 500 substrings, taking all possible pairs (500*500 possibilities). 250 of the strings were generated randomly, whereas the other 250 were generated by concatenating parts of some of the generated substrings. It took KMP around 16 seconds and boost around 54 seconds. This is still not very bad for boost's implementation, since the data is random, which is quite close to best-case. If however substrings were chosen so that there were many of them to partly match in the string, but not fully, then worst-case would be triggered. It is obvious why an algorithm in O(nm) having its worst-case triggered for relatively big values of n and m (I used m=1..100 n=1..1000) would be a disaster. P.S. I believe Rabin-Karp or suffix arrays would have performed better on random data than my KMP did. P.P.S. I'll share the benchmarking code tomorrow if you're interested.

Phil Endecott wrote:
Stefan wrote:
* Optimize Boost String Algorithm finders to use an efficient substring matching algorithm as opposed to the naive algorithm. Most likely because of the fact that boost has implemented the finders for strings as generic algorithms, they haven't used efficient algorithms such as KMP, Boyer-Moore, Rabin-Karp etc. \see boost/algorithm/string/detail/finder.hpp The idea is to implement efficient algorithms such as the ones mentioned above to replace the current naive implementation. Note: it may be difficult to find nth matching substring position using some of these algorithms. Difficulty: easy-medium
* Implement a string algorithm for generating the suffix array. The suffix array, after being precomputed on a certain string, allows for very efficient substring searches in the future. This is useful when the string doesn't change too often but there are plenty of substring to search against.
Hi Stefan,
Have you seen this thread, from a couple of years ago? :
http://thread.gmane.org/gmane.comp.lib.boost.devel/171098
I don't know what has changed in string_algo since then, but I think there is probably plenty of opportunity for easy optimisation.
Do you have a motivating application for this? A lot of this stuff is rather data-dependant, so having a particular application that can provide a benchmark is a good idea. I guess that a suffix array library would be reasonable for a GSoC project, if you're good! You might also like to consider variants that index at e.g. word boundaries; I have used something like that and it works well for some kinds of search.
Regards, Phil.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I've decided to implement Rabin-Karp today and after benchmarking I realized that on my pseudo-randomly generated data it performs worse than boost. My Rabin-Karp implementation will sure need some profiling before it can be used in benchmarking. Also some better data sets than what I came up with. Here it is: /** Rabin-karp * * \author mstefanro@gmail.com */ #include <boost/cstdint.hpp> #include <vector> #include <utility> #include <algorithm> #include <boost/utility.hpp> #include <boost/tuple/tuple.hpp> #include <cassert> //Note: I've wasted a few hours debugging because I initially haven't used the uintmax_t for logpow(). //! Binary exponentiation in O(log exp). This should also be added to boost unless it already exists and I'm not aware of it //! Very quickly computes (base^exp)%mod template <class Uint> Uint logpow (Uint base, Uint exp, Uint mod) { boost::uintmax_t ret=1, exponential_base = base; for (; exp; exp >>= 1) { if (exp&1) ret = (ret*exponential_base)%mod; exponential_base = (exponential_base*exponential_base)%mod; } return static_cast<Uint>(ret); } //! @TODO: add support for Container=char[],char*,wchar_t[],wchar_t* //! @TODO: accept generic matches container rather than just std::vector in find() //! @TODO: choose smaller primes (so that 2*prime fits in uint32) template < class Container, bool Heuristic = false, class Comparator=std::equal_to<Container::value_type>, class HashType = boost::uint32_t, HashType FirstBase=257u, HashType SecondBase=277u, HashType FirstModulo=7624679u, HashType SecondModulo=7116647u
class rabin_karp { public: rabin_karp (Container const &substring) : substring_(substring), first_hash(0u), second_hash(0u), comparator() { if (!substring.empty()) { boost::tie(first_hash, second_hash) = compute_hashes(substring.begin(), substring.end()); first_additive_inverse = FirstModulo - ::logpow(FirstBase , static_cast<HashType>(substring.size()-1), FirstModulo ); second_additive_inverse = SecondModulo - ::logpow(SecondBase, static_cast<HashType>(substring.size()-1), SecondModulo); } } template <class Container2> bool find( Container2 const &string, std::vector<typename Container2::const_iterator> &matches, unsigned int stopAt=0) { HashType current_first_hash, current_second_hash; bool found = false; matches.clear(); if (string.size() < substring_.size() || substring_.empty()) return false; boost::tie(current_first_hash, current_second_hash) = compute_hashes(string.begin(), string.begin()+substring_.size()); Container2::const_iterator iter = string.begin()+substring_.length()-1; do { ++iter; if (first_hash == current_first_hash && second_hash == current_second_hash && test_for_match<Heuristic>(iter-substring_.length(), iter) ) { // Got a match matches.push_back(iter-substring_.length()); found = true; if (stopAt) { --stopAt; if (!stopAt) break; } } if (iter == string.end()) break; else { // we've got a new character, update the hashes // remove *(iter-substring.length()) // add *iter HashType remove = *(iter-substring_.length()), add = *iter; current_first_hash = ( (current_first_hash + (remove*first_additive_inverse)%FirstModulo)*FirstBase ) % FirstModulo; current_first_hash = (current_first_hash + add)%FirstModulo; current_second_hash = ( (current_second_hash + (remove*second_additive_inverse)%SecondModulo)*SecondBase ) % SecondModulo; current_second_hash = (current_second_hash + add)%SecondModulo; } } while (true); return found; } private: Container substring_; Comparator comparator; HashType first_hash, second_hash; HashType first_additive_inverse, second_additive_inverse; template <class Iterator> std::pair<HashType, HashType> compute_hashes (Iterator begin, Iterator end) { HashType first=0u, second=0u; for (Iterator iter = begin; iter != end; ++iter) { first = (first * FirstBase + *iter) % FirstModulo ; second = (second * SecondBase + *iter) % SecondModulo; } return std::make_pair(first, second); } //! Check if [begin,end) == [ substring_.begin(),substring_.end() ) template <bool Heuristic_, class Iterator> typename boost::enable_if_c<Heuristic_ == false,bool>::type test_for_match(Iterator begin, Iterator end) { Iterator iter2 = begin; assert(end-begin == substring_.length()); for (Container::const_iterator iter=substring_.begin(); iter != substring_.end(); ++iter, ++iter2) { if (!comparator(*iter, *iter2)) return false; } return true; } //! Heuristic specialization, omit testing character by character. //! One should hope for no collisions on both hash functions at once (highly unlikely) //! If it's seen happening, add a third hash function? template <bool Heuristic_, class Iterator> typename boost::enable_if_c<Heuristic_ == true,bool>::type test_for_match(Iterator begin, Iterator end) { return true; } }; There's both an heuristic and a deterministic algorithm, but the heuristic one has never failed me so far (tested on around 20 mil inputs the heuristic implementation has always yielded the correct results). Typical usage is: std::vector<std::string::const_iterator> matches; rabin_karp<std::string> rk("my substring"); if (rk.find(std::string("my string which contains my substring"), matches)) { /* ... */ } Or for the heuristic version: ... rabin_karp<std::string, true> rk("my substring"); ... I doubt rabin-karp normally performs so badly, therefore my implementation definitely needs improvement.

Stefan wrote:
Phil Endecott wrote:
Stefan wrote:
* Optimize Boost String Algorithm finders to use an efficient substring matching algorithm as opposed to the naive algorithm. Most likely because of the fact that boost has implemented the finders for strings as generic algorithms, they haven't used efficient algorithms such as KMP, Boyer-Moore, Rabin-Karp etc. \see boost/algorithm/string/detail/finder.hpp The idea is to implement efficient algorithms such as the ones mentioned above to replace the current naive implementation. Note: it may be difficult to find nth matching substring position using some of these algorithms. Difficulty: easy-medium
* Implement a string algorithm for generating the suffix array. The suffix array, after being precomputed on a certain string, allows for very efficient substring searches in the future. This is useful when the string doesn't change too often but there are plenty of substring to search against.
Hi Stefan,
Have you seen this thread, from a couple of years ago? :
http://thread.gmane.org/gmane.comp.lib.boost.devel/171098
I don't know what has changed in string_algo since then, but I think there is probably plenty of opportunity for easy optimisation.
Do you have a motivating application for this? A lot of this stuff is rather data-dependant, so having a particular application that can provide a benchmark is a good idea. I guess that a suffix array library would be reasonable for a GSoC project, if you're good! You might also like to consider variants that index at e.g. word boundaries; I have used something like that and it works well for some kinds of search.
Regards, Phil.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I've decided to implement Rabin-Karp today and after benchmarking I realized that on my pseudo-randomly generated data it performs worse than boost. My Rabin-Karp implementation will sure need some profiling before it can be used in benchmarking. Also some better data sets than what I came up with. Here it is:
... I doubt rabin-karp normally performs so badly, therefore my implementation definitely needs improvement.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Sorry for double-posting, no idea why it happened. Also, one can neglect the comments "//! @TODO: choose smaller primes (so that 2*prime fits in uint32)" and "//! Heuristic specialization, omit testing character by character." as they are incorrect. The former has been solved whereas the latter is not a specialization, although was meant to be at one point

Stefan wrote:
Hello, I am Stefan and this is my first year at GSoC as it's the first time I'm eligible to participate as a student. Since I have been using C++ and the Boost libraries for quite some time, I would be thrilled to be able to code for Boost. I have quite a few ideas, and I was wondering which ones you consider to benefit the most, therefore I'm awaiting for your feedback. Also, please let me know if any of these ideas are already implemented in any of the libraries. Here we go:
* Selection algorithm (find the kth smallest element in a container, O(n) worst-case, very easy to implement)
* Optimize Boost String Algorithm finders to use an efficient substring matching algorithm as opposed to the naive algorithm. Most likely because of the fact that boost has implemented the finders for strings as generic algorithms, they haven't used efficient algorithms such as KMP, Boyer-Moore, Rabin-Karp etc. \see boost/algorithm/string/detail/finder.hpp The idea is to implement efficient algorithms such as the ones mentioned above to replace the current naive implementation. Note: it may be difficult to find nth matching substring position using some of these algorithms. Difficulty: easy-medium
* Implement a string algorithm for generating the suffix array. The suffix array, after being precomputed on a certain string, allows for very efficient substring searches in the future. This is useful when the string doesn't change too often but there are plenty of substring to search against.
* Yet another efficient associative data structure, skip lists?
* Space partitioning data structures? kd-trees? quadtrees? octtrees? collision detectors? closest neighbors?
* A parser capable of evaluating mathematical expressions which contains operands,operators,function calls,variable names and constant names (also considering priority of operators): 1. a list of valid operators described by: a) number of operands (unary/binary/ternary?); b) allowed operand types; c) the precedence (priority of the operator) 2. a list of existent functions described by: a) function name; b) number of parameters; c) type of each parameter and return type; d) a callback (maybe just automatically deduce type based on the callback's type if possible) 3. a list of constants 4. a list of allowed variable names and types
Here's a pseudo-C++ code sample on how it should look like: namespace m = boost::math_expressions_parser; m::parser p; p.add_operators() ('+', 1, ret<int32_t>(_1+_2)) // binary +, precedence 1 ('*', 2, ret<int32_t>(_1*_2)) // binary -, precedence 2 ('+', 3, ret<int32_t>(_1)) // unary +, precedence 3 ('-', 3, ret<int32_t>(-_1)); // unary - (also a funny face), precedence 3 p.add_functions() ("sin", ret<int32_t>(sin(_1))); p.add_constants() ("PI", 3); p.add_allowed_variables() (allowed_var<int32_t>("x"));
m::expression E("1 + 3*sin(PI*x)", p); // evaluates mathematical expression E against parser p if (!E.valid()) { /* ... */ } else { std::cout << E.evaluate( ("x", 3) ); // evaluate this expression by letting x=3 E.let() ("x",4); std::cout << E.evaluate(); } //Maybe an optimization so only the parts that depend on the changed variable will be recalculated every time?. E.g.: for expression f(x)+y, f(x) is recomputed only when x is changed. This may cause problems for functions that do not always yield the same results for the same input. Maybe make this optimization optional, only if the user marks the function/operator as "volatile" or something Difficulty: hard-very hard
* Someone suggested on this ml earlier that a crypto suite would be useful for boost. Maybe, but it would take a lot of time and dedication to implement. I believe that we should at least add some Cryptographic hash functions (such as md5, sha1, sha256 etc.). No one wants to install Crypto++ or something similar just to be able to compute the SHA1 sum of a file. It is a common necessity to want to compute a hash function of a string. Difficulty: very easy if you copy the implementation and just create a boost-ish interface
* In-place radix sort? Radix sort is a very efficient algorithm which performs better than std::sort (my implementation) (also asymptotically better) for some particular types such as: uint8_t, uint16_t, uint32_t, unsigned char, unsigned char[2], unsigned char[4] etc. Radix sorting takes linear time, but unfortunately, linear memory. It is very useful for sorting very large amounts of numbers tho (or genetic codes and maybe some other stuff).
I am able to implement some of these stuff and I'm waiting for your opinion in what would be most useful to be part of boost. Let me know if you believe that none will have any use.
Yours sincerely, Stefan
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Here are my ideas before submitting a proposal: * Selection algorithm (quick implementation, useful for efficiently finding kth smallest number in a container offering random access iterators) * In the String algorithms library, add optimal algorithms for searching a substring in a string: Rabin-Karp, Knuth-Morris-Pratt, Boyer-Moore, possibly Boyer-Moore-Horspool and suffix arrays. As has been previously discussed in the mailing lists, these algorithms can offer significant improvements to the current substring searching implementation in boost::string::find_*(). Some of these algorithms perform better when the string changes less often, others perform better when substring changes less often and their efficiency is quite dependent on the input data. I believe it would be a good idea to perform benchmarks on these algorithms using various types of input data: random strings and substrings, strings and substrings taken from the English language, substrings that occur often in the strings, substrings that occur only partly in the strings, very long substrings, very short substrings, searches with very small alphabets (such as DNA) etc. It is also my belief that it would be a good idea to add such benchmarks to the documentation pages so that the users of these algorithms will know which one will be the most suitable, having additional information about the input. * Extend the bindings to Python of Boost.Graph or write the topology generators (whichever you consider more important or of a higher priority). Personally, I would prefer doing the bindings to python, because Python is one of my favorite programming languages and I've never done something similar before. This would be beneficial for me, as I'm sure I can do it and I will also have the chance of learning something new (i.e. improve my python and my boost.graph skills) Can you please review my list of ideas and let me know what you think? Regards, Stefan
participants (8)
-
joel falcou
-
Marshall Clow
-
Mathias Gaunard
-
Phil Endecott
-
Simonson, Lucanus J
-
Stefan
-
Thomas Klimpel
-
Topher Cooper