
Eric Niebler wrote:
Aries Tao wrote:
hi everybody,I use boost.xpressive to search email address in a binary file which size is 10*1024*1024 bytes. every bytes is 0x6f in that file.boost.xpressive is inefficient. anyone can help me? thanks! the code is below: <snip>
After I figured out what John's code was doing, it was a pretty simple matter to implement a similar optimization in xpressive, which I have just committed to trunk. It should dramatically speed up repeated searches for patterns that begin with a simple repeat, such as "[a-z]+". When used together with an independent subexpression to eliminate unnecessary backtracking, the performance is really quite good. I'm attaching a little benchmark program which uses your pattern to search a 1Gb buffer. Here are the timings after my fix: Boost.regex: 6.859 Boost.xpressive (dynamic): 3.02 Boost.xpressive (static): 2.912 With boost.regex, I couldn't use an independent subexpression because it disables the optimization. That accounts for the difference. Thanks for bringing this issue to my attention. -- Eric Niebler Boost Consulting www.boost-consulting.com /////////////////////////////////////////////////////////////////////////////// // main.hpp // // Copyright 2007 Eric Niebler. Distributed under the Boost // Software License, Version 1.0. (See accompanying file // LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) #include <cstring> #include <iostream> #include <boost/timer.hpp> #include <boost/regex.hpp> #include <boost/xpressive/xpressive.hpp> int main() { std::size_t const Mb = 1000000000; char *begin = new char[Mb]; char *end = begin + Mb; std::memset(begin, 0x6f, Mb); try { using namespace boost; cregex_iterator e; char const *pattern = "([a-z#~_\\.!\\#$%\\^&\\*\\(\\)\\-]+@[a-z#_\\-]+\\.[a-z#_\\-\\.]+)"; regex token(pattern, (regex_constants::syntax_option_type)(regex_constants::ECMAScript | regex_constants::optimize)); boost::timer tim; tim.restart(); cregex_iterator cur(begin, end, token); for(; cur != e; ++cur) ; double elapsed = tim.elapsed(); std::cout << "Boost.regex: " << elapsed << '\n'; } catch(std::exception const &e) { std::cout << "boost.regex error: " << e.what() << std::endl; } try { using namespace boost::xpressive; cregex_iterator e; // test a dynamic regex char const *pattern = "((?>[a-z#~_\\.!\\#$%\\^&\\*\\(\\)\\-]+)@[a-z#_\\-]+\\.[a-z#_\\-\\.]+)"; cregex token = cregex::compile(pattern, regex_constants::ECMAScript | regex_constants::optimize); boost::timer tim; tim.restart(); cregex_iterator cur(begin, end, token); for(; cur != e; ++cur) ; double elapsed = tim.elapsed(); std::cout << "Boost.xpressive (dynamic): " << elapsed << '\n'; // test a static regex token = (s1= keep(+set[range('a','z') | '#' | '~' | '_' | '.' | '!' | '$' | '%' | '^' | '&' | '*' | '(' | ')' | '-']) >> '@' >> +set[range('a','z') | '#' | '_' | '-'] >> '.' >> +set[range('a','z') | '#' | '_' | '-' | '.'] ); tim.restart(); cregex_iterator cur2(begin, end, token); for(; cur2 != e; ++cur2) ; elapsed = tim.elapsed(); std::cout << "Boost.xpressive (static): " << elapsed << '\n'; } catch(std::exception const &e) { std::cout << "boost.xpressive error: " << e.what() << std::endl; } return 0; }