
Aries Tao wrote:
hi everybody,I use boost.xpressive to search email address in a binary file which size is 10*1024*1024 bytes. every bytes is 0x6f in that file.boost.xpressive is inefficient. anyone can help me? thanks! the code is below: <snip>
I've done some investigation, and I've discovered a couple of things... - The Boost.Regex algorithms that do not accept a match_results struct automatically set the match_any flag. I don't see anything about that in TR1. Compliance issue? Or is this covered by the as-if rule? - The match_any flag has a dramatic adverse effect on the performance of some regular expressions because of the following lines in perl_matcher_non_recursive.hpp: line 725: // // start by working out how much we can skip: // bool greedy = (rep->greedy) && (!(m_match_flags & regex_constants::match_any) || m_independent); line 753: if(greedy) { if((rep->leading) && (count < rep->max)) restart = position; This last line causes Boost.Regex to avoid exponential behavior. I don't yet understand how this optimization works, or why match_any disables it. John? - Boost.Regex's regex_iterator invokes regex_search with a match_results, so match_any is not set and the optimization kicks in. If you pass the match_any flag to the regex_iterator constructor, the constructor takes exponential time and throws on memory exhaustion. - Boost.Xpressive lacks this optimization, and so always takes exponential time, but does not exhaust memory and does not throw. I'm attaching some code that demonstrates this strange situation. John, could you comment on match_any and the re_repeat::leading optimization, and why the regex algorithms are setting match_any? Aries, your pattern presents a challenge to any backtracking regex engine. Clever optimizations aside, there are some ways you can improve the situation. In particular, think about wrapping your repeats in independent sub-expressions like (?>[...]+). Unfortunately in this case, that doesn't completely solve the problem because regex_search will attempt a brute-force match at each of the 1 million positions within the string. And, because of your pattern and the text against which you're matching, each match will walk from the current position to the end of the string. That's just going to be very, very slow. -- Eric Niebler Boost Consulting www.boost-consulting.com /////////////////////////////////////////////////////////////////////////////// // main.hpp // // Copyright 2007 Eric Niebler. Distributed under the Boost // Software License, Version 1.0. (See accompanying file // LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) #include <cstring> #include <iostream> //#include <boost/regex.hpp> #include <boost/xpressive/xpressive.hpp> int main() { std::size_t const Mb = 1048576; // 1Mb char *begin = new char[Mb]; char *end = begin + Mb; std::memset(begin, 0x6f, 1048576); char const *pattern = "((?>[a-z#~_\\.!\\#$%\\^&\\*\\(\\)\\-]+)@(?>[a-z#_\\-]+)\\.(?>[a-z#_\\-\\.]+))"; //try //{ // using namespace boost; // regex token(pattern); // // fast, doesn't throw: // cregex_iterator cur(begin, end, token); // // slow, throws on memory exhaustion: // regex_search(begin, end, token); //} //catch(std::exception const &e) //{ // std::cout << "boost.regex error: " << e.what() << std::endl; //} try { using namespace boost::xpressive; cregex token = cregex::compile(pattern, regex_constants::optimize); // fast, doesn't throw: cregex_iterator cur(begin, end, token); // slow, throws on memory exhaustion: //regex_search(begin, end, token); } catch(std::exception const &e) { std::cout << "boost.xpressive error: " << e.what() << std::endl; } return 0; }