Re: [Boost-users] boost.xpressive is inefficient while seaching in raw bytes

24 Jan 2008

      Aries Tao wrote:
...
hi everybody,I use boost.xpressive to search email address in a binary 
file which size is 10*1024*1024 bytes. every bytes is 0x6f in that 
file.boost.xpressive is inefficient. anyone can help me? thanks!
the code is below:
<snip>
I've done some investigation, and I've discovered a couple of things...

- The Boost.Regex algorithms that do not accept a match_results struct 
automatically set the match_any flag. I don't see anything about that in 
TR1. Compliance issue? Or is this covered by the as-if rule?

- The match_any flag has a dramatic adverse effect on the performance of 
some regular expressions because of the following lines in 
perl_matcher_non_recursive.hpp:

line 725:
    //
    // start by working out how much we can skip:
    //
    bool greedy = (rep->greedy) && (!(m_match_flags & 
regex_constants::match_any) || m_independent);

line 753:
    if(greedy)
    {
       if((rep->leading) && (count < rep->max))
          restart = position;

This last line causes Boost.Regex to avoid exponential behavior. I don't 
yet understand how this optimization works, or why match_any disables 
it. John?

- Boost.Regex's regex_iterator invokes regex_search with a 
match_results, so match_any is not set and the optimization kicks in. If 
you pass the match_any flag to the regex_iterator constructor, the 
constructor takes exponential time and throws on memory exhaustion.

- Boost.Xpressive lacks this optimization, and so always takes 
exponential time, but does not exhaust memory and does not throw.

I'm attaching some code that demonstrates this strange situation. John, 
could you comment on match_any and the re_repeat::leading optimization, 
and why the regex algorithms are setting match_any?

Aries, your pattern presents a challenge to any backtracking regex 
engine. Clever optimizations aside, there are some ways you can improve 
the situation. In particular, think about wrapping your repeats in 
independent sub-expressions like (?>[...]+). Unfortunately in this case, 
that doesn't completely solve the problem because regex_search will 
attempt a brute-force match at each of the 1 million positions within 
the string. And, because of your pattern and the text against which 
you're matching, each match will walk from the current position to the 
end of the string. That's just going to be very, very slow.

-- 
Eric Niebler
Boost Consulting
www.boost-consulting.com

///////////////////////////////////////////////////////////////////////////////
// main.hpp
//
//  Copyright 2007 Eric Niebler. Distributed under the Boost
//  Software License, Version 1.0. (See accompanying file
//  LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

#include <cstring>
#include <iostream>
//#include <boost/regex.hpp>
#include <boost/xpressive/xpressive.hpp>

int main()
{
    std::size_t const Mb = 1048576; // 1Mb
    char *begin = new char[Mb];
    char *end   = begin + Mb;
    std::memset(begin, 0x6f, 1048576);
    char const *pattern = "((?>[a-z#~_\\.!\\#$%\\^&\\*\\(\\)\\-]+)@(?>[a-z#_\\-]+)\\.(?>[a-z#_\\-\\.]+))";

    //try
    //{
    //    using namespace boost;
    //    regex token(pattern);
    //    // fast, doesn't throw:
    //    cregex_iterator cur(begin, end, token);
    //    // slow, throws on memory exhaustion:
    //    regex_search(begin, end, token);
    //}
    //catch(std::exception const &e)
    //{
    //    std::cout << "boost.regex error: " << e.what() << std::endl;
    //}

    try
    {
        using namespace boost::xpressive;
        cregex token = cregex::compile(pattern, regex_constants::optimize);
        // fast, doesn't throw:
        cregex_iterator cur(begin, end, token);
        // slow, throws on memory exhaustion:
        //regex_search(begin, end, token);
    }
    catch(std::exception const &e)
    {
        std::cout << "boost.xpressive error: " << e.what() << std::endl;
    }

    return 0;
}