boost.xpressive is inefficient while seaching in raw bytes

older
Generic indexing by "const char *"

Aries Tao

20 Jan 2008 20 Jan '08

3:03 p.m.

hi everybody,I use boost.xpressive to search email address in a binary file which size is 10*1024*1024 bytes. every bytes is 0x6f in that file.boost.xpressive is inefficient. anyone can help me? thanks! the code is below: //read file CFile file(_T("d:\\test.dat"),CFile::modeRead ); byte* pBuffer = new byte[file.GetLength()]; file.Read(pBuffer,file.GetLength()); std::string str((char*)pBuffer,file.GetLength()); sregex token = sregex::compile("([a-z#~_\\.!\\#$%\\^&\\*\$\$\\-]+@[a-z#_\\-]+\\.[a-z#_\\-\\.]+)"); sregex_iterator cur( str.begin(), str.end(), token ); sregex_iterator end; for( ; cur != end; ++cur ) { smatch const &what = *cur; std::string strResult = what[0]; int start = what[0].first - str.begin(); } _________________________________________________________________ Helping your favorite cause is as easy as instant messaging. You IM, we give. http://im.live.com/Messenger/IM/Home/?source=text_hotmail_join

Attachments:

attachment.html (text/html — 1.6 KB)

Show replies by date

Eric Niebler

20 Jan 20 Jan

5:24 p.m.

New subject: boost.xpressive is inefficient while seaching in raw bytes

Aries Tao wrote:

...

hi everybody,I use boost.xpressive to search email address in a binary file which size is 10*1024*1024 bytes. every bytes is 0x6f in that file.boost.xpressive is inefficient. anyone can help me? thanks! the code is below:

How inefficient? Is perl or boost.regex significantly faster? -- Eric Niebler Boost Consulting www.boost-consulting.com

Eric Niebler

24 Jan 24 Jan

7:57 p.m.

New subject: boost.xpressive is inefficient while seaching in raw bytes

Aries Tao wrote:

...

hi everybody,I use boost.xpressive to search email address in a binary file which size is 10*1024*1024 bytes. every bytes is 0x6f in that file.boost.xpressive is inefficient. anyone can help me? thanks! the code is below: <snip>

I've done some investigation, and I've discovered a couple of things... - The Boost.Regex algorithms that do not accept a match_results struct automatically set the match_any flag. I don't see anything about that in TR1. Compliance issue? Or is this covered by the as-if rule? - The match_any flag has a dramatic adverse effect on the performance of some regular expressions because of the following lines in perl_matcher_non_recursive.hpp: line 725: // // start by working out how much we can skip: // bool greedy = (rep->greedy) && (!(m_match_flags & regex_constants::match_any) || m_independent); line 753: if(greedy) { if((rep->leading) && (count < rep->max)) restart = position; This last line causes Boost.Regex to avoid exponential behavior. I don't yet understand how this optimization works, or why match_any disables it. John? - Boost.Regex's regex_iterator invokes regex_search with a match_results, so match_any is not set and the optimization kicks in. If you pass the match_any flag to the regex_iterator constructor, the constructor takes exponential time and throws on memory exhaustion. - Boost.Xpressive lacks this optimization, and so always takes exponential time, but does not exhaust memory and does not throw. I'm attaching some code that demonstrates this strange situation. John, could you comment on match_any and the re_repeat::leading optimization, and why the regex algorithms are setting match_any? Aries, your pattern presents a challenge to any backtracking regex engine. Clever optimizations aside, there are some ways you can improve the situation. In particular, think about wrapping your repeats in independent sub-expressions like (?>[...]+). Unfortunately in this case, that doesn't completely solve the problem because regex_search will attempt a brute-force match at each of the 1 million positions within the string. And, because of your pattern and the text against which you're matching, each match will walk from the current position to the end of the string. That's just going to be very, very slow. -- Eric Niebler Boost Consulting www.boost-consulting.com /////////////////////////////////////////////////////////////////////////////// // main.hpp // // Copyright 2007 Eric Niebler. Distributed under the Boost // Software License, Version 1.0. (See accompanying file // LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) #include <cstring> #include <iostream> //#include <boost/regex.hpp> #include <boost/xpressive/xpressive.hpp> int main() { std::size_t const Mb = 1048576; // 1Mb char *begin = new char[Mb]; char *end = begin + Mb; std::memset(begin, 0x6f, 1048576); char const *pattern = "((?>[a-z#~_\\.!\\#$%\\^&\\*\$\$\\-]+)@(?>[a-z#_\\-]+)\\.(?>[a-z#_\\-\\.]+))"; //try //{ // using namespace boost; // regex token(pattern); // // fast, doesn't throw: // cregex_iterator cur(begin, end, token); // // slow, throws on memory exhaustion: // regex_search(begin, end, token); //} //catch(std::exception const &e) //{ // std::cout << "boost.regex error: " << e.what() << std::endl; //} try { using namespace boost::xpressive; cregex token = cregex::compile(pattern, regex_constants::optimize); // fast, doesn't throw: cregex_iterator cur(begin, end, token); // slow, throws on memory exhaustion: //regex_search(begin, end, token); } catch(std::exception const &e) { std::cout << "boost.xpressive error: " << e.what() << std::endl; } return 0; }

Eric Niebler

9:04 p.m.

New subject: boost.xpressive is inefficient while seaching in raw bytes

Eric Niebler wrote:

...

Aries Tao wrote:

...
hi everybody,I use boost.xpressive to search email address in a binary file which size is 10*1024*1024 bytes. every bytes is 0x6f in that file.boost.xpressive is inefficient. anyone can help me? thanks! the code is below: <snip>

I've done some investigation, and I've discovered a couple of things...

Correct file attached now... -- Eric Niebler Boost Consulting www.boost-consulting.com /////////////////////////////////////////////////////////////////////////////// // main.hpp // // Copyright 2007 Eric Niebler. Distributed under the Boost // Software License, Version 1.0. (See accompanying file // LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) #include <cstring> #include <iostream> #include <boost/regex.hpp> int main() { std::size_t const Mb = 1048576; // 1Mb char *begin = new char[Mb]; char *end = begin + Mb; std::memset(begin, 0x6f, 1048576); char const *pattern = "([a-z#~_\\.!\\#$%\\^&\\*\$\$\\-]+@[a-z#_\\-]+\\.[a-z#_\\-\\.]+)"; try { using namespace boost; regex token(pattern); // fast, doesn't throw: cregex_iterator cur(begin, end, token); // slow, throws on memory exhaustion: regex_search(begin, end, token); } catch(std::exception const &e) { std::cout << "boost.regex error: " << e.what() << std::endl; } return 0; }

Eric Niebler

26 Jan 26 Jan

7:53 p.m.

New subject: boost.xpressive is inefficient while seaching in raw bytes

Eric Niebler wrote:

...

Aries Tao wrote:

...
hi everybody,I use boost.xpressive to search email address in a binary file which size is 10*1024*1024 bytes. every bytes is 0x6f in that file.boost.xpressive is inefficient. anyone can help me? thanks! the code is below: <snip>

After I figured out what John's code was doing, it was a pretty simple matter to implement a similar optimization in xpressive, which I have just committed to trunk. It should dramatically speed up repeated searches for patterns that begin with a simple repeat, such as "[a-z]+". When used together with an independent subexpression to eliminate unnecessary backtracking, the performance is really quite good. I'm attaching a little benchmark program which uses your pattern to search a 1Gb buffer. Here are the timings after my fix: Boost.regex: 6.859 Boost.xpressive (dynamic): 3.02 Boost.xpressive (static): 2.912 With boost.regex, I couldn't use an independent subexpression because it disables the optimization. That accounts for the difference. Thanks for bringing this issue to my attention. -- Eric Niebler Boost Consulting www.boost-consulting.com /////////////////////////////////////////////////////////////////////////////// // main.hpp // // Copyright 2007 Eric Niebler. Distributed under the Boost // Software License, Version 1.0. (See accompanying file // LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) #include <cstring> #include <iostream> #include <boost/timer.hpp> #include <boost/regex.hpp> #include <boost/xpressive/xpressive.hpp> int main() { std::size_t const Mb = 1000000000; char *begin = new char[Mb]; char *end = begin + Mb; std::memset(begin, 0x6f, Mb); try { using namespace boost; cregex_iterator e; char const *pattern = "([a-z#~_\\.!\\#$%\\^&\\*\$\$\\-]+@[a-z#_\\-]+\\.[a-z#_\\-\\.]+)"; regex token(pattern, (regex_constants::syntax_option_type)(regex_constants::ECMAScript | regex_constants::optimize)); boost::timer tim; tim.restart(); cregex_iterator cur(begin, end, token); for(; cur != e; ++cur) ; double elapsed = tim.elapsed(); std::cout << "Boost.regex: " << elapsed << '\n'; } catch(std::exception const &e) { std::cout << "boost.regex error: " << e.what() << std::endl; } try { using namespace boost::xpressive; cregex_iterator e; // test a dynamic regex char const *pattern = "((?>[a-z#~_\\.!\\#$%\\^&\\*\$\$\\-]+)@[a-z#_\\-]+\\.[a-z#_\\-\\.]+)"; cregex token = cregex::compile(pattern, regex_constants::ECMAScript | regex_constants::optimize); boost::timer tim; tim.restart(); cregex_iterator cur(begin, end, token); for(; cur != e; ++cur) ; double elapsed = tim.elapsed(); std::cout << "Boost.xpressive (dynamic): " << elapsed << '\n'; // test a static regex token = (s1= keep(+set[range('a','z') | '#' | '~' | '_' | '.' | '!' | '$' | '%' | '^' | '&' | '*' | '(' | ')' | '-']) >> '@' >> +set[range('a','z') | '#' | '_' | '-'] >> '.' >> +set[range('a','z') | '#' | '_' | '-' | '.'] ); tim.restart(); cregex_iterator cur2(begin, end, token); for(; cur2 != e; ++cur2) ; elapsed = tim.elapsed(); std::cout << "Boost.xpressive (static): " << elapsed << '\n'; } catch(std::exception const &e) { std::cout << "boost.xpressive error: " << e.what() << std::endl; } return 0; }

John Maddock

27 Jan 27 Jan

6:46 p.m.

New subject: boost.xpressive is inefficient while seaching inraw bytes

Eric Niebler wrote:

...

...
Boost.regex: 6.859 Boost.xpressive (dynamic): 3.02 Boost.xpressive (static): 2.912

With boost.regex, I couldn't use an independent subexpression because it disables the optimization. That accounts for the difference.

I've just extended the optimization to more cases in Boost.Regex (SVN Trunk), so the two libraries should be back on a par now performance wise (I hope!) Cheers, John.

Eric Niebler

28 Jan 28 Jan

8:06 p.m.

New subject: boost.xpressive is inefficient while seaching inraw bytes

John Maddock wrote:

...

Eric Niebler wrote:

...
...
Boost.regex: 6.859 Boost.xpressive (dynamic): 3.02 Boost.xpressive (static): 2.912

With boost.regex, I couldn't use an independent subexpression because it disables the optimization. That accounts for the difference.

I've just extended the optimization to more cases in Boost.Regex (SVN Trunk), so the two libraries should be back on a par now performance wise (I hope!)

Confirmed, nice work. <tips hat> -- Eric Niebler Boost Consulting www.boost-consulting.com

Robert Ramey

29 Jan 29 Jan

2:15 a.m.

New subject: boost.xpressive is inefficient while seachinginraw bytes

I have sort of a dumb question about this. What would be the motivation for using Xpressive if its not faster than regex? Putting it another way, I would have thought that Xpressive would be faster with the cost of more code being instantiated every time it's invoked. What else is going on here? Another question: How would using expressive be different than using spirit with a regular expression grammar? I have no purpose for asking other than idle curiosity. Robert Ramey Eric Niebler wrote:

...

John Maddock wrote:

...
Eric Niebler wrote:

...
...
Boost.regex: 6.859 Boost.xpressive (dynamic): 3.02 Boost.xpressive (static): 2.912

With boost.regex, I couldn't use an independent subexpression because it disables the optimization. That accounts for the difference.

I've just extended the optimization to more cases in Boost.Regex (SVN Trunk), so the two libraries should be back on a par now performance wise (I hope!)

Confirmed, nice work. <tips hat>

Eric Niebler

2:44 a.m.

New subject: boost.xpressive is inefficient while seachinginraw bytes

Robert Ramey wrote:

...

I have sort of a dumb question about this. What would be the motivation for using Xpressive if its not faster than regex? Putting it another way, I would have thought that Xpressive would be faster with the cost of more code being instantiated every time it's invoked. What else is going on here? Another question: How would using expressive be different than using spirit with a regular expression grammar?

I have no purpose for asking other than idle curiosity.

When there is an optimization opportunity that reduces the search problem to "find the end of this buffer", as in this case, then its true that boost.regex can do it as fast as anybody. That shouldn't be surprising. In other cases, you can find examples of patterns and search strings which make either boost.regex or xpressive appear faster, but not algorithmically so. That's because they both solve the same problem in roughly the same way, algorithmically speaking. In my tests, I find xpressive to better handle shorter matches (search string < 1K) and boost.regex to better handle longer matches. YMMV. The key advantage of xpressive's static syntax is that it makes it simple to refer to data and code elsewhere in your program. In xpressive's docs, see the sections "Grammars and Nested Matches", "Semantic Actions and User-Defined Assertions" and "Symbol Tables and Attributes" for examples of things that boost.regex cannot do. Xpressive differs from Spirit in two major respects: 1) In xpressive, a regex can be specified as a string at runtime. In Spirit, a grammar rule cannot be specified that way. Only with xpressive can an entire grammar be entirely specified in a config file read at program start-up, for example. 2) Xpressive regexes have exhaustive backtracking semantics, even when they are assembled into grammars. Spirit grammars do not do exhaustive backtracking. That can make it harder to author some types of patterns. Hope that helps, -- Eric Niebler Boost Consulting www.boost-consulting.com

Eric Niebler

30 Jan 30 Jan

2:31 a.m.

New subject: boost.xpressive is inefficient while seachinginraw bytes

Eric Niebler wrote:

...

Robert Ramey wrote:

...
I have sort of a dumb question about this. What would be the motivation for using Xpressive if its not faster than regex? Putting it another way, I would have thought that Xpressive would be faster with the cost of more code being instantiated every time it's invoked. <snip>

I've already largely answered this in a previous message, but I wanted to add that depending on how you use xpressive, you may end up with *less* code instantiated than with boost.regex. That's because with static regexes, you only pay for the features you use, but with boost.regex, you have pay for everything up front. On the other hand, if you have lots of complicated static regexes, the code can get big. Finally, there actually is a measurable performance difference between boost.regex and xpressive in this case, even after John's tweaks: Boost.regex: 3.411 Boost.xpressive (dynamic): 3.015 Boost.xpressive (static): 2.949 So as you initially thought, static xpressive is marginally faster in this case. On my hardware. In this phase of the moon. Until John one-ups me again. ;-) -- Eric Niebler Boost Consulting www.boost-consulting.com

6395

Age (days ago)

6405

Last active (days ago)

List overview

Download

9 comments

4 participants

participants (4)

Aries Tao
Eric Niebler
John Maddock
Robert Ramey