boost.xpressive is inefficient while seaching in raw bytes
hi everybody,I use boost.xpressive to search email address in a binary file which size is 10*1024*1024 bytes. every bytes is 0x6f in that file.boost.xpressive is inefficient. anyone can help me? thanks! the code is below: //read file CFile file(_T("d:\\test.dat"),CFile::modeRead ); byte* pBuffer = new byte[file.GetLength()]; file.Read(pBuffer,file.GetLength()); std::string str((char*)pBuffer,file.GetLength()); sregex token = sregex::compile("([a-z#~_\\.!\\#$%\\^&\\*\\(\\)\\-]+@[a-z#_\\-]+\\.[a-z#_\\-\\.]+)"); sregex_iterator cur( str.begin(), str.end(), token ); sregex_iterator end; for( ; cur != end; ++cur ) { smatch const &what = *cur; std::string strResult = what[0]; int start = what[0].first - str.begin(); } _________________________________________________________________ Helping your favorite cause is as easy as instant messaging. You IM, we give. http://im.live.com/Messenger/IM/Home/?source=text_hotmail_join
Aries Tao wrote:
hi everybody,I use boost.xpressive to search email address in a binary file which size is 10*1024*1024 bytes. every bytes is 0x6f in that file.boost.xpressive is inefficient. anyone can help me? thanks! the code is below:
How inefficient? Is perl or boost.regex significantly faster? -- Eric Niebler Boost Consulting www.boost-consulting.com
Aries Tao wrote:
hi everybody,I use boost.xpressive to search email address in a binary file which size is 10*1024*1024 bytes. every bytes is 0x6f in that file.boost.xpressive is inefficient. anyone can help me? thanks! the code is below: <snip>
I've done some investigation, and I've discovered a couple of things...
- The Boost.Regex algorithms that do not accept a match_results struct
automatically set the match_any flag. I don't see anything about that in
TR1. Compliance issue? Or is this covered by the as-if rule?
- The match_any flag has a dramatic adverse effect on the performance of
some regular expressions because of the following lines in
perl_matcher_non_recursive.hpp:
line 725:
//
// start by working out how much we can skip:
//
bool greedy = (rep->greedy) && (!(m_match_flags &
regex_constants::match_any) || m_independent);
line 753:
if(greedy)
{
if((rep->leading) && (count < rep->max))
restart = position;
This last line causes Boost.Regex to avoid exponential behavior. I don't
yet understand how this optimization works, or why match_any disables
it. John?
- Boost.Regex's regex_iterator invokes regex_search with a
match_results, so match_any is not set and the optimization kicks in. If
you pass the match_any flag to the regex_iterator constructor, the
constructor takes exponential time and throws on memory exhaustion.
- Boost.Xpressive lacks this optimization, and so always takes
exponential time, but does not exhaust memory and does not throw.
I'm attaching some code that demonstrates this strange situation. John,
could you comment on match_any and the re_repeat::leading optimization,
and why the regex algorithms are setting match_any?
Aries, your pattern presents a challenge to any backtracking regex
engine. Clever optimizations aside, there are some ways you can improve
the situation. In particular, think about wrapping your repeats in
independent sub-expressions like (?>[...]+). Unfortunately in this case,
that doesn't completely solve the problem because regex_search will
attempt a brute-force match at each of the 1 million positions within
the string. And, because of your pattern and the text against which
you're matching, each match will walk from the current position to the
end of the string. That's just going to be very, very slow.
--
Eric Niebler
Boost Consulting
www.boost-consulting.com
///////////////////////////////////////////////////////////////////////////////
// main.hpp
//
// Copyright 2007 Eric Niebler. Distributed under the Boost
// Software License, Version 1.0. (See accompanying file
// LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
#include <cstring>
#include <iostream>
//#include
Eric Niebler wrote:
Aries Tao wrote:
hi everybody,I use boost.xpressive to search email address in a binary file which size is 10*1024*1024 bytes. every bytes is 0x6f in that file.boost.xpressive is inefficient. anyone can help me? thanks! the code is below: <snip>
I've done some investigation, and I've discovered a couple of things...
Correct file attached now...
--
Eric Niebler
Boost Consulting
www.boost-consulting.com
///////////////////////////////////////////////////////////////////////////////
// main.hpp
//
// Copyright 2007 Eric Niebler. Distributed under the Boost
// Software License, Version 1.0. (See accompanying file
// LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
#include <cstring>
#include <iostream>
#include
Eric Niebler wrote:
Aries Tao wrote:
hi everybody,I use boost.xpressive to search email address in a binary file which size is 10*1024*1024 bytes. every bytes is 0x6f in that file.boost.xpressive is inefficient. anyone can help me? thanks! the code is below: <snip>
After I figured out what John's code was doing, it was a pretty simple
matter to implement a similar optimization in xpressive, which I have
just committed to trunk. It should dramatically speed up repeated
searches for patterns that begin with a simple repeat, such as "[a-z]+".
When used together with an independent subexpression to eliminate
unnecessary backtracking, the performance is really quite good.
I'm attaching a little benchmark program which uses your pattern to
search a 1Gb buffer. Here are the timings after my fix:
Boost.regex: 6.859
Boost.xpressive (dynamic): 3.02
Boost.xpressive (static): 2.912
With boost.regex, I couldn't use an independent subexpression because it
disables the optimization. That accounts for the difference.
Thanks for bringing this issue to my attention.
--
Eric Niebler
Boost Consulting
www.boost-consulting.com
///////////////////////////////////////////////////////////////////////////////
// main.hpp
//
// Copyright 2007 Eric Niebler. Distributed under the Boost
// Software License, Version 1.0. (See accompanying file
// LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
#include <cstring>
#include <iostream>
#include
Eric Niebler wrote:
Boost.regex: 6.859 Boost.xpressive (dynamic): 3.02 Boost.xpressive (static): 2.912
With boost.regex, I couldn't use an independent subexpression because it disables the optimization. That accounts for the difference.
I've just extended the optimization to more cases in Boost.Regex (SVN Trunk), so the two libraries should be back on a par now performance wise (I hope!) Cheers, John.
John Maddock wrote:
Eric Niebler wrote:
Boost.regex: 6.859 Boost.xpressive (dynamic): 3.02 Boost.xpressive (static): 2.912
With boost.regex, I couldn't use an independent subexpression because it disables the optimization. That accounts for the difference.
I've just extended the optimization to more cases in Boost.Regex (SVN Trunk), so the two libraries should be back on a par now performance wise (I hope!)
Confirmed, nice work. <tips hat> -- Eric Niebler Boost Consulting www.boost-consulting.com
I have sort of a dumb question about this. What would be the motivation for using Xpressive if its not faster than regex? Putting it another way, I would have thought that Xpressive would be faster with the cost of more code being instantiated every time it's invoked. What else is going on here? Another question: How would using expressive be different than using spirit with a regular expression grammar? I have no purpose for asking other than idle curiosity. Robert Ramey Eric Niebler wrote:
John Maddock wrote:
Eric Niebler wrote:
Boost.regex: 6.859 Boost.xpressive (dynamic): 3.02 Boost.xpressive (static): 2.912
With boost.regex, I couldn't use an independent subexpression because it disables the optimization. That accounts for the difference.
I've just extended the optimization to more cases in Boost.Regex (SVN Trunk), so the two libraries should be back on a par now performance wise (I hope!)
Confirmed, nice work. <tips hat>
Robert Ramey wrote:
I have sort of a dumb question about this. What would be the motivation for using Xpressive if its not faster than regex? Putting it another way, I would have thought that Xpressive would be faster with the cost of more code being instantiated every time it's invoked. What else is going on here? Another question: How would using expressive be different than using spirit with a regular expression grammar?
I have no purpose for asking other than idle curiosity.
When there is an optimization opportunity that reduces the search problem to "find the end of this buffer", as in this case, then its true that boost.regex can do it as fast as anybody. That shouldn't be surprising. In other cases, you can find examples of patterns and search strings which make either boost.regex or xpressive appear faster, but not algorithmically so. That's because they both solve the same problem in roughly the same way, algorithmically speaking. In my tests, I find xpressive to better handle shorter matches (search string < 1K) and boost.regex to better handle longer matches. YMMV. The key advantage of xpressive's static syntax is that it makes it simple to refer to data and code elsewhere in your program. In xpressive's docs, see the sections "Grammars and Nested Matches", "Semantic Actions and User-Defined Assertions" and "Symbol Tables and Attributes" for examples of things that boost.regex cannot do. Xpressive differs from Spirit in two major respects: 1) In xpressive, a regex can be specified as a string at runtime. In Spirit, a grammar rule cannot be specified that way. Only with xpressive can an entire grammar be entirely specified in a config file read at program start-up, for example. 2) Xpressive regexes have exhaustive backtracking semantics, even when they are assembled into grammars. Spirit grammars do not do exhaustive backtracking. That can make it harder to author some types of patterns. Hope that helps, -- Eric Niebler Boost Consulting www.boost-consulting.com
Eric Niebler wrote:
Robert Ramey wrote:
I have sort of a dumb question about this. What would be the motivation for using Xpressive if its not faster than regex? Putting it another way, I would have thought that Xpressive would be faster with the cost of more code being instantiated every time it's invoked. <snip>
I've already largely answered this in a previous message, but I wanted to add that depending on how you use xpressive, you may end up with *less* code instantiated than with boost.regex. That's because with static regexes, you only pay for the features you use, but with boost.regex, you have pay for everything up front. On the other hand, if you have lots of complicated static regexes, the code can get big. Finally, there actually is a measurable performance difference between boost.regex and xpressive in this case, even after John's tweaks: Boost.regex: 3.411 Boost.xpressive (dynamic): 3.015 Boost.xpressive (static): 2.949 So as you initially thought, static xpressive is marginally faster in this case. On my hardware. In this phase of the moon. Until John one-ups me again. ;-) -- Eric Niebler Boost Consulting www.boost-consulting.com
participants (4)
-
Aries Tao
-
Eric Niebler
-
John Maddock
-
Robert Ramey