Re: [boost] [soc 07] SoC 2007 Projects

12 Apr 2007

      Jeff Garland wrote:
...
Hi All -
I'm happy to announce that Boost will be mentoring 9 Google SoC projects this 
year.  They are:
<snip>
...
Application to extend Boost regex with recursive matching, named 
sub-expressions, and automata-based matching
      			Hugh Wimberly 	John Maddock
<snip>

A DFA or Thompson NFA regex backend would be very nice, but I'm a bit 
dubious of the claim that it can be faster than the backtracking engine 
in any but the most pathological cases. I've toyed with various 
approaches, and communicated with Russ Cox after his excellent article 
on the topic: http://swtch.com/~rsc/regexp/regexp1.html. The trouble is 
that getting ECMAScript semantics (captures, leftmost biased) saddles 
these approaches with bookkeeping that makes the Thompson NFA 
considerably slower in the common case than the 
dumber-than-a-bag-of-hammers backtracking NFA. I ported Russ' Thompson 
NFA-with-captures-and-leftmost-bias to C++/Boost, wrote a Spirit-based 
regex parser, separated the mutable state for thread-safety and did an 
apples-to-apples shootout against xpressive's backtracking NFA 
implementation. The result: for common regexes, xpressive was about 30x 
faster. But it's also true that pathological regexes exist that expose 
the weaknesses of the backtracking approach. <shrug>

I would be happy to make the Thompson NFA engine publicly available. I'm 
sure it can be improved, but I tend to doubt a 30x speedup is possible.

My advise: implement a simple DFA and use it only for patterns without 
captures (regex_constants::nosubs) or fancy assertions (eg. look-ahead). 
  If you need captures and leftmost-biased semantics, just use the 
backtracking NFA.

Oh, and make the DFA so that I can use it in xpressive, too! :-)
...
And, if you aren't coming to BoostCon, well, you should ;-)
http://www.boostcon.com/registration
You bet! See you there!

-- 
Eric Niebler
Boost Consulting
www.boost-consulting.com