
Jeff Garland wrote:
Hi All -
I'm happy to announce that Boost will be mentoring 9 Google SoC projects this year. They are: <snip>
Application to extend Boost regex with recursive matching, named sub-expressions, and automata-based matching Hugh Wimberly John Maddock
<snip> A DFA or Thompson NFA regex backend would be very nice, but I'm a bit dubious of the claim that it can be faster than the backtracking engine in any but the most pathological cases. I've toyed with various approaches, and communicated with Russ Cox after his excellent article on the topic: http://swtch.com/~rsc/regexp/regexp1.html. The trouble is that getting ECMAScript semantics (captures, leftmost biased) saddles these approaches with bookkeeping that makes the Thompson NFA considerably slower in the common case than the dumber-than-a-bag-of-hammers backtracking NFA. I ported Russ' Thompson NFA-with-captures-and-leftmost-bias to C++/Boost, wrote a Spirit-based regex parser, separated the mutable state for thread-safety and did an apples-to-apples shootout against xpressive's backtracking NFA implementation. The result: for common regexes, xpressive was about 30x faster. But it's also true that pathological regexes exist that expose the weaknesses of the backtracking approach. <shrug> I would be happy to make the Thompson NFA engine publicly available. I'm sure it can be improved, but I tend to doubt a 30x speedup is possible. My advise: implement a simple DFA and use it only for patterns without captures (regex_constants::nosubs) or fancy assertions (eg. look-ahead). If you need captures and leftmost-biased semantics, just use the backtracking NFA. Oh, and make the DFA so that I can use it in xpressive, too! :-)
And, if you aren't coming to BoostCon, well, you should ;-)
You bet! See you there! -- Eric Niebler Boost Consulting www.boost-consulting.com