
Some time ago Eric Niebler wrote:
You commented on the algorithmic complexity, but have you benchmarked against Boost.regex? I'd be very curious to see how your back-end stacks up performance-wise.
I have done some benchmarking of ANFA-regex versus Boost.Regex, and it revealed the following (quite expected) trends. For very simple match problems ANFA-regex is by about an order of magnitude slower. Moreover, the slowdown noticeably increases with the number of capturing parentheses. For real-world-like match problems, like finding paragraphs of "libs/libraries.htm" by matching it against ".*?(<p>.*?</p>)", the slowdown drops to about 1.5x--2x. However, in this case the "regex_search" function of Boost.Regex performs much faster than "regex_match" (thanks to Knuth-Morris-Pratt algorithm, AFAICS). For complex / pathological problems ANFA-regex performs much faster (any number of orders of magnitude, depending on the problem) than Boost.Regex---the linear complexity pays off. Both Boost.Regex and ANFA-regex have linear complexity for simple regexes. Highly optimized nature of Boost.Regex allows its complexity to have much lower time constant than that of ANFA-regex, resulting in much better performance. However, for more involved regexes Boost.Regex gets nonlinear complexity, with respective consequences. The most obvious way of speeding up is to optimize the actual code (or help the compiler to optimize it). The problem is, the ANFA simulation algorithm is quite simple and straightforward, so there's not much room for improvement. On the other hand, there's a whole new story in optimizing the algorithm itself. Two main slowdown factors are (1) excessive tag copying and (2) nondeterminism. I can name two ways to reduce (1). The factor (2) is all about ANFA to ADFA conversion. These optimizations are subject to additional research. And, of course, there are additional speed reserves in KMP for search problems. (A side note. An ANFA version used for benchmarking is not the version I uploaded into yahoo groups recently---I actually managed to double its speed, or something like that.) -- Best regards, Vladimir Pozdyayev.

Apologies for being way behind with your messages - I haven't had a chance to examine your code yet.
Two main slowdown factors are (1) excessive tag copying and (2) nondeterminism. I can name two ways to reduce (1). The factor (2) is all about ANFA to ADFA conversion.
Copying, is always a real killer, unless you can elliminate memory allocations. More than anything else these will soak up cycles, especially on multi-processor systems where the locking can really kill performance (imagine multiple threads all calling new/delete constantly).
These optimizations are subject to additional research.
And, of course, there are additional speed reserves in KMP for search problems.
Actually, I think I may remove that code as a dis-optimisation, a simple scan for first character is probably faster in the *average* case, and of course if you can do a Boyer-search that's even better. Like you, more research is needed! John.

John Maddock wrote:
And, of course, there are additional speed reserves in KMP for search problems.
Actually, I think I may remove that code as a dis-optimisation, a simple scan for first character is probably faster in the *average* case, and of course if you can do a Boyer-search that's even better.
Like you, more research is needed!
I have experimented a bit in this area, and you're right. I've had best results with a three-pronged approach to searches: (1) Boyer-Moore is a *huge* win if you can do it. xpressive and GRETA both use BM to great effect when the pattern begins with a string literal. (2) As a fall-back, use the first character if you can. This handles patterns like (a...|b.|c..), where I will scan forward for one of {abc}. (3) Fall back on brute force search at every position. -- Eric Niebler Boost Consulting www.boost-consulting.com
participants (3)
-
Eric Niebler
-
John Maddock
-
Vladimir Pozdyayev