anfa-regex performance

12 Jan 2005

      Some time ago Eric Niebler wrote:
...
You commented on the algorithmic complexity, but have you
benchmarked against Boost.regex? I'd be very curious to see how your
back-end stacks up performance-wise.
I have done some benchmarking of ANFA-regex versus Boost.Regex, and
it revealed the following (quite expected) trends.

    For very simple match problems ANFA-regex is by about an order of
    magnitude slower. Moreover, the slowdown noticeably increases
    with the number of capturing parentheses.

    For real-world-like match problems, like finding paragraphs of
    "libs/libraries.htm" by matching it against ".*?(<p>.*?</p>)",
    the slowdown drops to about 1.5x--2x. However, in this case the
    "regex_search" function of Boost.Regex performs much faster than
    "regex_match" (thanks to Knuth-Morris-Pratt algorithm, AFAICS).

    For complex / pathological problems ANFA-regex performs much
    faster (any number of orders of magnitude, depending on the
    problem) than Boost.Regex---the linear complexity pays off.

Both Boost.Regex and ANFA-regex have linear complexity for simple
regexes. Highly optimized nature of Boost.Regex allows its complexity
to have much lower time constant than that of ANFA-regex, resulting
in much better performance. However, for more involved regexes
Boost.Regex gets nonlinear complexity, with respective consequences.

The most obvious way of speeding up is to optimize the actual code
(or help the compiler to optimize it). The problem is, the ANFA
simulation algorithm is quite simple and straightforward, so there's
not much room for improvement. On the other hand, there's a whole new
story in optimizing the algorithm itself.

Two main slowdown factors are (1) excessive tag copying and (2)
nondeterminism. I can name two ways to reduce (1). The factor (2) is
all about ANFA to ADFA conversion.

These optimizations are subject to additional research.

And, of course, there are additional speed reserves in KMP for search
problems.

(A side note. An ANFA version used for benchmarking is not the version
I uploaded into yahoo groups recently---I actually managed to double
its speed, or something like that.)

-- 
Best regards,
Vladimir Pozdyayev.

Vladimir Pozdyayev

John Maddock

Eric Niebler

tags

participants (3)