Re: [boost] Regexp performance guarantees

23 Jul 2021

      Phil Endecott,

Match time linear to the length of the string works fine if the regex
doesn't have alternations but breaks down completely if they are
present.

Imagine you have the regex
"auto|bool|const|continue|...|([A-Z_][A-Za-z0-9_]*)" set up to match
any C token (the ... in this case are missing keywords, integers, etc)
but your input is "continuation". This regex will try for each
alternate, get almost all the way with "continue" but then have to
backtrack and try the other alternates.  It's only when you get to the
very last that the string will match. And it would be possible to
contrive even worse examples (regex match over a dictionary for
example)

John Maddock was describing regex in general as not being able to
specify the complexity rather than your specific example.

On 7/23/21, Phil Endecott via Boost <boost@lists.boost.org> wrote:
...
Thanks all for your replies.
John Maddock wrote:
...
Complexity for regular expression is really really hard to specify
I disagree. Pretty much by definition, Regular Expressions can
be matched in time linear in the length of the string and that's
what I'd expect a std::c++ spec to require, just as it requires
sorting to be N log N etc. etc. The fact that Perl >bleurgh< chose
to provide what I'd call "irregular expressions" 25 years ago
doesn't mean that we had to copy that. I reject the idea that we
have to support back-references in patterns because "people expect
that" - I've never used them.
Dominique Devienne wrote:
...
Have you tried Russ Cox's RE2 from https://github.com/google/re2
Thanks for the pointer. He has a series of three really good
documents explaining the history of regular expression implementations
and how we've slipped into this hole where the most common
implementations have poor complexity guarantees. See:
https://swtch.com/~rsc/regexp/
This also prompted me to look at CTRE, which compiles regular
expressions from strings at compile time and even works pre-C++20.
See: https://github.com/hanickadot/compile-time-regular-expressions
Gavin Lambert wrote:
...
In pretty much all regexp languages, if you want to match '-'
inside a character set then you must specify it as the first
character
I was looking at the cppreference docs here:
https://en.cppreference.com/w/cpp/regex/ecmascript
The grammar they give includes:
CharacterClass ::
[ [ lookahead ∉ {^}] ClassRanges ]
    [ ^ ClassRanges ]
ClassRanges ::
[empty]
    NonemptyClassRanges
NonemptyClassRanges ::
ClassAtom
    ClassAtom NonemptyClassRangesNoDash
    ClassAtom - ClassAtom ClassRanges
ClassAtom ::
-
    ClassAtomNoDash
    ClassAtomExClass(C++ only)
    ClassAtomCollatingElement(C++ only)
    ClassAtomEquivalence(C++ only)
ClassAtomNoDash ::
SourceCharacter but not one of \ or ] or -
    \ ClassEscape
I believe that allows [A-Z0-9-_/], doesn't it?
Anyway, all this prompted me to do some more investigation and
some benchmarking. The libraries that I have tried are libstdc++
(as supplied with g++ 8.3, so rather old), Boost.Regex,
Boost.Xpressive (with run-time expression strings, not the
Spirit-like compile time mode) (both Boost version 1.75),
RE2, and CTRE.
What I'm trying to do is to sanitise the input to an internet-
exposed process, to reject malicious input'); drop table users;
As an example I'll look at input that is supposed to be base-64
encoded and no more than a couple of kilobytes long.
Typical-case performance doesn't matter much as this runs once
per process invocation (and hence also caching the compiled
regex doesn't help), but I do want to be sure that it doesn't
have bad worst-case complexity in the face of pathological
input. So my first test is a quick check with a regular
expression that should might trigger worst-case behaviour in
a non-linear implementation:
a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa
matched against: aaaaaaaaaaaaaaaaaaaa
The execution times were:
CTRE: 1.1 us
RE2: 148 us
Xpressive: 27286 us
Boost.Regex: 31 us
libstdc++: 88632 us
Based on that, Xpressive and libstdc++ can be rejected immediately.
(Of course this doesn't prove that the others use exclusively-linear
algorithms; they may have heuristics that handle that case or just
got lucky; this is why I believe there should be complexity guarantees.)
Here are the patterns that I have benchmarked for my base64 test:
1. [A-Za-z0-9/+=]{0,8192}
2. [A-Za-z0-9/+=]*
3.
(?:[A-Za-z0-9/+]{4}){0,2048}(?:|(?:[A-Za-z0-9/+]{2}==)|(?:[A-Za-z0-9/+]{3}=))
4. (?:[A-Za-z0-9/+]{4})*(?:|(?:[A-Za-z0-9/+]{2}==)|(?:[A-Za-z0-9/+]{3}=))
Recall that base64 has chunks of four printable characters
with the final chunk using = to pad. Variants 3 & 4 strictly
check the padding. Variants 1 and 3 check for excessive length
while 2 & 4 require a separate check to do that.
Note that I'm using the "non capturing" syntax (?: ) rather
than ( ) because I only need the boolean match result.
First a note on compatibility. I noted before that expressions
like [A-Za-z0-9-_/] were accepted by some libraries but not others.
I found two other issues: only libstdc++ would accept [A-Z]{4}*,
while the others all required ([A-Z{4})*. Then RE2 rejected the
{0,8192} and {0,2048} repeats - it limits them to some smaller
value.
A note on compile times (g++ 8.3 -O3): there was a substantial
variation, with RE2 and CTRE being the fastest, Boost.Regex and
libstdc++ in the middle, and Boost.Xpressive slowest. The
difference from fastest to slowest was about 10X. It was interesting
that the "Compile Time Regular Expression" library CTRE was one of
the fastest to compile!
Regarding run-time performance, testing with about 3 kbytes of
input data: CTRE was fastest. RE2 was second in the two expressions
that it did not reject. Boost.Regex was last.
My conclusion is that CTRE is the best choice, and I would recommend
it unless (a) you need to specify the regular expression at runtime,
or (b) you need some of the "irregular" Perl extension syntax.
I hope that is of interest.
Regards, Phil.
P.S. I am subscribed to the list digest, which used to flush whatever
had been posted at least once per day; now it doesn't seem to send
anything until it has reached its threshold. Do others see this or is
it just me? Can it be fixed? I would have replied earlier, and
separately to the other replies, if I had received a digest at some
point.
_______________________________________________
Unsubscribe & other changes:
http://lists.boost.org/mailman/listinfo.cgi/boost
-- 
Soronel Haetir
soronel.haetir@gmail.com