Phil Endecott,
Match time linear to the length of the string works fine if the regex
doesn't have alternations but breaks down completely if they are
present.
Imagine you have the regex
"auto|bool|const|continue|...|([A-Z_][A-Za-z0-9_]*)" set up to match
any C token (the ... in this case are missing keywords, integers, etc)
but your input is "continuation". This regex will try for each
alternate, get almost all the way with "continue" but then have to
backtrack and try the other alternates. It's only when you get to the
very last that the string will match. And it would be possible to
contrive even worse examples (regex match over a dictionary for
example)
John Maddock was describing regex in general as not being able to
specify the complexity rather than your specific example.
On 7/23/21, Phil Endecott via Boost
Thanks all for your replies.
John Maddock wrote:
Complexity for regular expression is really really hard to specify
I disagree. Pretty much by definition, Regular Expressions can be matched in time linear in the length of the string and that's what I'd expect a std::c++ spec to require, just as it requires sorting to be N log N etc. etc. The fact that Perl >bleurgh< chose to provide what I'd call "irregular expressions" 25 years ago doesn't mean that we had to copy that. I reject the idea that we have to support back-references in patterns because "people expect that" - I've never used them.
Dominique Devienne wrote:
Have you tried Russ Cox's RE2 from https://github.com/google/re2
Thanks for the pointer. He has a series of three really good documents explaining the history of regular expression implementations and how we've slipped into this hole where the most common implementations have poor complexity guarantees. See:
https://swtch.com/~rsc/regexp/
This also prompted me to look at CTRE, which compiles regular expressions from strings at compile time and even works pre-C++20. See: https://github.com/hanickadot/compile-time-regular-expressions
Gavin Lambert wrote:
In pretty much all regexp languages, if you want to match '-' inside a character set then you must specify it as the first character
I was looking at the cppreference docs here:
https://en.cppreference.com/w/cpp/regex/ecmascript
The grammar they give includes:
CharacterClass ::
[ [ lookahead ∉ {^}] ClassRanges ] [ ^ ClassRanges ]
ClassRanges ::
[empty] NonemptyClassRanges
NonemptyClassRanges ::
ClassAtom ClassAtom NonemptyClassRangesNoDash ClassAtom - ClassAtom ClassRanges
ClassAtom ::
- ClassAtomNoDash ClassAtomExClass(C++ only) ClassAtomCollatingElement(C++ only) ClassAtomEquivalence(C++ only)
ClassAtomNoDash ::
SourceCharacter but not one of \ or ] or - \ ClassEscape
I believe that allows [A-Z0-9-_/], doesn't it?
Anyway, all this prompted me to do some more investigation and some benchmarking. The libraries that I have tried are libstdc++ (as supplied with g++ 8.3, so rather old), Boost.Regex, Boost.Xpressive (with run-time expression strings, not the Spirit-like compile time mode) (both Boost version 1.75), RE2, and CTRE.
What I'm trying to do is to sanitise the input to an internet- exposed process, to reject malicious input'); drop table users; As an example I'll look at input that is supposed to be base-64 encoded and no more than a couple of kilobytes long.
Typical-case performance doesn't matter much as this runs once per process invocation (and hence also caching the compiled regex doesn't help), but I do want to be sure that it doesn't have bad worst-case complexity in the face of pathological input. So my first test is a quick check with a regular expression that should might trigger worst-case behaviour in a non-linear implementation:
a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa
matched against: aaaaaaaaaaaaaaaaaaaa
The execution times were:
CTRE: 1.1 us RE2: 148 us Xpressive: 27286 us Boost.Regex: 31 us libstdc++: 88632 us
Based on that, Xpressive and libstdc++ can be rejected immediately. (Of course this doesn't prove that the others use exclusively-linear algorithms; they may have heuristics that handle that case or just got lucky; this is why I believe there should be complexity guarantees.)
Here are the patterns that I have benchmarked for my base64 test:
1. [A-Za-z0-9/+=]{0,8192} 2. [A-Za-z0-9/+=]* 3. (?:[A-Za-z0-9/+]{4}){0,2048}(?:|(?:[A-Za-z0-9/+]{2}==)|(?:[A-Za-z0-9/+]{3}=)) 4. (?:[A-Za-z0-9/+]{4})*(?:|(?:[A-Za-z0-9/+]{2}==)|(?:[A-Za-z0-9/+]{3}=))
Recall that base64 has chunks of four printable characters with the final chunk using = to pad. Variants 3 & 4 strictly check the padding. Variants 1 and 3 check for excessive length while 2 & 4 require a separate check to do that.
Note that I'm using the "non capturing" syntax (?: ) rather than ( ) because I only need the boolean match result.
First a note on compatibility. I noted before that expressions like [A-Za-z0-9-_/] were accepted by some libraries but not others. I found two other issues: only libstdc++ would accept [A-Z]{4}*, while the others all required ([A-Z{4})*. Then RE2 rejected the {0,8192} and {0,2048} repeats - it limits them to some smaller value.
A note on compile times (g++ 8.3 -O3): there was a substantial variation, with RE2 and CTRE being the fastest, Boost.Regex and libstdc++ in the middle, and Boost.Xpressive slowest. The difference from fastest to slowest was about 10X. It was interesting that the "Compile Time Regular Expression" library CTRE was one of the fastest to compile!
Regarding run-time performance, testing with about 3 kbytes of input data: CTRE was fastest. RE2 was second in the two expressions that it did not reject. Boost.Regex was last.
My conclusion is that CTRE is the best choice, and I would recommend it unless (a) you need to specify the regular expression at runtime, or (b) you need some of the "irregular" Perl extension syntax.
I hope that is of interest.
Regards, Phil.
P.S. I am subscribed to the list digest, which used to flush whatever had been posted at least once per day; now it doesn't seem to send anything until it has reached its threshold. Do others see this or is it just me? Can it be fixed? I would have replied earlier, and separately to the other replies, if I had received a digest at some point.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Soronel Haetir soronel.haetir@gmail.com