On 21/07/2021 16:19, Phil Endecott via Boost wrote:
Dear Experts,
I recently wanted to add some input-sanitising code to a C++ web application, and as I had some existing Javascript code I decided to port its regex-based input checking to the C++ code. This didn't go as well as I had expected!
Here's an example of the sort of thing I have:
if (!regexp_match(s, regex("[A-Za-z0-9-_/]{1,8192}"))) throw MalformedInput();
I would definitely cache the regex object in a case like that - constructing and analyzing the expression is quite a significant task.
The Javascript equivalent seems to be fine. In C++, with libstdc++'s regex implementation, it seems to take about 2 seconds to run. Boost.Regex didn't like the '-' after the '9' but when I fixed that the execution time became negligible. Boost.Xpressive seems also to be fast.
On investigation it seems that this is due to there being two different algorithms, one of which (DFA) is linear in the string size but exponential in the pattern size (which in this case seems to mean exponential in the max repeat value of 8192), and the other (backtracking) is linear in the pattern size but exponential in the string size.
It seems that the standard deliberately doesn't specify complexity guarantees for std::regex, leaving the choice of algorithm to the implementation. I find this a bit surprising. It's true that both algorithms have their strengths, but not knowing which you are going to get is the worst of both worlds - you would have to assume that it will be exponential in both pattern and string sizes in portable code! (Imagine a container that doesn't disclose whether it has vector or list complexity!)
Complexity for regular expression is really really hard to specify - in particular matching Perl style regular expressions are known to NP-complete in the worst case, but fortunately the worst case is rarely encountered, and in the most normal usage, Perl regexes are reasonably performant.
So I'm wondering if it would be useful for regex libraries to somehow be explicit about which algorithm they are and what the complexity is.
Compare, for example, with the text search algorithms (Boyer-Moore etc) which indicate the algorithm in their names. Could we have regex_dfa / regex_backtracking, or something?
Now having said that, I think that what I really need for my application is something that is worst-case linear in both the pattern size and the input string. I'm not sure if the apparently-fast Boost.Regex algorithm has a worst-case input that is slow. ("Regular Expression Denial Of Service").
What Boost.Regex has that other libraries do not (so far as I know), is a cap on the maximum number of machine states which may be visited while attempting a match. This is quite deliberately to prevent DOS attacks, and was built into the library from the start. The limit is set at approximately O(N^2) before a runtime error is triggered. HTH, John.
It's clear that my example can be written without a regex as a simple linear loop, and maybe some regex libraries have heuristics that do that.
Thoughts anyone?
Regards, Phil.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus