
We have encountered a problem in the regex++ package, which reports
having exhausted memory after examining a very short string with a
regular expression of only modest complexity. I realize that the
documentation for the package does not specify how much memory usage is
too much, but since the same combination of regular expression and test
string works without problems with a number of other programming
toolsets (e.g., Python, Perl, C#) I'm guessing that the maintainers of
the package would be interested in tracking down the problem (I would if
it were my software). Here's a repro case which boils the problem down
to the tiniest example:
#include

Bob Kline wrote:
#include
int main() { boost::wregex e(L"^[^\\s]( ?([^\\s]+ ?)*[^\\s])?$"); boost::wcmatch m; boost::regex_match(L"codeine phosphate ", m, e); return 0; }
[...]
I'm pretty sure that the expression is boiled down to the least ambiguous form (without changing the semantics). In plain English, it's looking for strings that have no leading or trailing whitespace, and for which any internal whitespace runs are comprised solely of a single blank character.
I've no idea which part of the regexp causes the library to give up, but isn't there a much simpler reformulation: ^[^\s]+( [^\s]+)*$ that achieves the same result? In addition to running at all, it should also run much faster. :-) Unless I've missed something, of course.

Bob Kline wrote:
We have encountered a problem in the regex++ package, which reports having exhausted memory after examining a very short string with a regular expression of only modest complexity. I realize that the documentation for the package does not specify how much memory usage is too much, but since the same combination of regular expression and test string works without problems with a number of other programming toolsets (e.g., Python, Perl, C#) I'm guessing that the maintainers of the package would be interested in tracking down the problem (I would if it were my software). Here's a repro case which boils the problem down to the tiniest example:
This is probably not what you want to hear, but the behaviour is by design: Perl-regexes are in the worst case NP-Hard problems to solve, so there is a built in state-count that tracks how many states in the machine are visited and throws an exception if the number of states visited looks too large compared to the size of the string. The aim is to weed out "bad" regexes that may cause problems with certain texts and not others before they can do too much damage. Obviously this is a heuristic that sometimes throws too early or too late. When this issue occurs it's almost always a problem with the expression being ambiguous: each time a repeat or other decision has to be made then either alternative can be taken, so if no match is found the regex state machine has to thrash around trying ever possible combination. Is there any reason why you can't simplify your expression to just "^[^\\s]+( ([^\\s]+)*)?$" which does the same thing much more efficiently? HTH, John.

-----Original Message----- From: boost-users-bounces@lists.boost.org on behalf of John Maddock Sent: Tue 2/6/2007 7:31 AM To: boost-users@lists.boost.org Subject: Re: [Boost-users] Regex failure Is there any reason why you can't simplify your expression to just "^[^\\s]+( ([^\\s]+)*)?$" which does the same thing much more efficiently? [Nat] <offtopic> Is there any research work about programmatically optimizing a given regex to an equivalent, but more efficient, one? </offtopic>

Nat Goodspeed wrote:
[Nat] <offtopic> Is there any research work about programmatically optimizing a given regex to an equivalent, but more efficient, one? </offtopic>
That would certainly be useful. Last time I did a literature search in this area (a few years ago now I admit) there was a lot of work on *simplified* regular expressions: there was some overlap with search engine technology which made this a hot topic for a while. Likewise anything that can be reduced to a DFA ends up being automatically simplified, but I found very little for Perl-style expressions. John.

On 2/6/2007 7:31 AM, John Maddock wrote:
Is there any reason why you can't simplify your expression to just "^[^\\s]+( ([^\\s]+)*)?$" which does the same thing much more efficiently?
No reason at all. I have no idea why we didn't see that when were looking at this originally, unless it was just way too late in the afternoon that day. :-) Thanks, John (and Peter, who also suggested the same simplification), and sorry for the wasted bandwidth. Bob
participants (4)
-
Bob Kline
-
John Maddock
-
Nat Goodspeed
-
Peter Dimov