[Regex] "Truly" incremental search?

Hello, The current "incremental search" code needs to resume search where the a possible match starts, and for certain expressions the worst-case time (i.e. input bytes come in one at a time) complexity becomes O(n^2) where n is the number of characters that match the expression: Regex: omg.*wtf Pass 1 input: o Pass 1 input to matching: o Pass 1 verdict: Partial, resume at offset 0. Pass 2 input: m Pass 2 input to matching: om Pass 2 verdict: Partial, resume at offset 0. Pass 3 input: g Pass 3 input to matching: omg Pass 3 verdict: Partial, resume at offset 0. ... Pass n-1 input: t Pass n-1 input to matching: omg...wt (length n-1) Pass n-1 verdict: Partial, resume at offset 0. Pass n input: f Pass n input to matching: omg...wtf (length n) Pass n verdict: Match found and starts at offset 0. Total number of byte matches = n*(n+1)/2, i.e. O(n^2) I am looking for a way to feed the input bytes only once, in order to reduce the time complexity to approximately O(n). Of course I will lose the ability to recover submatches, but the ability is unnecessary for this particular application that I am considering. I cannot break the regex into two parts ("omg" and "wtf") either because it is specified as an input to the program. Eugene

Eugene M. Kim wrote:
I am looking for a way to feed the input bytes only once, in order to reduce the time complexity to approximately O(n). Of course I will lose the ability to recover submatches, but the ability is unnecessary for this particular application that I am considering. I cannot break the regex into two parts ("omg" and "wtf") either because it is specified as an input to the program.
Hi, It may sound silly but what if you pre-process the input splitting it by ".*" and do an AND match for all sub-strings? input = "omg.*wtf" => input[] = "omg", "wtf"; return (regexp_match(input[0]) && regexp_match(input[1])); cheers, --renato PS: of course the code above is pseudo-code and of course you'll take as many input parameters as available and not hard-coded like this.

Renato golin wrote:
Eugene M. Kim wrote:
I am looking for a way to feed the input bytes only once, in order to reduce the time complexity to approximately O(n). Of course I will lose the ability to recover submatches, but the ability is unnecessary for this particular application that I am considering. I cannot break the regex into two parts ("omg" and "wtf") either because it is specified as an input to the program.
Hi,
It may sound silly but what if you pre-process the input splitting it by ".*" and do an AND match for all sub-strings?
input = "omg.*wtf" => input[] = "omg", "wtf"; return (regexp_match(input[0]) && regexp_match(input[1]));
cheers, --renato
PS: of course the code above is pseudo-code and of course you'll take as many input parameters as available and not hard-coded like this. That would work for simple cases where the two sub-patterns are separated by ".*", but fail for other non-trivial expressions such as "omg(bbq|cakes)*wtf", which the program must also be able to accept. ;_;
Eugene

Eugene M. Kim wrote:
PS: of course the code above is pseudo-code and of course you'll take as many input parameters as available and not hard-coded like this. That would work for simple cases where the two sub-patterns are separated by ".*", but fail for other non-trivial expressions such as "omg(bbq|cakes)*wtf", which the program must also be able to accept. ;_;
Yeah, If you're doing full text search it'd be fine but I know what you mean. When I started with Perl I found out that regular expressions could do virtually everything... after a while I realised that they were severely limited and sometimes two (or more) regular expressions were much faster than a complex one. There is no magic in regular expressions and people like Larry Wall worked very hard to optimize it for the generic case but if your case is different you may find a better way of doing it. The problem is, as John said well, you'll loose generality. So, at the end of the day, you'll still have to pre-process your regex to see in which case it fits better and even that have limited quality (ie. you can't always tell the best case for any given string). Both regular expressions you said "omg.*wtf" and "omg(bbq|cakes)*wtf" are fairly simple and when running over a reasonable amount of data do perform really well against other approaches (like specific parsers and state machines). I'd say you have an external problem, you shouldn't need (or shouldn't allow the user to) require complex needles or give such a big haystack and block execution time to avoid breaking your service down for exceptions like this. Hope that helps, --renato -- Reclaim your digital rights, eliminate DRM, learn more at http://www.defectivebydesign.org/what_is_drm

That would work for simple cases where the two sub-patterns are separated by ".*", but fail for other non-trivial expressions such>as "omg(bbq|cakes)*wtf", which the program must also be able to accept. ;_;
The technique is obviously extendable. I'm doing a similar thing where I have boost regex and greta available as program options but also my own code specialized for certain odd searches. I generally do 1000's of REGEX's against 10-100 strings from 10-500k long each. To make the speed reasonable, I can compile the regex into a simple intermediate form and record which features I will need during searching- if you find a '(" or '|' then you just need to take a slightly different strategy then "omg.wtf". The overhead in doing a parse on a small regex is obviously lower than making a pointless check everywhere along the samples. It also may be worth considering indexing the string ( make something similar to database index of table of contents) if you have a lot of REGEX's with single character matches or can identify the most unique part of the regex and look for that first. If every REGEX has a 4 char exact match requirement, ...ABCD..., there is little point checking for the complicated stuff when it isn't near the "ABCD" part. Mike Marchywka 586 Saint James Walk Marietta GA 30067-7165 404-788-1216 (C)<- leave message 989-348-4796 (P)<- emergency only marchywka@hotmail.com Note: Hotmail is blocking my mom's entire ISP claiming it is to reduce spam but probably to force users to use hotmail. Please DON'T assume I am ignoring you and try me on marchywka@yahoo.com if no reply here. Thanks. ________________________________ Date: Mon, 11 Feb 2008 18:08:31 -0800 From: gene@nttmcl.com To: boost-users@lists.boost.org Subject: Re: [Boost-users] [Regex] "Truly" incremental search? Renato golin wrote: Eugene M. Kim wrote: I am looking for a way to feed the input bytes only once, in order to reduce the time complexity to approximately O(n). Of course I will lose the ability to recover submatches, but the ability is unnecessary for this particular application that I am considering. I cannot break the regex into two parts ("omg" and "wtf") either because it is specified as an input to the program. Hi, It may sound silly but what if you pre-process the input splitting it by ".*" and do an AND match for all sub-strings? input = "omg.*wtf" => input[] = "omg", "wtf"; return (regexp_match(input[0]) && regexp_match(input[1])); cheers, --renato PS: of course the code above is pseudo-code and of course you'll take as many input parameters as available and not hard-coded like this. That would work for simple cases where the two sub-patterns are separated by ".*", but fail for other non-trivial expressions such as "omg(bbq|cakes)*wtf", which the program must also be able to accept. ;_; Eugene _________________________________________________________________ Need to know the score, the latest news, or you need your Hotmail®-get your "fix". http://www.msnmobilefix.com/Default.aspx

The current "incremental search" code needs to resume search where the a possible match starts, and for certain expressions the worst-case time (i.e. input bytes come in one at a time) complexity becomes O(n^2) where n is the number of characters that match the expression:
It's much worse than that: Perl style regular expressions are NP-Hard to match in general, and the worst case complexity is probably nearer to O(N!).
I am looking for a way to feed the input bytes only once, in order to reduce the time complexity to approximately O(n). Of course I will lose the ability to recover submatches, but the ability is unnecessary for this particular application that I am considering. I cannot break the regex into two parts ("omg" and "wtf") either because it is specified as an input to the program.
That's not supported by Boost.Regex or as far as I know Boost.Xpressive: Perl style regex matchers need to be able to backtrack over the input in order to do their stuff. In order to get O(N) behaviour you would need a DFA based regex engine, but would then loose: * Marked subexpressions. * Backreferences * All the Perl extensions, non-greedy repeats, and (? constructs etc. Can you not cache blocks of input rather than reading one character at a time (there's an example for searching large files that does that)? Of course even then, it is in principal possible to devise a regex that requires backtracking over the *whole* of the text before it can determine whether there is a match or not :-( Sorry for not helping much! John.

On Tue, Feb 12, 2008 at 08:57:05AM -0000, John Maddock wrote:
It's much worse than that: Perl style regular expressions are NP-Hard to match in general, and the worst case complexity is probably nearer to O(N!).
...
That's not supported by Boost.Regex or as far as I know Boost.Xpressive: Perl style regex matchers need to be able to backtrack over the input in order to do their stuff. In order to get O(N) behaviour you would need a DFA based regex engine, but would then loose:
* Marked subexpressions. * Backreferences * All the Perl extensions, non-greedy repeats, and (? constructs etc.
Can you not cache blocks of input rather than reading one character at a time (there's an example for searching large files that does that)? Of course even then, it is in principal possible to devise a regex that requires backtracking over the *whole* of the text before it can determine whether there is a match or not :-(
Sorry for not helping much!
Not at all, because what you said is actually good to know... Since all I need to support is just the POSIX ERE and not the Perl-style RE, I should probably just use the pcre_dfa_exec() of the PCRE library instead; its documentation specifically states that the algorithm does not backtrack. (Funny, I have to use the PCRE library in order /not/ to use the PCRE syntax. :-p) Thank you, Eugene
participants (5)
-
Eugene M. Kim
-
John Maddock
-
Mike Marchywka
-
Renato golin
-
Renato Golin