[regex] why partial match / early break ?
Hi, why is this regex const string sRe = "((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9])\\.)+" "((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9]))"; not matching this string wholly? "a1a.a2a.a3a.a4aaaa" It rather matches only this part: "a1a.a2a.a3a.a4" What's the problem here? (I know I can append a delimiter to solve the problem, but I think the regex should've match it wholly, shouldn't it?)
I didn't read your code, but search for Greedy vs Nongreedy matching and
you'll get an answer.
Regards,
Júlio.
2011/11/1 U.Mutlu
Hi, why is this regex const string sRe = "((([a-zA-Z]|([a-zA-Z][a-zA-**Z0-9\\-]))+[a-zA-Z0-9])\\.)+" "((([a-zA-Z]|([a-zA-Z][a-zA-**Z0-9\\-]))+[a-zA-Z0-9]))";
not matching this string wholly? "a1a.a2a.a3a.a4aaaa"
It rather matches only this part: "a1a.a2a.a3a.a4"
What's the problem here? (I know I can append a delimiter to solve the problem, but I think the regex should've match it wholly, shouldn't it?)
______________________________**_________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/**mailman/listinfo.cgi/boost-**usershttp://lists.boost.org/mailman/listinfo.cgi/boost-users
Júlio Hoffimann wrote, On 2011-11-01 15:59:
I didn't read your code, but search for Greedy vs Nongreedy matching and you'll get an answer.
I'm using perl mode. Greedy is IMHO generally default in regex. One can disable it with '?', but it wasn't used in the example, therefore I wonder why it stops early.
U.Mutlu wrote, On 2011-11-01 15:37:
Hi, why is this regex const string sRe = "((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9])\\.)+" "((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9]))";
changing the 2nd part as follows has solved the problem: "((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9]+))";
"U.Mutlu"
why is this regex const string sRe = "((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9])\\.)+" "((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9]))";
not matching this string wholly? "a1a.a2a.a3a.a4aaaa"
It rather matches only this part: "a1a.a2a.a3a.a4"
What's the problem here? (I know I can append a delimiter to solve the problem, but I think the regex should've match it wholly, shouldn't it?)
I think you're getting caught by "first match, not longest match". Put differently, the regex engine isn't backtracking where you think it is. Looking at the last segment of your target string, the match goes like so: ((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9])) ^ Ok, "a" matches this charset. ((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9])) ^ But "4" doesn't match ([a-zA-Z]), so we call the quantifier done. ((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9])) ^ The "4" does match here, and that completes the match. As you said, you can put ^ and $ if you want to force the match to the entire string. I'm also curious if you're capturing those substrings for some purpose? If not, you can simplify it quite a bit (and probably make it faster, since regular parens require the RE engine to do a fair bit of extra work): ([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+ Becomes: [a-zA-Z][a-zA-Z0-9\\-]* Also, note that "-" should be treated as a regular character if it's at the beginning or end of a character range, so we can drop those backslashes: [a-zA-Z][a-zA-Z0-9-]* And I'd even consider using the preprocessor to make it a bit more readable: #define LETTER "a-zA-Z" #define DIGIT "0-9" #define DASH "-" #define ELEMENT \ "[" LETTER "]" "[" LETTER DIGIT DASH "]*" "[" LETTER DIGIT "]" const string re( "(?:" ELEMENT "\\." ")+" ELEMENT ); #undef ELEMENT #undef DASH #undef DIGIT #undef LETTER (I suppose that using actual C++ strings, with a compiler that can do string expression evaluation at compile time, would be just as good if not preferable...) This also makes clear the (possibly surprising) issue that you won't match single-character elements; I don't know if that's intentional or not. If you want to say "an element starts with a letter followed by any number of letters, digits, or dashes, but must end with a letter or digit", then you might prefer: #define ELEMENT \ "[" LETTER "]" "[" LETTER DIGIT DASH "]*" "(?
Anthony Foiani wrote, On 2011-11-01 19:10:
"U.Mutlu"
writes: why is this regex const string sRe = "((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9])\\.)+" "((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9]))";
not matching this string wholly? "a1a.a2a.a3a.a4aaaa"
It rather matches only this part: "a1a.a2a.a3a.a4"
I think you're getting caught by "first match, not longest match". Put differently, the regex engine isn't backtracking where you think it is.
Thanks for the detailed info, it helped me much.
I'm also curious if you're capturing those substrings for some purpose?
Actually it was my first attempt for writing a "regex for hostnames", because the regex I had found on the web had some bugs. I know there are RFC's for hostname, and that nowadays hostname elements can also start with a digit, but for me a simple (ie. the old) solution suffices.
why is this regex const string sRe = "((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9])\\.)+" "((([a-zA-Z]|([a-zA-Z][a-zA-Z0-9\\-]))+[a-zA-Z0-9]))";
not matching this string wholly? "a1a.a2a.a3a.a4aaaa"
It rather matches only this part: "a1a.a2a.a3a.a4"
What's the problem here? (I know I can append a delimiter to solve the problem, but I think the regex should've match it wholly, shouldn't it?)
No, in Perl mode, early alternatives are preferred to later ones, so given: a|aa against: aaaa will only match the first "a". This is what's happening in your expression. HTH, John.
participants (4)
-
Anthony Foiani
-
John Maddock
-
Júlio Hoffimann
-
U.Mutlu