[regex] difficulties with the "Leftmost Longest" Rule
I have a difficulty to predict, which part of a regular expression will match. Example: I have a regular expression for a general HTML tag: <[^>]*> combined with an expression for the body tag:
]*)> to: (<[^>]*>)|(]*)>) This expression matches the text: <body bgcolor="white"> As both alternatives can match the input with the same length, I expected, that the repeated fouth part of the "Leftmost Longest" Rule would determine, which alternatve is chosen: 4. Find the match which has matched the first sub-expression in the leftmost position, along with any ties. If there is only on(e) such match possible then return it. // note the missing 'e' As the tag-expression has no sub-expression at all, the body-expression should win. Its sub-expression could match, but doesn't. It seems to me, that the sequence of the alternatives determines the match. Now I guess, that I misinterpreted 4.: its not a means to predict the matching alternative but only to find the one that matched accidentally? My software constructs lexers from elementary expressions automatically. So it's important for me to direct and predict the expected matching alternative. Are there any other rules? Does the sequence of the alternatives determine the match unmistakably? With Kind Regards, Detlef Meyer-EltzDetlef Meyer-Eltz wrote:
I have a difficulty to predict, which part of a regular expression will match.
Example: I have a regular expression for a general HTML tag: <[^>]*> combined with an expression for the body tag:
]*)>to: (<[^>]*>)|(
]*)>)This expression matches the text: <body bgcolor="white">
As both alternatives can match the input with the same length, I expected, that the repeated fouth part of the "Leftmost Longest" Rule would determine, which alternatve is chosen:
4. Find the match which has matched the first sub-expression in the leftmost position, along with any ties. If there is only on(e) such match possible then return it.
// note the missing 'e'
As the tag-expression has no sub-expression at all, the body-expression should win. Its sub-expression could match, but doesn't. It seems to me, that the sequence of the alternatives determines the match.
Now I guess, that I misinterpreted 4.: its not a means to predict the matching alternative but only to find the one that matched accidentally? My software constructs lexers from elementary expressions automatically. So it's important for me to direct and predict the expected matching alternative. Are there any other rules? Does the sequence of the alternatives determine the match unmistakably?
Which Boost.Regex version are you using, and how are you compiling the expression? Recent versions default to the Perl matching rules: *which do not use the leftmost longest rule*. They match based on a "first match found" rule, so if the first alternative leads to a match then subsequent alternatives are never examined. If you really want leftmost-longest semantics, then compile the expression as a POSIX extended regex, but of course then you loose the ability to use Perl-like regex extensions. HTH, John. PS, your analysis of the leftmost-longest rule looks correct however.
Thank you for your quick reply. It's just because of the leftmost-longest rule, that I'm indeed using POSIX extended style (from boost 1.33.1 with Borland CBuilder 6). To assert, that I didn't do anything wrong im my software, I now have made an extra commandline program: /////////////////////////////////////////// #pragma hdrstop #include <iostream> #include <string> #include "boost\regex.hpp" using namespace std; using namespace boost; //--------------------------------------------------------------------------- int main(int argc, char* argv[]) { // my usual option, but makes no difference here, I think regex_constants::syntax_option_type syntax_flags = regex_constants::extended & ~regex_constants::no_escape_in_lists; regex expression1("(<[^>]*>)|(
]*)>)", regex_constants::extended); regex expression2("(]*)>)|(<[^>]*>)", regex_constants::extended); string testtext = ""; std::string::const_iterator start, end; start = testtext.begin(); end = testtext.end(); match_resultsstd::string::const_iterator what; match_flag_type flags = match_default; cout << "testing: " << expression1 << endl; if(regex_search(start, end, what, expression1, flags)) { int iCount = 0; for(unsigned int i = 0 ; i < what.size() ; ++i ) cout << "sub-expression" << iCount++ << ": \"" << what[i] << "\"" << endl; } cout << endl; cout << "testing: " << expression2 << endl; if(regex_search(start, end, what, expression2, flags)) { int iCount = 0; for(unsigned int i = 0 ; i < what.size() ; ++i ) cout << "sub-expression" << iCount++ << ": \"" << what[i] << "\"" << endl; } return 0; } ///////////////////////////////////////////////////////// That's the result: testing: (<[^>]*>)|(]*)>) sub-expression0: "<body bgcolor="white">" // (<[^>]*>)|(]*)>) sub-expression1: "<body bgcolor="white">" // <[^>]*> sub-expression2: "" // ]*)> sub-expression3: "" // [^>]* testing: (]*)>)|(<[^>]*>) sub-expression0: "<body bgcolor="white">" // (]*)>)|(<[^>]*>) sub-expression1: "<body bgcolor="white">" // ]*)> sub-expression2: " bgcolor="white"" // [^>]* sub-expression3: "" // <[^>]*> I hoped, for both arrangements of the alternatives, the matching parts would be the same. I would be very happy, if I had made a mistake ;-) Best regards Detlef Meyer-Eltz -- am Montag, 18. Dezember 2006 um 18:31 schrieben Sie:Detlef Meyer-Eltz wrote:
I have a difficulty to predict, which part of a regular expression will match.
Example: I have a regular expression for a general HTML tag: <[^>]*> combined with an expression for the body tag:
]*)>to: (<[^>]*>)|(
]*)>)This expression matches the text: <body bgcolor="white">
As both alternatives can match the input with the same length, I expected, that the repeated fouth part of the "Leftmost Longest" Rule would determine, which alternatve is chosen:
4. Find the match which has matched the first sub-expression in the leftmost position, along with any ties. If there is only on(e) such match possible then return it.
// note the missing 'e'
As the tag-expression has no sub-expression at all, the body-expression should win. Its sub-expression could match, but doesn't. It seems to me, that the sequence of the alternatives determines the match.
Now I guess, that I misinterpreted 4.: its not a means to predict the matching alternative but only to find the one that matched accidentally? My software constructs lexers from elementary expressions automatically. So it's important for me to direct and predict the expected matching alternative. Are there any other rules? Does the sequence of the alternatives determine the match unmistakably?
Which Boost.Regex version are you using, and how are you compiling the expression?
Recent versions default to the Perl matching rules: *which do not use the leftmost longest rule*. They match based on a "first match found" rule, so if the first alternative leads to a match then subsequent alternatives are never examined.
If you really want leftmost-longest semantics, then compile the expression as a POSIX extended regex, but of course then you loose the ability to use Perl-like regex extensions.
HTH, John.
PS, your analysis of the leftmost-longest rule looks correct however.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Detlef Meyer-Eltz wrote:
Thank you for your quick reply. It's just because of the leftmost-longest rule, that I'm indeed using POSIX extended style (from boost 1.33.1 with Borland CBuilder 6). To assert, that I didn't do anything wrong im my software, I now have made an extra commandline program:
I see the same output, here's what's happening with the leftmost longest rule: 1 Find the leftmost match, if there is only one match possible at this location then return it. There are two matches at position 0, carry on... 2) Find the longest of the possible matches, along with any ties. If there is only one such possible match then return it. Both are equal length carry on.... 3) If there are no marked sub-expressions, then all the remaining alternatives are indistinguishable; return the first of these found. Doesn't apply, carry on... 4) Find the match which has matched the first sub-expression in the leftmost position, along with any ties. If there is only on such match possible then return it. There is only one match with $1 matched: the first alternative. So we return it. We don't get to these: 5) Find the match which has the longest match for the first sub-expression, along with any ties. If there is only one such match then return it. 6) Repeat steps 3 and 4 for each additional marked sub-expression. If there is still more than one possible match remaining, then they are indistinguishable; return the first one found. If you change the expression to "<[^>]*>|
]*)>" The the presence of a marked sub-expression on the second alternative breaks the tie, and the second alternative is returned. I know this wasn't what you were looking for, but I hope this explanation helps! John.I know this wasn't what you were looking for, but I hope this explanation helps!
Thank you, the truth always helps. A hint for you: there is another writing mistake: "that" instead of "than" in "Often there is more that one way of matching". And the boost::regex::egrep option doesn't work. No match is found with: regex expression1("<[^>]*>\n
]*)>", boost::regex::egrep); I guess, even a working egrep option would not change anything for me. Best regards Detlef.Detlef Meyer-Eltz wrote:
I know this wasn't what you were looking for, but I hope this explanation helps!
Thank you, the truth always helps.
A hint for you: there is another writing mistake: "that" instead of "than" in "Often there is more that one way of matching".
That'll be fixed in cvs in a moment: thanks for pointing that out.
And the boost::regex::egrep option doesn't work. No match is found with:
regex expression1("<[^>]*>\n
]*)>", boost::regex::egrep);
Yikes, that's a bad one to slip through, I'm testing (the fairly trivial) fix now.
I guess, even a working egrep option would not change anything for me.
Nope, you need to either: Use the perl-compatible expressions and place the alternatives in the order you want them searched, or Use POSIX expressions, and either put brackets around all the alternatives *and* put them in the order you want. Or don't put braces around those of lower priority, and do put them around those of higher priority. HTH, John.
I didn't want to steal your time, so I didn't tell you, what really is my problem. You were too kind that I can resist any more. I've built a parser generator IDE where the user can define single tokens as regular expressions. These expressions are combined into a bigger expression automatically, which then is used as a scanner. For this, it is necessary, that the token, which matched can be calculated from the scanner sub-expressions which matched. Further the sub-expressions of the matching token should be accessible easily. So I put every alternative token into parenthesis. Example: Integer ::= \d+ Real ::= (\d+\.\d*|\.\d+)([eE][+-]*\d+)? Identifier ::= \w+ Scanner :: (\d+)|((\d+\.\d*|\.\d+)([eE][+-]*\d+)?)|(\w+) or, to get the token at the actual location of the input: Scanner :: \A((\d+)|((\d+\.\d*|\.\d+)([eE][+-]*\d+)?)|(\w+)) Parallel to the construction of the scanner expression a vector (m_vSymbols) is filled with the numbers of (and symbols to) the subexpressions, which are representing the tokens (2,3,6). (These numbers are calculated by means of mark_count()). You now get the matching token by for(t = m_vSymbols.begin(); t != tEnd; ++t) if(xMatch[t->first].matched) return *t; The sub-expressions of the matching token can be accessed similar as in an isolated regular expression by adding the offset of the whole sub-expression: Token.str(i) == Scanner.str(i + offset). This goes behind the scene and worked fine for a long time. Now you can imagine, that it is a shock for me, to discover, that I misinterpreted the leftmost longest rule in the manner I liked. I didn't stumble over this error, because the matching of two alternatives with the same length seems to be a rare case.
Nope, you need to either:
Use the perl-compatible expressions and place the alternatives in the order you want them searched,
POSIX regular expressions are preferable for a parser, because the leftmost longest rule helps to solve conflicts. For example a label can be distinguished from an ordinary identifier easily: \w+|\w+\s*: Perhaps I will provide the option in the future to use Perl regexs.
or
Use POSIX expressions, and either put brackets around all the alternatives *and* put them in the order you want. Or don't put braces around those of lower priority, and do put them around those of higher priority.
Yes, the first suggestion seems feasible: I simply could arrange the
tokens in the reversed order of their mark_count. This should have
exactly the intended result. If I am right, this technique could be
used for an extension of the regex library. Something like:
template
Detlef Meyer-Eltz wrote:
Now you can imagine, that it is a shock for me, to discover, that I misinterpreted the leftmost longest rule in the manner I liked. I didn't stumble over this error, because the matching of two alternatives with the same length seems to be a rare case.
Nod.
void add_symbol(const charT* p, symbol_type s);
I don't know, whether there is a chance to write code for such an addition to a lexregex which is already compiled. Otherwise such a lexregex had to be compiled in an extra step before use. In this form I could make it on top of the existing regex class.
I think you would have to recompile the whole regex in order to add an arbitrary extension to the expression.
In this context there are two other points I'm interested in:
In my parser generator there is a preference for literal tokens already (they aren't treated as regular expressions but by a ternary tree), and I have a vague idea, that generally a token should be preferred the more, the more literally it is. In your documentation you mention some experimental non-member comparison operators. What is the idea behind these comparisions? Could they be used, to define preferences?
The comparison operators aren't used anywhere to determine matching: they can be used by the user to compare the result to a specific string for example.
I guess, that testing one token after the other would be much more expensive, than testing them together. All the more as there is a special feature of my parser generator not only to test for tokens at the actual location in the input as to look for the next location, where one of several tokens occur. Can you tell me something about these differences of costs?
It's likely to be more expensive yes: "impossible" branches in the state machine get eliminated quite quickly in the regex internals during matching, where as the "one expression at a time" approach necessarily tests all the candidates. The difference would depend very much upon the particular expressions though. HTH, John.
participants (2)
-
Detlef Meyer-Eltz
-
John Maddock