
I didn't want to steal your time, so I didn't tell you, what really is my problem. You were too kind that I can resist any more. I've built a parser generator IDE where the user can define single tokens as regular expressions. These expressions are combined into a bigger expression automatically, which then is used as a scanner. For this, it is necessary, that the token, which matched can be calculated from the scanner sub-expressions which matched. Further the sub-expressions of the matching token should be accessible easily. So I put every alternative token into parenthesis. Example: Integer ::= \d+ Real ::= (\d+\.\d*|\.\d+)([eE][+-]*\d+)? Identifier ::= \w+ Scanner :: (\d+)|((\d+\.\d*|\.\d+)([eE][+-]*\d+)?)|(\w+) or, to get the token at the actual location of the input: Scanner :: \A((\d+)|((\d+\.\d*|\.\d+)([eE][+-]*\d+)?)|(\w+)) Parallel to the construction of the scanner expression a vector (m_vSymbols) is filled with the numbers of (and symbols to) the subexpressions, which are representing the tokens (2,3,6). (These numbers are calculated by means of mark_count()). You now get the matching token by for(t = m_vSymbols.begin(); t != tEnd; ++t) if(xMatch[t->first].matched) return *t; The sub-expressions of the matching token can be accessed similar as in an isolated regular expression by adding the offset of the whole sub-expression: Token.str(i) == Scanner.str(i + offset). This goes behind the scene and worked fine for a long time. Now you can imagine, that it is a shock for me, to discover, that I misinterpreted the leftmost longest rule in the manner I liked. I didn't stumble over this error, because the matching of two alternatives with the same length seems to be a rare case.
Nope, you need to either:
Use the perl-compatible expressions and place the alternatives in the order you want them searched,
POSIX regular expressions are preferable for a parser, because the leftmost longest rule helps to solve conflicts. For example a label can be distinguished from an ordinary identifier easily: \w+|\w+\s*: Perhaps I will provide the option in the future to use Perl regexs.
or
Use POSIX expressions, and either put brackets around all the alternatives *and* put them in the order you want. Or don't put braces around those of lower priority, and do put them around those of higher priority.
Yes, the first suggestion seems feasible: I simply could arrange the tokens in the reversed order of their mark_count. This should have exactly the intended result. If I am right, this technique could be used for an extension of the regex library. Something like: template <class symbol_type, class charT, class traits = regex_traits<charT> > class lexregex { ... void add_symbol(const charT* p, symbol_type s); I don't know, whether there is a chance to write code for such an addition to a lexregex which is already compiled. Otherwise such a lexregex had to be compiled in an extra step before use. In this form I could make it on top of the existing regex class. An according lexmatch_results is needed too, with string_type str(int sub = 0) const; // returning the subexpressions of the matching token symbol_type symbol() const; In this context there are two other points I'm interested in: In my parser generator there is a preference for literal tokens already (they aren't treated as regular expressions but by a ternary tree), and I have a vague idea, that generally a token should be preferred the more, the more literally it is. In your documentation you mention some experimental non-member comparison operators. What is the idea behind these comparisions? Could they be used, to define preferences? I guess, that testing one token after the other would be much more expensive, than testing them together. All the more as there is a special feature of my parser generator not only to test for tokens at the actual location in the input as to look for the next location, where one of several tokens occur. Can you tell me something about these differences of costs? I am very interested to know your opinion about my ideas. (It isn't urgent.) Best Regards Detlef