I didn't want to steal your time, so I didn't tell you, what really is my problem. You were too kind that I can resist any more. I've built a parser generator IDE where the user can define single tokens as regular expressions. These expressions are combined into a bigger expression automatically, which then is used as a scanner. For this, it is necessary, that the token, which matched can be calculated from the scanner sub-expressions which matched. Further the sub-expressions of the matching token should be accessible easily. So I put every alternative token into parenthesis. Example: Integer ::= \d+ Real ::= (\d+\.\d*|\.\d+)([eE][+-]*\d+)? Identifier ::= \w+ Scanner :: (\d+)|((\d+\.\d*|\.\d+)([eE][+-]*\d+)?)|(\w+) or, to get the token at the actual location of the input: Scanner :: \A((\d+)|((\d+\.\d*|\.\d+)([eE][+-]*\d+)?)|(\w+)) Parallel to the construction of the scanner expression a vector (m_vSymbols) is filled with the numbers of (and symbols to) the subexpressions, which are representing the tokens (2,3,6). (These numbers are calculated by means of mark_count()). You now get the matching token by for(t = m_vSymbols.begin(); t != tEnd; ++t) if(xMatch[t->first].matched) return *t; The sub-expressions of the matching token can be accessed similar as in an isolated regular expression by adding the offset of the whole sub-expression: Token.str(i) == Scanner.str(i + offset). This goes behind the scene and worked fine for a long time. Now you can imagine, that it is a shock for me, to discover, that I misinterpreted the leftmost longest rule in the manner I liked. I didn't stumble over this error, because the matching of two alternatives with the same length seems to be a rare case.
Nope, you need to either:
Use the perl-compatible expressions and place the alternatives in the order you want them searched,
POSIX regular expressions are preferable for a parser, because the leftmost longest rule helps to solve conflicts. For example a label can be distinguished from an ordinary identifier easily: \w+|\w+\s*: Perhaps I will provide the option in the future to use Perl regexs.
or
Use POSIX expressions, and either put brackets around all the alternatives *and* put them in the order you want. Or don't put braces around those of lower priority, and do put them around those of higher priority.
Yes, the first suggestion seems feasible: I simply could arrange the
tokens in the reversed order of their mark_count. This should have
exactly the intended result. If I am right, this technique could be
used for an extension of the regex library. Something like:
template