Re: [Boost-users] [regex] difficulties with the "LeftmostLongest"Rule

21 Dec 2006

      I didn't want to steal your time, so I didn't tell you, what really is
my problem. You were too kind that I can resist any more.

I've built a parser generator IDE where the user can define single
tokens as regular expressions. These expressions are combined into a
bigger expression automatically, which then is used as a scanner. For
this, it is necessary, that the token, which matched can be calculated
from the scanner sub-expressions which matched. Further the
sub-expressions of the matching token should be accessible easily. So
I put every alternative token into parenthesis.

Example:

Integer    ::= \d+
Real       ::= (\d+\.\d*|\.\d+)([eE][+-]*\d+)?
Identifier ::= \w+

Scanner    :: (\d+)|((\d+\.\d*|\.\d+)([eE][+-]*\d+)?)|(\w+)

or, to get the token at the actual location of the input:

Scanner    :: \A((\d+)|((\d+\.\d*|\.\d+)([eE][+-]*\d+)?)|(\w+))

Parallel to the construction of the scanner expression a vector
(m_vSymbols) is filled with the numbers of (and symbols to) the
subexpressions, which are representing the tokens (2,3,6). (These
numbers are calculated by means of mark_count()). You now get the
matching token by

for(t = m_vSymbols.begin(); t != tEnd; ++t)
  if(xMatch[t->first].matched)
    return *t;

The sub-expressions of the matching token can be accessed similar as
in an isolated regular expression by adding the offset of the whole
sub-expression: Token.str(i) == Scanner.str(i + offset). This goes
behind the scene and worked fine for a long time.

Now you can imagine, that it is a shock for me, to discover, that I
misinterpreted the leftmost longest rule in the manner I liked. I
didn't stumble over this error, because the matching of two
alternatives with the same length seems to be a rare case.
...
Nope, you need to either:
...
Use the perl-compatible expressions and place the alternatives in
the order
you want them searched,
POSIX regular expressions are preferable for a parser, because the
leftmost longest rule helps to solve conflicts. For example a label
can be distinguished from an ordinary identifier easily: \w+|\w+\s*:
Perhaps I will provide the option in the future to use Perl regexs.
...
or
...
Use POSIX expressions, and either put brackets around all the
alternatives
*and* put them in the order you want. Or don't put braces around
those of
lower priority, and do put them around those of higher priority.
Yes, the first suggestion seems feasible: I simply could arrange the
tokens in the reversed order of their mark_count. This should have
exactly the intended result. If I am right, this technique could be
used for an extension of the regex library. Something like:

template <class symbol_type, class  charT, class traits =
regex_traits<charT> >
class lexregex {
...
void add_symbol(const  charT* p, symbol_type s);

I don't know, whether there is a chance to write code for such an
addition to a lexregex which is already compiled. Otherwise such a
lexregex had to be compiled in an extra step before use. In this form
I could make it on top of the existing regex class.

An according lexmatch_results is needed too, with

string_type str(int sub = 0) const;  // returning the subexpressions
of the matching token
symbol_type symbol() const;

In this context there are two other points I'm interested in:

In my parser generator there is a preference for literal tokens
already (they aren't treated as regular expressions but by a ternary
tree), and I have a vague idea, that generally a token should be
preferred the more, the more literally it is. In your documentation
you mention some experimental non-member comparison operators. What is
the idea behind these comparisions? Could they be used, to define
preferences?

I guess, that testing one token after the other would be much more
expensive, than testing them together. All the more as there is a
special feature of my parser generator not only to test for tokens at
the actual location in the input as to look for the next location,
where one of several tokens occur. Can you tell me something about
these differences of costs?

I am very interested to know your opinion about my ideas.
(It isn't urgent.)

Best Regards

Detlef

Re: [Boost-users] [regex] difficulties with the "LeftmostLongest"Rule

Detlef Meyer-Eltz