[boost] Interest in a fast lexical analyser compatible with Spirit

28 Dec 2004

      Since I get the mailing list in a digest, it looks like I have a lot of 
questions to answer.  Sorry for the long(er) post, but I'm answering them 
all in one post:

Darren Cook wrote:
...
There seems some interest currently in alternatives/improvements to Spirit,
so I thought I'd mention Hapy:
  http://hapy.sourceforge.net/
I'll have a look at Hapy and see whether we could interface to that as well.

Joel de Guzman wrote:
...
Some more things to note:
* why use lexeme_d on the lexer stage?
* the comment rule can be rewritten as:
  '#' >> *(anychar_p - eol_p) >> eol_p;
* stringLiteral can be rewriten similarly.
* word can be rewritten using chsets.
I used lexeme_d because the VRML grammar uses whitespace to separate ints 
and floats.  Without the lexeme_d, ints could merge into one.

The other optimisations are all valid, but unfortunately, don't make any 
difference to the stopwatch.  This is because comments, stringLiterals and 
words don't occur too often in a VRML file - most of the time is spent 
parsing floats and ints.

Joel de Guzman wrote:
...
Sure, no problem. But watch out, Spirit-2 will have a speed improvement
in the 10X-20X range over classic Spirit ;-) I already have some test
cases, and this does not touch on predictive parsing and lexing.
Joel,  I would be really keen to try running my VRML test case through 
Spirit-2.  Given that I've now got the Spirit-1 parser running at about 6-7 
seconds, if you get a 10x speed improvement on this file, then it could make 
the case for separate lexing much harder to support!  I may also need a 
bigger file to start doing test cases on - millisecond test timings don't 
make for accuracy in my view :-)

Joel de Guzman wrote:
...
Anyway, as I said, I'm very interested with your lexer.
Would you be interested in a collaboration?
Yes, we would be very interested in a collaboration.

Hartmut Kaiser wrote:
...
How do you specify production rule's? Your above sample specifies how to 
use
the recognition part only ;-)
The numbers are the token ID's, I assume?
The production rules are wrapped up in the different (and extensible) 
concept of a token.  The token_float for example will match a float, then 
store the matched number as a float inside the token.  The normal token 
stores nothing since something like a keyword match you don't need the 
overhead of the original text, but the token_string stores the string that 
was matched.  I am contemplating an interface that allows you to use a 
functor to perform some basic string manipulation before the token is 
stored - for example dropping the enclosing quotes or similar.  Although 
this is clearly similar to functionality already available in Spirit.

Yes the numbers are token IDs - I have contemplated using const char * 
template parameters as well, but at present I haven't done that because of 
the added difficulty of ensuring that your names have external linkage. 
Perhaps in the final version we can support both.
...
Is your token type a predefined one or is it possible to use my own token
types instead?
I'm asking, because Wave in principle allows to use any token type for its
work, but it requires the tokens to expose a certain (very small) 
interface.
In the system as stands, all token types must inherit from a base version. 
This base version provides lots of useful virtual functions that derived 
token types can work with.  The iterator interface then returns the 
token_base from its de-reference function.  You can define new token types, 
but accessing the specific interface on the derived token means doing one 
of:

1)    Ensuring the interface is exposed in the token_base.
2)    Using polymorphic despatch to get the correct type, through a visitor 
for example.
3)    If only a single token type is used in the lexer then use a 
static_cast.
4)    Using a dynamic_cast.

You can also define new rule classes - at present we have written a rule 
class for conventional tokens, and another rule class for tokens that have a 
dynamic (run-time) ID rather than the static (compile-time) ID in the 
template parameter.

Dave Handley