[boost] Interest in a fast lexical analyser compatible

28 Dec 2004

      Hartmut Kaiser wrote:
...
I'd be willing to write the interfacing stub to plug your library into 
Wave.
Thanks - I think this should be relatively easy because we currently use a 
forward iterator across the tokens in order to generate suitable output for 
Spirit.  Clearly one of the requirements for the Boost community to accept 
this library in general will probably be meeting the Standard requirements 
for forward iterators - and I note that this is one of your key requirements 
in Wave.
...
The two different lexers I was using in the Wave library were a re2c based
lexer (static switch based lexer, extremly fast and compact) and a SLEX
based lexer (runtime generated DFA). I haven't done serious speed
measurements, but the numbers I've got so far showed similar timings for
both with a slight advantage for re2c (as expected). I'd expect your 
library
to be very similar in speed as well. But just out of curiousity I'm very
interested in seeing the static DFA generation version :-)
Can these lexers be effectively used with any Spirit grammar?  I'll download 
Wave over the next few days and have a look at how much overlap there is 
between our library and the lexers within Wave.

My plan for the static DFA version (which we haven't fully discussed in our 
internal design discussions yet so may change significantly), is to use the 
memento pattern to reflect the internal state of the DFA, along with the 
production rules.  These could then be saved or loaded, or serialised to a 
C++ file, in a rather similar way to flex.  This would mean that the lextl 
classes would be both an inline and an offline tool.  In principle this 
gives some very desirable advantages:

1)    The API remains the same in both compile-time and run-time versions.
2)    You can very easily swap between run-time and compile-time versions - 
for example during development you may use run-time creation to speed up the 
process of developing a complex grammar, then switch to compile-time for 
your production release.
3)    You have a number of options for compile-time versions.  The grammar 
can be in a configuration file, or compiled into the program directly.
...
at least you could have used the symbol parser ... <snip>
I've changed the Spirit grammar to use the symbol parser, and dropped the 
time from 40-50 seconds to about 6-7 seconds.  Thanks for the tip.
...
Is there any documentation available?
No, I'm sorry, we don't have any documentation at present, because we are 
still only a little beyond a prototyping stage.  We still have to refine the 
API a little, and complete some optimisation work before we publish an early 
Alpha.  As I said in my original post, I'm trying to gauge interest at the 
moment, and I'm pleased to see that we are generating a little interest.

As a flavour, the current API would result in some code looking like:

syntax<> mySyntax;
mySyntax.add_rule( rule<token<1> >( "Group" ) );
mySyntax.add_rule( rule<token<2> >( "Separator" ) );
mySyntax.add_rule( rule<token<3> >( "Switch" ) );
// Lots more symbols for the rest of VRML...
mySyntax.add_rule( rule<token_float<69> >( 
"([-+]?((([0-9]+[.]?)|([0-9]*[.][0-9]+))([eE][-+]?[0-9]+)?))" ) );
mySyntax.add_rule( rule<token_string<70> >( "#[^\\n\\r]*[\\n\\r]" ) );
mySyntax.add_rule( rule<token_string<71> >( "\"[^\"\\n\\r]*[\"\\n\\r]" ) );
mySyntax.add_rule( rule<token_string<72> >( "[a-zA-Z_][a-zA-Z0-9_]*" ) );
mySyntax.add_rule( rule<token<73> >( "[ \\t\\n\\r]+" ) );
lexer<> myLexer( mySyntax );
std::ifstream fsp( "test.vrml" );
myLexer.set_source( fsp ); // Also it will have a char iterator interface
lexer<>::token_iterator iter = myLexer.begin();
for ( ; iter != myLexer.end(); ++iter )
    // Do something with the tokens...
    ;

Essentially you create a syntax to which you add rules.  These rules explain 
the token that will be created by the rule, and a regular expression to be 
matched.  Special versions of tokens exist that will automatically generate 
a float, int or string from the matched expression.  You then construct a 
lexer off the syntax, set a suitable source input, and then iterate over the 
tokens.  The tokens are polymorphic so you can easily access the type and 
data, they are statically typed so you can use types to match the tokens, or 
you can dispatch tokens to functions through a visitor framework.  You can 
also use dynamically typed tokens as well if you want - although you lose 
much of the benefit of strong typing.  I especially like the idea of using 
visitors to distinguish between tokens.  This means that in my example of a 
VRML file, the long float lists can be handled with a visitor that contains 
visit functions for a float token, a comma token (note that in vector lists 
in VRML the elements in a 3D vector are separated by whitespace, and the 3D 
vectors themselves separated by commas) and a list termination token (eg a 
"]").  The visitor would then automatically store the floats in a suitable 
data structure - like a list of 3D vectors.

Dave

[boost] Interest in a fast lexical analyser compatible

Dave Handley