Interest in a fast lexical analyser compatible

Hartmut Kaiser wrote:
I'd be willing to write the interfacing stub to plug your library into Wave.
Thanks - I think this should be relatively easy because we currently use a forward iterator across the tokens in order to generate suitable output for Spirit. Clearly one of the requirements for the Boost community to accept this library in general will probably be meeting the Standard requirements for forward iterators - and I note that this is one of your key requirements in Wave.
The two different lexers I was using in the Wave library were a re2c based lexer (static switch based lexer, extremly fast and compact) and a SLEX based lexer (runtime generated DFA). I haven't done serious speed measurements, but the numbers I've got so far showed similar timings for both with a slight advantage for re2c (as expected). I'd expect your library to be very similar in speed as well. But just out of curiousity I'm very interested in seeing the static DFA generation version :-)
Can these lexers be effectively used with any Spirit grammar? I'll download Wave over the next few days and have a look at how much overlap there is between our library and the lexers within Wave. My plan for the static DFA version (which we haven't fully discussed in our internal design discussions yet so may change significantly), is to use the memento pattern to reflect the internal state of the DFA, along with the production rules. These could then be saved or loaded, or serialised to a C++ file, in a rather similar way to flex. This would mean that the lextl classes would be both an inline and an offline tool. In principle this gives some very desirable advantages: 1) The API remains the same in both compile-time and run-time versions. 2) You can very easily swap between run-time and compile-time versions - for example during development you may use run-time creation to speed up the process of developing a complex grammar, then switch to compile-time for your production release. 3) You have a number of options for compile-time versions. The grammar can be in a configuration file, or compiled into the program directly.
at least you could have used the symbol parser ... <snip>
I've changed the Spirit grammar to use the symbol parser, and dropped the time from 40-50 seconds to about 6-7 seconds. Thanks for the tip.
Is there any documentation available?
No, I'm sorry, we don't have any documentation at present, because we are still only a little beyond a prototyping stage. We still have to refine the API a little, and complete some optimisation work before we publish an early Alpha. As I said in my original post, I'm trying to gauge interest at the moment, and I'm pleased to see that we are generating a little interest. As a flavour, the current API would result in some code looking like: syntax<> mySyntax; mySyntax.add_rule( rule<token<1> >( "Group" ) ); mySyntax.add_rule( rule<token<2> >( "Separator" ) ); mySyntax.add_rule( rule<token<3> >( "Switch" ) ); // Lots more symbols for the rest of VRML... mySyntax.add_rule( rule<token_float<69> >( "([-+]?((([0-9]+[.]?)|([0-9]*[.][0-9]+))([eE][-+]?[0-9]+)?))" ) ); mySyntax.add_rule( rule<token_string<70> >( "#[^\\n\\r]*[\\n\\r]" ) ); mySyntax.add_rule( rule<token_string<71> >( "\"[^\"\\n\\r]*[\"\\n\\r]" ) ); mySyntax.add_rule( rule<token_string<72> >( "[a-zA-Z_][a-zA-Z0-9_]*" ) ); mySyntax.add_rule( rule<token<73> >( "[ \\t\\n\\r]+" ) ); lexer<> myLexer( mySyntax ); std::ifstream fsp( "test.vrml" ); myLexer.set_source( fsp ); // Also it will have a char iterator interface lexer<>::token_iterator iter = myLexer.begin(); for ( ; iter != myLexer.end(); ++iter ) // Do something with the tokens... ; Essentially you create a syntax to which you add rules. These rules explain the token that will be created by the rule, and a regular expression to be matched. Special versions of tokens exist that will automatically generate a float, int or string from the matched expression. You then construct a lexer off the syntax, set a suitable source input, and then iterate over the tokens. The tokens are polymorphic so you can easily access the type and data, they are statically typed so you can use types to match the tokens, or you can dispatch tokens to functions through a visitor framework. You can also use dynamically typed tokens as well if you want - although you lose much of the benefit of strong typing. I especially like the idea of using visitors to distinguish between tokens. This means that in my example of a VRML file, the long float lists can be handled with a visitor that contains visit functions for a float token, a comma token (note that in vector lists in VRML the elements in a 3D vector are separated by whitespace, and the 3D vectors themselves separated by commas) and a list termination token (eg a "]"). The visitor would then automatically store the floats in a suitable data structure - like a list of 3D vectors. Dave

I've changed the Spirit grammar to use the symbol parser, and dropped
Dave Handley wrote: the time from 40-50 seconds to about 6-7 seconds. Thanks for the tip. There might be some more (marginal) improvements to the grammar. See my other post. Anyway, you might be interested to know that informal tests of a toy Spirit-2 sample gives as much as 10X improvement over classic Spirit. There was even a case where a Spirit-2 scheme gave a 20X speed over classic Spirit. Anyway, as I said, I'm very interested with your lexer. Would you be interested in a collaboration? << pardon me if this is a duplicate. I'm having some trouble posting to gman >> Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Dave Handley wrote:
I'd be willing to write the interfacing stub to plug your library into Wave.
Thanks - I think this should be relatively easy because we currently use a forward iterator across the tokens in order to generate suitable output for Spirit. Clearly one of the requirements for the Boost community to accept this library in general will probably be meeting the Standard requirements for forward iterators - and I note that this is one of your key requirements in Wave.
The lexer in Wave feeds a spirit grammar (which recognises the preprocessor statements), thus a forward iterator is sufficient.
The two different lexers I was using in the Wave library were a re2c based lexer (static switch based lexer, extremly fast and compact) and a SLEX based lexer (runtime generated DFA). I haven't done serious speed measurements, but the numbers I've got so far showed similar timings for both with a slight advantage for re2c (as expected). I'd expect your library to be very similar in speed as well. But just out of curiousity I'm very interested in seeing the static DFA generation version :-)
Can these lexers be effectively used with any Spirit grammar? I'll download Wave over the next few days and have a look at how much overlap there is between our library and the lexers within Wave.
Both essentially expose a functor based (lex like) interface, which when called returns the next token. I was using the multi_pass_iterator to wrap these into a usable iterator interface. So yes, both are usable with spirit grammars out of the box.
My plan for the static DFA version (which we haven't fully discussed in our internal design discussions yet so may change significantly), is to use the memento pattern to reflect the internal state of the DFA, along with the production rules. These could then be saved or loaded, or serialised to a C++ file, in a rather similar way to flex.
The SLEX lexer DFA tables can be saved to a file as well. But since the SLEX doesn't allow to have production rules (it's a recognition engine only) obviously not production rules can be saved ;-)
This would mean that the lextl classes would be both an inline and an offline tool. In principle this gives some very desirable advantages:
1) The API remains the same in both compile-time and run-time versions. 2) You can very easily swap between run-time and compile-time versions - for example during development you may use run-time creation to speed up the process of developing a complex grammar, then switch to compile-time for your production release. 3) You have a number of options for compile-time versions. The grammar can be in a configuration file, or compiled into the program directly.
Agreed!
As a flavour, the current API would result in some code looking like:
syntax<> mySyntax; mySyntax.add_rule( rule<token<1> >( "Group" ) ); mySyntax.add_rule( rule<token<2> >( "Separator" ) ); mySyntax.add_rule( rule<token<3> >( "Switch" ) ); // Lots more symbols for the rest of VRML... mySyntax.add_rule( rule<token_float<69> >( "([-+]?((([0-9]+[.]?)|([0-9]*[.][0-9]+))([eE][-+]?[0-9]+)?))" ) ); mySyntax.add_rule( rule<token_string<70> >( "#[^\\n\\r]*[\\n\\r]" ) ); mySyntax.add_rule( rule<token_string<71> >( "\"[^\"\\n\\r]*[\"\\n\\r]" ) ); mySyntax.add_rule( rule<token_string<72> >( "[a-zA-Z_][a-zA-Z0-9_]*" ) ); mySyntax.add_rule( rule<token<73> >( "[ \\t\\n\\r]+" ) ); lexer<> myLexer( mySyntax ); std::ifstream fsp( "test.vrml" ); myLexer.set_source( fsp ); // Also it will have a char iterator interface lexer<>::token_iterator iter = myLexer.begin(); for ( ; iter != myLexer.end(); ++iter ) // Do something with the tokens... ;
How do you specify production rule's? Your above sample specifies how to use the recognition part only ;-) The numbers are the token ID's, I assume?
Essentially you create a syntax to which you add rules. These rules explain the token that will be created by the rule, and a regular expression to be
matched. Special versions of tokens exist that will automatically generate a float, int or string from the matched expression. You then construct a lexer off the syntax, set a suitable source input, and then iterate over the tokens. The tokens are polymorphic so you can easily access the type and data, they are statically typed so you can use types to match
the tokens, or you can dispatch tokens to functions through a visitor framework.
You can also use dynamically typed tokens as well if you want - although you lose much of the benefit of strong typing. I especially like
idea of using visitors to distinguish between tokens. This means that in my example of a VRML file, the long float lists can be handled with a visitor that contains visit functions for a float token, a comma token (note that in vector lists in VRML the elements in a 3D vector are separated by whitespace, and the 3D vectors themselves separated by commas) and a list termination token (eg a "]"). The visitor would then automatically store
Is your token type a predefined one or is it possible to use my own token types instead? I'm asking, because Wave in principle allows to use any token type for its work, but it requires the tokens to expose a certain (very small) interface. the the floats
in a suitable data structure - like a list of 3D vectors.
Sounds good! Regards Hartmut
participants (3)
-
Dave Handley
-
Hartmut Kaiser
-
Joel de Guzman