Interest in a fast lexical analyser compatible with Spirit

Since I get the mailing list in a digest, it looks like I have a lot of questions to answer. Sorry for the long(er) post, but I'm answering them all in one post: Darren Cook wrote:
There seems some interest currently in alternatives/improvements to Spirit, so I thought I'd mention Hapy: http://hapy.sourceforge.net/
I'll have a look at Hapy and see whether we could interface to that as well. Joel de Guzman wrote:
Some more things to note: * why use lexeme_d on the lexer stage? * the comment rule can be rewritten as: '#' >> *(anychar_p - eol_p) >> eol_p; * stringLiteral can be rewriten similarly. * word can be rewritten using chsets.
I used lexeme_d because the VRML grammar uses whitespace to separate ints and floats. Without the lexeme_d, ints could merge into one. The other optimisations are all valid, but unfortunately, don't make any difference to the stopwatch. This is because comments, stringLiterals and words don't occur too often in a VRML file - most of the time is spent parsing floats and ints. Joel de Guzman wrote:
Sure, no problem. But watch out, Spirit-2 will have a speed improvement in the 10X-20X range over classic Spirit ;-) I already have some test cases, and this does not touch on predictive parsing and lexing.
Joel, I would be really keen to try running my VRML test case through Spirit-2. Given that I've now got the Spirit-1 parser running at about 6-7 seconds, if you get a 10x speed improvement on this file, then it could make the case for separate lexing much harder to support! I may also need a bigger file to start doing test cases on - millisecond test timings don't make for accuracy in my view :-) Joel de Guzman wrote:
Anyway, as I said, I'm very interested with your lexer. Would you be interested in a collaboration?
Yes, we would be very interested in a collaboration. Hartmut Kaiser wrote:
How do you specify production rule's? Your above sample specifies how to use the recognition part only ;-) The numbers are the token ID's, I assume?
The production rules are wrapped up in the different (and extensible) concept of a token. The token_float for example will match a float, then store the matched number as a float inside the token. The normal token stores nothing since something like a keyword match you don't need the overhead of the original text, but the token_string stores the string that was matched. I am contemplating an interface that allows you to use a functor to perform some basic string manipulation before the token is stored - for example dropping the enclosing quotes or similar. Although this is clearly similar to functionality already available in Spirit. Yes the numbers are token IDs - I have contemplated using const char * template parameters as well, but at present I haven't done that because of the added difficulty of ensuring that your names have external linkage. Perhaps in the final version we can support both.
Is your token type a predefined one or is it possible to use my own token types instead? I'm asking, because Wave in principle allows to use any token type for its work, but it requires the tokens to expose a certain (very small) interface.
In the system as stands, all token types must inherit from a base version. This base version provides lots of useful virtual functions that derived token types can work with. The iterator interface then returns the token_base from its de-reference function. You can define new token types, but accessing the specific interface on the derived token means doing one of: 1) Ensuring the interface is exposed in the token_base. 2) Using polymorphic despatch to get the correct type, through a visitor for example. 3) If only a single token type is used in the lexer then use a static_cast. 4) Using a dynamic_cast. You can also define new rule classes - at present we have written a rule class for conventional tokens, and another rule class for tokens that have a dynamic (run-time) ID rather than the static (compile-time) ID in the template parameter. Dave Handley

Dave Handley wrote:
Joel de Guzman wrote:
Some more things to note: * why use lexeme_d on the lexer stage? * the comment rule can be rewritten as: '#' >> *(anychar_p - eol_p) >> eol_p; * stringLiteral can be rewriten similarly. * word can be rewritten using chsets.
I used lexeme_d because the VRML grammar uses whitespace to separate ints and floats. Without the lexeme_d, ints could merge into one.
I smell something wrong here. Lexers work at the character level and process white space appropriately. The added skip-parser will be another source of slowdown. White-space skipping should be in the lexer, not outside it.
Joel de Guzman wrote:
Sure, no problem. But watch out, Spirit-2 will have a speed improvement in the 10X-20X range over classic Spirit ;-) I already have some test cases, and this does not touch on predictive parsing and lexing.
Joel, I would be really keen to try running my VRML test case through Spirit-2. Given that I've now got the Spirit-1 parser running at about 6-7 seconds, if you get a 10x speed improvement on this file, then it could make the case for separate lexing much harder to support! I may also need a bigger file to start doing test cases on - millisecond test timings don't make for accuracy in my view :-)
See the smiley ;-) ... I was just teasing :-) I do have some rather informal test cases though, on a toy parser. Seriously, well, it's rather early to say anything really conclusive. I'm not yet sure how it would turn out in real world cases outside the toy room. I do intend to do lots of benchmarks and optimization. That's an area that's largely left for later. The initial thrust was not on speed, but on the interface and usability. We've certainly learned a lot from the start when Spirit debuted. Now it's time to rock and roll!
Joel de Guzman wrote:
Anyway, as I said, I'm very interested with your lexer. Would you be interested in a collaboration?
Yes, we would be very interested in a collaboration.
Very cool! I'll keep in touch as things progress in the Spirit-2 world. Lexing + predictive parsing: that's where I'm heading. Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Dave Handley wrote:
How do you specify production rule's? Your above sample specifies how to use the recognition part only ;-) The numbers are the token ID's, I assume?
The production rules are wrapped up in the different (and extensible) concept of a token. The token_float for example will match a float, then store the matched number as a float inside the token. The normal token stores nothing since something like a keyword match you don't need the overhead of the original text, but the token_string stores the string that was matched. I am contemplating an interface that allows you to use a functor to perform some basic string manipulation before the token is stored - for example dropping the enclosing quotes or similar. Although this is clearly similar to functionality already available in Spirit.
Ok, got it. Wave as it is today uses one single token type only. I was thinking from the beginning to use different token types to support tools build on top of the preprocessor (for storing symbolt table entries and such), but haven't got to implement this yet. So this is a second point which makes it very interesting for me to collaborate with your team!
Yes the numbers are token IDs - I have contemplated using const char * template parameters as well, but at present I haven't done that because of the added difficulty of ensuring that your names have external linkage. Perhaps in the final version we can support both.
Token IDs should be sufficient in most cases, IMHO.
Is your token type a predefined one or is it possible to use my own token types instead? I'm asking, because Wave in principle allows to use any token type for its work, but it requires the tokens to expose a certain (very small) interface.
In the system as stands, all token types must inherit from a base version. This base version provides lots of useful virtual functions that derived token types can work with. The iterator interface then returns the token_base from its de-reference function. You can define new token types, but accessing the specific interface on the derived token means doing one of:
1) Ensuring the interface is exposed in the token_base. 2) Using polymorphic despatch to get the correct type, through a visitor for example. 3) If only a single token type is used in the lexer then use a static_cast. 4) Using a dynamic_cast.
Since tokens have to be copied around in a typesafe manner I'd have expected to have something similar to the boost::variant, but maybe I've got you wrong... How do you plan to achieve this? Copying through a base class won't work AFAICT?
You can also define new rule classes - at present we have written a rule class for conventional tokens, and another rule class for tokens that have a dynamic (run-time) ID rather than the static (compile-time) ID in the template parameter.
This should be sufficient to integrate with Wave. Looking forward and please keep me in the loop! Regards Hartmut
participants (3)
-
Dave Handley
-
Hartmut Kaiser
-
Joel de Guzman