
Since I get the mailing list in a digest, it looks like I have a lot of questions to answer. Sorry for the long(er) post, but I'm answering them all in one post: Darren Cook wrote:
There seems some interest currently in alternatives/improvements to Spirit, so I thought I'd mention Hapy: http://hapy.sourceforge.net/
I'll have a look at Hapy and see whether we could interface to that as well. Joel de Guzman wrote:
Some more things to note: * why use lexeme_d on the lexer stage? * the comment rule can be rewritten as: '#' >> *(anychar_p - eol_p) >> eol_p; * stringLiteral can be rewriten similarly. * word can be rewritten using chsets.
I used lexeme_d because the VRML grammar uses whitespace to separate ints and floats. Without the lexeme_d, ints could merge into one. The other optimisations are all valid, but unfortunately, don't make any difference to the stopwatch. This is because comments, stringLiterals and words don't occur too often in a VRML file - most of the time is spent parsing floats and ints. Joel de Guzman wrote:
Sure, no problem. But watch out, Spirit-2 will have a speed improvement in the 10X-20X range over classic Spirit ;-) I already have some test cases, and this does not touch on predictive parsing and lexing.
Joel, I would be really keen to try running my VRML test case through Spirit-2. Given that I've now got the Spirit-1 parser running at about 6-7 seconds, if you get a 10x speed improvement on this file, then it could make the case for separate lexing much harder to support! I may also need a bigger file to start doing test cases on - millisecond test timings don't make for accuracy in my view :-) Joel de Guzman wrote:
Anyway, as I said, I'm very interested with your lexer. Would you be interested in a collaboration?
Yes, we would be very interested in a collaboration. Hartmut Kaiser wrote:
How do you specify production rule's? Your above sample specifies how to use the recognition part only ;-) The numbers are the token ID's, I assume?
The production rules are wrapped up in the different (and extensible) concept of a token. The token_float for example will match a float, then store the matched number as a float inside the token. The normal token stores nothing since something like a keyword match you don't need the overhead of the original text, but the token_string stores the string that was matched. I am contemplating an interface that allows you to use a functor to perform some basic string manipulation before the token is stored - for example dropping the enclosing quotes or similar. Although this is clearly similar to functionality already available in Spirit. Yes the numbers are token IDs - I have contemplated using const char * template parameters as well, but at present I haven't done that because of the added difficulty of ensuring that your names have external linkage. Perhaps in the final version we can support both.
Is your token type a predefined one or is it possible to use my own token types instead? I'm asking, because Wave in principle allows to use any token type for its work, but it requires the tokens to expose a certain (very small) interface.
In the system as stands, all token types must inherit from a base version. This base version provides lots of useful virtual functions that derived token types can work with. The iterator interface then returns the token_base from its de-reference function. You can define new token types, but accessing the specific interface on the derived token means doing one of: 1) Ensuring the interface is exposed in the token_base. 2) Using polymorphic despatch to get the correct type, through a visitor for example. 3) If only a single token type is used in the lexer then use a static_cast. 4) Using a dynamic_cast. You can also define new rule classes - at present we have written a rule class for conventional tokens, and another rule class for tokens that have a dynamic (run-time) ID rather than the static (compile-time) ID in the template parameter. Dave Handley