Interest in a fast lexical analyser compatible

Dick Hadsell wrote: <snip>
disappointed, because I was hoping to use Spirit or something like it, to give me some independence from Lex/Yacc's dictatorial control of the input source.
Your project sounds like it would solve the worst of the problems I have in trying to move to Spirit.
The current API we are working with templatises the input so that in theory it will work with any character like input, in much the same way as std::basic_string<>. We are still working on getting the DFA to work generically, rather than just explicitly with char and wchar_t, but I think we should have some success. At present the lexer is strongly typed from this character type in the same way as std::basic_string<> but I don't necessarily see that as a problem.
I broke up the problem into 3 steps. In the first phase the program uses a Spirit grammar to generate a list of tokens with info similar to <snip>
Depending on the type of grammar I think you should easily achieve a 6x or greater performance boost. If the input has long repetitive sections, you could probably further optimise the stage so that the lexer does most of the work - for example if you have long lists of numbers or similar. I'm not sure how well this would work with Spirit until I try it, but it should be possible to switch control part way through a parse to a very quick and efficient parser that just throws the tokens at a visitor until a particular section is finished. I'm sure this could probably be done by writing a new parser type in Spirit. The idea would be to process long lists of numbers or strings or similar, where those lists have a clearly defined end token. Dave Handley

I just wrote a quick and dirty comparison between YARD and Spirit and YARD performs roughly 10x faster as a toy C++ tokenizer. I know Joel, I said I wouldn't do any comparisons, but I couldn't resist, what with Dave's claim to be outperforming Spirit by 50x! This increased performance of YARD is due to the fact that YARD generates the parser at compile-time, rather than at run-time. Clearly I am not using an optimized Spirit grammar, I opted instead to implement both grammars in a naive and straightforward manner. Here is the full Spirit grammar I used: single_comment_p = str_p("//") >> *(~ch_p('\n')) >> ~ch_p('\n'); full_comment_p = str_p("/*") >> anychar_p - str_p("*/"); comment_p = single_comment_p | full_comment_p; ws = +(space_p | comment_p); escape_char_p = ch_p('\\') >> anychar_p; string_literal_p = ch_p('"') >> *(escape_char_p | ~ch_p('"')) >> ch_p('"'); char_literal_p = ch_p('\'') >> (escape_char_p | ~ch_p('\'')) >> ch_p('\''); ident_p = (alpha_p | ch_p('_')) >> +(alnum_p | ch_p('_')); number_p = real_p; cpp_token = ws | char_literal_p | string_literal_p | number_p | ident_p[&inc_counter]; tokens = *(cpp_token | anychar_p); I would appreciate any suggestions on how to improve the Spirit grammar. The YARD grammar is far more verbose, here is only a small snippet: struct MatchBeginFullComment : public re_and< MatchChar<'/'>, MatchChar<'*'> > { }; struct MatchEndFullComment : public re_and< MatchChar<'*'>, MatchChar<'/'> > { }; struct MatchFullComment : public re_and< MatchBeginFullComment, MatchEndFullComment > { }; struct MatchComment : public re_or< MatchSingleLineComment, MatchFullComment > { }; Anyway you get the picture, YARD is verbose but quite fast. I will be including the full source in the next YARD release. Christopher Diggins http://sourceforge.net/projects/yard-parser

christopher diggins wrote:
I just wrote a quick and dirty comparison between YARD and Spirit and YARD performs roughly 10x faster as a toy C++ tokenizer. I know Joel, I said I wouldn't do any comparisons, but I couldn't resist, what with Dave's claim to be outperforming Spirit by 50x!
Sure, no problem. But watch out, Spirit-2 will have a speed improvement in the 10X-20X range over classic Spirit ;-) I already have some test cases, and this does not touch on predictive parsing and lexing. Cheers, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net
participants (3)
-
christopher diggins
-
Dave Handley
-
Joel de Guzman