Interest in a fast lexical analyser compatible with Spirit

A colleague and I are working on a fast lexical analyser designed to work alongside Spirit. We currently have some working prototypes, and as such I thought that I would gauge interest in this library. We plan to produce a DFA based lexical analyser that provides output as a set of iterable polymorphic flyweighted tokens. These could then be provided as input to Spirit instead of character iterators (albeit with the addition of a token_p parser in Spirit). The objectives of our project are as follows: 1) As fast as lex/flex. 2) Simple to use 3) Rules to generate tokens can be provided both statically and dynamically. Static definition would be through an offline pre-process stage much like lex. 4) Easy to interface with Spirit. I understand from Joel's posts below (both on this group and on codeproject) that he is already planning some work to this end. My understanding of expression template based lexers is that they can outperform DFAs under some circumstances, but with other situations DFAs can provide the best performance. I am therefore recommending this in addition to Joel and Eric's work rather than instead of it. Finally, as a taster of the work we are doing, I have performed some rather unscientific and arbitrary performance tests on a 20Mb VRML 1.0 file. The performance of flex, Spirit, and our new library are as follows: flex: < 1 second new library: about 1 second Spirit: 40-50 seconds. These results were produced with no semantic actions to slow down any of the 3 solutions. They are also not conclusive, since we are almost an order of magnitude slower on an Athlon 64 bit machine at present. Dave Handley Relevant posts: Joel de Guzman wrote:
I guess if you use string literals you'll have this problem no matter what, though I'm not sure the use of string literals should be a foregone conclusion. If you consider parser generators like YACC, they cleanly separate lexical analysis from parsing and the parser itself is never working with string literals.
Again, true. However, one of the original intent of Spirit is to be usable in micro parsing tasks. When parsing small grammars like, say, a complex number, you do not want to write a separate front end lexer.
Anyway, I do plan to write an optional but separate lexer sometime.
Joel de Guzman wrote:
As for the lexer, I am working with Eric Niebler, author of Greta and xpressive (google regular expression c++) with the hopes of developing a powerful expresion template based lexer front end.

----- Original Message ----- From: "Dave Handley" <dave@dah.me.uk> To: <boost@lists.boost.org> Sent: Monday, December 27, 2004 9:18 AM Subject: [boost] Interest in a fast lexical analyser compatible with Spirit
Finally, as a taster of the work we are doing, I have performed some rather unscientific and arbitrary performance tests on a 20Mb VRML 1.0 file. The performance of flex, Spirit, and our new library are as follows:
flex: < 1 second new library: about 1 second Spirit: 40-50 seconds.
Would you please post the grammar productions you used for Spirit? Would you consider testing the yard parser as well, http://yard-parser.sf.net ? Are there other advantages of your tool over Flex, other than providing a nice interface to Spirit? Christopher Diggins http://www.cdiggins.com http://www.heron-language.com

Dave Handley wrote:
A colleague and I are working on a fast lexical analyser designed to work alongside Spirit. We currently have some working prototypes, and as such I thought that I would gauge interest in this library. We plan to produce a DFA based lexical analyser that provides output as a set of iterable polymorphic flyweighted tokens. These could then be provided as input to Spirit instead of character iterators (albeit with the addition of a token_p parser in Spirit).
I'm definitely interested to have a look at your library. Besides my general interest in Spirit I'd like to try it out as an alternative lexing component for Wave. I'm pretty sure this should is interesting for you as well, because there are implemented already two different lexers, which gives a good opportunity to compare them in a real environment.
The objectives of our project are as follows:
1) As fast as lex/flex. 2) Simple to use 3) Rules to generate tokens can be provided both statically and dynamically. Static definition would be through an offline pre-process stage much like lex. 4) Easy to interface with Spirit.
As for a dynamic DFA based lexer Wave already uses the Spirit based SLEX, but a static DFA based solution is very interesting to look at. Is the DFA generated at compile time?
Finally, as a taster of the work we are doing, I have performed some rather unscientific and arbitrary performance tests on a 20Mb VRML 1.0 file. The performance of flex, Spirit, and our new library are as follows:
flex: < 1 second new library: about 1 second Spirit: 40-50 seconds.
We definitely should try the new upcoming Spirit-2 code base as well, since it should be a lot faster then the current version. Is it possible to have a look at your test code as well? This way we could try to make a comparision as soon as the Spirit-2 codebase evolves.
These results were produced with no semantic actions to slow down any of the 3 solutions. They are also not conclusive, since we are almost an order of magnitude slower on an Athlon 64 bit machine at present.
Regards Hartmut

Dave Handley wrote:
A colleague and I are working on a fast lexical analyser designed to work alongside Spirit. We currently have some working prototypes, and as such I thought that I would gauge interest in this library. We plan to produce a DFA based lexical analyser that provides output as a set of iterable polymorphic flyweighted tokens. These could then be provided as input to Spirit instead of character iterators (albeit with the addition of a token_p parser in Spirit). The objectives of our project are as follows:
1) As fast as lex/flex. 2) Simple to use 3) Rules to generate tokens can be provided both statically and dynamically. Static definition would be through an offline pre-process stage much like lex. 4) Easy to interface with Spirit.
I am looking for exactly what you propose. I have been trying to replace my Lex/Yacc parser with Spirit but have run into performance as the major problem. I broke up the problem into 3 steps. In the first phase the program uses a Spirit grammar to generate a list of tokens with info similar to the info generated by Lex. In phase 2 another Spirit grammar generates a parse tree from the tokens with rules equivalent to the Yacc rules. Phase 3 evaluates the parse tree with results equivalent to the Yacc actions. After getting the bugs out, I compared the performance and found that the total time for processing a test case of input scripts took 6 times as long using this scheme. Since some of the time is spent in running the virtual machine that follows the parsing steps, the actual Spirit performance is much worse. I found that much of the slowdown was due to Phase 1, the grammar that scans the text input a la Lex. After a lot of tweaking I was able to bring down the performance hit to a bit over 3 times worse than what I am trying to replace. However, this is still unacceptable. I am disappointed, because I was hoping to use Spirit or something like it, to give me some independence from Lex/Yacc's dictatorial control of the input source. Your project sounds like it would solve the worst of the problems I have in trying to move to Spirit. -- Dick Hadsell 914-259-6320 Fax: 914-259-6499 Reply-to: hadsell@blueskystudios.com Blue Sky Studios http://www.blueskystudios.com 44 South Broadway, White Plains, NY 10601

A colleague and I are working on a fast lexical analyser designed to work alongside Spirit. We currently have some working prototypes, and
There seems some interest currently in alternatives/improvements to Spirit, so I thought I'd mention Hapy: http://hapy.sourceforge.net/ I've used it on a couple of projects and found it as easy to use as Spirit but much faster to compile. I don't know about run-time speed comparison: in all my spirit/hapy projects the use of the data takes much more time than the parsing of it. Darren

Darren Cook wrote:
A colleague and I are working on a fast lexical analyser designed to work alongside Spirit. We currently have some working prototypes, and
There seems some interest currently in alternatives/improvements to Spirit, so I thought I'd mention Hapy: http://hapy.sourceforge.net/
"The Hapy library would not exist if Spirit would generate correct parsers by default..." Whoa. Them's fightin' words! -- Dave Abrahams Boost Consulting http://www.boost-consulting.com
participants (6)
-
christopher diggins
-
Darren Cook
-
Dave Handley
-
David Abrahams
-
Hartmut Kaiser
-
Richard Hadsell