Re: Interest in a fast lexical analyser compatible with Spirit

Hartmut Kaiser wrote:
Since tokens have to be copied around in a typesafe manner I'd have expected to have something similar to the boost::variant, but maybe I've got you wrong... How do you plan to achieve this? Copying through a base class won't work AFAICT?
In my experience, copying around tokens is a major source of slow-down in lexers. The last lexer I wrote, I had to abandon the token class and use a C style union containing ints, doubles and std::string pointers. At the time, I was a little restricted in the amount I could use boost by my company, so I couldn't use boost.variant. Since then I've been putting a lot of thought into this problem and come up with one solution. We haven't completed integration testing with Spirit and other libraries yet using this technique (and I can foresee some issues already :-( ). The solution uses flyweighted tokens. The actual token exists in the rule, and during parsing, if the token has an assigned value, this is copied into the flyweight, then the token is passed around by reference/pointer. When you come to use the token in the parser you have a number of options: 1) Use it and then throw it away as soon as you call the ++ operator on the token_iterator. 2) Get the data out and do something with it - for example put it in a list/deque, etc. 3) Copy the token and use the copied version. For this final option, the token supports a virtual clone function to create copies of the flyweighted version for use elsewhere. An implementation point here is that the iterators contain a pointer to the token so that the iterator can be copied, or have the equals operator called, etc. There are some other potential solutions. However, any other solution leaves the problem of how to allocate and deallocate tokens on the heap very fast. I considered using Alexandrescu's small object allocator from the Loki library, and have also briefly looked at a boost.pool, but I am not comfortable with either of these solutions, because I am convinced that it will add too much of a burden to token creation. In my last lexer I found one or two orders of magnitude of speed improvement when I removed the token class and replaced it with a union. Since our first priority on this project is speed, I do not want the framework to slow down the DFA at all. The critical path in execution should definitely be the DFA. Dave Handley
participants (1)
-
Dave Handley