Re: Interest in a fast lexical analyser compatible with Spirit

28 Dec 2004

      Hartmut Kaiser wrote:
...
Since tokens have to be copied around in a typesafe manner I'd have 
expected
to have something similar to the boost::variant, but maybe I've got you
wrong... How do you plan to achieve this? Copying through a base class 
won't
work AFAICT?
In my experience, copying around tokens is a major source of slow-down in 
lexers.  The last lexer I wrote, I had to abandon the token class and use a 
C style union containing ints, doubles and std::string pointers.  At the 
time, I was a little restricted in the amount I could use boost by my 
company, so I couldn't use boost.variant.  Since then I've been putting a 
lot of thought into this problem and come up with one solution.  We haven't 
completed integration testing with Spirit and other libraries yet using this 
technique (and I can foresee some issues already :-( ).

The solution uses flyweighted tokens.  The actual token exists in the rule, 
and during parsing, if the token has an assigned value, this is copied into 
the flyweight, then the token is passed around by reference/pointer.  When 
you come to use the token in the parser you have a number of options:

1)    Use it and then throw it away as soon as you call the ++ operator on 
the token_iterator.
2)    Get the data out and do something with it - for example put it in a 
list/deque, etc.
3)    Copy the token and use the copied version.  For this final option, the 
token supports a virtual clone function to create copies of the flyweighted 
version for use elsewhere.

An implementation point here is that the iterators contain a pointer to the 
token so that the iterator can be copied, or have the equals operator 
called, etc.

There are some other potential solutions.  However, any other solution 
leaves the problem of how to allocate and deallocate tokens on the heap very 
fast.  I considered using Alexandrescu's small object allocator from the 
Loki library, and have also briefly looked at a boost.pool, but I am not 
comfortable with either of these solutions, because I am convinced that it 
will add too much of a burden to token creation.  In my last lexer I found 
one or two orders of magnitude of speed improvement when I removed the token 
class and replaced it with a union.

Since our first priority on this project is speed, I do not want the 
framework to slow down the DFA at all.  The critical path in execution 
should definitely be the DFA.

Dave Handley

Dave Handley

tags

participants (1)