
On 08/06/10 01:34, Eric Niebler wrote:
On 6/7/2010 8:01 PM, Joel de Guzman wrote:
On 6/8/10 12:10 AM, Eric Niebler wrote:
An interesting research project would be to investigate whether xpressive's hybrid static/dynamic NFA design can be applied to DFAs as well, with similar perf wins. This would be a non-trivial undertaking, like completely reimplementing xpressive.
Just to remind you, there's a very good DFA implementation underneath Spirit Lex. The interface is not perfect (Ben has requested for some input in that area), but the implementation is very good. Spirit Lex is proto on top of lexertl (Ben's work). We find it to be one of the fastest lexer, close to performance to the fastest (RE2C) which is a code-gen that generates a humongous switch.
I haven't forgotten. IIRC, lexertl is like dynamic xpressive: it accepts strings as input and builds a FSA. I was speculating about what the lexertl equivalent of static xpressive would look like and whether it would be even faster. Then static xpressive could use static lexertl for simple regexes, and dynamic xpressive could use the dynamic one.
I once wrote a compile-time lexer compiler, which seems to be the sort of thing you're suggesting. It worked like this: typedef lexer< rule<literal<'k','e','y','w','o','r','d'>, construct<KEYWORD> >, rule<one_or_more<range<'a','z'> >, construct<ID, mpl::true_> >, rule<alternative_literal<' ','\n','\t','\r'>, discard >
keyword_lexer; typedef keyword_lexer::token_type token; typedef keyword_lexer::token_id_map id_map;
BOOST_MPL_ASSERT_RELATION( (mpl::at<id_map, KEYWORD>::type::value),==,0 ); BOOST_MPL_ASSERT_RELATION((mpl::at<id_map, ID>::type::value),==,1); string s("foo bar\t\tkeyword \n\r baz keywordo "); keyword_lexer l(s); vector<token::ptr> tokens; copy(l.begin(), l.end(), back_inserter(tokens)); BOOST_REQUIRE_EQUAL(tokens.size(), 5); BOOST_CHECK_EQUAL(tokens[0]->id(), l.get_id<ID>()); BOOST_CHECK_EQUAL(tokens[2]->id(), l.get_id<KEYWORD>()); BOOST_CHECK_EQUAL(tokens[4]->id(), l.get_id<ID>()); This defines a lexer recognizing whitespace-separated lower-case words, but treating the word 'keyword' as special. Each rule of the lexer is a regex followed by what to do when that regex is matched. The first rule constructs a KEYWORD token, the second constructs an ID token (and the mpl::true_ means it passes the range which matched the regex to the constructor), the third matches without creating a token. The library will take this lexer specification and turn it into an NFA, then transform that into a DFA, then encode that as a transition table in an array, *all at compile time*. On reflection, it probably would have been better (and faster) to use the giant-switch-statement approach, rather than an array-based transition table. Then we give the lexer a string to parse (any forward range of characters would do) and we can access the parsed tokens (in a type-erased fashion) by treating the lexer itself as a range. Obviously going to all this trouble to do everything at compile time and then offering only a type-erased interface is a bit silly, but there are other ways (you can give the lexer a UnaryFunction which it calls on the next token). The down side with this approach, of course, is the absolutely massive compile-time cost. Compiling the above example with g++ 4.4.3 takes over 6 minutes and over 1.3GiB of memory (and this is after I've put in effort to optimize it). The time I could live with, but the memory requirements are absurd. Any lexer of real-world complexity would be quite impossible to compile. However, it might just be possible to use it for simple regexes, rather than full-blown lexers. Anyone think that might be worth investigating? John Bytheway