Re: [boost] [regex, xpressive] interesting(?) perf benchmark

8 Jun 2010

      On 08/06/10 01:34, Eric Niebler wrote:
...
On 6/7/2010 8:01 PM, Joel de Guzman wrote:
...
On 6/8/10 12:10 AM, Eric Niebler wrote:
...
An interesting research project would be to investigate whether
xpressive's hybrid static/dynamic NFA design can be applied to DFAs
as well, with similar perf wins. This would be a non-trivial
undertaking, like completely reimplementing xpressive.
Just to remind you, there's a very good DFA implementation
underneath Spirit Lex. The interface is not perfect (Ben has
requested for some input in that area), but the implementation is
very good. Spirit Lex is proto on top of lexertl (Ben's work). We
find it to be one of the fastest lexer, close to performance to the
fastest (RE2C) which is a code-gen that generates a humongous
switch.
I haven't forgotten. IIRC, lexertl is like dynamic xpressive: it accepts
strings as input and builds a FSA. I was speculating about what the
lexertl equivalent of static xpressive would look like and whether it
would be even faster. Then static xpressive could use static lexertl for
simple regexes, and dynamic xpressive could use the dynamic one.
I once wrote a compile-time lexer compiler, which seems to be the sort
of thing you're suggesting.  It worked like this:

  typedef lexer<
    rule<literal<'k','e','y','w','o','r','d'>, construct<KEYWORD> >,
    rule<one_or_more<range<'a','z'> >, construct<ID, mpl::true_> >,
    rule<alternative_literal<' ','\n','\t','\r'>, discard >
...
keyword_lexer;
  typedef keyword_lexer::token_type token;
  typedef keyword_lexer::token_id_map id_map;
BOOST_MPL_ASSERT_RELATION(
    (mpl::at<id_map, KEYWORD>::type::value),==,0
  );
  BOOST_MPL_ASSERT_RELATION((mpl::at<id_map, ID>::type::value),==,1);

  string s("foo bar\t\tkeyword \n\r  baz  keywordo ");
  keyword_lexer l(s);
  vector<token::ptr> tokens;
  copy(l.begin(), l.end(), back_inserter(tokens));

  BOOST_REQUIRE_EQUAL(tokens.size(), 5);
  BOOST_CHECK_EQUAL(tokens[0]->id(), l.get_id<ID>());
  BOOST_CHECK_EQUAL(tokens[2]->id(), l.get_id<KEYWORD>());
  BOOST_CHECK_EQUAL(tokens[4]->id(), l.get_id<ID>());

This defines a lexer recognizing whitespace-separated lower-case words,
but treating the word 'keyword' as special.  Each rule of the lexer is a
regex followed by what to do when that regex is matched.  The first rule
constructs a KEYWORD token, the second constructs an ID token (and the
mpl::true_ means it passes the range which matched the regex to the
constructor), the third matches without creating a token.

The library will take this lexer specification and turn it into an NFA,
then transform that into a DFA, then encode that as a transition table
in an array, *all at compile time*.  On reflection, it probably would
have been better (and faster) to use the giant-switch-statement
approach, rather than an array-based transition table.

Then we give the lexer a string to parse (any forward range of
characters would do) and we can access the parsed tokens (in a
type-erased fashion) by treating the lexer itself as a range.  Obviously
going to all this trouble to do everything at compile time and then
offering only a type-erased interface is a bit silly, but there are
other ways (you can give the lexer a UnaryFunction which it calls on the
next token).

The down side with this approach, of course, is the absolutely massive
compile-time cost.  Compiling the above example with g++ 4.4.3 takes
over 6 minutes and over 1.3GiB of memory (and this is after I've put in
effort to optimize it).  The time I could live with, but the memory
requirements are absurd.  Any lexer of real-world complexity would be
quite impossible to compile.

However, it might just be possible to use it for simple regexes, rather
than full-blown lexers.  Anyone think that might be worth investigating?

John Bytheway