[regex, xpressive] interesting(?) perf benchmark

Eric Niebler

7 Jun 2010 7 Jun '10

3:58 a.m.

I'm always suspect of benchmarks, but I figured I'd post this anyway. http://astl.sourceforge.net/bench.7.html. It seems someone wrote a generic automata library and a regex engine on top of it, and compared its performance to boost.regex, xpressive, pcre, and greta. Of course, his library wins handily. (The purpose of benchmarking is to make yourself look good, right? ;-) He doesn't say that (a) his engine is a DFA, and he's comparing to NFAs; and (b) he's only implemented a tiny subset of regex features people would expect; e.g., no captures even, let alone backreferences: things that can be tricky or impossible to implement with DFAs. He also doesn't say what version of Boost he's using (tests run Nov. 2009) or whether he's using static or dynamic xpressive (probably dynamic; static can be ~15% faster). Anyway, it's not my intention to kick off yet another costly escalation in the regex war. God help me, I don't have the time. I post it here because I haven't come across a comparison of these four libraries by a third party before. Some quick observations: - I'm pleased that on the balance, xpressive fares well here relative to the other NFAs. In the end, though, not much separates best from worst. - All the NFA libraries seem to have large overhead that becomes very noticeable when matching short strings. I don't know why. - I'm gobsmacked that my VERY OLD library greta handily beats pcre, boost.regex /and/ xpressive on short matches. Failing to beat yourself in any perf benchmark is kind of embarrassing. - I'm tickled that for long matches, both xpressive and boost.regex (NFAs) beat his library (DFA). How'd that happen? - I find it interesting that both xpressive and boost.regex have large discontinuities in the chart at the bottom; the others don't. Huh. When more std library vendors start shipping regex libraries, I'll be interested in seeing this perf matrix expanded. -- Eric Niebler BoostPro Computing http://www.boostpro.com

Show replies by date

John Maddock

7 Jun 7 Jun

8:26 a.m.

...

I'm always suspect of benchmarks, but I figured I'd post this anyway.

http://astl.sourceforge.net/bench.7.html.

LOL, nice to see my own benchmark suite used against me!

...

Anyway, it's not my intention to kick off yet another costly escalation in the regex war. God help me, I don't have the time. I post it here because I haven't come across a comparison of these four libraries by a third party before. Some quick observations:

- I'm pleased that on the balance, xpressive fares well here relative to the other NFAs. In the end, though, not much separates best from worst.

- All the NFA libraries seem to have large overhead that becomes very noticeable when matching short strings. I don't know why.

I spent some time trying mostly in vain to hammer this down lower with current Boost.Regex - it turns out that most of it is down to initialisation of state - Perl-style re's typically have a lot of state to initialise. The thing to remember is for that short matches against short texts the cost of a match is basically the cost of a function call for DFA's. Or to put it another way, even if the NFA's are 10x slower on these short tests, 10x almost nothing is still almost nothing (though I accept there are certain outlyer-use-cases where this overhead becomes an issue).

...

- I'm gobsmacked that my VERY OLD library greta handily beats pcre, boost.regex /and/ xpressive on short matches. Failing to beat yourself in any perf benchmark is kind of embarrassing.

Probably more features in xpressive to set up?

...

- I'm tickled that for long matches, both xpressive and boost.regex (NFAs) beat his library (DFA). How'd that happen?

Maybe reflects the preocupations of their authors?

...

- I find it interesting that both xpressive and boost.regex have large discontinuities in the chart at the bottom; the others don't. Huh.

Weird indeed, can't really imagine why that would happen, unless there's some kind of cache miss effect going on? Cheers, John.

Mathias Gaunard

9:50 a.m.

Le 07/06/2010 04:58, Eric Niebler wrote:

...

I'm always suspect of benchmarks, but I figured I'd post this anyway.

http://astl.sourceforge.net/bench.7.html.

It seems someone wrote a generic automata library and a regex engine on top of it, and compared its performance to boost.regex, xpressive, pcre, and greta. Of course, his library wins handily. (The purpose of benchmarking is to make yourself look good, right? ;-) He doesn't say that (a) his engine is a DFA, and he's comparing to NFAs; and (b) he's only implemented a tiny subset of regex features people would expect; e.g., no captures even, let alone backreferences: things that can be tricky or impossible to implement with DFAs. He also doesn't say what version of Boost he's using (tests run Nov. 2009) or whether he's using static or dynamic xpressive (probably dynamic; static can be ~15% faster).

Just an idea: for static xpressive, couldn't you detect at compile-time that the expression is truly regular, and use a DFA in that case?

Eric Niebler

12:59 p.m.

On 6/7/2010 5:50 AM, Mathias Gaunard wrote:

...

Le 07/06/2010 04:58, Eric Niebler wrote:

...
I'm always suspect of benchmarks, but I figured I'd post this anyway. <snip>

Just an idea: for static xpressive, couldn't you detect at compile-time that the expression is truly regular, and use a DFA in that case?

Oh, sure! Why don't you submit a DFA and I'll use it in xpressive? ;-) After nosing about on the internet a bit more, I found this interesting comparison: http://shootout.alioth.debian.org/u32/benchmark.php?test=regexdna&lang=all Here we see every language compared on how well it can perform on a particular regex task. The top-performer is <drumroll> Google's JavaScript V8 engine! Wow. C++ is in 5th place. The fastest C++ program submitted to the competition uses static xpressive <pats own back>. I'm not so upset about being beaten by V8. It adaptively improves its native codegen *at runtime*. What really bugs me is that we're skunked by a C library: Tcl. Grrrr. I've read a bit about Tcl's regex library; it does what Mathias is suggesting: implements both a DFA and an NFA, analyzes the pattern and chooses which to use. I've known for a while that this is the way forward, but I just don't have the time for that. (Wasn't there a GSoC project to do that for Boost.Regex?) -- Eric Niebler BoostPro Computing http://www.boostpro.com

Mathias Gaunard

1:20 p.m.

Eric Niebler wrote:

...

On 6/7/2010 5:50 AM, Mathias Gaunard wrote:

...
...
I'm always suspect of benchmarks, but I figured I'd post this anyway. <snip> Just an idea: for static xpressive, couldn't you detect at compile-time

Le 07/06/2010 04:58, Eric Niebler wrote: that the expression is truly regular, and use a DFA in that case?

Oh, sure! Why don't you submit a DFA and I'll use it in xpressive? ;-)

How about a warning "you may not be using the optimal tool for the job" until then? Are the finite state machine libraries in Boost usable for this at all?

Eric Niebler

4:10 p.m.

On 6/7/2010 9:20 AM, Mathias Gaunard wrote:

...

Eric Niebler wrote:

...
On 6/7/2010 5:50 AM, Mathias Gaunard wrote:

...
...
I'm always suspect of benchmarks, but I figured I'd post this anyway. <snip> Just an idea: for static xpressive, couldn't you detect at compile-time

Le 07/06/2010 04:58, Eric Niebler wrote: that the expression is truly regular, and use a DFA in that case?

Oh, sure! Why don't you submit a DFA and I'll use it in xpressive? ;-)

How about a warning "you may not be using the optimal tool for the job" until then?

I'm having a hard time getting excited about that.

...

Are the finite state machine libraries in Boost usable for this at all?

I don't know. At BoostCon, Christophe presented some interesting results pitting MSM against xpressive for some simple regexing. An interesting research project would be to investigate whether xpressive's hybrid static/dynamic NFA design can be applied to DFAs as well, with similar perf wins. This would be a non-trivial undertaking, like completely reimplementing xpressive. -- Eric Niebler BoostPro Computing http://www.boostpro.com

Joel de Guzman

8 Jun 8 Jun

12:01 a.m.

On 6/8/10 12:10 AM, Eric Niebler wrote:

...

On 6/7/2010 9:20 AM, Mathias Gaunard wrote:

...
Eric Niebler wrote:

...
On 6/7/2010 5:50 AM, Mathias Gaunard wrote:

...
...
I'm always suspect of benchmarks, but I figured I'd post this anyway. <snip> Just an idea: for static xpressive, couldn't you detect at compile-time

Le 07/06/2010 04:58, Eric Niebler wrote: that the expression is truly regular, and use a DFA in that case?

Oh, sure! Why don't you submit a DFA and I'll use it in xpressive? ;-)

How about a warning "you may not be using the optimal tool for the job" until then?

I'm having a hard time getting excited about that.

...
Are the finite state machine libraries in Boost usable for this at all?

I don't know. At BoostCon, Christophe presented some interesting results pitting MSM against xpressive for some simple regexing. An interesting research project would be to investigate whether xpressive's hybrid static/dynamic NFA design can be applied to DFAs as well, with similar perf wins. This would be a non-trivial undertaking, like completely reimplementing xpressive.

Just to remind you, there's a very good DFA implementation underneath Spirit Lex. The interface is not perfect (Ben has requested for some input in that area), but the implementation is very good. Spirit Lex is proto on top of lexertl (Ben's work). We find it to be one of the fastest lexer, close to performance to the fastest (RE2C) which is a code-gen that generates a humongous switch. Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net

Eric Niebler

12:34 a.m.

On 6/7/2010 8:01 PM, Joel de Guzman wrote:

...

On 6/8/10 12:10 AM, Eric Niebler wrote:

...
An interesting research project would be to investigate whether xpressive's hybrid static/dynamic NFA design can be applied to DFAs as well, with similar perf wins. This would be a non-trivial undertaking, like completely reimplementing xpressive.

Just to remind you, there's a very good DFA implementation underneath Spirit Lex. The interface is not perfect (Ben has requested for some input in that area), but the implementation is very good. Spirit Lex is proto on top of lexertl (Ben's work). We find it to be one of the fastest lexer, close to performance to the fastest (RE2C) which is a code-gen that generates a humongous switch.

I haven't forgotten. IIRC, lexertl is like dynamic xpressive: it accepts strings as input and builds a FSA. I was speculating about what the lexertl equivalent of static xpressive would look like and whether it would be even faster. Then static xpressive could use static lexertl for simple regexes, and dynamic xpressive could use the dynamic one. -- Eric Niebler BoostPro Computing http://www.boostpro.com

Joel de Guzman

3:08 a.m.

On 6/8/10 8:34 AM, Eric Niebler wrote:

...

On 6/7/2010 8:01 PM, Joel de Guzman wrote:

...
On 6/8/10 12:10 AM, Eric Niebler wrote:

...
An interesting research project would be to investigate whether xpressive's hybrid static/dynamic NFA design can be applied to DFAs as well, with similar perf wins. This would be a non-trivial undertaking, like completely reimplementing xpressive.

Just to remind you, there's a very good DFA implementation underneath Spirit Lex. The interface is not perfect (Ben has requested for some input in that area), but the implementation is very good. Spirit Lex is proto on top of lexertl (Ben's work). We find it to be one of the fastest lexer, close to performance to the fastest (RE2C) which is a code-gen that generates a humongous switch.

I haven't forgotten. IIRC, lexertl is like dynamic xpressive: it accepts strings as input and builds a FSA. I was speculating about what the lexertl equivalent of static xpressive would look like and whether it would be even faster. Then static xpressive could use static lexertl for simple regexes, and dynamic xpressive could use the dynamic one.

That's an awesome idea! Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net

John Bytheway

8:22 a.m.

On 08/06/10 01:34, Eric Niebler wrote:

...

On 6/7/2010 8:01 PM, Joel de Guzman wrote:

...
On 6/8/10 12:10 AM, Eric Niebler wrote:

...
An interesting research project would be to investigate whether xpressive's hybrid static/dynamic NFA design can be applied to DFAs as well, with similar perf wins. This would be a non-trivial undertaking, like completely reimplementing xpressive.

Just to remind you, there's a very good DFA implementation underneath Spirit Lex. The interface is not perfect (Ben has requested for some input in that area), but the implementation is very good. Spirit Lex is proto on top of lexertl (Ben's work). We find it to be one of the fastest lexer, close to performance to the fastest (RE2C) which is a code-gen that generates a humongous switch.

I haven't forgotten. IIRC, lexertl is like dynamic xpressive: it accepts strings as input and builds a FSA. I was speculating about what the lexertl equivalent of static xpressive would look like and whether it would be even faster. Then static xpressive could use static lexertl for simple regexes, and dynamic xpressive could use the dynamic one.

I once wrote a compile-time lexer compiler, which seems to be the sort of thing you're suggesting. It worked like this: typedef lexer< rule<literal<'k','e','y','w','o','r','d'>, construct<KEYWORD> >, rule<one_or_more<range<'a','z'> >, construct<ID, mpl::true_> >, rule<alternative_literal<' ','\n','\t','\r'>, discard >

...

keyword_lexer; typedef keyword_lexer::token_type token; typedef keyword_lexer::token_id_map id_map;

BOOST_MPL_ASSERT_RELATION( (mpl::at<id_map, KEYWORD>::type::value),==,0 ); BOOST_MPL_ASSERT_RELATION((mpl::at<id_map, ID>::type::value),==,1); string s("foo bar\t\tkeyword \n\r baz keywordo "); keyword_lexer l(s); vector<token::ptr> tokens; copy(l.begin(), l.end(), back_inserter(tokens)); BOOST_REQUIRE_EQUAL(tokens.size(), 5); BOOST_CHECK_EQUAL(tokens[0]->id(), l.get_id<ID>()); BOOST_CHECK_EQUAL(tokens[2]->id(), l.get_id<KEYWORD>()); BOOST_CHECK_EQUAL(tokens[4]->id(), l.get_id<ID>()); This defines a lexer recognizing whitespace-separated lower-case words, but treating the word 'keyword' as special. Each rule of the lexer is a regex followed by what to do when that regex is matched. The first rule constructs a KEYWORD token, the second constructs an ID token (and the mpl::true_ means it passes the range which matched the regex to the constructor), the third matches without creating a token. The library will take this lexer specification and turn it into an NFA, then transform that into a DFA, then encode that as a transition table in an array, *all at compile time*. On reflection, it probably would have been better (and faster) to use the giant-switch-statement approach, rather than an array-based transition table. Then we give the lexer a string to parse (any forward range of characters would do) and we can access the parsed tokens (in a type-erased fashion) by treating the lexer itself as a range. Obviously going to all this trouble to do everything at compile time and then offering only a type-erased interface is a bit silly, but there are other ways (you can give the lexer a UnaryFunction which it calls on the next token). The down side with this approach, of course, is the absolutely massive compile-time cost. Compiling the above example with g++ 4.4.3 takes over 6 minutes and over 1.3GiB of memory (and this is after I've put in effort to optimize it). The time I could live with, but the memory requirements are absurd. Any lexer of real-world complexity would be quite impossible to compile. However, it might just be possible to use it for simple regexes, rather than full-blown lexers. Anyone think that might be worth investigating? John Bytheway

Eric Niebler

1:13 p.m.

On 6/8/2010 4:22 AM, John Bytheway wrote:

...

On 08/06/10 01:34, Eric Niebler wrote:

...
I haven't forgotten. IIRC, lexertl is like dynamic xpressive: it accepts strings as input and builds a FSA. I was speculating about what the lexertl equivalent of static xpressive would look like and whether it would be even faster. Then static xpressive could use static lexertl for simple regexes, and dynamic xpressive could use the dynamic one.

I once wrote a compile-time lexer compiler, which seems to be the sort of thing you're suggesting. It worked like this:

typedef lexer< rule<literal<'k','e','y','w','o','r','d'>, construct<KEYWORD> >,

mpl::string could help you here.

...

rule<one_or_more<range<'a','z'> >, construct<ID, mpl::true_> >, rule<alternative_literal<' ','\n','\t','\r'>, discard >

...
keyword_lexer;

<snip>

...

This defines a lexer recognizing whitespace-separated lower-case words, but treating the word 'keyword' as special. Each rule of the lexer is a regex followed by what to do when that regex is matched. The first rule constructs a KEYWORD token, the second constructs an ID token (and the mpl::true_ means it passes the range which matched the regex to the constructor), the third matches without creating a token.

The library will take this lexer specification and turn it into an NFA, then transform that into a DFA, then encode that as a transition table in an array, *all at compile time*.

Wow! This is something like what I had in mind. <snip>

...

The down side with this approach, of course, is the absolutely massive compile-time cost. Compiling the above example with g++ 4.4.3 takes over 6 minutes and over 1.3GiB of memory (and this is after I've put in effort to optimize it). The time I could live with, but the memory requirements are absurd. Any lexer of real-world complexity would be quite impossible to compile.

Ouch, that's a serious problem. Maybe there is a more efficient way to get to a DFA. Or maybe the world isn't ready for compile-time DFAs.

...

However, it might just be possible to use it for simple regexes, rather than full-blown lexers. Anyone think that might be worth investigating?

If the compile times are that bad, I wouldn't be able to use it. I can't use it as-is regardless, because static xpressive doesn't have compile-time access to the string literals; they're still char*. -- Eric Niebler BoostPro Computing http://www.boostpro.com

John Bytheway

8:28 p.m.

On 08/06/10 14:13, Eric Niebler wrote:

...

On 6/8/2010 4:22 AM, John Bytheway wrote:

...
I once wrote a compile-time lexer compiler, which seems to be the sort of thing you're suggesting. It worked like this:

typedef lexer< rule<literal<'k','e','y','w','o','r','d'>, construct<KEYWORD> >,

mpl::string could help you here.

Indeed. I think it didn't exist when I wrote this code.

...

...
The library will take this lexer specification and turn it into an NFA, then transform that into a DFA, then encode that as a transition table in an array, *all at compile time*.

Wow! This is something like what I had in mind.

Well, I'm glad you seem impressed. It wasn't trivial to make this work :).

...

<snip>

...
The down side with this approach, of course, is the absolutely massive compile-time cost. Compiling the above example with g++ 4.4.3 takes over 6 minutes and over 1.3GiB of memory (and this is after I've put in effort to optimize it). The time I could live with, but the memory requirements are absurd. Any lexer of real-world complexity would be quite impossible to compile.

Ouch, that's a serious problem. Maybe there is a more efficient way to get to a DFA. Or maybe the world isn't ready for compile-time DFAs.

I suspect there's a decent constant factor that could be squeezed out, but I don't think there are any serious algorithmic improvements. OTOH, metaprogramming algorithmic complexity is confusing at the best of times, so I may be wrong. Once g++ supports C++0x constexpr functions I intend to revisit this; they might well provide substantial improvement. That said, I'll try clang now and see how that fares...

...

...
However, it might just be possible to use it for simple regexes, rather than full-blown lexers. Anyone think that might be worth investigating?

If the compile times are that bad, I wouldn't be able to use it. I can't use it as-is regardless, because static xpressive doesn't have compile-time access to the string literals; they're still char*.

Yeah, that is an annoyance. I don't think there's anything to be done about it, though. John Bytheway

Mathias Gaunard

1:31 p.m.

John Bytheway wrote:

...

The library will take this lexer specification and turn it into an NFA, then transform that into a DFA, then encode that as a transition table in an array, *all at compile time*. On reflection, it probably would have been better (and faster) to use the giant-switch-statement approach, rather than an array-based transition table.

Boost.Switch (from Steven Watanabe if I remember well) probably has limitations of its own.

Steven Watanabe

2:03 p.m.

AMDG John Bytheway wrote:

...

The library will take this lexer specification and turn it into an NFA, then transform that into a DFA, then encode that as a transition table in an array, *all at compile time*. On reflection, it probably would have been better (and faster) to use the giant-switch-statement approach, rather than an array-based transition table.

In my experience, array lookups tend to be faster than switch statements. In Christ, Steven Watanabe

John Bytheway

8:31 p.m.

On 08/06/10 15:03, Steven Watanabe wrote:

...

AMDG

John Bytheway wrote:

...
The library will take this lexer specification and turn it into an NFA, then transform that into a DFA, then encode that as a transition table in an array, *all at compile time*. On reflection, it probably would have been better (and faster) to use the giant-switch-statement approach, rather than an array-based transition table.

In my experience, array lookups tend to be faster than switch statements.

Shame. It would have made the metaprogramming simpler (I think); the process of turning a 2-dimensional compile-time array into a 2-dimensional runtime array requires copious preprocessor metaprogramming (my first implementation took 20 minutes to preprocess! I improved it substantially, though). John Bytheway

Andy Venikov

3:16 p.m.

John Bytheway wrote: <snip>

...

The down side with this approach, of course, is the absolutely massive compile-time cost. Compiling the above example with g++ 4.4.3 takes over 6 minutes and over 1.3GiB of memory (and this is after I've put in effort to optimize it). The time I could live with, but the memory requirements are absurd. Any lexer of real-world complexity would be quite impossible to compile.

Could you try it with gcc 4.3? 4.4 has a known regression when it comes to compiling meta-c++ code. Just ask Christophe. Andy.

John Bytheway

8:56 p.m.

On 08/06/10 16:16, Andy Venikov wrote:

...

John Bytheway wrote: <snip>

...
The down side with this approach, of course, is the absolutely massive compile-time cost. Compiling the above example with g++ 4.4.3 takes over 6 minutes and over 1.3GiB of memory (and this is after I've put in effort to optimize it). The time I could live with, but the memory requirements are absurd. Any lexer of real-world complexity would be quite impossible to compile.

Could you try it with gcc 4.3? 4.4 has a known regression when it comes to compiling meta-c++ code.

Here are some numbers for compiling one of my tests which contains the example I quoted and a couple of other relatively trivial examples. These numbers are reported by /usr/bin/time. The memory usage numbers are not at all accurate, but I think they're proportional to the truth. g++ 4.4.3: Elapsed (wall clock) time (h:mm:ss or m:ss): 7:01.17 Maximum resident set size (kbytes): 5298256 (as I said before, this really needs about 1.3GiB, not the 5GB reported) g++ 4.3.3: Elapsed (wall clock) time (h:mm:ss or m:ss): 9:38.92 Maximum resident set size (kbytes): 5390064 So 4.3 is worse on both counts. clang++ (svn HEAD): Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.59 Maximum resident set size (kbytes): 5806528 So clang is *much* faster than either g++, but uses even more memory (this number translates as a little over 1.5GiB in real terms). John Bytheway

Mathieu -

9:22 p.m.

It would also be interesting, if possible, to test with g++ 4.5, as there is an improvement on template compilation time.

John Maddock

7 Jun 7 Jun

3:17 p.m.

...

...
Just an idea: for static xpressive, couldn't you detect at compile-time that the expression is truly regular, and use a DFA in that case?

Oh, sure! Why don't you submit a DFA and I'll use it in xpressive? ;-)

After nosing about on the internet a bit more, I found this interesting comparison:

http://shootout.alioth.debian.org/u32/benchmark.php?test=regexdna&lang=all

Here we see every language compared on how well it can perform on a particular regex task. The top-performer is <drumroll> Google's JavaScript V8 engine! Wow. C++ is in 5th place. The fastest C++ program submitted to the competition uses static xpressive <pats own back>. I'm not so upset about being beaten by V8. It adaptively improves its native codegen *at runtime*. What really bugs me is that we're skunked by a C library: Tcl. Grrrr. I've read a bit about Tcl's regex library; it does what Mathias is suggesting: implements both a DFA and an NFA, analyzes the pattern and chooses which to use. I've known for a while that this is the way forward, but I just don't have the time for that. (Wasn't there a GSoC project to do that for Boost.Regex?)

My memory fails me.... In any case the regex GSOC project never got off the ground. Nosing around the entries to the competition, I wonder how much of the performance difference is down to the regex engine, and how much to other tricks the entries use: for example I notice the top C program uses a thread pool to conduct everything in parallel. Cheating I say! ;-) John.

Eric Niebler

4 p.m.

On 6/7/2010 11:17 AM, John Maddock wrote:

...

...
http://shootout.alioth.debian.org/u32/benchmark.php?test=regexdna&lang=all

Here we see every language compared on how well it can perform on a particular regex task. <snip> What really bugs me is that we're skunked by a C library: Tcl. Grrrr. I've read a bit about Tcl's regex library; it does what Mathias is suggesting: implements both a DFA and an NFA, analyzes the pattern and chooses which to use. I've known for a while that this is the way forward, but I just don't have the time for that. (Wasn't there a GSoC project to do that for Boost.Regex?)

My memory fails me.... In any case the regex GSOC project never got off the ground.

Bummer.

...

Nosing around the entries to the competition, I wonder how much of the performance difference is down to the regex engine, and how much to other tricks the entries use: for example I notice the top C program uses a thread pool to conduct everything in parallel. Cheating I say! ;-)

The entry that uses xpressive does something similar. By hook or by crook, I say! But the competition also lets you compare the solutions by lines of code and by memory usage. On both counts, the solution that uses Tcl fares very badly, but that won't matter to most people, I'll wager. -- Eric Niebler BoostPro Computing http://www.boostpro.com

Isaac Gouy

8 Jun 8 Jun

3:41 p.m.

On Jun 7, 8:17 am, John Maddock <boost.re...@virgin.net> wrote:

...

...
...
Just an idea: for static xpressive, couldn't you detect at compile-time that the expression is truly regular, and use a DFA in that case?

...
Oh, sure! Why don't you submit a DFA and I'll use it in xpressive? ;-)

...
After nosing about on the internet a bit more, I found this interesting comparison:

...
http://shootout.alioth.debian.org/u32/benchmark.php?test=regexdna&lan...

...
Here we see every language compared on how well it can perform on a particular regex task. The top-performer is <drumroll> Google's JavaScript V8 engine! Wow. C++ is in 5th place. The fastest C++ program submitted to the competition uses static xpressive <pats own back>. I'm not so upset about being beaten by V8. It adaptively improves its native codegen *at runtime*. What really bugs me is that we're skunked by a C library: Tcl. Grrrr. I've read a bit about Tcl's regex library; it does what Mathias is suggesting: implements both a DFA and an NFA, analyzes the pattern and chooses which to use. I've known for a while that this is the way forward, but I just don't have the time for that. (Wasn't there a GSoC project to do that for Boost.Regex?)

My memory fails me.... In any case the regex GSOC project never got off the ground.

Nosing around the entries to the competition, I wonder how much of the performance difference is down to the regex engine, and how much to other tricks the entries use: for example I notice the top C program uses a thread pool to conduct everything in parallel. Cheating I say! ;-)

Does "everything in parallel" gain anything when the program is being forced onto one core? (Notice the ≈ CPU Load column). In contrast, when programs are allowed to use quad-core there's a benefit: http://shootout.alioth.debian.org/u32q/performance.php?test=regexdna

Jeroen Habraken

10:09 a.m.

On 7 June 2010 05:58, Eric Niebler <eric@boostpro.com> wrote:

...

I'm always suspect of benchmarks, but I figured I'd post this anyway.

http://astl.sourceforge.net/bench.7.html.

It seems someone wrote a generic automata library and a regex engine on top of it, and compared its performance to boost.regex, xpressive, pcre, and greta. Of course, his library wins handily. (The purpose of benchmarking is to make yourself look good, right? ;-) He doesn't say that (a) his engine is a DFA, and he's comparing to NFAs; and (b) he's only implemented a tiny subset of regex features people would expect; e.g., no captures even, let alone backreferences: things that can be tricky or impossible to implement with DFAs. He also doesn't say what version of Boost he's using (tests run Nov. 2009) or whether he's using static or dynamic xpressive (probably dynamic; static can be ~15% faster).

Anyway, it's not my intention to kick off yet another costly escalation in the regex war. God help me, I don't have the time. I post it here because I haven't come across a comparison of these four libraries by a third party before. Some quick observations:

- I'm pleased that on the balance, xpressive fares well here relative to the other NFAs. In the end, though, not much separates best from worst.

- All the NFA libraries seem to have large overhead that becomes very noticeable when matching short strings. I don't know why.

- I'm gobsmacked that my VERY OLD library greta handily beats pcre, boost.regex /and/ xpressive on short matches. Failing to beat yourself in any perf benchmark is kind of embarrassing.

- I'm tickled that for long matches, both xpressive and boost.regex (NFAs) beat his library (DFA). How'd that happen?

- I find it interesting that both xpressive and boost.regex have large discontinuities in the chart at the bottom; the others don't. Huh.

When more std library vendors start shipping regex libraries, I'll be interested in seeing this perf matrix expanded.

-- Eric Niebler BoostPro Computing http://www.boostpro.com _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Just as an FYI, there's a set of relatively new libraries from Google, re2: http://code.google.com/p/re2/, and Irregexp: http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html. Regards, Jeroen Habraken

Eric Niebler

12:36 p.m.

On 6/8/2010 6:09 AM, Jeroen Habraken wrote:

...

Just as an FYI, there's a set of relatively new libraries from Google, re2: http://code.google.com/p/re2/, and Irregexp: http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html.

Irregexp is what is used by JavaScript V8. It takes the #1 spot in the regex shootout. -- Eric Niebler BoostPro Computing http://www.boostpro.com

5539

Age (days ago)

5540

Last active (days ago)

List overview

Download

22 comments

10 participants

participants (10)

Andy Venikov
Eric Niebler
Isaac Gouy
Jeroen Habraken
Joel de Guzman
John Bytheway
John Maddock
Mathias Gaunard
Mathieu -
Steven Watanabe