
The review of Eric Niebler's xpressive library begins now and runs through Sunday, September 18. The library is available here: http://www.boost-consulting.com/vault/index.php?directory=Strings%20-% 20Text%20Processing Here is a short introduction that I shamelessly reassembled from the xpressive documentation. xpressive is an advanced, object-oriented regular expression template library for C++. Regular expressions can be written as strings that are parsed at run-time, or as expression templates that are parsed at compile-time. xpressive brings these two approaches seamlessly together and occupies a unique niche in the world of C++ text processing. With xpressive, you can choose to use it much as you would use Boost.Regex, representing regular expressions as strings. Or you can use it as you would use Spirit, writing your regexes as C+ + expressions, enjoying all the benefits of an embedded language dedicated to text manipulation. When writing your review, you may wish to consider the following questions: * What is your evaluation of the design? * What is your evaluation of the implementation? * What is your evaluation of the documentation? * What is your evaluation of the potential usefulness of the library? * Did you try to use the library? With what compiler? Did you have any problems? * How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? * Are you knowledgeable about the problem domain? And finally, every review should answer this question: * Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion. Please send your reviews either to the Boost list or to me personally. The former is preferred, because it allows the community to consider your review as well. Thanks Thomas Witt Review Manager xpressive Thomas Witt witt@acm.org

Thomas Witt wrote:
The review of Eric Niebler's xpressive library begins now and runs through Sunday, September 18.
The library is available here:
http://www.boost-consulting.com/vault/index.php?directory=Strings%20-% 20Text%20Processing
And the docs are online at: http://boost-sandbox.sf.net/libs/xpressive -- Eric Niebler Boost Consulting www.boost-consulting.com

Hi Eric, Boost already has Regex. What is different about xpressive? regards Andy Little

Cytowanie Andy Little <andy@servocomm.freeserve.co.uk>:
Boost already has Regex. What is different about xpressive?
I can not take part in full review for variety of reasons (primary I have little knowledge of the problem domain and I'm on vacation now), but there is something very important about xpressive - it provides static (compile-time) regular expressions. This allows an extra performance boost, which is very important for some users - eg. my colleagues at work. We use Boost.Regex extensively and parts of our software are very CPU-bound. Switch to xpressive could free some (badly needed) CPU cycles. For the same reason my colleagues are considering switch to Spirit, but learning curve (and library complexity) seems to be steeper. If my voice is worth anything, I would strongly vote YES due to performance benefits. Besides, I believe it is just fine to have more that one implementation of regex if they are sufficiently different (choice is a Good Thing). B.

"Bronek Kozicki" <brok@rubikon.pl> wrote in message news:20050913142103.s6zwgckkcw8ww40k@webmail.spamcop.net...
Cytowanie Andy Little <andy@servocomm.freeserve.co.uk>:
Boost already has Regex. What is different about xpressive?
I can not take part in full review for variety of reasons (primary I have little knowledge of the problem domain and I'm on vacation now), but there is something very important about xpressive - it provides static (compile-time) regular expressions.
I understand that and that seems sufficiently different from Regex, but does xpressive not also duplicate the runtime functionality in Regex, Regex being a candidate for standardisation? regards Andy Little

Andy Little wrote:
Hi Eric,
Boost already has Regex. What is different about xpressive?
regards Andy Little
Although xpressive's primitives are regex-ish in nature, xpressive is far more than just a regular expression engine. (The docs don't make this point strongly enough, my bad.) xpressive is a pattern matching engine with the power of context-free grammars. By allowing you to embed one pattern in another by reference, xpressive lets you find patterns in data that are recursive in nature. Problems that are intractible with "classic" regular expressions, such as matching balanced nested parentheses or matching XML tags, are simple with xpressive. So xpressive is more rightly thought of as a context-free grammar engine with exhaustive backtracking. The goal is to provide one framework and one set of pattern matching primitives that scales from simple strstr-style grepping, through ordinary regexing, all the way to full-blown parsing. That's the primary difference. Other differences flow from that one, like the ability to find patterns in non-char data, and better integration with "real" C++. The latter point is a work in progress, but the ultimate goal is to make it as simple to call C++ code from a regex as it is to call a regex from C++. That includes things like user-defined predicates (extending the set of regex primitives), assignment to variables (string, list, map) from withing the regex, and semantic actions. It's true that there is functional overlap between Regex and xpressive, but xpressive takes a much broader view of pattern matching. And just because one piece of this puzzle has been proposed for standardization doesn't mean that all innovation in this problem space must cease. -- Eric Niebler Boost Consulting www.boost-consulting.com

On 9/13/05 10:56 AM, "Eric Niebler" <eric@boost-consulting.com> wrote:
It's true that there is functional overlap between Regex and xpressive, but xpressive takes a much broader view of pattern matching. And just because one piece of this puzzle has been proposed for standardization doesn't mean that all innovation in this problem space must cease.
If xpressive gets accepted... ..could we factor stuff so common capabilities get to be in one set of code? (In other words, is xpressive is a super-set of Boost.Regex, and if so, can we change the implementation of Boost.Regex to use xpressive? What about any commonalities between these libraries and Spirit. Maybe we could have a grand unification of these three libraries some day, if appropriate. Of course, we could start a Regex and Spirit factoring now.) -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Daryle Walker wrote:
On 9/13/05 10:56 AM, "Eric Niebler" <eric@boost-consulting.com> wrote:
It's true that there is functional overlap between Regex and xpressive, but xpressive takes a much broader view of pattern matching. And just because one piece of this puzzle has been proposed for standardization doesn't mean that all innovation in this problem space must cease.
If xpressive gets accepted...
..could we factor stuff so common capabilities get to be in one set of code? (In other words, is xpressive is a super-set of Boost.Regex, and if so, can we change the implementation of Boost.Regex to use xpressive? What about any commonalities between these libraries and Spirit. Maybe we could have a grand unification of these three libraries some day, if appropriate. Of course, we could start a Regex and Spirit factoring now.)
xpressive is more like Spirit than Regex, and Joel and I have spent much time talking about unification possibilities. Our current thinking is that in the future, Spirit and xpressive will share a common core. In fact, I have already begun factoring out and generalizing large chunks of xpressive's meta-programming for reuse as a general expression template domain-specific embeded language compiler construction toolset (see boost/xpressive/detail/proto). In addition, xpressive already reuses Fusion, which is part of Spirit. Unification with Boost.Regex would be less straightforward. Look over my reply to Thomas Witt about the design of xpressive's basic_regex<> type: http://article.gmane.org/gmane.comp.lib.boost.devel/130973 xpressive's design goals led to a different interface and a different internal representation for the regex type. The interface chosen by Boost.Regex places some requirements on the implementation -- namely, that the compiled form of a regex much be iterator-neutral. xpressive's implementation is a poor fit for the Boost.Regex interface. IMO, it's best to leave Boost.Regex alone. Its implementation is a far better fit for its interface than xpressive would be. -- Eric Niebler Boost Consulting www.boost-consulting.com

"Eric Niebler" <eric@boost-consulting.com> wrote in message news:4327CAA3.5090305@boost-consulting.com... ...
xpressive's design goals led to a different interface and a different internal representation for the regex type. The interface chosen by Boost.Regex places some requirements on the implementation -- namely, that the compiled form of a regex much be iterator-neutral. xpressive's implementation is a poor fit for the Boost.Regex interface. IMO, it's best to leave Boost.Regex alone. Its implementation is a far better fit for its interface than xpressive would be.
That begs the question: do we need Boost.Regex's interface? I'm not up to date on Boost.Regex's status as far as TR1, so maybe it's a done deal already. But it seems to me that xpressive has a broader application domain, and may be a better candidate? I stopped using Boost.Regex after I invested the time to learn Spirit. Not that there was anything wrong with Boost.Regex. I just started to think in terms of Spirit grammars and functor parsers. My programming productivity seems to be bound by my memory limitations. So I try to utilize a learned technology as often as I can to keep up on the nuances of the tool. Jeff Flinn

Jeff Flinn wrote:
That begs the question: do we need Boost.Regex's interface? I'm not up to date on Boost.Regex's status as far as TR1, so maybe it's a done deal already. But it seems to me that xpressive has a broader application domain, and may be a better candidate?
<snip> It's a done deal. The Boost.Regex interface has already been accepted as the TR1 regex interface. -- Eric Niebler Boost Consulting www.boost-consulting.com

On 9/14/05 11:33 AM, "Eric Niebler" <eric@boost-consulting.com> wrote:
Jeff Flinn wrote:
That begs the question: do we need Boost.Regex's interface? I'm not up to date on Boost.Regex's status as far as TR1, so maybe it's a done deal already. But it seems to me that xpressive has a broader application domain, and may be a better candidate?
<snip> It's a done deal. The Boost.Regex interface has already been accepted as the TR1 regex interface.
It wouldn't be the first time an interface was accepted into C++ that seemed great at the time, but possibly better alternates were discovered later.... (e.g. valarray vs. expression templates, vector<bool> vs. separate type) -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Thomas Witt <witt@acm.org> writes:
The review of Eric Niebler's xpressive library begins now and runs through Sunday, September 18.
The library is available here:
http://www.boost-consulting.com/vault/index.php?directory=Strings%20-%20Text...
For easy reviewing I've posted a copy of the docs at http://www.boost-consulting.com/projects/xpressive -- Dave Abrahams Boost Consulting www.boost-consulting.com

This may have more to do with boostbook/quickbook than with xpressive, but in my Windows XP FireFox browser, it appears that every place inline code is followed by regular text, there is no space separating them. Spaces also seem to be missing in a few other places, e.g. "usingnamespace" can be seen on http://boost-consulting.com/projects/xpressive/xpressive/user_s_guide.html#x..., as can "representationof" ("representation" is in italics). So this probably applies to all style changes. Also, the full justification of text is difficult to read. I don't think web browsers are smart enough about formatting to make that legible, at least not at the normal smallish text sizes. Also, I think the abbrev "parens" should be changed to the word "parentheses" :) -- Dave Abrahams Boost Consulting www.boost-consulting.com

This is a mini-review, based just on reading the documentation, not on trying to compile anything. My comments on documentation, with a few questions are attached below. Design looks well thought out; library is potentially useful. Documentation was good. I only read the user documentation, not the reference docs. Are you knowledgeable about the problem domain? I've used boost.regex and spirit, as well as using regexes a lot in scripting languages. Do you think the library should be accepted as a Boost library? Yes, but conditional on having benchmarks showing a worthwhile speed improvement over boost.regex. (Or alternatively over spirit.) Without that there is no strong reason to have it in Boost along with both Spirit and boost.regex. Darren ------------------- URLs are relative to http://boost-sandbox.sourceforge.net/libs/xpressive/doc/html/xpressive/ * user_s_guide.html As I read I assumed "sregex" meant static (compile-time) regex. I then thought compile() must be very clever and wondered why bother with the alternative ">>" syntax. So I think you need to make it clearer on this page that sregex means std::string regex, and that compile() is for a run-time regex, and the ">>" syntax is for a compile-time regex. * creating_a_regex_object.html 1. Either the meaning of Perl's /s modifier needs to be defined clearly, or the difference between "_" and "~_n" needs to be shown with an example (incidentally none of your examples at examples.html match strings with carriage-returns). 2. I see I can use icase("Abc") but is there a way to say the whole regex should be case-insensitive? I.e. the equivalent of: "/match something/i" * grammars_and_nested_matches.html In the example that starts: sregex parentheses; parens = '(' should "parens" actually be "parentheses" ? If not, I don't understand what is happening, and this example needs more explanation. 2. In Filtering Nested Results, I wasn't clear what the purpose was. Is it to show all the name matches before all the id matches? If so, choosing a less regular example string would help, e.g. with more names than ids, names following names some of the time, etc. 3. "See the definition of output_nested_results in the Examples section." I think that function should be moved to grammars_and_nested_matches.html; it seemed out of place in examples.html. * Other 1. I'd like to see some fuller examples, that show the I/O part as well. E.g. a full program that takes a list of email addresses on stdin, one per line, and spits out a list of the illegal ones. 2. Benchmarking. I wanted to see the relative speed of compile-time vs. run-time vs. boost::regex (and ideally vs. PCRE or a scripting language) on some realistic application.

Answers inline... Darren Cook wrote:
Do you think the library should be accepted as a Boost library? Yes, but conditional on having benchmarks showing a worthwhile speed improvement over boost.regex. (Or alternatively over spirit.) Without that there is no strong reason to have it in Boost along with both Spirit and boost.regex.
There are results from of performance benchmarks of static xpressive vs. dynamic xpressive vs. Boost.Regex in the Appendix of xpressive's documentation. You must have missed it. See: http://boost-sandbox.sf.net/libs/xpressive/doc/html/xpressive/perf.html In short, xpressive comes out consistently ahead of Boost.Regex on short matches, and roughly on par for longer matches (with wide variation). Results are shown for both gcc 3.4 and vc7.1. The xpressive download includes the code for the perf test, so you can run it yourself, if you like.
* user_s_guide.html As I read I assumed "sregex" meant static (compile-time) regex. I then thought compile() must be very clever and wondered why bother with the alternative ">>" syntax. So I think you need to make it clearer on this page that sregex means std::string regex, and that compile() is for a run-time regex, and the ">>" syntax is for a compile-time regex.
Agreed.
* creating_a_regex_object.html 1. Either the meaning of Perl's /s modifier needs to be defined clearly, or the difference between "_" and "~_n" needs to be shown with an example (incidentally none of your examples at examples.html match strings with carriage-returns).
Agreed. FYI, "_" matches any one character. ~_n matches any character that is not '\n'. I also need to describe _ln which matches a logical newline (eg., "\n" or "\r" or "\r\n" or other line separators) and ~_ln which matches any one character that is not a line separator. This all needs to be documented better.
2. I see I can use icase("Abc") but is there a way to say the whole regex should be case-insensitive? I.e. the equivalent of: "/match something/i"
You can just wrap the whole regex in icase(). I need to show an example of that.
* grammars_and_nested_matches.html In the example that starts: sregex parentheses; parens = '('
should "parens" actually be "parentheses" ?
Yes. My bad.
2. In Filtering Nested Results, I wasn't clear what the purpose was. Is it to show all the name matches before all the id matches? If so, choosing a less regular example string would help, e.g. with more names than ids, names following names some of the time, etc.
I'm not at all sure of the utility of the nested results filter, and I may just cut it. After matching a regex that contains nested regexes, the match_results object contains nested results. Figuring out which results correspond to which regex can be difficult. The filter lets you see only those results corresponding to a particular nested regex. But I've yet to need it in practice. *shrug*
3. "See the definition of output_nested_results in the Examples section." I think that function should be moved to grammars_and_nested_matches.html; it seemed out of place in examples.html.
You're right it doesn't belong in Examples. But I didn't want to clutter the user doc with what is really an implementation detail. I'll think about it.
* Other 1. I'd like to see some fuller examples, that show the I/O part as well. E.g. a full program that takes a list of email addresses on stdin, one per line, and spits out a list of the illegal ones.
Haha! Have you /seen/ the regex that matches email addresses? It's 5 pages long. But I get the idea -- examples are important. I'll see what I can come up with.
2. Benchmarking. I wanted to see the relative speed of compile-time vs. run-time vs. boost::regex (and ideally vs. PCRE or a scripting language) on some realistic application.
It's in there. -- Eric Niebler Boost Consulting www.boost-consulting.com

My reply to Darren seems to have been eaten by the GMane monster. Resending.... Eric Niebler wrote:
Answers inline...
Darren Cook wrote:
Do you think the library should be accepted as a Boost library? Yes, but conditional on having benchmarks showing a worthwhile speed improvement over boost.regex. (Or alternatively over spirit.) Without that there is no strong reason to have it in Boost along with both Spirit and boost.regex.
There are results from of performance benchmarks of static xpressive vs. dynamic xpressive vs. Boost.Regex in the Appendix of xpressive's documentation. You must have missed it. See:
http://boost-sandbox.sf.net/libs/xpressive/doc/html/xpressive/perf.html
In short, xpressive comes out consistently ahead of Boost.Regex on short matches, and roughly on par for longer matches (with wide variation). Results are shown for both gcc 3.4 and vc7.1. The xpressive download includes the code for the perf test, so you can run it yourself, if you like.
* user_s_guide.html As I read I assumed "sregex" meant static (compile-time) regex. I then thought compile() must be very clever and wondered why bother with the alternative ">>" syntax. So I think you need to make it clearer on this page that sregex means std::string regex, and that compile() is for a run-time regex, and the ">>" syntax is for a compile-time regex.
Agreed.
* creating_a_regex_object.html 1. Either the meaning of Perl's /s modifier needs to be defined clearly, or the difference between "_" and "~_n" needs to be shown with an example (incidentally none of your examples at examples.html match strings with carriage-returns).
Agreed. FYI, "_" matches any one character. ~_n matches any character that is not '\n'. I also need to describe _ln which matches a logical newline (eg., "\n" or "\r" or "\r\n" or other line separators) and ~_ln which matches any one character that is not a line separator. This all needs to be documented better.
2. I see I can use icase("Abc") but is there a way to say the whole regex should be case-insensitive? I.e. the equivalent of: "/match something/i"
You can just wrap the whole regex in icase(). I need to show an example of that.
* grammars_and_nested_matches.html In the example that starts: sregex parentheses; parens = '('
should "parens" actually be "parentheses" ?
Yes. My bad.
2. In Filtering Nested Results, I wasn't clear what the purpose was. Is it to show all the name matches before all the id matches? If so, choosing a less regular example string would help, e.g. with more names than ids, names following names some of the time, etc.
I'm not at all sure of the utility of the nested results filter, and I may just cut it. After matching a regex that contains nested regexes, the match_results object contains nested results. Figuring out which results correspond to which regex can be difficult. The filter lets you see only those results corresponding to a particular nested regex. But I've yet to need it in practice. *shrug*
3. "See the definition of output_nested_results in the Examples section." I think that function should be moved to grammars_and_nested_matches.html; it seemed out of place in examples.html.
You're right it doesn't belong in Examples. But I didn't want to clutter the user doc with what is really an implementation detail. I'll think about it.
* Other 1. I'd like to see some fuller examples, that show the I/O part as well. E.g. a full program that takes a list of email addresses on stdin, one per line, and spits out a list of the illegal ones.
Haha! Have you /seen/ the regex that matches email addresses? It's 5 pages long. But I get the idea -- examples are important. I'll see what I can come up with.
2. Benchmarking. I wanted to see the relative speed of compile-time vs. run-time vs. boost::regex (and ideally vs. PCRE or a scripting language) on some realistic application.
It's in there.
-- Eric Niebler Boost Consulting www.boost-consulting.com

There are results from of performance benchmarks of static xpressive vs. dynamic xpressive vs. Boost.Regex in the Appendix of xpressive's documentation. You must have missed it. See:
Sorry, yes I missed the appendices.
http://boost-sandbox.sf.net/libs/xpressive/doc/html/xpressive/perf.html
In short, xpressive comes out consistently ahead of Boost.Regex on short matches, and roughly on par for longer matches (with wide variation).
Interesting. This left me with two questions: 1. Why is dynamic quicker than static xpressive on some expressions? 2. Why is boost::regex quicker on longer strings? Something to do with buffering or dynamic memory usage? I thought "Huck[[:alpha:]]+" (expressive twice as quick) vs. "[[:alpha:]]+ing" (boost::regex twice as quick) was very curious. Is this due to some design decision, or just something waiting to be optimized?
Agreed. FYI, "_" matches any one character. ~_n matches any character that is not '\n'. I also need to describe _ln which matches a logical newline (eg., "\n" or "\r" or "\r\n" or other line separators) and ~_ln which matches any one character that is not a line separator.
_ln sounds useful. Is that in perl/PCRE ? Darren

Darren Cook wrote:
http://boost-sandbox.sf.net/libs/xpressive/doc/html/xpressive/perf.html
In short, xpressive comes out consistently ahead of Boost.Regex on short matches, and roughly on par for longer matches (with wide variation).
Interesting. This left me with two questions: 1. Why is dynamic quicker than static xpressive on some expressions?
It's only that way for gcc. On VC7.1, static xpressive is always faster. I can only guess that gcc's optimizer is at fault here.
2. Why is boost::regex quicker on longer strings? Something to do with buffering or dynamic memory usage?
I haven't fully investigated this, but I suspect that for some of those patterns, Boost.Regex is finding a clever optimization. I have noticed that if you change the pattern: Tom|Sawyer|Huckleberry|Finn to: Tom|Sawyer|.uckleberry|Finn ^ then xpressive is considerably faster than Boost.Regex at finding all matches. Clearly, I need to be testing more patterns to make sure the results are representative.
I thought "Huck[[:alpha:]]+" (expressive twice as quick) vs. "[[:alpha:]]+ing" (boost::regex twice as quick) was very curious. Is this due to some design decision, or just something waiting to be optimized?
This is a case where xpressive is finding a clever optimization that Boost.Regex is missing. When a pattern begins with a string literal, xpressive uses Boyer-Moore. It's a huge win. I have no idea why Boost.Regex is faster at matching "[[:alpha:]]+ing". It's worth looking in to.
Agreed. FYI, "_" matches any one character. ~_n matches any character that is not '\n'. I also need to describe _ln which matches a logical newline (eg., "\n" or "\r" or "\r\n" or other line separators) and ~_ln which matches any one character that is not a line separator.
_ln sounds useful. Is that in perl/PCRE ?
I don't recall where I got that idea. Perhaps from Perl 6. -- Eric Niebler Boost Consulting www.boost-consulting.com

Caleb Epstein wrote:
On 9/16/05, Darren Cook <darren@dcook.org> wrote:
~_ln which matches any one character that is not a line separator.
_ln sounds useful. Is that in perl/PCRE ?
Its not in Perl 5.
It's inspired from Perl 6. See the table at the bottom of: http://www.perl.com/pub/a/2002/06/04/apo5.html?page=7 Perl 5: \r?\n Perl 6: \n Meaning: asserts logical newline Perl 5: [^\n] Perl 6: \N Meaning: not a logical newline For Perl 6, \n really means \r?\n. I liked that idea, but it's not as general as it could be. Also, it could be slow. So I broke it in two: _n == literal newline and _ln == logical newline, where the line separators are determined by the regex traits type. -- Eric Niebler Boost Consulting www.boost-consulting.com

----- Original Message ----- From: "Eric Niebler" <eric@boost-consulting.com> To: <boost@lists.boost.org> Sent: Saturday, September 17, 2005 5:10 AM Subject: Re: [boost] [Review] xpressive
2. In Filtering Nested Results, I wasn't clear what the purpose was. Is it to show all the name matches before all the id matches? If so, choosing a less regular example string would help, e.g. with more names than ids, names following names some of the time, etc.
I'm not at all sure of the utility of the nested results filter, and I may just cut it. After matching a regex that contains nested regexes, the match_results object contains nested results. Figuring out which results correspond to which regex can be difficult. The filter lets you see only those results corresponding to a particular nested regex. But I've yet to need it in practice. *shrug*
I was assuming that nested results would be the way that complex patterns (i..e. grammars) and semantic actions may be associated. Having matched some deeply structured thing, nested results gives you the ability to process the hierarchy in a clean fashion. I'm not particularly championing this approach but its a workable alternative. If you did the semantic actions another way then this point wouldn't hold.

Eric, my apologies for not posting before, I really must get around to a full review, but in the mean time:
In short, xpressive comes out consistently ahead of Boost.Regex on short matches, and roughly on par for longer matches (with wide variation). Results are shown for both gcc 3.4 and vc7.1. The xpressive download includes the code for the perf test, so you can run it yourself, if you like.
I'm attaching the full results I get from my performance test suite (updated to reflect the changes you made in the xpressive one, but without the static expressions: I couldn't be bothered to translate them all!). I compiled with VC7.1 with all the optimisations on (any suitable inline, global optimisations on, inline intrinsics on). Be aware that any such results should be treated with extreme caution: there is absolutely no such thing as a "typical" regex. I haven't run the tests with cygwin: in the past I've found cygwin to be a pretty poor platform for testing code performance, I don't think the cygwin guys have put much effort into runtime efficiency, as compared to Linux say. For the "trivial" short matches test, I see broadly very similar results from xpressive and Boost.Regex, I haven't counted them, but it looks like honors even to me. The bad news is that PCRE kicks us both into touch on this test, however I'm pretty sure that Boost.Regex only dropped behind PCRE on this test section, when I added protection againt stack overflow (a __try __except block). Since I pinched this idea from GRETA, I suspect xpression does the same thing? In any case that protection is a "good thing" so it's staying in Boost.Regex whatever. For the html document search test cases xpressive is consistently ahead: by up 5x, probably the Boyer-Moore code kicking in (but see below). On this test PCRE is somewhere in between us. For the C++ code search test cases, Boost.Regex is consistently ahead by about 2x (it beats PCRE on this test as well), there is one complicated expression that xpressive didn't compile, but which Boost.Regex and PCRE did handle OK. For the plain text search test cases, Boost.Regex is ahead of xpressive in all cases, up to 12x faster in the short text, and 25x in the very long text (these are the extremes though, don't pay too much attention!). The really interesting thing about these tests, is that some of the expressions are string literals: the xpressive Boyer Moore code should really shine here, and yet it comes out quite a bit slower. I'm not completely sure what's happening here, but I would guess that the Boost.Regex code is easier to optimise, and so executes faster, *provided* it doesn't find a tentative match too often. In comparison in the html tests, most of the regular expressions begin with "<", which tends to occur rather often in html. So in this case, a better algorithm wins out over the easier to optimise code. Ideally there would be an heuristist that would choose between the two, and always pick the best, but without analysing the text that you're going to search first I don't what it would be at present :-( Finally, I would note that in the search tests the Boost.Regex results are hampered compared to PCRE by the need to construct a regex_iterator: the class is a pimple which uses a shared_ptr, so there's a couple of memory allocations in there that PCRE doesn't have to do. For short searches this a big hit, but *only* in the rarified atmosphere of a test suite, back in the real world iterators tend to get passed around by value, so this is a another "good thing" in general. I haven't looked at PCRE, but I suspect you do something similar? John.

John Maddock wrote:
Eric, my apologies for not posting before, I really must get around to a full review, but in the mean time:
FYI - tomorrow is the last day of the review.
For the "trivial" short matches test, I see broadly very similar results from xpressive and Boost.Regex, I haven't counted them, but it looks like honors even to me.
I wonder why your results are so different from mine. Please post the code for your modified test so I can run it locally. Thanks. -- Eric Niebler Boost Consulting www.boost-consulting.com

I wonder why your results are so different from mine. Please post the code for your modified test so I can run it locally. Thanks.
It's in cvs now under the usual libs/regex/performance path: there are still some problems with the html output that I don't understand yet, but I haven't had a chance to look at those. As for the results being different from yours, I have noticed that the results can differ quite a bit from run to run, particularly the ftp response expression "^([0-9]+)(\-| |$)(.*)$" I've seen either Boost.Regex or xpressive win out by some margin depending on the machine setup. John.

"John Maddock" <john@johnmaddock.co.uk> writes:
I wonder why your results are so different from mine. Please post the code for your modified test so I can run it locally. Thanks.
It's in cvs now under the usual libs/regex/performance path: there are still some problems with the html output that I don't understand yet, but I haven't had a chance to look at those.
As for the results being different from yours, I have noticed that the results can differ quite a bit from run to run, particularly the ftp response expression "^([0-9]+)(\-| |$)(.*)$" I've seen either Boost.Regex or xpressive win out by some margin depending on the machine setup.
You guys might want to take a look at the performance test Matthias Troyer and I developed for the parameter library. Matthias is an expert in high-performance computing and knows how to correct for some of those effects. libs/parameter/test/efficiency.cpp and libs/parameter/test/timings.txt HTH, -- Dave Abrahams Boost Consulting www.boost-consulting.com

You guys might want to take a look at the performance test Matthias Troyer and I developed for the parameter library. Matthias is an expert in high-performance computing and knows how to correct for some of those effects.
Yep, that's pretty much what the program does: hammer through the test case a few times to eliminate cache effects, and then do a timed run at the end. We could really use a Boost.Performance library rather than lots of folks doing their own thing, should someone want to take the hint and write one :-) John.

"John Maddock" <john@johnmaddock.co.uk> writes:
You guys might want to take a look at the performance test Matthias Troyer and I developed for the parameter library. Matthias is an expert in high-performance computing and knows how to correct for some of those effects.
Yep, that's pretty much what the program does: hammer through the test case a few times to eliminate cache effects, and then do a timed run at the end.
There are other issues you might want to consider, like pipeline length. -- Dave Abrahams Boost Consulting www.boost-consulting.com

John Maddock wrote:
I wonder why your results are so different from mine. Please post the code for your modified test so I can run it locally. Thanks.
It's in cvs now under the usual libs/regex/performance path: there are still some problems with the html output that I don't understand yet, but I haven't had a chance to look at those.
As for the results being different from yours, I have noticed that the results can differ quite a bit from run to run, particularly the ftp response expression "^([0-9]+)(\-| |$)(.*)$" I've seen either Boost.Regex or xpressive win out by some margin depending on the machine setup.
Right off, I spotted a couple of problems with your performance test: 1) The xpressive functions all have try/catch blocks in them, but the boost functions do not. 2) You are not passing the "optimize" syntax_option_type flag to xpressive's regex constructor. 3) People who care about performance *will* take the time to rewrite their patterns as static regexes, so a perf test that excludes static xpressive is less interesting. I fixed the first two problems and took the liberty of committing my changes. (In retrospect, a patch would have been the polite thing to do. Sorry.) I'll work on adding a test for static xpressive, too. After fixing these problems, the numbers for the short-matches comes out as I expected - dynamic xpressive is ahead by as much as 2x or more. The HTML search surprises me a bit -- xpressive does poorly. It could be related to the fact that this is a case-insensitive search. It's possible I have a bug, or it's possible that the silly things I had to do to make Boyer-Moore work with the regex traits interface make Boyer-Moore more trouble than its worth for case-insensitive matches. I haven't yet run the other tests. Testing: "abc" against "abc" Boost regex (C++ locale): 3.6478e-007s xpressive regex: 1.49012e-007s Testing: "^([0-9]+)(\-| |$)(.*)$" against "100- this is a line of ftp response which contains a message string" Boost regex (C++ locale): 7.59125e-007s xpressive regex: 4.46796e-007s Testing: "([[:digit:]]{4}[- ]){3}[[:digit:]]{3,4}" against "1234-5678-1234-456" Boost regex (C++ locale): 1.13106e-006s xpressive regex: 7.00951e-007s Testing: "^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$" against "john@johnmaddock.co.uk" Boost regex (C++ locale): 1.75667e-006s xpressive regex: 1.34087e-006s Testing: "^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$" against "foo12@foo.edu" Boost regex (C++ locale): 1.48964e-006s xpressive regex: 1.19209e-006s Testing: "^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$" against "bob.smith@foo.tv" Boost regex (C++ locale): 1.52016e-006s xpressive regex: 1.16158e-006s Testing: "^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$" against "EH10 2QQ" Boost regex (C++ locale): 5.96046e-007s xpressive regex: 3.20435e-007s Testing: "^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$" against "G1 1AA" Boost regex (C++ locale): 5.80788e-007s xpressive regex: 3.20435e-007s Testing: "^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$" against "SW1 1ZZ" Boost regex (C++ locale): 5.96046e-007s xpressive regex: 3.35217e-007s Testing: "^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$" against "4/1/2001" Boost regex (C++ locale): 5.36919e-007s xpressive regex: 3.12805e-007s Testing: "^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$" against "12/12/2001" Boost regex (C++ locale): 5.51224e-007s xpressive regex: 3.12805e-007s Testing: "^[-+]?[[:digit:]]*\.?[[:digit:]]*$" against "123" Boost regex (C++ locale): 5.65529e-007s xpressive regex: 2.98023e-007s Testing: "^[-+]?[[:digit:]]*\.?[[:digit:]]*$" against "+3.14159" Boost regex (C++ locale): 5.96046e-007s xpressive regex: 3.49998e-007s Testing: "^[-+]?[[:digit:]]*\.?[[:digit:]]*$" against "-3.14159" Boost regex (C++ locale): 5.96046e-007s xpressive regex: 3.42369e-007s Testing: ^(template[[:space:]]*<[^;:{]+>[[:space:]]*)?(class|struct)[[:space:]]*(\<\w+\>([ ]*\( [^)]*\))?[[:space:]]*)*(\<\w*\>)[[:space:]]*(<[^;:{]+>[[:space:]]*)?(\{|:[^;\{()]*\{) Boost regex (C++ locale): 0.00012207s xpressive regex: 0.000217529s Testing: (^[ ]*#(?:[^\\\n]|\\[^\n_[:punct:][:alnum:]]*[\n[:punct:][:word:]])*)|(//[^\n]*|/\*.*?\*/)|\<([+-]?(?:(?:0x[[:xdigit:]]+)|(?:(?:[[:digit:]]*\.)?[[:digit:]]+(?:[eE][+-]?[[:digit:]]+)?))u?(?:(?:int(?:8|16|32|64))|L)?)\>|('(?:[^\\']|\\.)*'|"(?:[^\\"]|\\.)*")|\<(__asm|__cdecl|__declspec|__export|__far16|__fastcall|__fortran|__import|__pascal|__rtti|__stdcall|_asm|_cdecl|__except|_export|_far16|_fastcall|__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto|bool|break|case|catch|cdecl|char|class|const|const_cast|continue|default|delete|do|double|dynamic_cast|else|enum|explicit|extern|false|float|for|friend|goto|if|inline|int|long|mutable|namespace|new|operator|pascal|private|protected|public|register|reinterpret_cast|return|short|signed|sizeof|static|static_cast|struct|switch|template|this|throw|true|try|typedef|typeid|typename|union|unsigned|using|virtual|void|volatile|wchar_t|while)\> Boost regex (C++ locale): 0.00426563s Exception: mismatched parenthesis xpressive regex: -1s Testing: ^[ ]*#[ ]*include[ ]+("[^"]+"|<[^>]+>) Boost regex (C++ locale): 0.000183105s xpressive regex: 0.000213623s Testing: ^[ ]*#[ ]*include[ ]+("boost/[^"]+"|<boost/[^>]+>) Boost regex (C++ locale): 0.000183105s xpressive regex: 0.000213623s Testing: beman|john|dave Boost regex (C++ locale): 0.000251465s xpressive regex: 0.0003125s Testing: <p>.*?</p> Boost regex (C++ locale): 0.00019458s xpressive regex: 0.00074707s Testing: <a[^>]+href=("[^"]*"|[^[:space:]]+)[^>]*> Boost regex (C++ locale): 0.000716797s xpressive regex: 0.00167773s Testing: <h[12345678][^>]*>.*?</h[12345678]> Boost regex (C++ locale): 0.000202148s xpressive regex: 0.00103711s Testing: <img[^>]+src=("[^"]*"|[^[:space:]]+)[^>]*> Boost regex (C++ locale): 0.000206055s xpressive regex: 0.000533203s Testing: <font[^>]+face=("[^"]*"|[^[:space:]]+)[^>]*>.*?</font> Boost regex (C++ locale): 0.000213623s xpressive regex: 0.000465332s -- Eric Niebler Boost Consulting www.boost-consulting.com

Right off, I spotted a couple of problems with your performance test:
1) The xpressive functions all have try/catch blocks in them, but the boost functions do not.
Right, but it's outside the timing code: I had to add the exception handling to get the tests to run to completion because one of the expressions was throwing, because xpressive wouldn't parse it.
2) You are not passing the "optimize" syntax_option_type flag to xpressive's regex constructor.
Ahah! That would be it probably.
3) People who care about performance *will* take the time to rewrite their patterns as static regexes, so a perf test that excludes static xpressive is less interesting.
Yep, I realise that. John.

I wonder why your results are so different from mine. Please post the code for your modified test so I can run it locally. Thanks.
Still no review :-( But I did find the time to set the performance tester going on Linux before I went out this morning, and: Boost.Regex and PCRE came in with almost spot on the same times as on Windows (they should it's the same machine just rebooted into Mandiva Linux). Xpressive was almost twice as fast as on Windows for the very short "trivial match" tests (!), other results were broadly as before. I also compiled with gcc-4.0.0 and -DBOOST_REGEX_RECURSIVE, the differences between xpressive and regex were much smaller now on those very short tests, the medium length ones they were neck and neck, and the long ones Boost.Regex pulled ahead again. The most noticeable thing was that all the results were about 30% faster with gcc-4 on Linux than with VC7.1 on Windows, I'd say that was a result for Open Source! John.

"John Maddock" <john@johnmaddock.co.uk> writes:
I haven't run the tests with cygwin: in the past I've found cygwin to be a pretty poor platform for testing code performance, I don't think the cygwin guys have put much effort into runtime efficiency, as compared to Linux say.
The Windows EH model hurts. That aside, GCC-4.0.0 is *much* faster than GCC-3.4.x -- Dave Abrahams Boost Consulting www.boost-consulting.com

John Maddock wrote:
there is one complicated expression that xpressive didn't compile, but which Boost.Regex and PCRE did handle OK.
I've tracked this down. xpressive is rejecting this regex because it is invalid according to the TR1 spec. It begins: const char* highlight_expression = "(^[ \t]*#(?:[^\\\\\\n]|" "\\\\[^\\n_[:punct:][:alnum:]]*[\\n[:punct:][:word:]])*)|" ----------------------------------------------^^^^^^^^ The problem is [:word:]. "word" is not a valid char-class-name, according to the TR1 spec: "Expressions of the form [[:class-name:]] are recognized, and are sensitive to the locale encapsulated by the traits class. The range of values for class-name is determined by the traits class, but at least the following names are recognized: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit." If you change your regex to use "\\w" instead of "[:word:]", xpressive compiles it fine. (xpressive sure could give a better error in this case, though. :-/) -- Eric Niebler Boost Consulting www.boost-consulting.com

Eric Niebler wrote:
John Maddock wrote:
there is one complicated expression that xpressive didn't compile, but which Boost.Regex and PCRE did handle OK.
I've tracked this down. xpressive is rejecting this regex because it is invalid according to the TR1 spec. It begins:
const char* highlight_expression = "(^[ \t]*#(?:[^\\\\\\n]|" "\\\\[^\\n_[:punct:][:alnum:]]*[\\n[:punct:][:word:]])*)|" ----------------------------------------------^^^^^^^^
The problem is [:word:]. "word" is not a valid char-class-name, according to the TR1 spec:
I've dug a bit deeper and I've found other problems with this regex. First, it uses "\\<" and "\\>" as begin- and end-of-word assertions. This is not recognized ECMA syntax, so xpressive treats them as literals. Also, you seem to be assuming that "\\n" is treated as a newline. Right now, xpressive does not recognize "\\n" as a newline unless you pass the "normalize" syntax_option_type flag, in which case, it will also recognize "\\a", "\\f", "\\r", "\\t" and "\\v". If you don't want to bother with the "normalize" flag, you should use "\n" in the pattern instead of "\\n". Thinking ... In looking over the ECMAScript syntax, I'm thinking this may be a bug in xpressive. I should probably do away with the "normalize" flag and always recognize "\\n" as a newline literal. Not sure what to do about "\\<" and "\\>", though. -- Eric Niebler Boost Consulting www.boost-consulting.com

From: "John Maddock" <john@johnmaddock.co.uk>
Not sure what to do about "\\<" and "\\>", though.
Sigh, yes it's a legacy from BSD style expressions, and probably shouldn't be enabled for Perl style expressions, they're just so useful though, and more efficient and precise than \b.
They are most definitely handy. I used them regularly with grep and vi (usually viper mode in emacs, but still). xpressive could include them as an extension, right? -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

Rob Stewart wrote:
From: "John Maddock" <john@johnmaddock.co.uk>
Not sure what to do about "\\<" and "\\>", though.
Sigh, yes it's a legacy from BSD style expressions, and probably shouldn't be enabled for Perl style expressions, they're just so useful though, and more efficient and precise than \b.
They are most definitely handy. I used them regularly with grep and vi (usually viper mode in emacs, but still). xpressive could include them as an extension, right?
Static xpressive already has them. They're spelled "bow" and "eow". The question is whether they should be recognized by dynamic xpressive by default. If we follow the ECMAScript standard to the letter, "\\<" should match a literal '<' character. So giving it a different behavior would be a non-conforming extension. If we want to make \\< and \\> begin- and end-word-assertions, we'd need to change the regex proposal. It's too late for TR1, but maybe for C++0x. In the mean time, I've changed xpressive to recognize \\< and \\< (in boost-sandbox CVS now). That makes xpressive non-conforming, but that is arguably more useful and less surprising than the standard behavior. -- Eric Niebler Boost Consulting www.boost-consulting.com

If you change your regex to use "\\w" instead of "[:word:]", xpressive compiles it fine.
Understood.
(xpressive sure could give a better error in this case, though. :-/)
Yep, it's not really a mismatched parenthesis :-) Not that any regex engine is that good with it's error messages mind you! John.

Hi boosters, so, here comes my first try at review participation.
* What is your evaluation of the design?
I think the design is sound. The static syntax is, as far as i can see, as readable as it can get using c++ operators.
* What is your evaluation of the implementation?
I did use it in a small one-off project and it performed well and worked as documented. I used both dynamic and static versions.
* What is your evaluation of the documentation?
I like the style of the documentation. It gives a good introduction and serves well as a reference. A table of primitives would be nice.
* What is your evaluation of the potential usefulness of the library?
I think the library is extremely useful. I used boost::regexp and boost::spirit quite a bit in the past, and very often I would have needed somethin "in between", more powerful than regexp, even easier to use than spirit. Xpressive nicely fills that gap.
* Did you try to use the library? With what compiler? Did you have any problems?
Yes, with gcc-4.0 on linux, VC7.1 and gcc-3.3 on windows. Compile times are quite long, as with all template-heavy libraries. I like that it is header only, though.
* How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
I read the docs and used tha library in a small project. I did not do much source-code reading.
* Are you knowledgeable about the problem domain?
I am not an expert, but a frequent user of boost::regexp and boost::spirit (and perl regexps).
And finally, every review should answer this question:
* Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion.
Yes
participants (13)
-
Andy Little
-
Aschwin Gopalan
-
Bronek Kozicki
-
Caleb Epstein
-
Darren Cook
-
Daryle Walker
-
David Abrahams
-
Eric Niebler
-
Jeff Flinn
-
John Maddock
-
Rob Stewart
-
Scott Woods
-
Thomas Witt