[regex] Boost.Regex + ICU vs. standalone ICU

Hi there, Is there a document out there describing any substantial differences (in particular, w.r.t. semantics and correctness) between using Boost.Regex with ICU support baked in as opposed to ICU's built-in RegexMatcher/RegexPattern classes? I do realize, for example, that the former's API has a 'modern C++' style, whereas the latter is modeled on Java's -- but these differences seem to be mostly cosmetic. Is there anything else I should be aware of before deciding to use one or the other -- performance, functionality, ease-of-use, etc.? Thanks! PS: It doesn't look like xpressive has Unicode support yet, which is why I'm not considering it, but I'd love to know if this impression is false or soon-to-be false.

Is there a document out there describing any substantial differences (in particular, w.r.t. semantics and correctness) between using Boost.Regex with ICU support baked in as opposed to ICU's built-in RegexMatcher/RegexPattern classes?
I do realize, for example, that the former's API has a 'modern C++' style, whereas the latter is modeled on Java's -- but these differences seem to be mostly cosmetic. Is there anything else I should be aware of before deciding to use one or the other -- performance, functionality, ease-of-use, etc.?
To a large extend all regex engines are created remarkably equal. Looks like ICU hasn't caught up with some of Perl-5.10's additions yet (recursive expressions for example). The other advantage of Boost.Regex is that being iterator based you can search text in non-contiguous storage. Those were the only two that jumped out at me from a quick look at ICU. HTH, John.

John Maddock
To a large extend all regex engines are created remarkably equal.
Looks like ICU hasn't caught up with some of Perl-5.10's additions yet (recursive expressions for example).
The other advantage of Boost.Regex is that being iterator based you can search text in non-contiguous storage.
Those were the only two that jumped out at me from a quick look at ICU.
HTH, John.
Thanks, it does :). Indeed, it doesn't sound like there's a ton of difference. I actually didn't know that Boost.Regex supported recursive regexen. Would you mind pointing me to the documentation for it, and/or what the syntax looks like? Thanks again.

I actually didn't know that Boost.Regex supported recursive regexen. Would you mind pointing me to the documentation for it, and/or what the syntax looks like?
The new stuff in Perl-5.10 and supported in Boost.Regex are: Named sub-expressions: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/syntax/... Branch resets: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/syntax/... Recursion: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/syntax/... Conditional on recursion or subexpression match: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/syntax/... The (*OPERATOR) syntax introduced in Perl-5.10 is not currently supported. HTH, John.

John Maddock
The new stuff in Perl-5.10 and supported in Boost.Regex are:
Named sub-expressions:
http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/syntax/... _syntax.html#boost_regex.syntax.perl_syntax.named_subexpressions
Branch resets:
http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/syntax/... _syntax.html#boost_regex.syntax.perl_syntax.branch_reset
Recursion:
http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/syntax/... _syntax.html#boost_regex.syntax.perl_syntax.recursive_expressions
Conditional on recursion or subexpression match:
http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/syntax/... _syntax.html#boost_regex.syntax.perl_syntax.conditional_expressions
The (*OPERATOR) syntax introduced in Perl-5.10 is not currently supported.
HTH, John.
That's fantastic, thank you. I have an unrelated Boost.Regex question: how expensive are basic_regex objects to copy? Say, relative to constructing (and thus reparsing) from a string pattern, anew? The reason I ask is because I have a variant type with several value types, including a regex type. I was wondering whether it makes more sense to store the regex pattern in that variant as a string, and to construct the actual basic_regex from that string every time I need to operate on it, or to store it as a basic_regex and hope that copies are not too expensive. One other alternative is to store the regex on the heap using a smart pointer, but I'd rather keep value semantics so long as performance isn't an issue. (Oh, incidentally, does Boost.Regex have support for Boost.Serialization? Right now what I do is serialize it [the pattern] as an std::string -- is that the recommended approach?) Thanks!

That's fantastic, thank you. I have an unrelated Boost.Regex question: how expensive are basic_regex objects to copy? Say, relative to constructing (and thus reparsing) from a string pattern, anew?
A *lot* more efficient - basic_regex is a pimpl so it's just a shared_ptr copy to copy a basic_regex.
The reason I ask is because I have a variant type with several value types, including a regex type. I was wondering whether it makes more sense to store the regex pattern in that variant as a string, and to construct the actual basic_regex from that string every time I need to operate on it, or to store it as a basic_regex and hope that copies are not too expensive. One other alternative is to store the regex on the heap using a smart pointer, but I'd rather keep value semantics so long as performance isn't an issue.
No, don't construct every time you need it, that's just wasting CPU cycles :-(
(Oh, incidentally, does Boost.Regex have support for Boost.Serialization? Right now what I do is serialize it [the pattern] as an std::string -- is that the recommended approach?)
Yes, there's no explicit serialization support. HTH, John.
participants (2)
-
AJG
-
John Maddock