[regex] empty expression issue

Hi All, I've just started to use boost::regex I met with the following incompatibility with Perl regular expressions: It is impossible to use empty expressions, neither alone nor inside of another regex statement. Sample code: #include <iostream> #include <boost/regex.hpp> void check_regexp(const char * text) { try { boost::regex e("(\\d+)\\s*(kbit|)"); boost::cmatch what; if (boost::regex_match(text, what, e)) { std::string item1(what[1].first, what[1].second); std::string item2(what[2].first, what[2].second); std::cout << "matched. " << std::endl << "Value : " << item1 << std::endl << "Units : " << item2 << std::endl << std::endl; } } catch (boost::regex_error & e) { std::cout << "ERROR: " << e.what() << std::endl; } } int main(int argc, char** argv) { check_regexp("96 kbit"); } Program output: ERROR: Empty expression Such regular expressions are quite legal in Perl (e.g. /(Ann|Bob|)/) One possible (quick-n-dirty) workaround for this is to replace boost::regex e("(\\d+)\\s*(kbit|)"); with boost::regex e("(\\d+)\\s*(kbit|.{0})"); Program output would be: matched. Value : 96 Units : kbit Would it be considered as an issue for further improvement? Thank you. -- impulse9

impulse9 wrote:
Hi All,
I've just started to use boost::regex I met with the following incompatibility with Perl regular expressions:
It is impossible to use empty expressions, neither alone nor inside of another regex statement.
Sample code:
boost::regex e("(\\d+)\\s*(kbit|)");
From the Boost.Regex docs (http://boost.org/libs/regex/doc/syntax_perl.html):
Empty alternatives are not allowed (these are almost always a mistake), but if you really want an empty alternative use (?:) as a placeholder, for example:
"|abc" is not a valid expression, but "(?:)|abc" is and is equivalent, also the expression: "(?:abc)??" has exactly the same effect.
This would seem to be non-compliant behavior according to TR1, which references ECMA-262, which describes the regex syntax in section 15.10.1 as:
Pattern :: Disjunction Disjunction :: Alternative Alternative | Disjunction Alternative :: [empty] Alternative Term
The TR1 draft (http://open-std.org/JTC1/SC22/WG21/docs/papers/2005/n1836.pdf) amends ECMA-262 in a few ways, but says nothing that would indicate that empty alternatives are not allowed. FWIW, you could use Boost.Xpressive (new in 1.34), which allows empty alternates. -- Eric Niebler Boost Consulting www.boost-consulting.com

Eric Niebler wrote:
Empty alternatives are not allowed (these are almost always a mistake), but if you really want an empty alternative use (?:) as a placeholder, for example:
"|abc" is not a valid expression, but "(?:)|abc" is and is equivalent, also the expression: "(?:abc)??" has exactly the same effect.
This would seem to be non-compliant behavior according to TR1, which references ECMA-262, which describes the regex syntax in section 15.10.1 as:
Quite possibly, although I note that Perl 6 (last time I checked anyway) was planning to disallow these, on the grounds that they are a persistent source of buggy regular expressions. And as noted there are alternatives: |abc == (?:abc)?? == (?:)|abc abc| == (?:abc)? == abc|(?:) John.

John Maddock wrote: JM> Eric Niebler wrote:
Empty alternatives are not allowed (these are almost always a mistake), but if you really want an empty alternative use (?:) as a placeholder, for example:
"|abc" is not a valid expression, but "(?:)|abc" is and is equivalent, also the expression: "(?:abc)??" has exactly the same effect.
This would seem to be non-compliant behavior according to TR1, which references ECMA-262, which describes the regex syntax in section 15.10.1 as:
JM> Quite possibly, although I note that Perl 6 (last time I checked anyway) was JM> planning to disallow these, on the grounds that they are a persistent source JM> of buggy regular expressions. And as noted there are alternatives: JM> |abc == (?:abc)?? == (?:)|abc JM> abc| == (?:abc)? == abc|(?:) JM> John. Disallowing these would do more bad than good. Buggy regular expressions are quite easy to write, even without '|)' Using |) is quite intuitive. We have a lot of cases than some symbol(s) may or may not be present, and this is a natural way to say that. Moreover it makes regular expressions more flexible. This is from http://en.wikipedia.org/wiki/Regular_expression [QUOTE] Regular expressions are particularly useful in the production of code completion systems and syntax highlighting in integrated development environments (IDEs). For example: (public|private|protected|)\s*(\w+)\s+(\w+)\s*\( as used to match type declarations in source code (see also, Regular expression examples ). [/QUOTE] Alternatives are just alternatives. Or should we assume that boost::regex is only Perl6+ compatible? That's imho. Thank you. -- impulse9

impulse9 wrote:
Disallowing these would do more bad than good. Buggy regular expressions are quite easy to write, even without '|)'
True!
Using |) is quite intuitive. We have a lot of cases than some symbol(s) may or may not be present, and this is a natural way to say that. Moreover it makes regular expressions more flexible.
This is from http://en.wikipedia.org/wiki/Regular_expression
[QUOTE] Regular expressions are particularly useful in the production of code completion systems and syntax highlighting in integrated development environments (IDEs).
For example: (public|private|protected|)\s*(\w+)\s+(\w+)\s*\(
as used to match type declarations in source code (see also, Regular expression examples ). [/QUOTE]
Alternatives are just alternatives. Or should we assume that boost::regex is only Perl6+ compatible?
It's a trivial change to make, but let me think this through... John.
participants (3)
-
Eric Niebler
-
impulse9
-
John Maddock