[regex] format_perl conundrum

I have a question and a bug report regarding the format_perl flag. First the question ... I see that, when you specify format_perl, match_results::format() recognizes the escape sequences \l \L \u and \U, which do uppercasing or lowercasing. These are necessarily locale-dependent character transformations, but match_results does not have a Traits parameter. How should the transformations be done? I note that the basic_regex<> class template has a traits parameter, and that match_results<>::format() can only be called after a successful regex match. One reasonable approach is that match_results<> holds a (shared) pointer to the regex object's traits. It would have to be a polymorphic base pointer, since match_results can't know the exact type of the traits object at the time format() is called. That doesn't exactly work because the RegexTraits concept doesn't have toupper() and tolower() functions. I suggest adding them. This isn't only a problem for format_perl, strictly speaking. match_results::format() also needs to know how to turn characters into integers (eg. to parse format strings like "$1"). That is the reason for RegexTraits::value()'s existence, so match_results<>::format() should use it. (Incidentally, I just implemented all this in xpressive, so I can confirm that this strategy works. It incurs a virtual call for each tolower(), toupper(), and value(), but there doesn't seem to be any other way without changing the interface in a non-TR1 compatible way.) Finally, a bug report. Consider the following code: std::string str ("fOO bAr BaZ"); regex rx ("\\w+"); str = regex_replace( str, rx, "\\L\\u$&", format_perl ); std::cout << str << std::endl; This prints: FOO BAr BaZ However, the equivalent perl: $str= 'fOO bAr BaZ'; $str =~ s/\w+/\L\u$&/g; print "$str\n"; Prints this: Foo Bar Baz Looks like in boost::regex, the \u is stomping the \L rather than merely overriding it for the next character. -- Eric Niebler Boost Consulting www.boost-consulting.com

Eric Niebler wrote:
I have a question and a bug report regarding the format_perl flag. First the question ...
I see that, when you specify format_perl, match_results::format() recognizes the escape sequences \l \L \u and \U, which do uppercasing or lowercasing. These are necessarily locale-dependent character transformations, but match_results does not have a Traits parameter. How should the transformations be done?
I note that the basic_regex<> class template has a traits parameter, and that match_results<>::format() can only be called after a successful regex match. One reasonable approach is that match_results<> holds a (shared) pointer to the regex object's traits. It would have to be a polymorphic base pointer, since match_results can't know the exact type of the traits object at the time format() is called.
That doesn't exactly work because the RegexTraits concept doesn't have toupper() and tolower() functions. I suggest adding them.
Right, but format_perl isn't part of TR1, so this is all in the realms of vendor-specific extensions. I added some *optional* extra members to the traits class to deal with this: the code detects at compile time whether the member are there, and uses them if they are, otherwise uses some sensible defaults.
This isn't only a problem for format_perl, strictly speaking. match_results::format() also needs to know how to turn characters into integers (eg. to parse format strings like "$1"). That is the reason for RegexTraits::value()'s existence, so match_results<>::format() should use it.
(Incidentally, I just implemented all this in xpressive, so I can confirm that this strategy works. It incurs a virtual call for each tolower(), toupper(), and value(), but there doesn't seem to be any other way without changing the interface in a non-TR1 compatible way.)
Yep, for regex_replace you can pass the regex object through to the code that does the formatting, but match_replace::format has no such object. I use the default locale in this case, but your approach is probably better.
Finally, a bug report. Consider the following code:
std::string str ("fOO bAr BaZ"); regex rx ("\\w+");
str = regex_replace( str, rx, "\\L\\u$&", format_perl ); std::cout << str << std::endl;
This prints:
FOO BAr BaZ
However, the equivalent perl:
$str= 'fOO bAr BaZ'; $str =~ s/\w+/\L\u$&/g; print "$str\n";
Prints this:
Foo Bar Baz
Looks like in boost::regex, the \u is stomping the \L rather than merely overriding it for the next character.
Yep, fixed in cvs, thanks for the report. John.

John Maddock wrote:
Eric Niebler wrote:
the RegexTraits concept doesn't have toupper() and tolower() functions. I suggest adding them.
Right, but format_perl isn't part of TR1, so this is all in the realms of vendor-specific extensions. I added some *optional* extra members to the traits class to deal with this: the code detects at compile time whether the member are there, and uses them if they are, otherwise uses some sensible defaults.
Ah. I didn't know that format_perl didn't make the cut. But it's still a problem for RegexTraits::value(), format_default ("$1") and format_sed ("\\1"). To be honest, I find the need for the hacks you describe above to be a bit distasteful. Any vendor who wants format_perl-like behavior will route around the damage in the RegexTraits concept. A better (?) design might have to been to go whole-hog with locales and facets, and define some standard regex traits facets that basic_regex<> and match_results<> can query for at runtime. Eg.: If a regex-trait version-2 facet isn't installed, try for version-1, then defer to the ctype and collate facets, etc. And the traits doesn't need to be part of the basic_regex<> type (it isn't in xpressive). C-compatibility could have been maintained by providing a ctype facet that is implemented in terms of the global C locale. The whole RegexTraits concept and version tags seem like wheel reinvention to me. Just thinking out loud, -- Eric Niebler Boost Consulting www.boost-consulting.com

Eric Niebler wrote:
To be honest, I find the need for the hacks you describe above to be a bit distasteful. Any vendor who wants format_perl-like behavior will route around the damage in the RegexTraits concept. A better (?) design might have to been to go whole-hog with locales and facets, and define some standard regex traits facets that basic_regex<> and match_results<> can query for at runtime. Eg.: If a regex-trait version-2 facet isn't installed, try for version-1, then defer to the ctype and collate facets, etc. And the traits doesn't need to be part of the basic_regex<> type (it isn't in xpressive). C-compatibility could have been maintained by providing a ctype facet that is implemented in terms of the global C locale. The whole RegexTraits concept and version tags seem like wheel reinvention to me.
Just thinking out loud,
Think all you like: it's fair comment. Actually the original regex implementation used a custom facet installed in the locale rather than a traits class: but broken and poorly behaved locale implementations caused me so much hassle that in the end I gave up and used a traits class instead :-( Use of a traits class also makes it easier IMO to use non-std locale mechanisms (the Win32 API's or ICU's locale support) if you want to. Still learning yours.... John.
participants (2)
-
Eric Niebler
-
John Maddock