
Eric Niebler wrote:
I have a question and a bug report regarding the format_perl flag. First the question ...
I see that, when you specify format_perl, match_results::format() recognizes the escape sequences \l \L \u and \U, which do uppercasing or lowercasing. These are necessarily locale-dependent character transformations, but match_results does not have a Traits parameter. How should the transformations be done?
I note that the basic_regex<> class template has a traits parameter, and that match_results<>::format() can only be called after a successful regex match. One reasonable approach is that match_results<> holds a (shared) pointer to the regex object's traits. It would have to be a polymorphic base pointer, since match_results can't know the exact type of the traits object at the time format() is called.
That doesn't exactly work because the RegexTraits concept doesn't have toupper() and tolower() functions. I suggest adding them.
Right, but format_perl isn't part of TR1, so this is all in the realms of vendor-specific extensions. I added some *optional* extra members to the traits class to deal with this: the code detects at compile time whether the member are there, and uses them if they are, otherwise uses some sensible defaults.
This isn't only a problem for format_perl, strictly speaking. match_results::format() also needs to know how to turn characters into integers (eg. to parse format strings like "$1"). That is the reason for RegexTraits::value()'s existence, so match_results<>::format() should use it.
(Incidentally, I just implemented all this in xpressive, so I can confirm that this strategy works. It incurs a virtual call for each tolower(), toupper(), and value(), but there doesn't seem to be any other way without changing the interface in a non-TR1 compatible way.)
Yep, for regex_replace you can pass the regex object through to the code that does the formatting, but match_replace::format has no such object. I use the default locale in this case, but your approach is probably better.
Finally, a bug report. Consider the following code:
std::string str ("fOO bAr BaZ"); regex rx ("\\w+");
str = regex_replace( str, rx, "\\L\\u$&", format_perl ); std::cout << str << std::endl;
This prints:
FOO BAr BaZ
However, the equivalent perl:
$str= 'fOO bAr BaZ'; $str =~ s/\w+/\L\u$&/g; print "$str\n";
Prints this:
Foo Bar Baz
Looks like in boost::regex, the \u is stomping the \L rather than merely overriding it for the next character.
Yep, fixed in cvs, thanks for the report. John.