Re: [boost] [regex] format_perl conundrum

19 Mar 2007

      Eric Niebler wrote:
...
I have a question and a bug report regarding the format_perl flag.
First the question ...
I see that, when you specify format_perl, match_results::format()
recognizes the escape sequences \l \L \u and \U, which do uppercasing
or lowercasing. These are necessarily locale-dependent character
transformations, but match_results does not have a Traits parameter.
How should the transformations be done?
I note that the basic_regex<> class template has a traits parameter,
and that match_results<>::format() can only be called after a
successful regex match. One reasonable approach is that
match_results<> holds a (shared) pointer to the regex object's
traits. It would have to be a polymorphic base pointer, since
match_results can't know the exact type of the traits object at the
time format() is called.
That doesn't exactly work because the RegexTraits concept doesn't have
toupper() and tolower() functions. I suggest adding them.
Right, but format_perl isn't part of TR1, so this is all in the realms of 
vendor-specific extensions.  I added some *optional* extra members to the 
traits class to deal with this: the code detects at compile time whether the 
member are there, and uses them if they are, otherwise uses some sensible 
defaults.
...
This isn't only a problem for format_perl, strictly speaking.
match_results::format() also needs to know how to turn characters into
integers (eg. to parse format strings like "$1"). That is the reason
for RegexTraits::value()'s existence, so match_results<>::format()
should use it.
(Incidentally, I just implemented all this in xpressive, so I can
confirm that this strategy works. It incurs a virtual call for each
tolower(), toupper(), and value(), but there doesn't seem to be any
other way without changing the interface in a non-TR1 compatible way.)
Yep, for regex_replace you can pass the regex object through to the code 
that does the formatting, but match_replace::format has no such object.  I 
use the default locale in this case, but your approach is probably better.
...
Finally, a bug report. Consider the following code:
std::string str ("fOO bAr BaZ");
    regex rx ("\\w+");
str = regex_replace( str, rx, "\\L\\u$&", format_perl );
    std::cout << str << std::endl;
This prints:
FOO BAr BaZ
However, the equivalent perl:
$str= 'fOO bAr BaZ';
    $str =~ s/\w+/\L\u$&/g;
    print "$str\n";
Prints this:
Foo Bar Baz
Looks like in boost::regex, the \u is stomping the \L rather than
merely overriding it for the next character.
Yep, fixed in cvs, thanks for the report.

John.

Re: [boost] [regex] format_perl conundrum

John Maddock