Re: [boost] [regex] Boost.Regex and TR1 fundamentally broken?

23 Jun 2005

      ...
I've had a nagging feeling about character ranges and the TR1 regex
proposal, so today I did some digging. I've found what I think is a
showstopping issue for TR1 regex. I hope I'm wrong.
The situation of interest is described in the ECMAScript specification
(ECMA-262), section 15.10.2.15:
"Even if the pattern ignores case, the case of the two ends of a range
is significant in determining which characters belong to the range.
Thus, for example, the pattern /[E-F]/i matches only the letters E, F,
e, and f, while the pattern /[E-f]/i matches all upper and lower-case
ASCII letters as well as the symbols [, \, ], ^, _, and `."
A more interesting case is what should happen when doing a
case-insentitive match on a range such as [Z-a]. It should match z, Z,
a, A and the symbols [, \, ], ^, _, and `. This is not what happens with
Boost.Regex (it throws an exception from the regex constructor).
The tough pill to swallow is that, given the specification in TR1, I
don't think there is any effective way to handle this situation.
According to the spec, case-insensitivity is handled with
regex_traits<>::translate_nocase(CharT) -- two characters are equivalent
if they compare equal after both are sent through the translate_nocase
function. But I don't see any way of using this translation function to
make character ranges case-insensitive. Consider the difficulty of
detecting whether "z" is in the range [Z-a]. Applying the transformation
to "z" has no effect (it is essentially std::tolower). And we're not
allowed to apply the transformation to the ends of the range, because as
ECMA-262 says, "the case of the two ends of a range is significant."
So AFAICT, TR1 regex is just broken, as is Boost.Regex. One possible fix
is to redefine translate_nocase to return a string_type containing all
the characters that should compare equal to the specified character. But
this function is hard to implement for Unicode, and it doesn't play nice
with the existing ctype facet. What a mess!
OK, first up I think you have discovered a problem, and in particular any 
incompatibility between the TR1 proposal and ECMAScript is entirely 
accidental.

I can't see any way of strictly implementing the ECMA semantics within the 
scope of the current regex traits classes either.

However, as noted previously, no one has ever reported this as a bug in 
practice, and anyone writing expressions like [Z-a] needs a poke with a 
sharp stick :-)   Obviously that's no excuse for incompatibility, just that 
the breakage is less "fundamental" than it may appear at first.

Thinking out loud now, lets try and consider how this might be fixed:

I'm assuming that the correct semantics are:

"Given a character range [c1-c2] then character c3 matches this range if 
there is some character c4, that is equivalent to c3, and whose character 
code lies in the range [c1-c2]."

As I understand it ECMA script assumes that for any character c, there is at 
most one other character that is considered equivalent to it (it's opposite 
case), which is clearly implementable either by providing toupper/tolower 
members to the traits class, or just an "other_case" member function.  It's 
easily implementable in terms of ctype as well.  Whew.

Unfortunately that's not the end of the story, for Unicode characters there 
are more than two cases (Upper Lower and Title cases), and that's without 
considering other forms of equivalence - for example the Angstrom symbol 
(U+212B) and character U+00C5 ("LATIN CAPITAL LETTER A WITH RING ABOVE") are 
normally considered as equivalents, likewise full and half width Hangul 
characters may be considered as equivalent for some uses.

The trouble is, for wide character ranges, there is no way we could 
enumerate all the equivalents for the whole range and build a big list of 
all the characters that could match - potentially there are thousands of 
them.

However, if we try to do it on the fly things get expensive - having an API 
that returns a string containing all equivalents to a character is a 
non-starter IMO, it would potentially be called for every single input 
character in the string being matched, allocating and returning a string for 
each one would grind performance to a halt.  We could change the API to 
return a pair of iterators, or even something like:

charT enumerate_equivalents(charT c);  // return the next character 
equivalent to c

But instead of essentially one operation per input character, we'd have N 
operations, if there are N equivalent characters.  Which may or may not be 
an issue in practice.

The main objection I have is that these API's are a lot harder to implement 
than the existing interface - currently in most cases a tolower will do the 
job, and even for fairly strict Unicode conformance a tolower(toupper(c)) 
will work.  If the interface is too hard to implement then it becomes 
useless as a traits class, and we might just as well get rid of it (I'd 
really rather not go down this road, but it has been suggested before).

OK, that's my brain dump over for now!

Regards, John.

Re: [boost] [regex] Boost.Regex and TR1 fundamentally broken?

John Maddock