
I've had a nagging feeling about character ranges and the TR1 regex proposal, so today I did some digging. I've found what I think is a showstopping issue for TR1 regex. I hope I'm wrong.
The situation of interest is described in the ECMAScript specification (ECMA-262), section 15.10.2.15:
"Even if the pattern ignores case, the case of the two ends of a range is significant in determining which characters belong to the range. Thus, for example, the pattern /[E-F]/i matches only the letters E, F, e, and f, while the pattern /[E-f]/i matches all upper and lower-case ASCII letters as well as the symbols [, \, ], ^, _, and `."
A more interesting case is what should happen when doing a case-insentitive match on a range such as [Z-a]. It should match z, Z, a, A and the symbols [, \, ], ^, _, and `. This is not what happens with Boost.Regex (it throws an exception from the regex constructor).
The tough pill to swallow is that, given the specification in TR1, I don't think there is any effective way to handle this situation. According to the spec, case-insensitivity is handled with regex_traits<>::translate_nocase(CharT) -- two characters are equivalent if they compare equal after both are sent through the translate_nocase function. But I don't see any way of using this translation function to make character ranges case-insensitive. Consider the difficulty of detecting whether "z" is in the range [Z-a]. Applying the transformation to "z" has no effect (it is essentially std::tolower). And we're not allowed to apply the transformation to the ends of the range, because as ECMA-262 says, "the case of the two ends of a range is significant."
So AFAICT, TR1 regex is just broken, as is Boost.Regex. One possible fix is to redefine translate_nocase to return a string_type containing all the characters that should compare equal to the specified character. But this function is hard to implement for Unicode, and it doesn't play nice with the existing ctype facet. What a mess!
OK, first up I think you have discovered a problem, and in particular any incompatibility between the TR1 proposal and ECMAScript is entirely accidental. I can't see any way of strictly implementing the ECMA semantics within the scope of the current regex traits classes either. However, as noted previously, no one has ever reported this as a bug in practice, and anyone writing expressions like [Z-a] needs a poke with a sharp stick :-) Obviously that's no excuse for incompatibility, just that the breakage is less "fundamental" than it may appear at first. Thinking out loud now, lets try and consider how this might be fixed: I'm assuming that the correct semantics are: "Given a character range [c1-c2] then character c3 matches this range if there is some character c4, that is equivalent to c3, and whose character code lies in the range [c1-c2]." As I understand it ECMA script assumes that for any character c, there is at most one other character that is considered equivalent to it (it's opposite case), which is clearly implementable either by providing toupper/tolower members to the traits class, or just an "other_case" member function. It's easily implementable in terms of ctype as well. Whew. Unfortunately that's not the end of the story, for Unicode characters there are more than two cases (Upper Lower and Title cases), and that's without considering other forms of equivalence - for example the Angstrom symbol (U+212B) and character U+00C5 ("LATIN CAPITAL LETTER A WITH RING ABOVE") are normally considered as equivalents, likewise full and half width Hangul characters may be considered as equivalent for some uses. The trouble is, for wide character ranges, there is no way we could enumerate all the equivalents for the whole range and build a big list of all the characters that could match - potentially there are thousands of them. However, if we try to do it on the fly things get expensive - having an API that returns a string containing all equivalents to a character is a non-starter IMO, it would potentially be called for every single input character in the string being matched, allocating and returning a string for each one would grind performance to a halt. We could change the API to return a pair of iterators, or even something like: charT enumerate_equivalents(charT c); // return the next character equivalent to c But instead of essentially one operation per input character, we'd have N operations, if there are N equivalent characters. Which may or may not be an issue in practice. The main objection I have is that these API's are a lot harder to implement than the existing interface - currently in most cases a tolower will do the job, and even for fairly strict Unicode conformance a tolower(toupper(c)) will work. If the interface is too hard to implement then it becomes useless as a traits class, and we might just as well get rid of it (I'd really rather not go down this road, but it has been suggested before). OK, that's my brain dump over for now! Regards, John.