[regex] Boost.Regex and TR1 fundamentally broken?

older
Re: [boost] [home page] Marketing?

Eric Niebler

20 Jun 2005 20 Jun '05

6:25 a.m.

I've had a nagging feeling about character ranges and the TR1 regex proposal, so today I did some digging. I've found what I think is a showstopping issue for TR1 regex. I hope I'm wrong. The situation of interest is described in the ECMAScript specification (ECMA-262), section 15.10.2.15: "Even if the pattern ignores case, the case of the two ends of a range is significant in determining which characters belong to the range. Thus, for example, the pattern /[E-F]/i matches only the letters E, F, e, and f, while the pattern /[E-f]/i matches all upper and lower-case ASCII letters as well as the symbols [, \, ], ^, _, and `." A more interesting case is what should happen when doing a case-insentitive match on a range such as [Z-a]. It should match z, Z, a, A and the symbols [, \, ], ^, _, and `. This is not what happens with Boost.Regex (it throws an exception from the regex constructor). The tough pill to swallow is that, given the specification in TR1, I don't think there is any effective way to handle this situation. According to the spec, case-insensitivity is handled with regex_traits<>::translate_nocase(CharT) -- two characters are equivalent if they compare equal after both are sent through the translate_nocase function. But I don't see any way of using this translation function to make character ranges case-insensitive. Consider the difficulty of detecting whether "z" is in the range [Z-a]. Applying the transformation to "z" has no effect (it is essentially std::tolower). And we're not allowed to apply the transformation to the ends of the range, because as ECMA-262 says, "the case of the two ends of a range is significant." So AFAICT, TR1 regex is just broken, as is Boost.Regex. One possible fix is to redefine translate_nocase to return a string_type containing all the characters that should compare equal to the specified character. But this function is hard to implement for Unicode, and it doesn't play nice with the existing ctype facet. What a mess! -- Eric Niebler Boost Consulting www.boost-consulting.com

Show replies by date

Caleb Epstein

20 Jun 20 Jun

1:08 p.m.

On 6/20/05, Eric Niebler <eric@boost-consulting.com> wrote:

...

The situation of interest is described in the ECMAScript specification (ECMA-262), section 15.10.2.15:

"Even if the pattern ignores case, the case of the two ends of a range is significant in determining which characters belong to the range. Thus, for example, the pattern /[E-F]/i matches only the letters E, F, e, and f, while the pattern /[E-f]/i matches all upper and lower-case ASCII letters as well as the symbols [, \, ], ^, _, and `."

A more interesting case is what should happen when doing a case-insentitive match on a range such as [Z-a]. It should match z, Z, a, A and the symbols [, \, ], ^, _, and `. This is not what happens with Boost.Regex (it throws an exception from the regex constructor).

Not meaning to be facetious, but why in heavens name would anyone want to use a character class like [E-f] or [Z-a]? Seriously, does anyone write regexps like this? -- Caleb Epstein caleb dot epstein at gmail dot com

Eric Niebler

2:38 p.m.

Caleb Epstein wrote:

...

Not meaning to be facetious, but why in heavens name would anyone want to use a character class like [E-f] or [Z-a]? Seriously, does anyone write regexps like this?

ECMA-262 calls this case out specifically, and AFAICT there is no reasonable way a TR1-compliant regex package can conform. Do you really think that's not a big deal? -- Eric Niebler Boost Consulting www.boost-consulting.com

Caleb Epstein

3:39 p.m.

On 6/20/05, Eric Niebler <eric@boost-consulting.com> wrote:

...

ECMA-262 calls this case out specifically, and AFAICT there is no reasonable way a TR1-compliant regex package can conform. Do you really think that's not a big deal?

Its hardly my place to say, but I think anyone who writes regular expressions in that way is a lunatic. Clearly, supporting a standard should hold more weight than my own value judgements though :-) -- Caleb Epstein caleb dot epstein at gmail dot com

John Maddock

4:32 p.m.

...

...
A more interesting case is what should happen when doing a case-insentitive match on a range such as [Z-a]. It should match z, Z, a, A and the symbols [, \, ], ^, _, and `. This is not what happens with Boost.Regex (it throws an exception from the regex constructor).

Not meaning to be facetious, but why in heavens name would anyone want to use a character class like [E-f] or [Z-a]? Seriously, does anyone write regexps like this?

That probably explains why no one has reported this as a bug before now. Eric, I'm going to have to think about this and come back to you, John.

John Maddock

23 Jun 23 Jun

11:20 a.m.

...

I've had a nagging feeling about character ranges and the TR1 regex proposal, so today I did some digging. I've found what I think is a showstopping issue for TR1 regex. I hope I'm wrong.

The situation of interest is described in the ECMAScript specification (ECMA-262), section 15.10.2.15:

"Even if the pattern ignores case, the case of the two ends of a range is significant in determining which characters belong to the range. Thus, for example, the pattern /[E-F]/i matches only the letters E, F, e, and f, while the pattern /[E-f]/i matches all upper and lower-case ASCII letters as well as the symbols [, \, ], ^, _, and `."

A more interesting case is what should happen when doing a case-insentitive match on a range such as [Z-a]. It should match z, Z, a, A and the symbols [, \, ], ^, _, and `. This is not what happens with Boost.Regex (it throws an exception from the regex constructor).

The tough pill to swallow is that, given the specification in TR1, I don't think there is any effective way to handle this situation. According to the spec, case-insensitivity is handled with regex_traits<>::translate_nocase(CharT) -- two characters are equivalent if they compare equal after both are sent through the translate_nocase function. But I don't see any way of using this translation function to make character ranges case-insensitive. Consider the difficulty of detecting whether "z" is in the range [Z-a]. Applying the transformation to "z" has no effect (it is essentially std::tolower). And we're not allowed to apply the transformation to the ends of the range, because as ECMA-262 says, "the case of the two ends of a range is significant."

So AFAICT, TR1 regex is just broken, as is Boost.Regex. One possible fix is to redefine translate_nocase to return a string_type containing all the characters that should compare equal to the specified character. But this function is hard to implement for Unicode, and it doesn't play nice with the existing ctype facet. What a mess!

OK, first up I think you have discovered a problem, and in particular any incompatibility between the TR1 proposal and ECMAScript is entirely accidental. I can't see any way of strictly implementing the ECMA semantics within the scope of the current regex traits classes either. However, as noted previously, no one has ever reported this as a bug in practice, and anyone writing expressions like [Z-a] needs a poke with a sharp stick :-) Obviously that's no excuse for incompatibility, just that the breakage is less "fundamental" than it may appear at first. Thinking out loud now, lets try and consider how this might be fixed: I'm assuming that the correct semantics are: "Given a character range [c1-c2] then character c3 matches this range if there is some character c4, that is equivalent to c3, and whose character code lies in the range [c1-c2]." As I understand it ECMA script assumes that for any character c, there is at most one other character that is considered equivalent to it (it's opposite case), which is clearly implementable either by providing toupper/tolower members to the traits class, or just an "other_case" member function. It's easily implementable in terms of ctype as well. Whew. Unfortunately that's not the end of the story, for Unicode characters there are more than two cases (Upper Lower and Title cases), and that's without considering other forms of equivalence - for example the Angstrom symbol (U+212B) and character U+00C5 ("LATIN CAPITAL LETTER A WITH RING ABOVE") are normally considered as equivalents, likewise full and half width Hangul characters may be considered as equivalent for some uses. The trouble is, for wide character ranges, there is no way we could enumerate all the equivalents for the whole range and build a big list of all the characters that could match - potentially there are thousands of them. However, if we try to do it on the fly things get expensive - having an API that returns a string containing all equivalents to a character is a non-starter IMO, it would potentially be called for every single input character in the string being matched, allocating and returning a string for each one would grind performance to a halt. We could change the API to return a pair of iterators, or even something like: charT enumerate_equivalents(charT c); // return the next character equivalent to c But instead of essentially one operation per input character, we'd have N operations, if there are N equivalent characters. Which may or may not be an issue in practice. The main objection I have is that these API's are a lot harder to implement than the existing interface - currently in most cases a tolower will do the job, and even for fairly strict Unicode conformance a tolower(toupper(c)) will work. If the interface is too hard to implement then it becomes useless as a traits class, and we might just as well get rid of it (I'd really rather not go down this road, but it has been suggested before). OK, that's my brain dump over for now! Regards, John.

Eric Niebler

4:04 p.m.

John Maddock wrote:

...

I'm assuming that the correct semantics are:

"Given a character range [c1-c2] then character c3 matches this range if there is some character c4, that is equivalent to c3, and whose character code lies in the range [c1-c2]."

Yes.

...

As I understand it ECMA script assumes that for any character c, there is at most one other character that is considered equivalent to it (it's opposite case), which is clearly implementable either by providing toupper/tolower members to the traits class, or just an "other_case" member function. It's easily implementable in terms of ctype as well. Whew.

Yes.

...

Unfortunately that's not the end of the story, for Unicode characters there are more than two cases (Upper Lower and Title cases)

Yes. , and that's without

...

considering other forms of equivalence - for example the Angstrom symbol (U+212B) and character U+00C5 ("LATIN CAPITAL LETTER A WITH RING ABOVE") are normally considered as equivalents, likewise full and half width Hangul characters may be considered as equivalent for some uses.

No. We're only talking about case folding -- specifically the mappings found in http://www.unicode.org/Public/UNIDATA/CaseFolding.txt.

...

The trouble is, for wide character ranges, there is no way we could enumerate all the equivalents for the whole range and build a big list of all the characters that could match - potentially there are thousands of them.

Right. I don't think we have to.

...

However, if we try to do it on the fly things get expensive - having an API that returns a string containing all equivalents to a character is a non-starter IMO, it would potentially be called for every single input character in the string being matched, allocating and returning a string for each one would grind performance to a halt.

"string_type" would have to be a type that uses the Small String Optimization to avoid allocs. These strings will be short (a max of 4 characters for simple Unicode case folding). But see below. We could change the API to

...

return a pair of iterators, or even something like:

charT enumerate_equivalents(charT c); // return the next character equivalent to c

But instead of essentially one operation per input character, we'd have N operations, if there are N equivalent characters. Which may or may not be an issue in practice.

The main objection I have is that these API's are a lot harder to implement than the existing interface - currently in most cases a tolower will do the job, and even for fairly strict Unicode conformance a tolower(toupper(c)) will work.

I don't see how. Can you explain? If the interface is too hard to implement then it becomes

...

useless as a traits class, and we might just as well get rid of it (I'd really rather not go down this road, but it has been suggested before).

The regex_traits should make it possible for implementers to do full Unicode case folding *if they desire*. That's not the case now. Here's my suggestion. We add to the traits class two functions: bool in_range(Char from, Char to, Char ch); bool in_range_nocase(Char from, Char to, Char ch); We define the behavior of the regex engine in terms of these functions, but we don't require their use. In particular, for narrow character sets, implementers would be free to use a std::bitset<256>, enumerate the char range [from, to], call translate_nocase on each char, and set the appropriate bit in the bitset. Matching happens by calling translate_nocase on the input char and seeing if its bit is set in the bitset. That gives the same behavior. For wide character sets, implementors will in all likelyhood be storing a sparse vector of [from, to] ranges. Matching happens by calling regex_traits::in_range_nocase(from, to, *in). What does this function do? Whatever the regex_traits implementer want it to. They can simply use ctype::toupper and ctype::tolower and do the easy thing. Or they can do full-on Unicode case folding if they want. I think it's OK for TR1 (and C++0x?) to specify the default regex_traits::in_range_nocase behavior solely in terms of ctype. The hope is that eventually, C++ will get real Unicode support, and we can require more of regex_traits then. But the key is giving people a way to get full Unicode support if they so choose. As an interesting data point, I wonder how the regex package in ICU handles this. <aside> Have you seen http://www.unicode.org/reports/tr18/? It's the Unicode Consortium's recommendations for Unicode-compliant regex. Very sobering. I would like our goal with the regex_traits to be to provide hooks so that full Level 3 compliance with TR18 is possible (but not required). We're far from that goal now. I'm pretty sure we couldn't even provide Level 1, which requires the proper handling of surrogate pairs. </aside> -- Eric Niebler Boost Consulting www.boost-consulting.com

John Maddock

24 Jun 24 Jun

10:26 a.m.

...

No. We're only talking about case folding -- specifically the mappings found in http://www.unicode.org/Public/UNIDATA/CaseFolding.txt.

Well maybe you are, but the regex traits clases were always intended to allow for other forms of equivalence as well.

...

...
However, if we try to do it on the fly things get expensive - having an API that returns a string containing all equivalents to a character is a non-starter IMO, it would potentially be called for every single input character in the string being matched, allocating and returning a string for each one would grind performance to a halt.

"string_type" would have to be a type that uses the Small String Optimization to avoid allocs. These strings will be short (a max of 4 characters for simple Unicode case folding). But see below.

Or we could return a reference to a string_type I guess.

...

We could change the API to

...
return a pair of iterators, or even something like:

charT enumerate_equivalents(charT c); // return the next character equivalent to c

But instead of essentially one operation per input character, we'd have N operations, if there are N equivalent characters. Which may or may not be an issue in practice.

The main objection I have is that these API's are a lot harder to implement than the existing interface - currently in most cases a tolower will do the job, and even for fairly strict Unicode conformance a tolower(toupper(c)) will work.

I don't see how. Can you explain?

Well they're harder than just a tolower anyway - you would need to enumerate through the entire code set and build a big table of equivalents and then dump that out as bunch C++ data declarations ready for the new API to return. This is actually duplicating the data that's already present in our C runtimes, and/or ICU or other libraries, but just isn't accessible in a form we would like.

...

If the interface is too hard to implement then it becomes

...
useless as a traits class, and we might just as well get rid of it (I'd really rather not go down this road, but it has been suggested before).

The regex_traits should make it possible for implementers to do full Unicode case folding *if they desire*. That's not the case now.

That's what happens in type u32regex in Boost.Regex now.

...

Here's my suggestion. We add to the traits class two functions:

bool in_range(Char from, Char to, Char ch);

What use is this one, or are you allowing equivalents other than case folding now <wink>? If so then I approve :-)

...

bool in_range_nocase(Char from, Char to, Char ch);

OK, but see below.

...

We define the behavior of the regex engine in terms of these functions, but we don't require their use. In particular, for narrow character sets, implementers would be free to use a std::bitset<256>, enumerate the char range [from, to], call translate_nocase on each char, and set the appropriate bit in the bitset. Matching happens by calling translate_nocase on the input char and seeing if its bit is set in the bitset. That gives the same behavior.

I don't like traits class API's that may or may not be called: what happens if a user defined traits class is provided that alters the behavior of in_range, but not translate? The side effects produced by these API's are clearly visible.

...

For wide character sets, implementors will in all likelyhood be storing a sparse vector of [from, to] ranges. Matching happens by calling regex_traits::in_range_nocase(from, to, *in). What does this function do? Whatever the regex_traits implementer want it to. They can simply use ctype::toupper and ctype::tolower and do the easy thing. Or they can do full-on Unicode case folding if they want.

I think it's OK for TR1 (and C++0x?) to specify the default regex_traits::in_range_nocase behavior solely in terms of ctype. The hope is that eventually, C++ will get real Unicode support, and we can require more of regex_traits then. But the key is giving people a way to get full Unicode support if they so choose.

I agree Unicode support is clearly desirable: however on point of proceedure, I believe it's too late to change this for TR1, changes for C++0x are clearly still possible though. Whatever we need to file this as a DR.

...

As an interesting data point, I wonder how the regex package in ICU handles this.

<aside> Have you seen http://www.unicode.org/reports/tr18/? It's the Unicode Consortium's recommendations for Unicode-compliant regex. Very sobering. I would like our goal with the regex_traits to be to provide hooks so that full Level 3 compliance with TR18 is possible (but not required). We're far from that goal now. I'm pretty sure we couldn't even provide Level 1, which requires the proper handling of surrogate pairs. </aside>

Yes I'm familiar with that, I had to twist the regex interface a little to handle different Unicode encoding formats, but you can now search UTF-8, UTF-16 or UTF-32 text with Boost.Regex (see libs/regex/doc/icu_strings.html). Boost.Regex conformance to the Unicode Regex TR, is documented at libs/regex/doc/standards.html. And you're right, there's still plenty to do. The most pressing point for level 1 support is section 1.5 Caseless Matching: "Supported, note that at this level, case transformations are 1:1, many to many case folding operations are not supported (for example "ß" to "SS"). " This is a real gotcha, neither your suggested API's above, nor the C/C++ API's provide support for many to many case transformations. The problem to be solved here is analogous to that of canonical equivalence, it would be much easier to solve by processing both the string and the expression into the same normalised form (think iterator adapters), except that wouldn't be ECMA compatible again :-( Just when you thought you had a hold on it, something comes along and bites you on the **** :-> Ducks and runs for cover, John.

John Maddock

12:24 p.m.

...

The most pressing point for level 1 support is section 1.5 Caseless Matching: "Supported, note that at this level, case transformations are 1:1, many to many case folding operations are not supported (for example "ß" to "SS"). "

I forgot to mention: this is part of a larger digraph problem - in some languages more than one character may collate as a single unit - in some case Unicode may provide predefined ligatures for these, but they don't do so for every case combination of every ligature. Boost.Regex supports things like [[.ae.]-[.ll.]] (match anything that collates in the range "ae" to "ll"), and currently this should work reasonably well in case insensitive mode as well (it fails where a many-to-one case transformation is required). Also, since there is no way tell which digraphs (if any) are supported by the current locale, expressions such as [a-z] will only ever match one character, and never match say "ae", even if the current locale does regard "ae" as a single unit. I believe this is the only sensible option, particularly as in many cases whether the next two characters are regarded as a digraph is dependent upon the meaning of the word (which is to say you need a dictionary to work it out, as Martin Bonner pointer out). Re ICU: this appears to case folding (convert everything to a case insensitive form) for caseless comparisons, I would assume their regex component does the same, but haven't had a chance to try it out. John.

Eric Niebler

5:21 p.m.

John Maddock wrote:

...

...
No. We're only talking about case folding -- specifically the mappings found in http://www.unicode.org/Public/UNIDATA/CaseFolding.txt.

Well maybe you are, but the regex traits clases were always intended to allow for other forms of equivalence as well.

...

...
Here's my suggestion. We add to the traits class two functions:

bool in_range(Char from, Char to, Char ch);

What use is this one, or are you allowing equivalents other than case folding now <wink>? If so then I approve :-)

I dunno. I threw it in for completeness, but I don't think any implementation besides: return from <= ch && ch <= to; makes sense. You don't want to do any character translations or fancy equivalence stuff here. Consider what happens if translate(from) > translate(to).

...

...
bool in_range_nocase(Char from, Char to, Char ch);

OK, but see below.

...
We define the behavior of the regex engine in terms of these functions, but we don't require their use. In particular, for narrow character sets, implementers would be free to use a std::bitset<256>, enumerate the char range [from, to], call translate_nocase on each char, and set the appropriate bit in the bitset. Matching happens by calling translate_nocase on the input char and seeing if its bit is set in the bitset. That gives the same behavior.

I don't like traits class API's that may or may not be called: what happens if a user defined traits class is provided that alters the behavior of in_range, but not translate? The side effects produced by these API's are clearly visible.

As I suggest above, I don't think in_range should depend on translate. Your point is still valid, though, but the optimization is too important to ignore. We could standardize a specialization of regex_traits<char> (like the specialization of char_traits<char>) for which the behavior is known. Or more generally, we could require that for all regex traits for which 1==sizeof(char_type) then in_range_nocase is required to give the same results as the algorithm described above.

...

I agree Unicode support is clearly desirable: however on point of proceedure, I believe it's too late to change this for TR1, changes for C++0x are clearly still possible though. Whatever we need to file this as a DR.

Agreed. How does one file a DR? On comp.std.c++? Do you want to do the honors, or should I?

...

The most pressing point for level 1 support is section 1.5 Caseless Matching: "Supported, note that at this level, case transformations are 1:1, many to many case folding operations are not supported (for example "ß" to "SS"). "

The way I read this, a 1:1 mapping is all that is needed for Level 1 support. So we don't have to worry about "ß" to "SS" unless we are shooting for Level 2 or 3, which IMO we should. But that's a radical change from TR1 regex. Let's fix what we got first. -- Eric Niebler Boost Consulting www.boost-consulting.com

Beman Dawes

27 Jun 27 Jun

10:07 p.m.

At 01:21 PM 6/24/2005, Eric Niebler wrote:

...

John Maddock wrote:

...
I agree Unicode support is clearly desirable: however on point of proceedure, I believe it's too late to change this for TR1, changes for

...

...
C++0x are clearly still possible though. Whatever we need to file this as a DR.

Agreed. How does one file a DR? On comp.std.c++?

On comp.std.c++ or email directly to Howard Hinnant. Subject should begin "Defect Report:" --Beman

John Maddock

28 Jun 28 Jun

9:32 a.m.

...

...
Agreed. How does one file a DR? On comp.std.c++?

On comp.std.c++ or email directly to Howard Hinnant. Subject should begin "Defect Report:"

Note to Eric: I'm planning to be in touch about this soon, but I'm still reading up about how Unicode handles caseless matching at present: as far as I can see so far it uses case folding, which is what Boost.Regex does now, and isn't really compatible with the ECMAScript semantics. I'm starting to think you could write a whole book on this subject alone.... :-( John.

Eric Niebler

29 Jun 29 Jun

7:12 p.m.

John Maddock wrote:

...

...
...
Agreed. How does one file a DR? On comp.std.c++?

On comp.std.c++ or email directly to Howard Hinnant. Subject should begin "Defect Report:"

Note to Eric: I'm planning to be in touch about this soon, but I'm still reading up about how Unicode handles caseless matching at present

OK, I'll hold off on filing a DR until we have some recommendations. FWIW, I no longer like my suggestion of adding in_range and in_range_nocase to the traits. They are both sub-optimal. The first makes it impossible to binary_search a sorted sparse vector of ranges, and the second forces you to do case folding repeatedly for every [from, to] range in the vector, instead of just once. Still thinking.... -- Eric Niebler Boost Consulting www.boost-consulting.com

David Abrahams

1 Jul 1 Jul

3:07 a.m.

"Eric Niebler" <eric@boost-consulting.com> writes:

...

John Maddock wrote:

...
...
...
Agreed. How does one file a DR? On comp.std.c++?

On comp.std.c++ or email directly to Howard Hinnant. Subject should begin "Defect Report:"

Note to Eric: I'm planning to be in touch about this soon, but I'm still reading up about how Unicode handles caseless matching at present

OK, I'll hold off on filing a DR until we have some recommendations.

Suggestion: don't keep this under your hat. At least alert the LWG and say that you will have recommendations in a few days. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Eric Niebler

6:49 a.m.

David Abrahams wrote:

...

"Eric Niebler" <eric@boost-consulting.com> writes:

...
OK, I'll hold off on filing a DR until we have some recommendations.

Suggestion: don't keep this under your hat. At least alert the LWG and say that you will have recommendations in a few days.

That's probably good advice. I've sent a DR to comp.std.c++. Also, I may have found another issue, closely related to the one under discussion. It regards case-insensitive matching of named character classes. The regex_traits<> provides two functions for working with named char classes: lookup_classname and isctype. To match a char class such as [[:alpha:]], you pass "alpha" to lookup_classname and get a bitmask. Later, you pass a char and the bitmask to isctype and get a bool yes/no answer. But how does case-insensitivity work in this scenario? Suppose we're doing a case-insensitive match on [[:lower:]]. It should behave as if it were [[:lower:][:upper:]], right? But there doesn't seem to be enough smarts in the regex_traits interface to do this. Imagine I write a traits class which recognizes [[:fubar:]], and the "fubar" char class happens to be case-sensitive. How is the regex engine to know that? And how should it do a case-insensitive match of a character against the [[:fubar:]] char class? John, can you confirm this is a legitimate problem? I see two options: 1) Add a bool icase parameter to lookup_classname. Then, lookup_classname( "upper", true ) will know to return lower|upper instead of just upper. 2) Add a isctype_nocase function I prefer (1) because the extra computation happens at the time the pattern is compiled rather than when it is executed. -- Eric Niebler Boost Consulting www.boost-consulting.com

John Maddock

10:41 a.m.

...

Also, I may have found another issue, closely related to the one under discussion. It regards case-insensitive matching of named character classes. The regex_traits<> provides two functions for working with named char classes: lookup_classname and isctype. To match a char class such as [[:alpha:]], you pass "alpha" to lookup_classname and get a bitmask. Later, you pass a char and the bitmask to isctype and get a bool yes/no answer.

But how does case-insensitivity work in this scenario? Suppose we're doing a case-insensitive match on [[:lower:]]. It should behave as if it were [[:lower:][:upper:]], right? But there doesn't seem to be enough smarts in the regex_traits interface to do this.

I've always thought that a case insensitive match for [[:lower:]] was an abomination frankly, but here's how I currently handle it: If the final bitmask contains all of the bits of the mask returned by lookup_classname("lower") or all the bits of the mask retruned by lookup_classname("upper") then I or the mask with the result of lookup_classname("alpha").

...

Imagine I write a traits class which recognizes [[:fubar:]], and the "fubar" char class happens to be case-sensitive. How is the regex engine to know that? And how should it do a case-insensitive match of a character against the [[:fubar:]] char class? John, can you confirm this is a legitimate problem?

OK, user defined classes may be an issue (see below).

...

I see two options:

1) Add a bool icase parameter to lookup_classname. Then, lookup_classname( "upper", true ) will know to return lower|upper instead of just upper.

2) Add a isctype_nocase function

I prefer (1) because the extra computation happens at the time the pattern is compiled rather than when it is executed.

If we're going to change this then (1) is definitely preferable, it's quite a small change after all. In fact I suspect this may be a real bug in the current Boost.Regex Unicode support: matching a case insensitive [[:Ll:]] will only match lower case letters. Although frankly which of the other L* categories it should match is an open question: should it match Lo or Lm for example? Head swimmingly yours, John.

Eric Niebler

3:24 p.m.

John Maddock wrote:

...

...
Imagine I write a traits class which recognizes [[:fubar:]], and the "fubar" char class happens to be case-sensitive. How is the regex engine to know that? And how should it do a case-insensitive match of a character against the [[:fubar:]] char class? John, can you confirm this is a legitimate problem?

<snip>

...

In fact I suspect this may be a real bug in the current Boost.Regex Unicode support: matching a case insensitive [[:Ll:]] will only match lower case letters. Although frankly which of the other L* categories it should match is an open question: should it match Lo or Lm for example?

Wow, I don't know. Do whatever ICU does. :-) My guess is no, the Lo and Lm categories don't seem to be involved in case folding. OK, I'll file another DR. -- Eric Niebler Boost Consulting www.boost-consulting.com

7330

Age (days ago)

7341

Last active (days ago)

List overview

Download

16 comments

5 participants

participants (5)

Beman Dawes
Caleb Epstein
David Abrahams
Eric Niebler
John Maddock