[Regex] Japanese character equivalence is not working.

Hi, My name is Raveendra Madala. I work for Adobe Systems. We are using boost/regex 1.33.1 in one of our projects and we have encountered the following bug. Character Equivalence is supposed to find all variations of a given character. When we try to use this with Japanese characters regex is failing. To test this, if we have 0x30A2(Full width Katakana letter A), 0xFF71(Half width katakana letter A) and 0x32D0(Circled Katakana A) in the text and if we enter [[=a=]] where a is katakana a then regex is unable to find the three characters. This is just an example and is happening for all Japanese characters. How can we work around this issue? If you require more information on this, please do let me know. Thanks, Raveendra

Raveendra Madala wrote :
Character Equivalence is supposed to find all variations of a given character.
boost.regex clearly says that it does not support canonical equivalence. http://www.boost.org/libs/regex/doc/standards.html
How can we work around this issue? If you require more information on this, please do let me know.
The best way would probably to have full Unicode support in boost first, to build a full unicode regex engine on top of it.

Hi, We do have Unicode support in boost and regex that we use. The issue I was referring has to do with Tailored Loose Matches failing as per the standards terminology. Is there anything that can be done to overcome this issue. Thanks, Raveendra -----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users-bounces@lists.boost.org] On Behalf Of loufoque Sent: Wednesday, September 13, 2006 8:31 PM To: boost-users@lists.boost.org Subject: Re: [Boost-users] [Regex] Japanese character equivalence is notworking. Raveendra Madala wrote :
Character Equivalence is supposed to find all variations of a given character.
boost.regex clearly says that it does not support canonical equivalence. http://www.boost.org/libs/regex/doc/standards.html
How can we work around this issue? If you require more information on this, please do let me know.
The best way would probably to have full Unicode support in boost first, to build a full unicode regex engine on top of it. _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Raveendra Madala wrote:
My name is Raveendra Madala. I work for Adobe Systems. We are using boost/regex 1.33.1 in one of our projects and we have encountered the following bug.
Hi Raveendra: you reported this before on the tracker at https://sourceforge.net/tracker/?func=detail&atid=107586&aid=1531909&group_id=7586 and I've been waiting for more information. But to repeat myself: "There is no portable way to make this work unfortunately, it requires that the regex engine is able to decode the collation string produced by the locale to extract the primary equivalence class. The "kind" of sort key used by the platform is determined in a fairly heuristic way in find_sort_syntax() in boost/regex/v4/primary_transform.hpp, and the actually sort key is produced in cpp_regex_traits::primary_transform(). You may - with a bit of debugging - be able to find out what's going wrong (I don't have access to a mac BTW). The most important thing would be to find out what kind of sort keys are returned by std::collate<>::transform." That all assumes that you're using boost::wregex: should you be using boost::u32regex and ICU support, then primary sort keys are produced by ICU's collation engine. This should work for all Unicode characters, but if not let me know. But basically: more information please :-) John.
participants (3)
-
John Maddock
-
loufoque
-
Raveendra Madala