regex with double byte character sets
Hi all, I am currently trying to use the boost regex library with Japanese language strings. It appears like DBCS is not supported. For example, using the following code (with compile definition of BOOST_REGEX_USE_C_LOCALE) I get the output strings as 0 = "。" 1 = "English" Instead of the expected: 0 = "やゆよわをー。" 1 = "English" This is due to the fact that the Japanese (SJIS encoding) for one of these characters uses the [ character as one of the characters in the encoding. setlocale( LC_COLLATE, "Japanese" ); setlocale( LC_CTYPE, "Japanese" ); char * pszText = "やゆよわをー。 [english]", char * pszRule ="([^\\[]*)\\[([[:word:]]*)\\]"; // split the string into it's components std::vectorstd::string vPart; boost::regex eParseExpr( pszRule, boost::regbase::normal | boost::regbase::icase ); boost::regex_split( std::back_inserter(vPart), std::string(pszText), eParseExpr ); Is there some what to modify the library to enable DBCS? For example, can the char_traits be modified to enable DBCS processing? (Keeping in mind that the biggest problem with DBCS is that a single character may consist of 2 bytes which tends to blow out all assumptions about the size of characters). Brodie.
I am currently trying to use the boost regex library with Japanese language strings. It appears like DBCS is not supported. For example, using the following code (with compile definition of BOOST_REGEX_USE_C_LOCALE) I get the output strings as
0 = "。" 1 = "English"
Instead of the expected:
0 = "やゆよわをー。" 1 = "English"
This is due to the fact that the Japanese (SJIS encoding) for one of these characters uses the [ character as one of the characters in the encoding.
setlocale( LC_COLLATE, "Japanese" ); setlocale( LC_CTYPE, "Japanese" );
char * pszText = "やゆよわをー。 [english]", char * pszRule ="([^\\[]*)\\[([[:word:]]*)\\]";
// split the string into it's components std::vectorstd::string vPart; boost::regex eParseExpr( pszRule, boost::regbase::normal | boost::regbase::icase ); boost::regex_split( std::back_inserter(vPart), std::string(pszText), eParseExpr );
Is there some what to modify the library to enable DBCS? For example, can the char_traits be modified to enable DBCS processing? (Keeping in mind that the biggest problem with DBCS is that a single character may consist of 2 bytes which tends to blow out all assumptions about the size of characters).
Brodie.
To be honest I know nothing at all about DBCS, but I assumed that very code point was represented by *exactly two* characters. If that's the case then I think it might be possible, one would have to create a new data type, something like: struct DBCS_proxy { char bytes[2]; }; then create a traits class for DBCS_proxy, and cast all char* strings to DBCS_Proxy*'s when calling the regex functions. Really I'm just thinking out loud here, I haven't tried it and I don't know if it would work. Otherwise can you use Unicode? John Maddock http://ourworld.compuserve.com/homepages/john_maddock/index.htm
On Sun, 30 Jun 2002 04:20:32 -0700, John Maddock wrote:
I am currently trying to use the boost regex library with Japanese language strings. It appears like DBCS is not supported. For example, using the following code (with compile definition of BOOST_REGEX_USE_C_LOCALE) I get the output strings as
0 = "$B!#(B" 1 = "English"
Instead of the expected:
0 = "$B$d$f$h$o$r!
This is due to the fact that the Japanese (SJIS encoding) for one of these characters uses the [ character as one of the characters in the encoding.
[snip]
Brodie.
To be honest I know nothing at all about DBCS, but I assumed that very code point was represented by *exactly two* characters. If that's the case then I think it might be possible
DBCS encodings like SJIS are variable-width. But the real problem is that given an iterator into a DBCS string, it is impossible to tell where the previous character starts without walking back to the beginning of the string. So you can really only make a forward DBCS iterator, not a bidirectional one. And I think regex++ requires bidirectional iterators, right John?
Otherwise can you use Unicode?
Yup, use Unicode.
John Maddock http://ourworld.compuserve.com/homepages/john_maddock/index.htm
Eric
Okay. I kind of expected that this would be the case. Thanks for your responses. Regards, Brodie.
DBCS encodings like SJIS are variable-width. But the real problem is that given an iterator into a DBCS string, it is impossible to tell where the previous character starts without walking back to the beginning of the string. So you can really only make a forward DBCS iterator, not a bidirectional one. And I think regex++ requires bidirectional iterators, right John?
Yes that's right, obviously my understanding of DBCS strings was flawed. John Maddock http://ourworld.compuserve.com/homepages/john_maddock/index.htm
participants (4)
-
Brodie Thiesfield
-
bthiesfield
-
Eric Niebler
-
John Maddock