On Sun, 30 Jun 2002 04:20:32 -0700, John Maddock wrote:
I am currently trying to use the boost regex library with Japanese language strings. It appears like DBCS is not supported. For example, using the following code (with compile definition of BOOST_REGEX_USE_C_LOCALE) I get the output strings as
0 = "。" 1 = "English"
Instead of the expected:
0 = "やゆよわをー。" 1 = "English"
This is due to the fact that the Japanese (SJIS encoding) for one of these characters uses the [ character as one of the characters in the encoding.
[snip]
Brodie.
To be honest I know nothing at all about DBCS, but I assumed that very code point was represented by *exactly two* characters. If that's the case then I think it might be possible
DBCS encodings like SJIS are variable-width. But the real problem is that given an iterator into a DBCS string, it is impossible to tell where the previous character starts without walking back to the beginning of the string. So you can really only make a forward DBCS iterator, not a bidirectional one. And I think regex++ requires bidirectional iterators, right John?
Otherwise can you use Unicode?
Yup, use Unicode.
John Maddock http://ourworld.compuserve.com/homepages/john_maddock/index.htm
Eric