Re: regex with double byte character sets

2 Jul 2002


      On Sun, 30 Jun 2002 04:20:32 -0700, John Maddock wrote:
...
...
I am currently trying to use the boost regex library with Japanese
language strings.  It appears like DBCS is not supported. For example,
using the following code (with compile definition of
BOOST_REGEX_USE_C_LOCALE) I get the output strings as
0 = "$B!#(B"
1 = "English"
Instead of the expected:
0 = "$B$d$f$h$o$r!<!#(B"
1 = "English"
This is due to the fact that the Japanese (SJIS encoding) for one of
these characters uses the [ character as one of the characters in the
encoding.
[snip]
...
Brodie.
To be honest I know nothing at all about DBCS, but I assumed that very
code point was represented by *exactly two* characters.  If that's the
case then I think it might be possible
DBCS encodings like SJIS are variable-width. But the real problem is that
given an iterator into a DBCS string, it is impossible to tell where the
previous character starts without walking back to the beginning of the
string. So you can really only make a forward DBCS iterator, not a
bidirectional one.  And I think regex++ requires bidirectional iterators,
right John?
...
Otherwise can you use Unicode?
Yup, use Unicode.
...
John Maddock
http://ourworld.compuserve.com/homepages/john_maddock/index.htm
Eric

Re: regex with double byte character sets

Eric Niebler