Re: [Boost-Users] regex with double byte character sets

30 Jun 2002

      ...
I am currently trying to use the boost regex library with Japanese
language strings.  It appears like DBCS is not supported. For
example, using the following code (with compile definition of
BOOST_REGEX_USE_C_LOCALE) I get the output strings as
0 = "。"
1 = "English"
Instead of the expected:
0 = "やゆよわをー。"
1 = "English"
This is due to the fact that the Japanese (SJIS encoding) for one of
these characters uses the [ character as one of the characters in the
encoding.
setlocale( LC_COLLATE, "Japanese" );
    setlocale( LC_CTYPE,   "Japanese" );
char * pszText = "やゆよわをー。 [english]",
    char * pszRule ="([^\\[]*)\\[([[:word:]]*)\\]";
// split the string into it's components
    std::vector<std::string> vPart;
    boost::regex eParseExpr( pszRule,
        boost::regbase::normal | boost::regbase::icase );
    boost::regex_split( std::back_inserter(vPart),
std::string(pszText), eParseExpr );
Is there some what to modify the library to enable DBCS?  For
example, can the char_traits be modified to enable DBCS processing?
(Keeping in mind that the biggest problem with DBCS is that a single
character may consist of 2 bytes which tends to blow out all
assumptions about the size of characters).
Brodie.
To be honest I know nothing at all about DBCS, but I assumed that very code
point was represented by *exactly two* characters.  If that's the case then
I think it might be possible, one would have to create a new data type,
something like:

struct DBCS_proxy
{
char bytes[2];
};

then create a traits class for DBCS_proxy, and cast all char* strings to
DBCS_Proxy*'s when calling the regex functions.  Really I'm just thinking
out loud here, I haven't tried it and I don't know if it would work.

Otherwise can you use Unicode?

John Maddock
http://ourworld.compuserve.com/homepages/john_maddock/index.htm