New subject: [Boost-Users] regex with double byte character sets

24 Jun 2002

      Hi all,

I am currently trying to use the boost regex library with Japanese
language strings.  It appears like DBCS is not supported. For
example, using the following code (with compile definition of
BOOST_REGEX_USE_C_LOCALE) I get the output strings as 

0 = "。"
1 = "English"

Instead of the expected:

0 = "やゆよわをー。"
1 = "English"

This is due to the fact that the Japanese (SJIS encoding) for one of
these characters uses the [ character as one of the characters in the
encoding.

    setlocale( LC_COLLATE, "Japanese" );
    setlocale( LC_CTYPE,   "Japanese" );

    char * pszText = "やゆよわをー。 [english]",
    char * pszRule ="([^\\[]*)\\[([[:word:]]*)\\]";

    // split the string into it's components
    std::vector<std::string> vPart;
    boost::regex eParseExpr( pszRule,
        boost::regbase::normal | boost::regbase::icase );
    boost::regex_split( std::back_inserter(vPart),
std::string(pszText), eParseExpr );

Is there some what to modify the library to enable DBCS?  For
example, can the char_traits be modified to enable DBCS processing? 
(Keeping in mind that the biggest problem with DBCS is that a single
character may consist of 2 bytes which tends to blow out all
assumptions about the size of characters).

Brodie.

regex with double byte character sets

bthiesfield

John Maddock

Eric Niebler

Brodie Thiesfield

John Maddock

tags

participants (4)