something about UTF8
hi guys, I want to use boost::regex in Windows XP to match Japanese kanji. The encoding of kanji is UTF-8 I want to make sure after I use the funcation: MultibyteToWideChar to change the UTF-8 Kanji string->wstring, I can directly use boost::wregex(from wstring) to match Japanese? Appreciate any help. Worldwind
On Wed, Dec 17, 2008 at 1:54 AM, wind world <worldwindjp@gmail.com> wrote:
hi guys, I want to use boost::regex in Windows XP to match Japanese kanji. The encoding of kanji is UTF-8 I want to make sure after I use the funcation: MultibyteToWideChar to change the UTF-8 Kanji string->wstring, I can directly use boost::wregex(from wstring) to match Japanese?
Not an expert in this, but if you compiled Boost.Regex with ICU, it should have full support for such languages, whereas wide-chars may not. May need someone else to come around and confirm...
wind world wrote:
hi guys, I want to use boost::regex in Windows XP to match Japanese kanji. The encoding of kanji is UTF-8 I want to make sure after I use the funcation: MultibyteToWideChar to change the UTF-8 Kanji string->wstring, I can directly use boost::wregex(from wstring) to match Japanese?
You would need to check the Windows API docs to make sure you're using the API correctly (does it work with UTF-8 as source? No idea on that), but yes, once you have the text encoded as UTF-16 then wregex will behave as you expect. Otherwise you could build regex with ICU support and then match UTF-8 directly: the downside is that you then have a dependency to ICU which is not a small library. HTH, John.
Working with wstrings with the regex lib should work without problems, except you cannot rely on unicode specific character classes. Just make sure you convert correctly between UTF-8 and wide-char strings. Rune On Thu, Dec 18, 2008 at 10:39 AM, John Maddock <john@johnmaddock.co.uk>wrote:
wind world wrote:
hi guys,
I want to use boost::regex in Windows XP to match Japanese kanji. The encoding of kanji is UTF-8 I want to make sure after I use the funcation: MultibyteToWideChar to change the UTF-8 Kanji string->wstring, I can directly use boost::wregex(from wstring) to match Japanese?
You would need to check the Windows API docs to make sure you're using the API correctly (does it work with UTF-8 as source? No idea on that), but yes, once you have the text encoded as UTF-16 then wregex will behave as you expect.
Otherwise you could build regex with ICU support and then match UTF-8 directly: the downside is that you then have a dependency to ICU which is not a small library.
HTH, John. _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
participants (4)
-
John Maddock
-
OvermindDL1
-
Rune Lund Olesen
-
wind world