UTF-16

Robert Dailey

14 Jun 2009 14 Jun '09

5:27 p.m.

Hi everyone, I did a bit of googling to see if Boost 1.39 as any portable support for UTF-16 encoded strings, but I did not find any. I'm currently using wxWidgets in my application, and I need a decent string object to use. I know that wxWidgets has UTF-16 string support through wxString, however I do not want to expose this object in my interfaces. I want to remain as abstracted away from wxWidgets as possible. Having said that, if someone could tell me if there is any existing UTF-16 string support in Boost, I'd appreciate it. I did not find anything in the vault, sandbox, or trunk in Boost. If boost has no such string object, could someone give me a head start on where to look? Thanks.

Attachments:

attachment.html (text/html — 758 bytes)

Show replies by date

Robert Dailey

14 Jun 14 Jun

5:32 p.m.

Oh, I also forgot to mention, I am also using boost::filesystem::path. I guess this means I need to use wchar_t everywhere (std::wstring, boost::filesystem::wpath, etc) and just let wxWidgets do the encoding/decoding? If I don't have to do any encoding/decoding myself, then there really is no need for a special object. But just in case I would like to have the encoding/decoding abilities. On Sun, Jun 14, 2009 at 12:27 PM, Robert Dailey <rcdailey@gmail.com> wrote:

...

Hi everyone, I did a bit of googling to see if Boost 1.39 as any portable support for UTF-16 encoded strings, but I did not find any. I'm currently using wxWidgets in my application, and I need a decent string object to use. I know that wxWidgets has UTF-16 string support through wxString, however I do not want to expose this object in my interfaces. I want to remain as abstracted away from wxWidgets as possible. Having said that, if someone could tell me if there is any existing UTF-16 string support in Boost, I'd appreciate it. I did not find anything in the vault, sandbox, or trunk in Boost.

If boost has no such string object, could someone give me a head start on where to look? Thanks.

Zachary Turner

6:08 p.m.

On Sun, Jun 14, 2009 at 12:32 PM, Robert Dailey<rcdailey@gmail.com> wrote:

...

Oh, I also forgot to mention, I am also using boost::filesystem::path. I guess this means I need to use wchar_t everywhere (std::wstring, boost::filesystem::wpath, etc) and just let wxWidgets do the encoding/decoding? If I don't have to do any encoding/decoding myself, then there really is no need for a special object. But just in case I would like to have the encoding/decoding abilities.

On Sun, Jun 14, 2009 at 12:27 PM, Robert Dailey <rcdailey@gmail.com> wrote:

...
Hi everyone, I did a bit of googling to see if Boost 1.39 as any portable support for UTF-16 encoded strings, but I did not find any. I'm currently using wxWidgets in my application, and I need a decent string object to use. I know that wxWidgets has UTF-16 string support through wxString, however I do not want to expose this object in my interfaces. I want to remain as abstracted away from wxWidgets as possible. Having said that, if someone could tell me if there is any existing UTF-16 string support in Boost, I'd appreciate it. I did not find anything in the vault, sandbox, or trunk in Boost. If boost has no such string object, could someone give me a head start on where to look? Thanks.

An application I currently work on is stricken with this. If (like us) you are just trying to provide basic internationalization across Windows and Linux and want it to "just work" and be simple, then I would suggest typedefing something like typedef std::wstring utf_string; typedef boost::filesystem::wpath utf_path; typedef wchar_t utf_char; etc on windows, and typedef std::string utf_string; typedef boost::filesystem::path utf_path; typedef char utf_char; on Linux. Then just use a simple UTF-8 <-> UTF-16 conversion if ever you need to persist / retrieve something, so that it's stored in a common format. We're getting many strange problems relating to locales when we try to use UTF-16 in wpaths on Linux, and if it's not too much effort it's going to be simpler to just have your program always store them in the native format that the OS is expecting.

Robert Dailey

7:41 p.m.

On Sun, Jun 14, 2009 at 1:08 PM, Zachary Turner <divisortheory@gmail.com>wrote:

...

An application I currently work on is stricken with this. If (like us) you are just trying to provide basic internationalization across Windows and Linux and want it to "just work" and be simple, then I would suggest typedefing something like

typedef std::wstring utf_string; typedef boost::filesystem::wpath utf_path; typedef wchar_t utf_char;

etc on windows, and

typedef std::string utf_string; typedef boost::filesystem::path utf_path; typedef char utf_char;

on Linux. Then just use a simple UTF-8 <-> UTF-16 conversion if ever you need to persist / retrieve something, so that it's stored in a common format. We're getting many strange problems relating to locales when we try to use UTF-16 in wpaths on Linux, and if it's not too much effort it's going to be simpler to just have your program always store them in the native format that the OS is expecting.

Great advice Zach. I'll definitely do this. However, it would be nice to have an already-made conversion routine for UTF8 to UTF16. I'm hoping that most of the cases where I'm converting encodings will be when I'm going through another library where it has already been handled, like wxWidgets or boost.

Boris Dušek

7:51 p.m.

...

Great advice Zach. I'll definitely do this. However, it would be nice to have an already-made conversion routine for UTF8 to UTF16. I'm hoping that most of the cases where I'm converting encodings will be when I'm going through another library where it has already been handled, like wxWidgets or boost.

I never used it, nor heard that it works for what you want, I just remember that there is some Unicode stuff in Adobe's ASL: http://stlab.adobe.com/group__asl__unicode.html. Skimming through the source, it's probably header-only library (i.e. no linking), but I could be wrong.

Zachary Turner

9 p.m.

On Sun, Jun 14, 2009 at 2:41 PM, Robert Dailey<rcdailey@gmail.com> wrote:

...

On Sun, Jun 14, 2009 at 1:08 PM, Zachary Turner <divisortheory@gmail.com> wrote:

...
An application I currently work on is stricken with this. If (like us) you are just trying to provide basic internationalization across Windows and Linux and want it to "just work" and be simple, then I would suggest typedefing something like

typedef std::wstring utf_string; typedef boost::filesystem::wpath utf_path; typedef wchar_t utf_char;

etc on windows, and

typedef std::string utf_string; typedef boost::filesystem::path utf_path; typedef char utf_char;

on Linux. Then just use a simple UTF-8 <-> UTF-16 conversion if ever you need to persist / retrieve something, so that it's stored in a common format. We're getting many strange problems relating to locales when we try to use UTF-16 in wpaths on Linux, and if it's not too much effort it's going to be simpler to just have your program always store them in the native format that the OS is expecting.

Great advice Zach. I'll definitely do this. However, it would be nice to have an already-made conversion routine for UTF8 to UTF16. I'm hoping that most of the cases where I'm converting encodings will be when I'm going through another library where it has already been handled, like wxWidgets or boost.

Yea definitely. I know there's a Boost.Unicode library in the works, although I don't know the status.

Space Ship Traveller

15 Jun 15 Jun

12:25 a.m.

I found this to be pretty useful UTF8-CPP: UTF-8 with C++ in a Portable Way: The Sourceforge project page

Boris Schaeling

8:55 a.m.

On Sun, 14 Jun 2009 21:41:16 +0200, Robert Dailey <rcdailey@gmail.com> wrote:

...

[...]Great advice Zach. I'll definitely do this. However, it would be nice to have an already-made conversion routine for UTF8 to UTF16. I'm hoping

The conversion routines you are looking for are std::mbsrtowcs() and std::wcsrtombs() in <cwchar>. You must set the global locale first before you use them so they know which multi-byte encoding they should use (try for example "en_US.UTF-8"). If you want your application to work on Windows, too, you can't use those functions unfortunately but must use MultiByteToWideChar() and WideCharToMultiByte() instead (as there is no UTF-8 locale on Windows). The Unicode FAQ for Unix and Linux might also help: http://www.cl.cam.ac.uk/~mgk25/unicode.html Boris

Robert Dailey

3:45 p.m.

On Mon, Jun 15, 2009 at 3:55 AM, Boris Schaeling <boris@highscore.de> wrote:

...

On Sun, 14 Jun 2009 21:41:16 +0200, Robert Dailey <rcdailey@gmail.com> wrote:

[...]Great advice Zach. I'll definitely do this. However, it would be nice

...
to have an already-made conversion routine for UTF8 to UTF16. I'm hoping

The conversion routines you are looking for are std::mbsrtowcs() and std::wcsrtombs() in <cwchar>. You must set the global locale first before you use them so they know which multi-byte encoding they should use (try for example "en_US.UTF-8"). If you want your application to work on Windows, too, you can't use those functions unfortunately but must use MultiByteToWideChar() and WideCharToMultiByte() instead (as there is no UTF-8 locale on Windows).

The Unicode FAQ for Unix and Linux might also help: http://www.cl.cam.ac.uk/~mgk25/unicode.html

Boris, This is a good idea. I had thought about this, but I was hoping there was a more portable solution already out there. If not, then I could just create a simple abstraction for the platform specific routines. Thanks for your help (And to everyone else as well)!

Zachary Turner

6:12 p.m.

On Mon, Jun 15, 2009 at 10:45 AM, Robert Dailey<rcdailey@gmail.com> wrote:

...

On Mon, Jun 15, 2009 at 3:55 AM, Boris Schaeling <boris@highscore.de> wrote:

...
On Sun, 14 Jun 2009 21:41:16 +0200, Robert Dailey <rcdailey@gmail.com> wrote:

...
[...]Great advice Zach. I'll definitely do this. However, it would be nice to have an already-made conversion routine for UTF8 to UTF16. I'm hoping

The conversion routines you are looking for are std::mbsrtowcs() and std::wcsrtombs() in <cwchar>. You must set the global locale first before you use them so they know which multi-byte encoding they should use (try for example "en_US.UTF-8"). If you want your application to work on Windows, too, you can't use those functions unfortunately but must use MultiByteToWideChar() and WideCharToMultiByte() instead (as there is no UTF-8 locale on Windows).

The Unicode FAQ for Unix and Linux might also help: http://www.cl.cam.ac.uk/~mgk25/unicode.html

Boris, This is a good idea. I had thought about this, but I was hoping there was a more portable solution already out there. If not, then I could just create a simple abstraction for the platform specific routines. Thanks for your help (And to everyone else as well)!

The probelm with this approach, if I'm not mistaken, is that different encodings have different names across operating systems, and even across different linux distros. So if you see en_US.UTF-8 on one distro, it might be something else on another distro. Correct me if I'm wrong

plarroy

16 Jul 16 Jul

8:39 a.m.

Hi My approach is using std::string, etc. all the time and using UTF-8 internally, only converting to other charsets when it's needed. I use IBM icu library and made a boost::iostreams filter to convert encoding, once it's done takes a lot of complexity away, I use it like: // setup a conversion from charset to utf-8 filt_streamb.push(ucnv_filter(charset.c_str(), "utf-8")); istream is(&filt_streamb); Perhaps there's interest to push this charset conversion into boost::iostreams filters examples. Regards. Robert Dailey wrote:

...

Oh, I also forgot to mention, I am also using boost::filesystem::path. I guess this means I need to use wchar_t everywhere (std::wstring, boost::filesystem::wpath, etc) and just let wxWidgets do the encoding/decoding? If I don't have to do any encoding/decoding myself, then there really is no need for a special object. But just in case I would like to have the encoding/decoding abilities.

On Sun, Jun 14, 2009 at 12:27 PM, Robert Dailey <rcdailey@gmail.com> wrote:

...
Hi everyone, I did a bit of googling to see if Boost 1.39 as any portable support for UTF-16 encoded strings, but I did not find any. I'm currently using wxWidgets in my application, and I need a decent string object to use. I know that wxWidgets has UTF-16 string support through wxString, however I do not want to expose this object in my interfaces. I want to remain as abstracted away from wxWidgets as possible. Having said that, if someone could tell me if there is any existing UTF-16 string support in Boost, I'd appreciate it. I did not find anything in the vault, sandbox, or trunk in Boost.

If boost has no such string object, could someone give me a head start on where to look? Thanks.

------------------------------------------------------------------------

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Robert Dailey

22 Jul 22 Jul

7:25 p.m.

Problem with that is that std::string::length() no longer provides a meaningful value. It will count each byte is 1 character. --------- Robert Dailey On Thu, Jul 16, 2009 at 3:39 AM, plarroy <plarroy@promax.es> wrote:

...

Hi

My approach is using std::string, etc. all the time and using UTF-8 internally, only converting to other charsets when it's needed.

I use IBM icu library and made a boost::iostreams filter to convert encoding, once it's done takes a lot of complexity away, I use it like:

// setup a conversion from charset to utf-8 filt_streamb.push(ucnv_filter(charset.c_str(), "utf-8")); istream is(&filt_streamb);

Perhaps there's interest to push this charset conversion into boost::iostreams filters examples.

Regards.

Robert Dailey wrote:

...
Oh, I also forgot to mention, I am also using boost::filesystem::path. I guess this means I need to use wchar_t everywhere (std::wstring, boost::filesystem::wpath, etc) and just let wxWidgets do the encoding/decoding? If I don't have to do any encoding/decoding myself, then there really is no need for a special object. But just in case I would like to have the encoding/decoding abilities.

On Sun, Jun 14, 2009 at 12:27 PM, Robert Dailey <rcdailey@gmail.com> wrote:

...
Hi everyone, I did a bit of googling to see if Boost 1.39 as any portable support for UTF-16 encoded strings, but I did not find any. I'm currently using wxWidgets in my application, and I need a decent string object to use. I know that wxWidgets has UTF-16 string support through wxString, however I do not want to expose this object in my interfaces. I want to remain as abstracted away from wxWidgets as possible. Having said that, if someone could tell me if there is any existing UTF-16 string support in Boost, I'd appreciate it. I did not find anything in the vault, sandbox, or trunk in Boost.

If boost has no such string object, could someone give me a head start on where to look? Thanks.

------------------------------------------------------------------------

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

plarroy

24 Jul 24 Jul

10:07 a.m.

Yes, but you can easily count codepoints using ICU if you need this feature. I choosed not leverage that with std::string. Regards. Robert Dailey wrote:

...

Problem with that is that std::string::length() no longer provides a meaningful value. It will count each byte is 1 character.

--------- Robert Dailey

On Thu, Jul 16, 2009 at 3:39 AM, plarroy <plarroy@promax.es> wrote:

...
Hi

My approach is using std::string, etc. all the time and using UTF-8 internally, only converting to other charsets when it's needed.

I use IBM icu library and made a boost::iostreams filter to convert encoding, once it's done takes a lot of complexity away, I use it like:

// setup a conversion from charset to utf-8 filt_streamb.push(ucnv_filter(charset.c_str(), "utf-8")); istream is(&filt_streamb);

Perhaps there's interest to push this charset conversion into boost::iostreams filters examples.

Regards.

Robert Dailey wrote:

...
Oh, I also forgot to mention, I am also using boost::filesystem::path. I guess this means I need to use wchar_t everywhere (std::wstring, boost::filesystem::wpath, etc) and just let wxWidgets do the encoding/decoding? If I don't have to do any encoding/decoding myself, then there really is no need for a special object. But just in case I would like to have the encoding/decoding abilities.

On Sun, Jun 14, 2009 at 12:27 PM, Robert Dailey <rcdailey@gmail.com> wrote:

...
Hi everyone, I did a bit of googling to see if Boost 1.39 as any portable support for UTF-16 encoded strings, but I did not find any. I'm currently using wxWidgets in my application, and I need a decent string object to use. I know that wxWidgets has UTF-16 string support through wxString, however I do not want to expose this object in my interfaces. I want to remain as abstracted away from wxWidgets as possible. Having said that, if someone could tell me if there is any existing UTF-16 string support in Boost, I'd appreciate it. I did not find anything in the vault, sandbox, or trunk in Boost.

If boost has no such string object, could someone give me a head start on where to look? Thanks.

------------------------------------------------------------------------

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

------------------------------------------------------------------------

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Dominique Devienne

3:17 p.m.

On Wed, Jul 22, 2009 at 2:25 PM, Robert Dailey<rcdailey@gmail.com> wrote:

...

Problem with that is that std::string::length() no longer provides a meaningful value. It will count each byte is 1 character.

Instead of ICU, there's also http://utfcpp.sourceforge.net/ with its utf8::distance, which may be lighter weight. --DD Quoting from that web page: This function is used to find the length (in code points) of a UTF-8 encoded string. The reason it is called distance, rather than, say, length is mainly because developers are used that length is an O(1) function. Computing the length of an UTF-8 string is a linear operation, and it looked better to model it after std::distance algorithm. In case of an invalid UTF-8 sequence, a utf8::invalid_utf8 exception is thrown. If last does not point to the past-of-end of a UTF-8 sequence, a utf8::not_enough_room exception is thrown.

Scott McMurray

4:47 p.m.

2009/7/22 Robert Dailey <rcdailey@gmail.com>:

...

Problem with that is that std::string::length() no longer provides a meaningful value. It will count each byte is 1 character.

Can you elaborate an example of a situation in which you actually need to know the length of the string (in codepoints) in the first place? Everything I can think of that might would be better done with a regex anyways.

Rainer Deyke

5:59 p.m.

Scott McMurray wrote:

...

Can you elaborate an example of a situation in which you actually need to know the length of the string (in codepoints) in the first place? Everything I can think of that might would be better done with a regex anyways.

The length of a unicode string isn't particularly interesting anyway, since simple characters like 'ä' can be encoded as either a single code point or multiple code points. -- Rainer Deyke - rainerd@eldwood.com

5855

Age (days ago)

5895

Last active (days ago)

List overview

Download

15 comments

9 participants

participants (9)

Boris Dušek
Boris Schaeling
Dominique Devienne
plarroy
Rainer Deyke
Robert Dailey
Scott McMurray
Space Ship Traveller
Zachary Turner

UTF-16

Robert Dailey

Robert Dailey

Zachary Turner

Robert Dailey

Boris Dušek

Zachary Turner

Space Ship Traveller

Boris Schaeling

Robert Dailey

Zachary Turner

plarroy

Robert Dailey

plarroy

Dominique Devienne

Scott McMurray

Rainer Deyke

tags

participants (9)