Hi everyone, I did a bit of googling to see if Boost 1.39 as any portable support for UTF-16 encoded strings, but I did not find any. I'm currently using wxWidgets in my application, and I need a decent string object to use. I know that wxWidgets has UTF-16 string support through wxString, however I do not want to expose this object in my interfaces. I want to remain as abstracted away from wxWidgets as possible. Having said that, if someone could tell me if there is any existing UTF-16 string support in Boost, I'd appreciate it. I did not find anything in the vault, sandbox, or trunk in Boost. If boost has no such string object, could someone give me a head start on where to look? Thanks.
Oh, I also forgot to mention, I am also using boost::filesystem::path. I
guess this means I need to use wchar_t everywhere (std::wstring,
boost::filesystem::wpath, etc) and just let wxWidgets do the
encoding/decoding? If I don't have to do any encoding/decoding myself, then
there really is no need for a special object. But just in case I would like
to have the encoding/decoding abilities.
On Sun, Jun 14, 2009 at 12:27 PM, Robert Dailey
Hi everyone, I did a bit of googling to see if Boost 1.39 as any portable support for UTF-16 encoded strings, but I did not find any. I'm currently using wxWidgets in my application, and I need a decent string object to use. I know that wxWidgets has UTF-16 string support through wxString, however I do not want to expose this object in my interfaces. I want to remain as abstracted away from wxWidgets as possible. Having said that, if someone could tell me if there is any existing UTF-16 string support in Boost, I'd appreciate it. I did not find anything in the vault, sandbox, or trunk in Boost.
If boost has no such string object, could someone give me a head start on where to look? Thanks.
On Sun, Jun 14, 2009 at 12:32 PM, Robert Dailey
Oh, I also forgot to mention, I am also using boost::filesystem::path. I guess this means I need to use wchar_t everywhere (std::wstring, boost::filesystem::wpath, etc) and just let wxWidgets do the encoding/decoding? If I don't have to do any encoding/decoding myself, then there really is no need for a special object. But just in case I would like to have the encoding/decoding abilities.
On Sun, Jun 14, 2009 at 12:27 PM, Robert Dailey
wrote: Hi everyone, I did a bit of googling to see if Boost 1.39 as any portable support for UTF-16 encoded strings, but I did not find any. I'm currently using wxWidgets in my application, and I need a decent string object to use. I know that wxWidgets has UTF-16 string support through wxString, however I do not want to expose this object in my interfaces. I want to remain as abstracted away from wxWidgets as possible. Having said that, if someone could tell me if there is any existing UTF-16 string support in Boost, I'd appreciate it. I did not find anything in the vault, sandbox, or trunk in Boost. If boost has no such string object, could someone give me a head start on where to look? Thanks.
An application I currently work on is stricken with this. If (like us) you are just trying to provide basic internationalization across Windows and Linux and want it to "just work" and be simple, then I would suggest typedefing something like typedef std::wstring utf_string; typedef boost::filesystem::wpath utf_path; typedef wchar_t utf_char; etc on windows, and typedef std::string utf_string; typedef boost::filesystem::path utf_path; typedef char utf_char; on Linux. Then just use a simple UTF-8 <-> UTF-16 conversion if ever you need to persist / retrieve something, so that it's stored in a common format. We're getting many strange problems relating to locales when we try to use UTF-16 in wpaths on Linux, and if it's not too much effort it's going to be simpler to just have your program always store them in the native format that the OS is expecting.
On Sun, Jun 14, 2009 at 1:08 PM, Zachary Turner
An application I currently work on is stricken with this. If (like us) you are just trying to provide basic internationalization across Windows and Linux and want it to "just work" and be simple, then I would suggest typedefing something like
typedef std::wstring utf_string; typedef boost::filesystem::wpath utf_path; typedef wchar_t utf_char;
etc on windows, and
typedef std::string utf_string; typedef boost::filesystem::path utf_path; typedef char utf_char;
on Linux. Then just use a simple UTF-8 <-> UTF-16 conversion if ever you need to persist / retrieve something, so that it's stored in a common format. We're getting many strange problems relating to locales when we try to use UTF-16 in wpaths on Linux, and if it's not too much effort it's going to be simpler to just have your program always store them in the native format that the OS is expecting.
Great advice Zach. I'll definitely do this. However, it would be nice to have an already-made conversion routine for UTF8 to UTF16. I'm hoping that most of the cases where I'm converting encodings will be when I'm going through another library where it has already been handled, like wxWidgets or boost.
Great advice Zach. I'll definitely do this. However, it would be nice to have an already-made conversion routine for UTF8 to UTF16. I'm hoping that most of the cases where I'm converting encodings will be when I'm going through another library where it has already been handled, like wxWidgets or boost.
I never used it, nor heard that it works for what you want, I just remember that there is some Unicode stuff in Adobe's ASL: http://stlab.adobe.com/group__asl__unicode.html. Skimming through the source, it's probably header-only library (i.e. no linking), but I could be wrong.
On Sun, Jun 14, 2009 at 2:41 PM, Robert Dailey
On Sun, Jun 14, 2009 at 1:08 PM, Zachary Turner
wrote: An application I currently work on is stricken with this. If (like us) you are just trying to provide basic internationalization across Windows and Linux and want it to "just work" and be simple, then I would suggest typedefing something like
typedef std::wstring utf_string; typedef boost::filesystem::wpath utf_path; typedef wchar_t utf_char;
etc on windows, and
typedef std::string utf_string; typedef boost::filesystem::path utf_path; typedef char utf_char;
on Linux. Then just use a simple UTF-8 <-> UTF-16 conversion if ever you need to persist / retrieve something, so that it's stored in a common format. We're getting many strange problems relating to locales when we try to use UTF-16 in wpaths on Linux, and if it's not too much effort it's going to be simpler to just have your program always store them in the native format that the OS is expecting.
Great advice Zach. I'll definitely do this. However, it would be nice to have an already-made conversion routine for UTF8 to UTF16. I'm hoping that most of the cases where I'm converting encodings will be when I'm going through another library where it has already been handled, like wxWidgets or boost.
Yea definitely. I know there's a Boost.Unicode library in the works, although I don't know the status.
I found this to be pretty useful UTF8-CPP: UTF-8 with C++ in a Portable Way: The Sourceforge project page
On Sun, 14 Jun 2009 21:41:16 +0200, Robert Dailey
[...]Great advice Zach. I'll definitely do this. However, it would be nice to have an already-made conversion routine for UTF8 to UTF16. I'm hoping
The conversion routines you are looking for are std::mbsrtowcs() and std::wcsrtombs() in <cwchar>. You must set the global locale first before you use them so they know which multi-byte encoding they should use (try for example "en_US.UTF-8"). If you want your application to work on Windows, too, you can't use those functions unfortunately but must use MultiByteToWideChar() and WideCharToMultiByte() instead (as there is no UTF-8 locale on Windows). The Unicode FAQ for Unix and Linux might also help: http://www.cl.cam.ac.uk/~mgk25/unicode.html Boris
On Mon, Jun 15, 2009 at 3:55 AM, Boris Schaeling
On Sun, 14 Jun 2009 21:41:16 +0200, Robert Dailey
wrote: [...]Great advice Zach. I'll definitely do this. However, it would be nice
to have an already-made conversion routine for UTF8 to UTF16. I'm hoping
The conversion routines you are looking for are std::mbsrtowcs() and std::wcsrtombs() in <cwchar>. You must set the global locale first before you use them so they know which multi-byte encoding they should use (try for example "en_US.UTF-8"). If you want your application to work on Windows, too, you can't use those functions unfortunately but must use MultiByteToWideChar() and WideCharToMultiByte() instead (as there is no UTF-8 locale on Windows).
The Unicode FAQ for Unix and Linux might also help: http://www.cl.cam.ac.uk/~mgk25/unicode.html
Boris, This is a good idea. I had thought about this, but I was hoping there was a more portable solution already out there. If not, then I could just create a simple abstraction for the platform specific routines. Thanks for your help (And to everyone else as well)!
On Mon, Jun 15, 2009 at 10:45 AM, Robert Dailey
On Mon, Jun 15, 2009 at 3:55 AM, Boris Schaeling
wrote: On Sun, 14 Jun 2009 21:41:16 +0200, Robert Dailey
wrote: [...]Great advice Zach. I'll definitely do this. However, it would be nice to have an already-made conversion routine for UTF8 to UTF16. I'm hoping
The conversion routines you are looking for are std::mbsrtowcs() and std::wcsrtombs() in <cwchar>. You must set the global locale first before you use them so they know which multi-byte encoding they should use (try for example "en_US.UTF-8"). If you want your application to work on Windows, too, you can't use those functions unfortunately but must use MultiByteToWideChar() and WideCharToMultiByte() instead (as there is no UTF-8 locale on Windows).
The Unicode FAQ for Unix and Linux might also help: http://www.cl.cam.ac.uk/~mgk25/unicode.html
Boris, This is a good idea. I had thought about this, but I was hoping there was a more portable solution already out there. If not, then I could just create a simple abstraction for the platform specific routines. Thanks for your help (And to everyone else as well)!
The probelm with this approach, if I'm not mistaken, is that different encodings have different names across operating systems, and even across different linux distros. So if you see en_US.UTF-8 on one distro, it might be something else on another distro. Correct me if I'm wrong
Hi My approach is using std::string, etc. all the time and using UTF-8 internally, only converting to other charsets when it's needed. I use IBM icu library and made a boost::iostreams filter to convert encoding, once it's done takes a lot of complexity away, I use it like: // setup a conversion from charset to utf-8 filt_streamb.push(ucnv_filter(charset.c_str(), "utf-8")); istream is(&filt_streamb); Perhaps there's interest to push this charset conversion into boost::iostreams filters examples. Regards. Robert Dailey wrote:
Oh, I also forgot to mention, I am also using boost::filesystem::path. I guess this means I need to use wchar_t everywhere (std::wstring, boost::filesystem::wpath, etc) and just let wxWidgets do the encoding/decoding? If I don't have to do any encoding/decoding myself, then there really is no need for a special object. But just in case I would like to have the encoding/decoding abilities.
On Sun, Jun 14, 2009 at 12:27 PM, Robert Dailey
wrote: Hi everyone, I did a bit of googling to see if Boost 1.39 as any portable support for UTF-16 encoded strings, but I did not find any. I'm currently using wxWidgets in my application, and I need a decent string object to use. I know that wxWidgets has UTF-16 string support through wxString, however I do not want to expose this object in my interfaces. I want to remain as abstracted away from wxWidgets as possible. Having said that, if someone could tell me if there is any existing UTF-16 string support in Boost, I'd appreciate it. I did not find anything in the vault, sandbox, or trunk in Boost.
If boost has no such string object, could someone give me a head start on where to look? Thanks.
------------------------------------------------------------------------
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Problem with that is that std::string::length() no longer provides a
meaningful value. It will count each byte is 1 character.
---------
Robert Dailey
On Thu, Jul 16, 2009 at 3:39 AM, plarroy
Hi
My approach is using std::string, etc. all the time and using UTF-8 internally, only converting to other charsets when it's needed.
I use IBM icu library and made a boost::iostreams filter to convert encoding, once it's done takes a lot of complexity away, I use it like:
// setup a conversion from charset to utf-8 filt_streamb.push(ucnv_filter(charset.c_str(), "utf-8")); istream is(&filt_streamb);
Perhaps there's interest to push this charset conversion into boost::iostreams filters examples.
Regards.
Robert Dailey wrote:
Oh, I also forgot to mention, I am also using boost::filesystem::path. I guess this means I need to use wchar_t everywhere (std::wstring, boost::filesystem::wpath, etc) and just let wxWidgets do the encoding/decoding? If I don't have to do any encoding/decoding myself, then there really is no need for a special object. But just in case I would like to have the encoding/decoding abilities.
On Sun, Jun 14, 2009 at 12:27 PM, Robert Dailey
wrote: Hi everyone, I did a bit of googling to see if Boost 1.39 as any portable support for UTF-16 encoded strings, but I did not find any. I'm currently using wxWidgets in my application, and I need a decent string object to use. I know that wxWidgets has UTF-16 string support through wxString, however I do not want to expose this object in my interfaces. I want to remain as abstracted away from wxWidgets as possible. Having said that, if someone could tell me if there is any existing UTF-16 string support in Boost, I'd appreciate it. I did not find anything in the vault, sandbox, or trunk in Boost.
If boost has no such string object, could someone give me a head start on where to look? Thanks.
------------------------------------------------------------------------
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Yes, but you can easily count codepoints using ICU if you need this feature. I choosed not leverage that with std::string. Regards. Robert Dailey wrote:
Problem with that is that std::string::length() no longer provides a meaningful value. It will count each byte is 1 character.
--------- Robert Dailey
On Thu, Jul 16, 2009 at 3:39 AM, plarroy
wrote: Hi
My approach is using std::string, etc. all the time and using UTF-8 internally, only converting to other charsets when it's needed.
I use IBM icu library and made a boost::iostreams filter to convert encoding, once it's done takes a lot of complexity away, I use it like:
// setup a conversion from charset to utf-8 filt_streamb.push(ucnv_filter(charset.c_str(), "utf-8")); istream is(&filt_streamb);
Perhaps there's interest to push this charset conversion into boost::iostreams filters examples.
Regards.
Robert Dailey wrote:
Oh, I also forgot to mention, I am also using boost::filesystem::path. I guess this means I need to use wchar_t everywhere (std::wstring, boost::filesystem::wpath, etc) and just let wxWidgets do the encoding/decoding? If I don't have to do any encoding/decoding myself, then there really is no need for a special object. But just in case I would like to have the encoding/decoding abilities.
On Sun, Jun 14, 2009 at 12:27 PM, Robert Dailey
wrote: Hi everyone, I did a bit of googling to see if Boost 1.39 as any portable support for UTF-16 encoded strings, but I did not find any. I'm currently using wxWidgets in my application, and I need a decent string object to use. I know that wxWidgets has UTF-16 string support through wxString, however I do not want to expose this object in my interfaces. I want to remain as abstracted away from wxWidgets as possible. Having said that, if someone could tell me if there is any existing UTF-16 string support in Boost, I'd appreciate it. I did not find anything in the vault, sandbox, or trunk in Boost.
If boost has no such string object, could someone give me a head start on where to look? Thanks.
------------------------------------------------------------------------
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
------------------------------------------------------------------------
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
On Wed, Jul 22, 2009 at 2:25 PM, Robert Dailey
Problem with that is that std::string::length() no longer provides a meaningful value. It will count each byte is 1 character.
Instead of ICU, there's also http://utfcpp.sourceforge.net/ with its utf8::distance, which may be lighter weight. --DD Quoting from that web page: This function is used to find the length (in code points) of a UTF-8 encoded string. The reason it is called distance, rather than, say, length is mainly because developers are used that length is an O(1) function. Computing the length of an UTF-8 string is a linear operation, and it looked better to model it after std::distance algorithm. In case of an invalid UTF-8 sequence, a utf8::invalid_utf8 exception is thrown. If last does not point to the past-of-end of a UTF-8 sequence, a utf8::not_enough_room exception is thrown.
2009/7/22 Robert Dailey
Problem with that is that std::string::length() no longer provides a meaningful value. It will count each byte is 1 character.
Can you elaborate an example of a situation in which you actually need to know the length of the string (in codepoints) in the first place? Everything I can think of that might would be better done with a regex anyways.
Scott McMurray wrote:
Can you elaborate an example of a situation in which you actually need to know the length of the string (in codepoints) in the first place? Everything I can think of that might would be better done with a regex anyways.
The length of a unicode string isn't particularly interesting anyway, since simple characters like 'ä' can be encoded as either a single code point or multiple code points. -- Rainer Deyke - rainerd@eldwood.com
participants (9)
-
Boris Dušek
-
Boris Schaeling
-
Dominique Devienne
-
plarroy
-
Rainer Deyke
-
Robert Dailey
-
Scott McMurray
-
Space Ship Traveller
-
Zachary Turner