Re: [boost] [General] Always treat std::strings as UTF-8

From: Dave Abrahams <dave@boostpro.com>
Peter Dimov wrote:
Alexander Lamaison wrote:
I'm opposed to this strategy simply because it differs from the way existing libraries treat narrow strings.
It differs from them because it's right, and existing libraries are wrong. Unfortunately, they'll continue being wrong for a long time, because of this same argument.
Does the "right" strategy come with some policies/practices that can allow it to coexist with the existing "wrong" libraries? If so, I'm all +1 on it.
Combining old libraries with new ones: ====================================== It would be simple to combine a library that uses old policies with new ones. namespace boost { std::string utf8_to_ansi(std::string const &s); std::string ansi_to_utf8(std::string const &s); std::wstring utf8_to_wide(std::string const &s); std::string wide_to_utf8(std::wstring const &s); } - If it supports wide strings call boost::utf8_to_wide **under Windows platform** and nothing is lost. - If it supports only narrow strings: a) if it is encoding agnostic: like some unit-test that only open files named with ASCII names, then you can safely ignore and pass UTF-8 string as ASCII and ASCII as UTF-8 as is the subset of it. b) Do following: 1. Fill a bug to library owner on not-supporting Unicode strings under Windows. 2. Use utf8_to_ansi/ansi_to_utf8 to pass strings to this library under Windows. Current State of Using Wide/ANSI API in Boost: ============================================== I've did a small search to find which libraries use what API: Following use both types of API: ------------------------------- thread asio system iostreams regex filesystem According to new policy they should replace ANSI api by wide api and conversion between UTF-8 and UTF-16 Following libraries use only ANSI API -------------------------------------- interprocess spirit test random The should replace their ANSI api by Wide one with a simple glue of utf8_to_wide/wide_to_utf8 Following libraries use STL functions that are not aware of unicode under windows --------------------------------------------------------------------------------- std::fstream - Serialization - Graph - wave - datetime - property_tree - progam_options fopen - gil - spirit - python - regex Need to replace with something like: boost::fstream and boost::fopen that work with UTF-8 under windows. The rest of the libraries seems to be encoding agnostic. Artyom

On Sat, Jan 15, 2011 at 8:39 PM, Artyom <artyomtnk@yahoo.com> wrote:
Combining old libraries with new ones: ======================================
It would be simple to combine a library that uses old policies with new ones.
namespace boost { std::string utf8_to_ansi(std::string const &s); std::string ansi_to_utf8(std::string const &s); std::wstring utf8_to_wide(std::string const &s); std::string wide_to_utf8(std::wstring const &s); }
- If it supports wide strings call boost::utf8_to_wide **under Windows platform** and nothing is lost.
- If it supports only narrow strings:
a) if it is encoding agnostic: like some unit-test that only open files named with ASCII names, then you can safely ignore and pass UTF-8 string as ASCII and ASCII as UTF-8 as is the subset of it.
b) Do following:
1. Fill a bug to library owner on not-supporting Unicode strings under Windows.
2. Use utf8_to_ansi/ansi_to_utf8 to pass strings to this library under Windows.
Current State of Using Wide/ANSI API in Boost: ==============================================
I've did a small search to find which libraries use what API:
Following use both types of API: -------------------------------
thread asio system iostreams regex filesystem
According to new policy they should replace ANSI api by wide api and conversion between UTF-8 and UTF-16
Following libraries use only ANSI API --------------------------------------
interprocess spirit test random
The should replace their ANSI api by Wide one with a simple glue of utf8_to_wide/wide_to_utf8
Following libraries use STL functions that are not aware of unicode under windows
---------------------------------------------------------------------------------
std::fstream
- Serialization - Graph - wave - datetime - property_tree - progam_options
fopen
- gil - spirit - python - regex
Need to replace with something like:
boost::fstream
and
boost::fopen
that work with UTF-8 under windows.
The rest of the libraries seems to be encoding agnostic.
Artyom
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
boost::filesystem::fstream uses a wide string under Windows afaik (assuming it can detect that you're using an STL implementation which has wide-string overloads -- aka Dinkumware). However there's still the problem that if you're using MinGW (or some other non-MSVC toolset that doesn't use a recent Dinkumware STL implementation) then it will drop back to a narrow string and we're back where we started again...

boost::filesystem::fstream uses a wide string under Windows afaik (assuming it can detect that you're using an STL implementation which has wide-string overloads -- aka Dinkumware). However there's still the problem that if you're using MinGW (or some other non-MSVC toolset that doesn't use a recent Dinkumware STL implementation) then it will drop back to a narrow string and we're back where we started again...
Yes I know this, that is why boost::fstream written over C stdio.h should be provided. I had written once small "nowide" library that does this and calls _wfopen under Windows (available in MinGW) and I actually use it in CppCMS's booster library which makes my life much simpler Artyom

On 15/01/2011 10:39, Artyom wrote:
It would be simple to combine a library that uses old policies with new ones.
namespace boost { std::string utf8_to_ansi(std::string const&s); std::string ansi_to_utf8(std::string const&s); std::wstring utf8_to_wide(std::string const&s); std::string wide_to_utf8(std::wstring const&s); }
ANSI doesn't really mean much. It's purely a windows thing. utf8_to_locale, which would take a std::locale object, would make more sense.

namespace boost { std::string utf8_to_ansi(std::string const&s); std::string ansi_to_utf8(std::string const&s); std::wstring utf8_to_wide(std::string const&s); std::string wide_to_utf8(std::wstring const&s); }
ANSI doesn't really mean much. It's purely a windows thing.
utf8_to_locale, which would take a std::locale object, would make more sense.
1. std::locale based conversion using std::codecvt facet strongly depends on current implementation and this is bad point to start from. 2. These utf8_to_ansi and backwards should not be used outside windows scope, where ANSI means narrow windows API (a.k.a. ANSI API) 3. Under non-windows platform that should do anything to strings and pass them as is as native POSIX api is narrow and not wide. Artyom

On 15/01/2011 14:29, Artyom wrote:
namespace boost { std::string utf8_to_ansi(std::string const&s); std::string ansi_to_utf8(std::string const&s); std::wstring utf8_to_wide(std::string const&s); std::string wide_to_utf8(std::wstring const&s); }
ANSI doesn't really mean much. It's purely a windows thing.
utf8_to_locale, which would take a std::locale object, would make more sense.
1. std::locale based conversion using std::codecvt facet strongly depends on current implementation and this is bad point to start from.
It is "reasonably reasonable" to assume the wide character locale is UTF-16 or UTF-32. Some IBM mainframes are the only ones where this is not the case as far as I know. Therefore you can portably convert a locale to UTF-8 by using std::codecvt<char, wchar_t> to convert it to UTF-16 or UTF-32, converting that UTF-16 to UTF-32 if needed, then convert it back to UTF-8. That's, of course, not exactly very efficient, especially when you're unable to pipeline those conversions.
2. These utf8_to_ansi and backwards should not be used outside windows scope, where ANSI means narrow windows API (a.k.a. ANSI API)
Good code is code that doesn't expose platform-specific details. The name ANSI is so bad (it means American National Standards Institute, even though Windows locales have nothing to do with that body) that I'd rather not put that in any function I'd use in real code.
3. Under non-windows platform that should do anything to strings and pass them as is as
native POSIX api is narrow and not wide.
Yet you still need to convert between UTF-8 and the POSIX locales. Even if most recent POSIX systems use UTF-8 as their locale, there is no guarantee of that. Indeed, quite a few still run in latin-1.

3. Under non-windows platform that should do anything to strings and pass them
as is as
native POSIX api is narrow and not wide.
Yet you still need to convert between UTF-8 and the POSIX locales. Even if most recent POSIX systems use UTF-8 as their locale, there is no guarantee of that. Indeed, quite a few still run in latin-1.
No you don't need convert UTF-8 to "locales" encoding as char* is native system API unlike Windows one. So you don't need to mess around with encodings at all unless you deal with text related stuff like for example collation. The **only** problem is badly designed Windows API that makes impossible to write cross platform code. So the idea that when we on windows we treat "char *" as UTF-8 and then call Wide API after converting it from UTF-8. There is no problem with this. As long as all library use same policy there would be no issues using Unicode any more. The problem is not locales, encodings or other stuff, the problem is that Windows API does not allow you to use "char *" based string fully as it does not support UTF-8 and platform independent programming becomes total mess. Artyom

On Sat, 15 Jan 2011 06:46:10 -0800 (PST), Artyom wrote:
Yet you still need to convert between UTF-8 and the POSIX locales. Even if most recent POSIX systems use UTF-8 as their locale, there is no guarantee of that. Indeed, quite a few still run in latin-1.
No you don't need convert UTF-8 to "locales" encoding as char* is native system API unlike Windows one. So you don't need to mess around with encodings at all unless you deal with text related stuff like for example collation.
I'm not sure I follow. If you pass a UTF-8 encoded string to a POSIX OS that uses a non-UTF charater set, how is the OS meant to interpret that? Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Yet you still need to convert between UTF-8 and the POSIX locales. Even if most recent POSIX systems use UTF-8 as their locale, there is no guarantee of that. Indeed, quite a few still run in latin-1.
No you don't need convert UTF-8 to "locales" encoding as char* is native system API unlike Windows one. So you don't need to mess around with encodings at all unless you deal with text related stuff like for example collation.
I'm not sure I follow. If you pass a UTF-8 encoded string to a POSIX OS that uses a non-UTF charater set, how is the OS meant to interpret that?
As a null terminated byte sequence, I mean if your locale is UTF-8 and there is a file with name "\xFF\xFF.txt" which is clearly not UTF-8 you can open it, remove it and do almost anything with it. It is locale agnostic (unless it is very specific language related API like strcoll) Artyom

On 15/01/2011 15:46, Artyom wrote:
No you don't need convert UTF-8 to "locales" encoding as char* is native system API unlike Windows one. So you don't need to mess around with encodings at all unless you deal with text related stuff like for example collation.
POSIX system calls expect the text they receive as char* to be encoded in the current character locale. To write cross-platform code, you need to convert your UTF-8 input to the locale encoding when calling system calls, and convert text you receive from those system calls from the locale encoding to UTF-8. (Note: this is exactly what gtkmm::ustring does) Windows is exactly the same, except it's got two sets of locales and two sets of system calls. The wide character locale is more interesting since it is always UTF-16, so the conversion you have to do is only between UTF-8 and UTF-16, which is easy and lossless. Likewise, you could also choose to use UTF-16 or UTF-32 as your internal representation rather than UTF-8. The choice is completely irrelevant which regards to providing an uniformly encoded interface regardless of platform.
The problem is not locales, encodings or other stuff, the problem is that Windows API does not allow you to use "char *" based string fully as it does not support UTF-8
The actual locale used by the user is irrelevant. Again, as I said earlier, the fact that UTF-8 is the most common locale on Linux but is not available on Windows shouldn't affect the way the system works. A lot of Linux systems use a Latin-1 locale, and your approach will simply fail on those systems.
and platform independent programming becomes total mess.
So your technique for writing independent code is relying on the user to use an UTF-8 locale?

Mathias Gaunard wrote:
POSIX system calls expect the text they receive as char* to be encoded in the current character locale.
No, POSIX system calls (under most Unix OSes, except on Mac OS X) are encoding-agnostic, they receive a null-terminated byte sequence (NTBS) without interpreting it. On Mac OS X, file paths must be UTF-8. Locales are not considered.
To write cross-platform code, you need to convert your UTF-8 input to the locale encoding when calling system calls, and convert text you receive from those system calls from the locale encoding to UTF-8.
This is one possible way to do it (blindly using UTF-8 is another). Strictly speaking, under an encoding-agnostic file system, you must not convert anything to anything because this may cause you to irretrievably lose the original path. For display purposes, of course, you have to pick an encoding somehow. There is no "current" character locale on Unix, by the way, unless you count the environment variables. The OS itself doesn't care. Using the current C locale (LANG=...) allows you to display the file names the same way the 'ls' command does, whereas using UTF-8 allows your user to enter file names which are not representable in the LANG locale.
Windows is exactly the same, except it's got two sets of locales and two sets of system calls.
Nope. It doesn't have two sets of locales.
So your technique for writing independent code is relying on the user to use an UTF-8 locale?
More or less. The code itself doesn't depend on the user locale, it always works, but to see the actual names in a terminal, you need an UTF-8 locale. This is now the recommended setup on all Unix OSes.
participants (5)
-
Alexander Lamaison
-
Artyom
-
Joshua Boyce
-
Mathias Gaunard
-
Peter Dimov