Re: [boost] [General] Always treat std::strings as UTF-8

16 Jan 2011

      Mathias Gaunard wrote:
...
POSIX system calls expect the text they receive as char* to be encoded in 
the current character locale.
No, POSIX system calls (under most Unix OSes, except on Mac OS X) are 
encoding-agnostic, they receive a null-terminated byte sequence (NTBS) 
without interpreting it. On Mac OS X, file paths must be UTF-8. Locales are 
not considered.
...
To write cross-platform code, you need to convert your UTF-8 input to the 
locale encoding when calling system calls, and convert text you receive 
from those system calls from the locale encoding to UTF-8.
This is one possible way to do it (blindly using UTF-8 is another). Strictly 
speaking, under an encoding-agnostic file system, you must not convert 
anything to anything because this may cause you to irretrievably lose the 
original path. For display purposes, of course, you have to pick an encoding 
somehow. There is no "current" character locale on Unix, by the way, unless 
you count the environment variables. The OS itself doesn't care.

Using the current C locale (LANG=...) allows you to display the file names 
the same way the 'ls' command does, whereas using UTF-8 allows your user to 
enter file names which are not representable in the LANG locale.
...
Windows is exactly the same, except it's got two sets of locales and two 
sets of system calls.
Nope. It doesn't have two sets of locales.
...
So your technique for writing independent code is relying on the user to 
use an UTF-8 locale?
More or less. The code itself doesn't depend on the user locale, it always 
works, but to see the actual names in a terminal, you need an UTF-8 locale. 
This is now the recommended setup on all Unix OSes.

Re: [boost] [General] Always treat std::strings as UTF-8

Peter Dimov