Re: [boost] [General] Always treat std::strings as UTF-8? (was [Process] List of small issues)

14 Jan 2011


      ...
On Fri, 14 Jan 2011 00:48:43 -0800 (PST), Artyom wrote:
...
...
Most platforms have a notion of a 'default' encoding.  On  Linux, the is
usually UTF-8 but isn't guaranteed to be.  On Windows this  is the active
local codepage (i.e. *not* UTF-8) for char and UCS2 for  wchar_t.
The safest approach (and the one taken by the STL and boost) is  to assume
the strings are in this OS's default encoding unless explicitly  known to be
otherwise.
Two problems with this approach:
- Even if the encoding under POSIX platforms is not UTF-8 you will
  be still able to open files, close them, stat on them and do any
  other operations regardless encoding as POSIX API is encoding
  agnostic, this is why it works well.
This isn't a problem, right?  This is exactly why it _does_ work :D  Assume
the strings are in OS-default encoding, don't mess with them, hand them to
the OS API which knows how to treat them.
...
- Under Windows, on the other hand you CAN NOT do everything with narrow
  strings. For example you can't create file "שלום-سلام-pease-Мир.txt"
  using char * API. And this has very bad consequences.
This is indeed true.  I was just describing the situation where the string
came from the result of one call and was being passed around.  If you want
to manipulate the strings, things become more tricky.
...
...
This means you can pass these strings around  freely without
worrying about their encoding because, eventually, they get  passed to an OS
call which knows how to handle them.
You can't under Windows... "ANSI" API is limited.
You've missed where I said "pass these strings around".  I'm not suggesting
you can change them.  But you can take a narrow string returned by an OS
call and pass it to another OS call without any problems.
...
...
Alternatively, if  you need to manipulate the string you can use the OS's
character conversion  functions to take your default-encoding string,
convert it to something  specific, manipulate the result and then convert it
back.  On Windows  you would use MultibyteToWideChar/WideCharToMultibyte
with the CP_ACP  flag.
I ommitted one important caveat here: if you manipulate the string once
you've converted it to UTF-16, you may not be able to convert it back to
the default encoding losslessly.  For example, as in your string above, if
you take the orginal string in Arabic, up-convert it and append a Russian
word, you can't blindly convert this back as the default encoding may not
be able to represent these two character sets simultaenously.

Alex


-- 
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Re: [boost] [General] Always treat std::strings as UTF-8? (was [Process] List of small issues)

Alexander Lamaison