
On Fri, 14 Jan 2011 00:48:43 -0800 (PST), Artyom wrote:
Most platforms have a notion of a 'default' encoding. On Linux, the is usually UTF-8 but isn't guaranteed to be. On Windows this is the active local codepage (i.e. *not* UTF-8) for char and UCS2 for wchar_t.
The safest approach (and the one taken by the STL and boost) is to assume the strings are in this OS's default encoding unless explicitly known to be otherwise.
Two problems with this approach:
- Even if the encoding under POSIX platforms is not UTF-8 you will be still able to open files, close them, stat on them and do any other operations regardless encoding as POSIX API is encoding agnostic, this is why it works well.
This isn't a problem, right? This is exactly why it _does_ work :D Assume the strings are in OS-default encoding, don't mess with them, hand them to the OS API which knows how to treat them.
- Under Windows, on the other hand you CAN NOT do everything with narrow strings. For example you can't create file "שלום-سلام-pease-Мир.txt" using char * API. And this has very bad consequences.
This is indeed true. I was just describing the situation where the string came from the result of one call and was being passed around. If you want to manipulate the strings, things become more tricky.
This means you can pass these strings around freely without worrying about their encoding because, eventually, they get passed to an OS call which knows how to handle them.
You can't under Windows... "ANSI" API is limited.
You've missed where I said "pass these strings around". I'm not suggesting you can change them. But you can take a narrow string returned by an OS call and pass it to another OS call without any problems.
Alternatively, if you need to manipulate the string you can use the OS's character conversion functions to take your default-encoding string, convert it to something specific, manipulate the result and then convert it back. On Windows you would use MultibyteToWideChar/WideCharToMultibyte with the CP_ACP flag.
I ommitted one important caveat here: if you manipulate the string once you've converted it to UTF-16, you may not be able to convert it back to the default encoding losslessly. For example, as in your string above, if you take the orginal string in Arabic, up-convert it and append a Russian word, you can't blindly convert this back as the default encoding may not be able to represent these two character sets simultaenously. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)