
At Wed, 19 Jan 2011 23:02:02 +0200, Peter Dimov wrote:
Dave Abrahams wrote: ...
OK. You're designing a portable library that talks to the OS. It has the following functions:
T get_path( ... ); void process_path( T );
What do you use for T? string or utf8_string?
I'm even less of an expert on encodings at the OS boundary than I am on an expert on encodings in general, but I'll take a shot at this one.
OK, according to all the experts (like you), we should be trafficking in UTF-8 everywhere, so I guess I'd say T is utf8_string (well, T is boost::filesystem::path, but that begs the same questions, ultimately).
My answer is different. T is std::string, and:
- on POSIX OSes, this string is taken directly from the OS and given directly to the OS, without any conversion;
- on Windows, this string is UTF-8 and is converted to UTF-16 before being given to the OS, and converted from UTF-16 after being received from it. This conversion should tolerate broken UTF-16 because the OS does so as well.
A fine answer if: a. you think the interface to std::string is a good one for posterity, and b. every other std::string that might be used along with your portable library is guaranteed to be utf-8 encoded. But I don't agree with a), and the interface to std::string makes a future where b) holds look highly unlikely to me. I prefer to have semantic constraints/invariants like "this is UTF-8 encoded" represented in the type system and enforced by public library interfaces. I'm arguing for a future like that. -- Dave Abrahams BoostPro Computing http://www.boostpro.com