Re: [boost] [review] Review of Nowide (Unicode) starts today

12 Jun 2017

      Zach Laine wrote:
...
On Mon, Jun 12, 2017 at 1:14 PM, Groke, Paul via Boost <
boost@lists.boost.org> wrote:
...
Since the UTF-8 conversion is only done on/for Windows, and Windows
doesn't guarantee that all wchar_t paths (or strings in general) will
always be valid UTF-16, wouldn't it make more sense to just *define*
that the library always uses WTF-8, which allows round-tripping of all
possible
16 bit strings? If it's documented that way it shouldn't matter.
Especially since users of the library cannot rely on the strings being
in UTF-8 anyway, at least not in portable applications.
I agree that round-tripping to wchar_t paths on Windows is very important.
I also agree that not detecting invalid UTF-8, or failing to produce an error in
such a case, is very important to *avoid*.
Can we get both?  Is it possible to add a WTF-8 mode, perhaps only used in
user-selectable string processing cases?
Well, I don't see why detecting invalid UTF-8 would be important. In my initial mail I (wrongly) assumed that the library would be translating between the native encoding and UTF-8 also on e.g. Linux (or non-Windows platforms in general). But since this isn't so, I guess the library can simply pass-through strings on all platforms that have narrow APIs. In fact I think it should, since checking for valid UTF-8 would make some files inaccessible on systems like Linux, where you can very easily create file names that aren't valid UTF-8.

In that case the terms "Unicode" and "UTF-8" should not be used in describing the library (name and documentation). The documentation should just say that it transforms strings to some unspecified encoding with the following properties:
- Byte-based
- ASCII compatible (*)
- self-synchronizing
- able to 100% round-trip all characters/NUL terminated strings from the native platform encoding
And for Windows this unspecified encoding then would just happen to be WTF-8.
(* For POSIX systems one cannot even 100% rely on that... so maybe using an even wider set of constraints would be good.)

On platforms like OS X, the API-wrappers of Boost.Nowide would then simply produce the same result as the native APIs would -- because the narrow string would simply be passed through unmodified. It it's invalid UTF-8, and the OS decides to fail the call because of the invalid UTF-8, then so be it - using the library wouldn't change anything.

Almost the same for Windows: the library would simply round-trip invalid UTF-16 to the same invalid UTF-16. If the native Windows API decides to fail the call because of that, OK, then it just fails -- it would also have failed in a wchar_t application.

But maybe I missed something here. If there really is a good reason for enforcing valid UTF-8 in some situation, please let me know :)