Zach Laine wrote:
On Mon, Jun 12, 2017 at 1:14 PM, Groke, Paul via Boost < boost@lists.boost.org> wrote:
Since the UTF-8 conversion is only done on/for Windows, and Windows doesn't guarantee that all wchar_t paths (or strings in general) will always be valid UTF-16, wouldn't it make more sense to just *define* that the library always uses WTF-8, which allows round-tripping of all possible 16 bit strings? If it's documented that way it shouldn't matter. Especially since users of the library cannot rely on the strings being in UTF-8 anyway, at least not in portable applications.
I agree that round-tripping to wchar_t paths on Windows is very important. I also agree that not detecting invalid UTF-8, or failing to produce an error in such a case, is very important to *avoid*.
Can we get both? Is it possible to add a WTF-8 mode, perhaps only used in user-selectable string processing cases?
Well, I don't see why detecting invalid UTF-8 would be important. In my initial mail I (wrongly) assumed that the library would be translating between the native encoding and UTF-8 also on e.g. Linux (or non-Windows platforms in general). But since this isn't so, I guess the library can simply pass-through strings on all platforms that have narrow APIs. In fact I think it should, since checking for valid UTF-8 would make some files inaccessible on systems like Linux, where you can very easily create file names that aren't valid UTF-8. In that case the terms "Unicode" and "UTF-8" should not be used in describing the library (name and documentation). The documentation should just say that it transforms strings to some unspecified encoding with the following properties: - Byte-based - ASCII compatible (*) - self-synchronizing - able to 100% round-trip all characters/NUL terminated strings from the native platform encoding And for Windows this unspecified encoding then would just happen to be WTF-8. (* For POSIX systems one cannot even 100% rely on that... so maybe using an even wider set of constraints would be good.) On platforms like OS X, the API-wrappers of Boost.Nowide would then simply produce the same result as the native APIs would -- because the narrow string would simply be passed through unmodified. It it's invalid UTF-8, and the OS decides to fail the call because of the invalid UTF-8, then so be it - using the library wouldn't change anything. Almost the same for Windows: the library would simply round-trip invalid UTF-16 to the same invalid UTF-16. If the native Windows API decides to fail the call because of that, OK, then it just fails -- it would also have failed in a wchar_t application. But maybe I missed something here. If there really is a good reason for enforcing valid UTF-8 in some situation, please let me know :)