On Tue, Jan 7, 2020 at 3:17 PM Gavin Lambert via Boost < boost@lists.boost.org> wrote:
See above, malformed UTF-16 can be converted to WTF-8 (a UTF-8 superset) and back losslessly. The unprecedented introduction of a platform specific interface into the standard was, still is, and will always be, a horrible mistake.
Given that WTF-8 is not itself supported by the C++ standard library (and the other formats are), that doesn't seem like a valid argument. You'd have to campaign for that to be added first.
It doesn't need to be added to the standard. My claim was that instead of adding a wchar_t/char Heisenstring into the standard and proliferating the amount of fstream constructors, one could stick to char interfaces and demand that "basic execution character set would be capable of storing any Unicode data". An Windows implementation could do that with WTF-8 to allow lossless transcoding. The main problem though is that once you start allowing transcoding of
any kind, it's a slippery slope to other conversions that can make lossy changes (such as applying different canonicalisation formats, or adding/removing layout codepoints such as RTL markers).
The truth is that there's already transcoding happening. Mount a Windows partition on Unix or vice versa. It's expected to have some breakage there if the filenames contain invalid sequences.
Also, if you read the WTF-8 spec, it notes that it is not legal to directly concatenate two WTF-8 strings (you either have to convert it back to UCS-16 first, or execute some special handling for the trailing characters of the first string), which immediately renders it a poor choice for a path storage format. And indeed a poor choice for any purpose. (I suspect many people who are using it have conveniently forgotten that part.)
Paths are, almost always, concatenated with ASCII separators (or other valid strings) in-between. Even when concatenating malformed strings directly, the issue isn't there if the result is passed immediately back to the "UTF-16" system.
Although on a related note, I think C++11/17 dropped the ball a bit on the new encoding-specific character types. [...]
C++11 over-engineered it, and you keep over-engineering it even further. Can't think of a time anybody had to mix ASCII, UTF-8, WTF-8 and EBCDIC strings in one program *at compile time*. -- Yakov Galka http://stannum.co.il/