Yakov Galka wrote:
On Mon, Jun 12, 2017 at 12:20 PM, Groke, Paul via Boost mailto:boost@lists.boost.org wrote:
I know modified UTF-8 is (can be) invalid UTF-8, that's why I asked. I think it could make sense to support it anyway though. Round tripping (strictly invalid, but possible) file names on Windows, easier interoperability with stuff like JNI, ...
Don't you mean WTF-8 then? AFAIK "Modified UTF-8" is UTF-8 that encodes the null character with an overlong sequence, and thus is incompatible with standard UTF-8, unlike WTF-8 which is a compatible extension.
No, I mean modified UTF-8. Modified UTF-8 is UTF-8 plus the following extensions: - Allow encoding UTF-16 surrogates as if they were code points (=what "WTF-8" does) - Allow an over-long 2 byte encoding of the NUL character Both are not strictly UTF-8 compatible, but both don't introduce significant overhead in most situations. I don't see how over-long NUL encodings are "more incompatible" then UTF-8 encoded surrogates, but then again that's not really important.
OTOH it would add overhead for systems with native UTF-8 APIs, because Nowide would at least have to check every string for "modified UTF-8 encoded" surrogate pairs and convert the string if necessary. Which of course is a good argument for not supporting modified UTF-8, because then Nowide could just > pass the strings through unmodified on those systems.
Implementing WTF-8 removes a check in UTF-8 -> UTF-16 conversion, and doesn't change anything in the reverse direction when there is a valid UTF-16. I suspect it isn't slower.
Supporting modified UTF-8 or WTF-8 adds overhead on systems where the native OS API accepts UTF-8, but only strictly valid UTF-8. When some UTF-8 enabled function of the library is called on such a system, it would have to check for WTF-8 encoded surrogates and convert them to "true" UTF-8 before passing the string to the OS API. Because you would expect and want the "normal" UTF-8 encoding for a string to refer to the same file as the WTF-8 encoding of the same string. Regards, Paul Groke