Re: [boost] [review] Review of Nowide (Unicode) starts today

13 Jun 2017

      ...
Let me rephrase.
The user gets a path from an external source and passes it to Nowide, for
example, to nowide::fopen.
Nowide does no validation on POSIX and strict UTF-8 validation on Windows.
Why is the Windows-only strict validation a good thing?
What attacks are prevented by not accepting WTF-8 in nowide::fopen ONLY
under Windows, and passing everything through unvalidated on POSIX?
Ok...

This is a very good question.

On windows I need to convert for obvious reason.

Now the question is what I accept as valid and what is not valid and
where do I draw the line.
---------------------------------------------------------------------------------------------------------------------------

Now as you have seen there are many possible "non-standard"  UTF-8 variants.

What should I accept?

Should I accept CESU-8 (non BMP encoded as 2 pairs of 3 "UTF-8" like
bytes) and UTF-8

Should I accept WTF-8 only non-paired surrogates? What if these
surrogates can be combined into correct UTF-16, is it valid?
(In fact concatenation of WTF-8 strings isn't trivial operation as
simple string + string does not work and lead to invalid WTF-16)

Should I accept modified UTF-8 - it was already asked (i.e. values
encoded without shortest sequence) for example should I accept "x"
encoded in two bytes? What about "."?
How should I treat stuff like "\xFF.txt" <- invalid UTF-8? Should I
convert it to L"\xFF.txt"? Should I convert it to some sort of WTF-16
to preserve the string?

Now despite what it may look from the discussion WTF-8 is far from
being "standard" for representing invalid UTF-16.
Some may substitute it with "?" others with U+FFFD, others just remove
one. I actually tested some cases for real "bad"
file names by different system and each did something different.

I don't think using WTF-8 is some widely used industry standard it is
just one of many variants to create UTF-8 extension.

But there MUST be clear line of what is accepted and what is not and
the safest and most standard line to draw is well defined
UTF-8 and UTF-16 that are (a) 100% convertible one from other (b)
widely used accepted standards.

So that was my decision - based on safety and standards (and there is
no such thing as non strict UTF-8/16)

Does it fits everybody? Of course not!
It there some line that fits everybody? There is no such thing!
Does it fits common case for vast majority of users/developers? IMHO yes.

So as a policy I decided to use UTF-8 and UTF-16 as selection of
encoding for each sides of widen/narrow is required.
...
If the program originates on Windows and as a result comes to rely on
Nowide's strict validation, and is later ported to POSIX, are not the users
left with a false sense of security?
You have valid point.

But adding validation on POSIX systems in general will be wrong (as I
noted in Q&A)
because there is no single encoding for POSIX OS - it is runtime
parameter - unlike Windows API that has
Wide UTF-16 API so such a validation on Linux/POSIX will likely cause
more issues than solve.

Thanks,
  Artyom

Re: [boost] [review] Review of Nowide (Unicode) starts today

Artyom Beilis