[review] Review of Nowide (Unicode) starts today

Frédéric Bron

12 Jun 2017 12 Jun '17

4:20 a.m.

Hi Everyone, The formal review of Artyom Beilis' Nowide library starts today and will last until Wed. 21st of June. Your participation is encouraged, as the proposed library is uncoupled, focused and rather small. Nowadays, everyone needs to handle Unicode but this is very difficult using only the standard library in a platform independant way. Nowide offers a very simple way to handle Unicode the same way on Windows/MacOS/Linux. Key features: * work with UTF-8 in your code, Nowide converts to OS encoding * Easy to use functions for converting UTF-8 to/from UTF-16 * A class to fixing argc, argc and env main parameters to use UTF-8 * UTF-8 aware functions: - stdio.h functions (fopen, freopen, remove, rename) - stdlib.h functions (system, getenv, setenv, unsetenv, putenv) - fstream (filebuf, fstream/ofstream/ifstream) - iostream (cout, cerr, clog, cin) Documentation: http://cppcms.com/files/nowide/html GitHub: https://github.com/artyom-beilis/nowide git clone https://github.com/artyom-beilis/nowide.git Latest tarballs: - to be unzipped in boost source: https://github.com/artyom-beilis/nowide/archive/master.zip - as a standalone library: http://cppcms.com/files/nowide/nowide_standalone.zip Nowide has standard boost layout. So, just put it into the boost source tree under libs/nowide directory for building and running tests. Or alternatively - for header only part of the library add the nowide/include path to the include path of your projects. Please post your comments and review to the boost mailing list (preferably), or privately to the Review Manager (to me ;-). Here are some questions you might want to answer in your review: - What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? - Did you try to use the library? With what compiler? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problem domain? And most importantly: - Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion. For more information about Boost Formal Review Process, see: http://www.boost.org/community/reviews.html Thank you very much for your time and efforts. Frédéric

Show replies by date

Groke, Paul

12 Jun 12 Jun

8:09 a.m.

Frédéric Bron wrote

...

Your participation is encouraged, as the proposed library is uncoupled, focused and rather small. Nowadays, everyone needs to handle Unicode but this is very difficult using only the standard library in a platform independant way. Nowide offers a very simple way to handle Unicode the same way on Windows/MacOS/Linux.

Key features:

* work with UTF-8 in your code, Nowide converts to OS encoding * Easy to use functions for converting UTF-8 to/from UTF-16 * A class to fixing argc, argc and env main parameters to use UTF-8 * UTF-8 aware functions: - stdio.h functions (fopen, freopen, remove, rename) - stdlib.h functions (system, getenv, setenv, unsetenv, putenv) - fstream (filebuf, fstream/ofstream/ifstream) - iostream (cout, cerr, clog, cin)

Not a review, just a question (actually two): how does Nowide deal with modified UTF-8? I'm asking because if modified UTF-8 is fully supported (converting from/to, passing modified UTF-8 directly to the UTF-8 aware functions etc.), then Nowide could be valuable to us. Also: are any other platforms aside from Windows/OS X/Linux supported (e.g. AIX, z/OS)? Thanks, Paul Groke

Artyom Beilis

8:15 a.m.

...

Not a review, just a question (actually two): how does Nowide deal with modified UTF-8?

No, modified utf-8 is not supported since it isn't utf-8 it will be considered invalid encoding.

...

Also: are any other platforms aside from Windows/OS X/Linux supported (e.g. AIX, z/OS)?

Actually if you take a look on the docs - the support of POSIX based OSes (or OS with native narrow API) is transparent.

...

Thanks, Paul Groke

Artyom Beilis

Yakov Galka

9:03 a.m.

On Mon, Jun 12, 2017 at 11:15 AM, Artyom Beilis via Boost < boost@lists.boost.org> wrote:

...

No, modified utf-8 is not supported since it isn't utf-8 it will be considered invalid encoding.

On a related note, does it support WTF-8? I.e. encoding lone UTF-16 surrogates (malformed UTF-16 sequences) within the UTF-8 scheme. It is needed to guarantee UTF-16 → UTF-8 → UTF-16 roundtrip of invalid UTF-16 data on Windows, and is not an invalid behavior per se, because all valid UTF-16 sequences still map bijectively onto valid UTF-8 sequences. -- Yakov Galka http://stannum.co.il/

Artyom Beilis

10:52 a.m.

...

On a related note, does it support WTF-8? I.e. encoding lone UTF-16 surrogates (malformed UTF-16 sequences) within the UTF-8 scheme. It is needed to guarantee UTF-16 → UTF-8 → UTF-16 roundtrip of invalid UTF-16 data on Windows, and is not an invalid behavior per se, because all valid UTF-16 sequences still map bijectively onto valid UTF-8 sequences.

-- Yakov Galka http://stannum.co.il/

No it does not. I considered it before but I think that security risk of creating or accepting malformed UTF-8 or UTF-16. Converting invalid UTF-16 to WTF-8 and other way around is not obvious behavior and has potential of security risk especially for users that are not aware of such an issue. So invalid UTF-8/16 sequences are rejected by design. Artyom

Peter Dimov

12:11 p.m.

Artyom Beilis wrote:

...

Converting invalid UTF-16 to WTF-8 and other way around is not obvious behavior and has potential of security risk especially for users that are not aware of such an issue. So invalid UTF-8/16 sequences are rejected by design.

Implying that there are Windows file names that can't be handled by the library?

degski

2:17 p.m.

On 12 June 2017 at 15:11, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

Artyom Beilis wrote:

Converting invalid UTF-16 to WTF-8 and other way around is not obvious

...
behavior and has potential of security risk especially for users that are not aware of such an issue. So invalid UTF-8/16 sequences are rejected by design.

Implying that there are Windows file names that can't be handled by the library?

Question: "Shouldn't the passing of invalid UTF-8/16 sequences be defined as UB?" How important is this point? I'm oblivious to this problem, and it sounds like I would like to keep it that way. degski -- "*Ihre sogenannte Religion wirkt bloß wie ein Opiat reizend, betäubend, Schmerzen aus Schwäche stillend.*" - Novalis 1798

Peter Dimov

2:57 p.m.

degski wrote:

...

Question: "Shouldn't the passing of invalid UTF-8/16 sequences be defined as UB?"

Of course not. Why would one need to use the library then? It defeats the whole purpose of it.

degski

13 Jun 13 Jun

6:20 a.m.

On 12 June 2017 at 17:57, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

degski wrote:

Question: "Shouldn't the passing of invalid UTF-8/16 sequences be defined

...
as UB?"

Of course not. Why would one need to use the library then? It defeats the whole purpose of it.

...

From WP (read up on it now): "RFC 3629 states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences."[13] <https://en.wikipedia.org/wiki/UTF-8#cite_note-rfc3629-13> *The Unicode Standard* requires decoders to "...treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence.""

So not UB then, but it should not pass either. Are we talking FAT32 or NTFS? What Windows verions are affected? I also think, as some posters below (and in another thread) state, that Windows should not be treated differently. A new boost library should not accomodate bad/sloppy windows' historic quirks. The library *can* require that's it's use depends on the system and its' users adhere to the standard. Then WP on Overlong encodings: "The standard specifies that the correct encoding of a code point use only the minimum number of bytes required to hold the significant bits of the code point. Longer encodings are called *overlong* and are not valid UTF-8 representations of the code point. This rule maintains a one-to-one correspondence between code points and their valid encodings, so that there is a unique valid encoding for each code point." The key being: "... are not valid UTF-8 representations ...", i.e. we're back to the case above. degski WP: https://en.wikipedia.org/wiki/UTF-8 -- "*Ihre sogenannte Religion wirkt bloß wie ein Opiat reizend, betäubend, Schmerzen aus Schwäche stillend.*" - Novalis 1798

Artyom Beilis

12 Jun 12 Jun

2:58 p.m.

On Mon, Jun 12, 2017 at 3:11 PM, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

Artyom Beilis wrote:

...
Converting invalid UTF-16 to WTF-8 and other way around is not obvious behavior and has potential of security risk especially for users that are not aware of such an issue. So invalid UTF-8/16 sequences are rejected by design.

Implying that there are Windows file names (that their names contain invalid UTF-16) that can't be handled by the library?

Yes indeed if you will try to use stuff like FindFileW and it returns invalid UTF-16 you will get an error trying to convert it to UTF-8. By definition: you can't handle file names that can't be represented in UTF-8 as there is no valid UTF-8 representation exist. Artyom

Vadim Zeitlin

3:05 p.m.

On Mon, 12 Jun 2017 17:58:32 +0300 Artyom Beilis via Boost <boost@lists.boost.org> wrote: AB> By definition: you can't handle file names that can't be represented AB> in UTF-8 as there is no valid UTF-8 representation exist. This is a nice principle to have in theory, but very unfortunate in practice because at least under Unix systems such file names do occur in the wild (maybe less often now than 10 years ago, when UTF-8 was less ubiquitous, but it's still hard to believe that the problem has completely disappeared). And there are ways to solve it, e.g. I think glib represents such file names using special characters from a PUA and there are other possible approaches, even if, admittedly, none of them is perfect. While it is reasonable to restrict the file names created by the program to valid UTF-8, not allowing certain existing files to be opened at all is, IMHO, not a good idea and probably has security implications as well. Regards, VZ

Artyom Beilis

5:39 p.m.

On Mon, Jun 12, 2017 at 6:05 PM, Vadim Zeitlin via Boost <boost@lists.boost.org> wrote:

...

On Mon, 12 Jun 2017 17:58:32 +0300 Artyom Beilis via Boost <boost@lists.boost.org> wrote:

AB> By definition: you can't handle file names that can't be represented AB> in UTF-8 as there is no valid UTF-8 representation exist.

This is a nice principle to have in theory, but very unfortunate in practice because at least under Unix systems such file names do occur in the wild (maybe less often now than 10 years ago, when UTF-8 was less ubiquitous, but it's still hard to believe that the problem has completely disappeared). And there are ways to solve it, e.g. I think glib represents such file names using special characters from a PUA and there are other possible approaches, even if, admittedly, none of them is perfect.

Please note: Under POSIX platforms no conversions are performed and no UTF-8 validation is done as this is incorrect: http://cppcms.com/files/nowide/html/index.html#qna The only case is when Windows Wide API returns/creates invalid UTF-16 - which can happen only when invalid surrogate UTF-16 pairs are generated - and they have no valid UTF-8 representation. On the other hand creating deliberately invalid UTF-8 is very problematic idea. Regards, Artyom

Yakov Galka

6:07 p.m.

On Mon, Jun 12, 2017 at 8:39 PM, Artyom Beilis via Boost < boost@lists.boost.org> wrote:

...

Please note: Under POSIX platforms no conversions are performed and no UTF-8 validation is done as this is incorrect:

http://cppcms.com/files/nowide/html/index.html#qna

The only case is when Windows Wide API returns/creates invalid UTF-16 - which can happen only when invalid surrogate UTF-16 pairs are generated - and they have no valid UTF-8 representation.

On the other hand creating deliberately invalid UTF-8 is very problematic idea.

That's not convincing. Why Windows has to be different in this regard? In both cases, Windows or POSIX, valid input gives valid UTF-8, and invalid input can give 'invalid' UTF-8. I expect the library to be as transparent layer as possible and not to interfere with such issues. -- Yakov Galka http://stannum.co.il/

Peter Dimov

6:30 p.m.

Yakov Galka wrote:

...

That's not convincing. Why Windows has to be different in this regard?

Because it is. The file name on Windows physically is a sequence of uint16_t, whereas on Linux, it's a sequence of uint8_t. On Linux, any null-terminated byte sequence results in a perfect roundtrip, as it's stored unmodified, it's literally the exact same byte sequence as the one you gave. If the library tried to mess with the file name in any way, this would result in nothing but problems. On Windows, it is simply not possible to have the exact same byte sequence stored as the file name. The file system does not use byte sequences, it uses uint16_t sequences. (Mac OS X also stores uint16_t by the way, and the functions require UTF-8. So it's not quite POSIX vs Windows.)

Yakov Galka

6:36 p.m.

On Mon, Jun 12, 2017 at 9:30 PM, Peter Dimov via Boost < boost@lists.boost.org> wrote:

...

Yakov Galka wrote:

That's not convincing. Why Windows has to be different in this regard?

...
Because it is. The file name on Windows physically is a sequence of uint16_t, whereas on Linux, it's a sequence of uint8_t. On Linux, any null-terminated byte sequence results in a perfect roundtrip, as it's stored unmodified, it's literally the exact same byte sequence as the one you gave. If the library tried to mess with the file name in any way, this would result in nothing but problems.

On Windows, it is simply not possible to have the exact same byte sequence stored as the file name. The file system does not use byte sequences, it uses uint16_t sequences.

Yeah, so? I say that the library can provide a Windows -> std::string -> Windows roundtrip just as it does with any other platform. If FreeBSD -> std::string conversion can return invalid UTF-8, then so does Windows -> std::string conversion. -- Yakov Galka http://stannum.co.il/

Peter Dimov

6:51 p.m.

Yakov Galka wrote:

...

Yeah, so? I say that the library can provide a Windows -> std::string -> Windows roundtrip just as it does with any other platform. If FreeBSD -> std::string conversion can return invalid UTF-8, then so does Windows -> std::string conversion.

The security concern here is that under FreeBSD the file name is what it is, and different byte sequences refer to different names, whereas under Windows if invalid UTF-8 is allowed many different byte sequences may map to the same file name. This does not necessarily preclude handling free surrogate pairs though. In practice the main problem is probably with overlong encoding of '.', '/' and '\'. Last time this came up I argued that if you rely on finding '.' as the literal 8 bit '.' your input validation is wrong anyway, but requiring strictly valid UTF-8 is a reasonable first line of defense. And realistically, if you want to validate the input in the correct manner, you have to #ifdef for each OS anyway, which kind of makes the library redundant. So in the specific use case where you _do_ use the library to avoid #ifdef'ing, it does make sense for it to protect you from invalid UTF-8 on Windows. With all that said, I don't quite see the concern with WTF-8. What's the attack we're defending from by disallowing it?

Yakov Galka

7:02 p.m.

On Mon, Jun 12, 2017 at 9:51 PM, Peter Dimov via Boost < boost@lists.boost.org> wrote:

...

... whereas under Windows if invalid UTF-8 is allowed many different byte sequences may map to the same file name.

This is a false presumption. Nobody here proposes allowing absolutely ANY byte sequences, only using WTF-8 as means of guaranteeing a round-trip. And as far as WTF-8 goes there is a unique representation for every 16-bit codeunit sequence.

...

With all that said, I don't quite see the concern with WTF-8. What's the attack we're defending from by disallowing it?

There are some concerns with WTF-8, specifically if you concatenate two WTF-8 strings where one ends in an unpaired surrogate whereas the other begins with one, then the result is an invalid WTF-8 string. Filenames are usually parsed and concatenated on ASCII separators, so I don't see a problem in the typical use-case. As for the non-typical use cases, I would argue that they are beyond the responsibility of this library. -- Yakov Galka http://stannum.co.il/

Groke, Paul

6:14 p.m.

Artyom Beilis wrote:

...

...
On Mon, 12 Jun 2017 17:58:32 +0300 Artyom Beilis via Boost <boost@lists.boost.org> wrote:

AB> By definition: you can't handle file names that can't be AB> represented in UTF-8 as there is no valid UTF-8 representation exist.

This is a nice principle to have in theory, but very unfortunate in practice because at least under Unix systems such file names do occur in the wild (maybe less often now than 10 years ago, when UTF-8 was less ubiquitous, but it's still hard to believe that the problem has completely disappeared). And there are ways to solve it, e.g. I think glib represents such file names using special characters from a PUA and there are other possible approaches, even if, admittedly, none of

On Mon, Jun 12, 2017 at 6:05 PM, Vadim Zeitlin via Boost <boost@lists.boost.org> wrote: them is perfect.

...
Please note: Under POSIX platforms no conversions are performed and no UTF-8 validation is done as this is incorrect:

http://cppcms.com/files/nowide/html/index.html#qna

Well... what's correct on POSIX platforms is a matter of opinion. If you go with the strict interpretation, then in fact conversion from the current locale to UTF-8 must be considered incorrect. Only then you cannot rely on *anything*, except that 0x00 is NUL and 0x2F is the path separator. Which makes any kind of isdigit/toupper/tolower/... string parsing/processing "incorrect".

...

The only case is when Windows Wide API returns/creates invalid UTF-16 - which can happen only when invalid surrogate UTF-16 pairs are generated - and they have no valid UTF-8 representation.

On the other hand creating deliberately invalid UTF-8 is very problematic idea.

Since the UTF-8 conversion is only done on/for Windows, and Windows doesn't guarantee that all wchar_t paths (or strings in general) will always be valid UTF-16, wouldn't it make more sense to just *define* that the library always uses WTF-8, which allows round-tripping of all possible 16 bit strings? If it's documented that way it shouldn't matter. Especially since users of the library cannot rely on the strings being in UTF-8 anyway, at least not in portable applications. I agree that the over-long zero/NUL encoding part of modified UTF-8 might still be problematic though, and therefor WTF-8 might be the better choice. Now that still leaves some files that can theoretically exist on a Windows system inaccessible (i.e. those with embedded NUL characters), but those are not accessible via the "usual" Windows APIs either (CreateFileW etc.). So this should be acceptable.

Zach Laine

6:35 p.m.

On Mon, Jun 12, 2017 at 1:14 PM, Groke, Paul via Boost < boost@lists.boost.org> wrote:

...

Since the UTF-8 conversion is only done on/for Windows, and Windows doesn't guarantee that all wchar_t paths (or strings in general) will always be valid UTF-16, wouldn't it make more sense to just *define* that the library always uses WTF-8, which allows round-tripping of all possible 16 bit strings? If it's documented that way it shouldn't matter. Especially since users of the library cannot rely on the strings being in UTF-8 anyway, at least not in portable applications.

I agree that round-tripping to wchar_t paths on Windows is very important. I also agree that not detecting invalid UTF-8, or failing to produce an error in such a case, is very important to *avoid*. Can we get both? Is it possible to add a WTF-8 mode, perhaps only used in user-selectable string processing cases? Zach

Groke, Paul

7:06 p.m.

Zach Laine wrote:

...

On Mon, Jun 12, 2017 at 1:14 PM, Groke, Paul via Boost < boost@lists.boost.org> wrote:

...
Since the UTF-8 conversion is only done on/for Windows, and Windows doesn't guarantee that all wchar_t paths (or strings in general) will always be valid UTF-16, wouldn't it make more sense to just *define* that the library always uses WTF-8, which allows round-tripping of all possible 16 bit strings? If it's documented that way it shouldn't matter. Especially since users of the library cannot rely on the strings being in UTF-8 anyway, at least not in portable applications.

I agree that round-tripping to wchar_t paths on Windows is very important. I also agree that not detecting invalid UTF-8, or failing to produce an error in such a case, is very important to *avoid*.

Can we get both? Is it possible to add a WTF-8 mode, perhaps only used in user-selectable string processing cases?

Well, I don't see why detecting invalid UTF-8 would be important. In my initial mail I (wrongly) assumed that the library would be translating between the native encoding and UTF-8 also on e.g. Linux (or non-Windows platforms in general). But since this isn't so, I guess the library can simply pass-through strings on all platforms that have narrow APIs. In fact I think it should, since checking for valid UTF-8 would make some files inaccessible on systems like Linux, where you can very easily create file names that aren't valid UTF-8. In that case the terms "Unicode" and "UTF-8" should not be used in describing the library (name and documentation). The documentation should just say that it transforms strings to some unspecified encoding with the following properties: - Byte-based - ASCII compatible (*) - self-synchronizing - able to 100% round-trip all characters/NUL terminated strings from the native platform encoding And for Windows this unspecified encoding then would just happen to be WTF-8. (* For POSIX systems one cannot even 100% rely on that... so maybe using an even wider set of constraints would be good.) On platforms like OS X, the API-wrappers of Boost.Nowide would then simply produce the same result as the native APIs would -- because the narrow string would simply be passed through unmodified. It it's invalid UTF-8, and the OS decides to fail the call because of the invalid UTF-8, then so be it - using the library wouldn't change anything. Almost the same for Windows: the library would simply round-trip invalid UTF-16 to the same invalid UTF-16. If the native Windows API decides to fail the call because of that, OK, then it just fails -- it would also have failed in a wchar_t application. But maybe I missed something here. If there really is a good reason for enforcing valid UTF-8 in some situation, please let me know :)

Artyom Beilis

7:37 p.m.

...

But maybe I missed something here. If there really is a good reason for enforcing valid UTF-8 in some situation, please let me know :)

Ok I want to summarize my response regarding WTF-8 especially from security point of view. There is lots of dangers in this assumption - generation of WTF-8 as if it was UTF-8 can lead to serious issues. It almost never OK to generate invalid UTF-8 especially since most of chances 99% of users will not understand what are we talking about round-trip/WTF-8 and so - and I'm talking from experience. Just to make clear how serious it can be - this is CVE regarding one bug in Boost.Locale: http://people.canonical.com/~ubuntu-security/cve/2013/CVE-2013-0252.html Lets show a trivial example that WTF-8 can lead to: Tool: - User monitoring system that monitors all files and creates report with all changes by XML to some central server Deny of Service Attack Example: - User creates a file with invalid UTF-16 - System monitors the file system and adds it to the XML report in WTF-8 format - The central server does not accept the XML since it fails UTF-8 validation - User does whatever he wants without monitoring - It removes the file - There were no reports generated during the period user needed -DOS attack Bottom line: (a) Since 99% of programmers are barely aware of various Unicode issues it is dangerous to assume that giving such a round trip in trivial way is OK (b) If you need to do complex manipulations of file system you my consider Boost.Filesystem that keeps internal representation in native format and convert to UTF-8 for other uses (nowide provides integration with Boost.Filesystem) (c) It is possible to add WTF-8 support under following restrictions: (c.1) only by adding wtf8_to_wide and wide_to_wtf8 - if user wants it explicitly(!) (c.2) No "C/C++" like API should accept one - i.e. nowide::fopen/nowide::fstream must not do it. Artyom Beilis

Zach Laine

8:39 p.m.

On Mon, Jun 12, 2017 at 2:37 PM, Artyom Beilis via Boost < boost@lists.boost.org> wrote:

...

...
But maybe I missed something here. If there really is a good reason for enforcing valid UTF-8 in some situation, please let me know :)

[snip]

Bottom line:

...

[snip]

To add a bottom line under that, it seems that nowide does not support round-tripping from arbitrary Windows path names to UTF-8 and back. Is that correct? Zach

Peter Dimov

9:05 p.m.

Artyom Beilis wrote:

...

Deny of Service Attack Example:

- User creates a file with invalid UTF-16 - System monitors the file system and adds it to the XML report in WTF-8 format - The central server does not accept the XML since it fails UTF-8 validation - User does whatever he wants without monitoring - It removes the file - There were no reports generated during the period user needed -DOS attack

I can't help but note that the same attack would work under Unix. The user can easily create a file with an invalid UTF-8 name. And, since the library doesn't enforce valid UTF-8 on POSIX (right?) it would pass through.

Niall Douglas

10:52 p.m.

On 12/06/2017 22:05, Peter Dimov via Boost wrote:

...

Artyom Beilis wrote:

...
Deny of Service Attack Example:

- User creates a file with invalid UTF-16 - System monitors the file system and adds it to the XML report in WTF-8 format - The central server does not accept the XML since it fails UTF-8 validation - User does whatever he wants without monitoring - It removes the file - There were no reports generated during the period user needed -DOS attack

I can't help but note that the same attack would work under Unix. The user can easily create a file with an invalid UTF-8 name. And, since the library doesn't enforce valid UTF-8 on POSIX (right?) it would pass through.

Also, running sanitisation on UTF sequences is laudable and all, but what about the 80-90% of code which will never, ever feed strings from untrusted remote parties to system APIs? Do they pay for the UTF sequence validation too? Like Peter and many of us, I have a personal "nowide" library too. It simply wraps the *W() Win32 APIs consuming 16 bit characters with a call to the NT kernel function for converting UTF-8 to UTF-16 which is RtlUTF8ToUnicodeN(). It does nothing else fancy, it leaves it entirely up to the Windows kernel what that conversion means. I'd suggest that the nowide library ought to keep things similarly simple. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Artyom Beilis

13 Jun 13 Jun

3:49 a.m.

On Tue, Jun 13, 2017 at 12:05 AM, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

Artyom Beilis wrote:

...
Deny of Service Attack Example:

- User creates a file with invalid UTF-16 - System monitors the file system and adds it to the XML report in WTF-8 format - The central server does not accept the XML since it fails UTF-8 validation - User does whatever he wants without monitoring - It removes the file - There were no reports generated during the period user needed -DOS attack

I can't help but note that the same attack would work under Unix. The user can easily create a file with an invalid UTF-8 name. And, since the library doesn't enforce valid UTF-8 on POSIX (right?) it would pass through.

Note, under POSIX user takes strings as is and can't trust the source. Under Windows it need to convert them using nowide which can give him false assumption that it receives valid UTF-8. Once again I have no problem providing wtf8 to wide and other way around functions when user EXPLICITLY says it. But it shell not be default behavior or some behavior you turn on with some global define. Artyom

Groke, Paul

9:30 a.m.

Artyom Beilis wrote:

...

On Tue, Jun 13, 2017 at 12:05 AM, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...
Artyom Beilis wrote:

...
- User creates a file with invalid UTF-16 - System monitors the file system and adds it to the XML report in WTF-8 format - The central server does not accept the XML since it fails UTF-8 validation - User does whatever he wants without monitoring - It removes the file - There were no reports generated during the period user needed -DOS attack

I can't help but note that the same attack would work under Unix. The user can easily create a file with an invalid UTF-8 name. And, since the library doesn't enforce valid UTF-8 on POSIX (right?) it would pass through.

Note, under POSIX user takes strings as is and can't trust the source. Under Windows it need to convert them using nowide which can give him false assumption that it receives valid UTF-8.

OK, thanks for explaining, I understand your concern now. I don't agree that a library like Nowide should necessarily try to address that kind of issue though. Regarding your example... isn't the whole point of Nowide to make it possible to write portable applications without #ifdef-ing stuff? So ... for this to create a problem on Windows (and at the same time not create a problem on POSIX), a bunch of things would have to coincide - Programmer didn't read manual and therefor assumes to get UTF-8 on Windows - Application then wants to write some cmdline argument/path/... into some document that requires valid UTF-8 - The software component that builds/writes that document fails to validate the UTF-8 itself - Programmer decides to skip the UTF-8 validation on Windows, because he assumes it has already been done Assuming that the application doesn't need an #ifdef _WIN32 in that part of the code anyway, that would mean that the programmer deliberately introduced an #ifdef _WIN32, just to selectively skip some validation on Windows, because he thinks that it might already have been done. No, I wouldn't say that's something that should be driving the design decisions of a library like Nowide. Additionally I don't see what validating the UTF-8 would improve here. In that case the user would still create a file with invalid UTF-16. Only then the application would even fail to enumerate the directory or convert the path. And I'm not convinced that kind of error is more likely to be handled correctly.

...

Once again I have no problem providing wtf8 to wide and other way around functions when user EXPLICITLY says it. But it shell not be default behavior or some behavior you turn on with some global define.

The whole appeal of a library like Nowide is that it works more or less transparently - i.e. without having to write conversion calls all over the place. And duplicating all APIs is something that I would deem a very bad choice - that would be confusing as hell.

Peter Dimov

3:08 p.m.

Artyom Beilis wrote:

...

Note, under POSIX user takes strings as is and can't trust the source. Under Windows it need to convert them using nowide which can give him false assumption that it receives valid UTF-8.

What use case are we discussing here? Is it the one where the user gets a path from an external source and passes it to Nowide? In which case Nowide does no validation on POSIX and strict UTF-8 validation on Windows? Giving the Windows user the false assumption that Nowide validates, which turns out to not happen in a later POSIX port. Or is it the one where the user receives a path from Nowide? Which currently doesn't occur anywhere, because the library does not return paths, it only takes them as arguments. (It would occur if you add `readdir`.)

...

Once again I have no problem providing wtf8 to wide and other way around functions when user EXPLICITLY says it.

How would portable user code look like in this scenario?

Artyom Beilis

3:42 p.m.

On Tue, Jun 13, 2017 at 6:08 PM, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

Artyom Beilis wrote:

...
Note, under POSIX user takes strings as is and can't trust the source. Under Windows it need to convert them using nowide which can give him false assumption that it receives valid UTF-8.

What use case are we discussing here?

Is it the one where the user gets a path from an external source and passes it to Nowide? In which case Nowide does no validation on POSIX and strict UTF-8 validation on Windows? Giving the Windows user the false assumption that Nowide validates, which turns out to not happen in a later POSIX port.

Or is it the one where the user receives a path from Nowide? Which currently doesn't occur anywhere, because the library does not return paths, it only takes them as arguments. (It would occur if you add `readdir`.)

For that purpose you have Boost.Filesystem that internally keeps wide strings on Windows and narrow on POSIX platforms. Nowide does not try to replace Boost.Filestem. However once you try to get an std::string from Boost.Filesystem (using nowide integration) you will get either error or valid UTF-8 string.

...

...
Once again I have no problem providing wtf8 to wide and other way around functions when user EXPLICITLY says it.

How would portable user code look like in this scenario?

Nowide does not try to provide solution to all possible portable API. But it provides a toolset to make such portable solutions easier. However if you convert some wide string from some Windows API you will expect to get valid UTF-8 and not WTF-8. Artyom

Peter Dimov

3:50 p.m.

Artyom Beilis wrote:

...

On Tue, Jun 13, 2017 at 6:08 PM, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...
Artyom Beilis wrote:

...
Note, under POSIX user takes strings as is and can't trust the source. Under Windows it need to convert them using nowide which can give him false assumption that it receives valid UTF-8.

What use case are we discussing here?

Is it the one where the user gets a path from an external source and passes it to Nowide? In which case Nowide does no validation on POSIX and strict UTF-8 validation on Windows? Giving the Windows user the false assumption that Nowide validates, which turns out to not happen in a later POSIX port.

Or is it the one where the user receives a path from Nowide? Which currently doesn't occur anywhere, because the library does not return paths, it only takes them as arguments. (It would occur if you add `readdir`.)

For that purpose you have Boost.Filesystem that internally keeps wide strings on Windows and narrow on POSIX platforms.

Nowide does not try to replace Boost.Filestem. However once you try to get an std::string from Boost.Filesystem (using nowide integration) you will get either error or valid UTF-8 string.

This doesn't answer my question. Let me rephrase. The user gets a path from an external source and passes it to Nowide, for example, to nowide::fopen. Nowide does no validation on POSIX and strict UTF-8 validation on Windows. Why is the Windows-only strict validation a good thing? What attacks are prevented by not accepting WTF-8 in nowide::fopen ONLY under Windows, and passing everything through unvalidated on POSIX? If the program originates on Windows and as a result comes to rely on Nowide's strict validation, and is later ported to POSIX, are not the users left with a false sense of security?

Artyom Beilis

8:22 p.m.

...

Let me rephrase.

The user gets a path from an external source and passes it to Nowide, for example, to nowide::fopen.

Nowide does no validation on POSIX and strict UTF-8 validation on Windows.

Why is the Windows-only strict validation a good thing?

What attacks are prevented by not accepting WTF-8 in nowide::fopen ONLY under Windows, and passing everything through unvalidated on POSIX?

Ok... This is a very good question. On windows I need to convert for obvious reason. Now the question is what I accept as valid and what is not valid and where do I draw the line. --------------------------------------------------------------------------------------------------------------------------- Now as you have seen there are many possible "non-standard" UTF-8 variants. What should I accept? Should I accept CESU-8 (non BMP encoded as 2 pairs of 3 "UTF-8" like bytes) and UTF-8 Should I accept WTF-8 only non-paired surrogates? What if these surrogates can be combined into correct UTF-16, is it valid? (In fact concatenation of WTF-8 strings isn't trivial operation as simple string + string does not work and lead to invalid WTF-16) Should I accept modified UTF-8 - it was already asked (i.e. values encoded without shortest sequence) for example should I accept "x" encoded in two bytes? What about "."? How should I treat stuff like "\xFF.txt" <- invalid UTF-8? Should I convert it to L"\xFF.txt"? Should I convert it to some sort of WTF-16 to preserve the string? Now despite what it may look from the discussion WTF-8 is far from being "standard" for representing invalid UTF-16. Some may substitute it with "?" others with U+FFFD, others just remove one. I actually tested some cases for real "bad" file names by different system and each did something different. I don't think using WTF-8 is some widely used industry standard it is just one of many variants to create UTF-8 extension. But there MUST be clear line of what is accepted and what is not and the safest and most standard line to draw is well defined UTF-8 and UTF-16 that are (a) 100% convertible one from other (b) widely used accepted standards. So that was my decision - based on safety and standards (and there is no such thing as non strict UTF-8/16) Does it fits everybody? Of course not! It there some line that fits everybody? There is no such thing! Does it fits common case for vast majority of users/developers? IMHO yes. So as a policy I decided to use UTF-8 and UTF-16 as selection of encoding for each sides of widen/narrow is required.

...

If the program originates on Windows and as a result comes to rely on Nowide's strict validation, and is later ported to POSIX, are not the users left with a false sense of security?

You have valid point. But adding validation on POSIX systems in general will be wrong (as I noted in Q&A) because there is no single encoding for POSIX OS - it is runtime parameter - unlike Windows API that has Wide UTF-16 API so such a validation on Linux/POSIX will likely cause more issues than solve. Thanks, Artyom

Niall Douglas

8:33 p.m.

...

...
Why is the Windows-only strict validation a good thing?

What attacks are prevented by not accepting WTF-8 in nowide::fopen ONLY under Windows, and passing everything through unvalidated on POSIX?

Ok...

This is a very good question.

On windows I need to convert for obvious reason.

Now the question is what I accept as valid and what is not valid and where do I draw the line. ---------------------------------------------------------------------------------------------------------------------------

Now as you have seen there are many possible "non-standard" UTF-8 variants.

What should I accept?

I still strongly suggest you simply call RtlUTF8ToUnicodeN() (https://msdn.microsoft.com/en-us/library/windows/hardware/ff563018(v=vs.85)....) to do the UTF-8 conversion. Do **nothing** else. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Zach Laine

8:52 p.m.

On Tue, Jun 13, 2017 at 3:33 PM, Niall Douglas via Boost < boost@lists.boost.org> wrote:

...

...
Now as you have seen there are many possible "non-standard" UTF-8 variants.

What should I accept?

I still strongly suggest you simply call RtlUTF8ToUnicodeN() (https://msdn.microsoft.com/en-us/library/windows/ hardware/ff563018(v=vs.85).aspx) to do the UTF-8 conversion. Do **nothing** else.

Niall, could you explain why? I don't know any of the Windows-relevant details. Zach

Niall Douglas

8:59 p.m.

...

> Now as you have seen there are many possible "non-standard" UTF-8 variants. > > What should I accept?

I still strongly suggest you simply call RtlUTF8ToUnicodeN() (https://msdn.microsoft.com/en-us/library/windows/hardware/ff563018(v=vs.85).... <https://msdn.microsoft.com/en-us/library/windows/hardware/ff563018(v=vs.85).aspx>) to do the UTF-8 conversion. Do **nothing** else.

Niall, could you explain why? I don't know any of the Windows-relevant details.

1. RtlUTF8ToUnicodeN() is what the NT kernel uses and isn't polluted by Win32. 2. RtlUTF8ToUnicodeN() has a well designed API unlike the awful Win32 MultiByteToWideChar() function. 3. RtlUTF8ToUnicodeN() is close to as fast as any implementation of the same thing, unlike MultiByteToWideChar() and some STL implementations of <codecvt>. 4. RtlUTF8ToUnicodeN() treats invalid input in the way which the rest of the NT kernel is built to expect. I caveat this with saying that Win32 functions can mangle input in a really unhelpful way, I would only trust RtlUTF8ToUnicodeN() being fed directly to NT kernel APIs. That I know works well. 5. I've been using RtlUTF8ToUnicodeN() in my own code for years and have found it unsurprising and unproblematic. Unlike MultiByteToWideChar() or C++ 11's UTF-8 support which doesn't quite work right on some standard libraries. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Artyom Beilis

14 Jun 14 Jun

4:51 a.m.

...

...
This is a very good question.

On windows I need to convert for obvious reason.

Now the question is what I accept as valid and what is not valid and where do I draw the line. ---------------------------------------------------------------------------------------------------------------------------

Now as you have seen there are many possible "non-standard" UTF-8 variants.

What should I accept?

I still strongly suggest you simply call RtlUTF8ToUnicodeN() (https://msdn.microsoft.com/en-us/library/windows/hardware/ff563018(v=vs.85)....) to do the UTF-8 conversion. Do **nothing** else.

Niall

Actually I think you provided me a good direction I hadn't considered before. RtlUTF8ToUnicodeN and other way around function does something very simple: It substitutes invalid codepoints/encoding with U+FFFD - REPLACEMENT CHARACTER which is standard Unicode way to say I failed to convert a thing. It is something similar to current ANSI / Wide conversions creating ? instead. It looks like it is better way to do it instead of failing to convert entire string all together. If you get invalid string conversion will success but you'll get special characters (that are usually marked as � in UI) that will actually tell you something was wrong. This way for example getenv on valid key will not return NULL and create ambiguity of what happened and it is actually something that is more common behavior in Windows. I like it and I think I'll change the behavior of the conversion functions in Boost.Nowide to this one Thanks! Artyom Beilis

Niall Douglas

10:41 a.m.

...

Actually I think you provided me a good direction I hadn't considered before.

RtlUTF8ToUnicodeN and other way around function does something very simple:

It substitutes invalid codepoints/encoding with U+FFFD - REPLACEMENT CHARACTER which is standard Unicode way to say I failed to convert a thing.

It is something similar to current ANSI / Wide conversions creating ? instead.

It looks like it is better way to do it instead of failing to convert entire string all together.

If you get invalid string conversion will success but you'll get special characters (that are usually marked as � in UI) that will actually tell you something was wrong.

Some additional information about how the NT kernel treats the FFFD character might be useful to you. Quite a lot of kernel APIs treat UNICODE_STRING as just a bunch of bytes e.g. the filesystem. You can supply any path with any characters at all, including one containing zero bytes. The only character you can't use is the backslash as that is the path separator. This matches how most Unix filesystems work, and indeed at the NT kernel level path comparisons are case sensitive as well as byte sensitive (unless you ask otherwise) because it's just a memcmp(). You can have real fun here creating lots of paths completely unparseable by Win32, as in, files totally inaccessible to any Win32 API with some really random Win32 error codes being returned. Other NT kernel APIs will refuse strings containing FFFD or illegal UTF-16 characters, and if they do it's generally because accepting them would be an obvious security risk. But those are definitely a minority. If the Win32 layer doesn't get in the way, doing as RtlUTF8ToUnicodeN() does should be safe in so far as the kernel team feel it is. They have placed appropriate checks on appropriate kernel APIs. But in the end, it's down to the programmer to correctly validate and check all untrusted input. Doing so always is an unnecessary expense for most end users who have trusted input. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Frédéric Bron

15 Jun 15 Jun

6:59 a.m.

...

RtlUTF8ToUnicodeN and other way around function does something very simple: It substitutes invalid codepoints/encoding with U+FFFD - REPLACEMENT CHARACTER which is standard Unicode way to say I failed to convert a thing.

so does this allow roundtrip conversion UTF-16->UTF-8->UTF-16? Frédéric

Peter Dimov

13 Jun 13 Jun

11:33 p.m.

Artyom Beilis wrote:

...

Now as you have seen there are many possible "non-standard" UTF-8 variants.

What should I accept?

Others have made their case for WTF-8, which has the desirable property of only allowing one encoding for any uint16_t sequence (taken in isolation). My personal choice here is to be more lenient and accept any combination of valid UTF-8 and UTF-8 encoded surrogates (a superset of CESU-8 and WTF-8), but I'm not going to argue very strongly for it over WTF-8, if at all. I think that overlongs should NOT be accepted, including overlong zero.

Groke, Paul

14 Jun 14 Jun

8:12 a.m.

Peter Dimov wrote:

...

Others have made their case for WTF-8, which has the desirable property of only allowing one encoding for any uint16_t sequence (taken in isolation).

My personal choice here is to be more lenient and accept any combination of valid UTF-8 and UTF-8 encoded surrogates (a superset of CESU-8 and WTF-8), but I'm not going to argue very strongly for it over WTF-8, if at all.

I have to admit that I was confusing WTF-8 and CESU-8. When converting from wide to narrow I would favor CESU-8, mostly because it makes the rules simpler and creates less special cases (e.g. when concatenating). When converting from narrow to wide I agree that accepting any UTF-8/WTF-8/CESU-8 mix would be good.

...

I think that overlongs should NOT be accepted, including overlong zero.

Agreed.

Frédéric Bron

15 Jun 15 Jun

6:43 a.m.

...

Or is it the one where the user receives a path from Nowide? Which currently doesn't occur anywhere, because the library does not return paths,

this is not fully true as you can get a path with getenv. Frédéric

Frédéric Bron

16 Jun 16 Jun

5 a.m.

...

Please note: Under POSIX platforms no conversions are performed and no UTF-8 validation is done as this is incorrect:

http://cppcms.com/files/nowide/html/index.html#qna

I do not quite understand the rationale behind not converting to UTF-8 on Posix platforms. I naively though I got UTF-8 in argv because my system is convigured in UTF-8 but I discover that this is not necessary always the case. In the example you highlight, I do not see the difference from the Windows case. You could convert to UTF-8 in argv and back to the local encoding in nowide::remove. I understand it is not efficient if you do not really use the content of the filename but if you have to write, say an xml report in UTF-8, you would have to convert anyway. Today, what is the portable way to convert argv to UTF-8? i.e. without having to #ifdef _WIN32...? Frédéric

Artyom Beilis

7 a.m.

On Fri, Jun 16, 2017 at 8:00 AM, Frédéric Bron <frederic.bron@m4x.org> wrote:

...

...
Please note: Under POSIX platforms no conversions are performed and no UTF-8 validation is done as this is incorrect:

http://cppcms.com/files/nowide/html/index.html#qna

I do not quite understand the rationale behind not converting to UTF-8 on Posix platforms. I naively though I got UTF-8 in argv because my system is convigured in UTF-8 but I discover that this is not necessary always the case. In the example you highlight, I do not see the difference from the Windows case. You could convert to UTF-8 in argv and back to the local encoding in nowide::remove. I understand it is not efficient if you do not really use the content of the filename but if you have to write, say an xml report in UTF-8, you would have to convert anyway.

Today, what is the portable way to convert argv to UTF-8? i.e. without having to #ifdef _WIN32...?

Frédéric

Hello Frederic, There are several reasons for this. One of is actually original purpose of the library: use same type of strings internally without creating broken software on Windows and since only Windows use native Wide instead of narrow API which is native for C++ only Windows case requires encoding conversion. However there is another deeper issue. Unlike on Windows where native wide API has well defined UTF-16 encoding it isn't the case for Unix like OSes. The encoding is defined by current Locale that can be defined globally, per user, per process and even change trivially in the same process during the runtime. There are also several sources of the Locale/Encoding information: Environment variables: - LANG/LC_CTYPE - which is UTF-8 on vast majority of modern Unix like platforms but frequently can be undefined or defined as "C" locale without encoding information. This one is what OS defines for the process. - C locale: setlocale API - which is by default "C" locale by standard unless explicitly defined otherwise - C++ locale: std::locale::global() API - which is by default "C" locale by standard unless explicitly defined otherwise They are all can be changed at runtime, they aren't synchronized and they be modified to whatever encoding user wants. Additionally using std::locale::global as not "C" locale can lead to some really nasty things like failing to create CSV files due to adding "," to numbers. So the safest and the most correct way to handle it is to pass narrow strings as is without any conversion. Regards, Artyom Beilis

Frédéric Bron

7:40 a.m.

...

The encoding is defined by current Locale that can be defined globally, per user, per process and even change trivially in the same process during the runtime. They are all can be changed at runtime, they aren't synchronized and they be modified to whatever encoding user wants.

So the safest and the most correct way to handle it is to pass narrow strings as is without any conversion.

I understand that this complexity can be a nightmare. How would you therefore do this simple task: 1. get a directory from nowide::getenv -> base_dir (UTF-8 on Windows, unknown narrow on Posix) 2. create a file in base_dir which name is file_name encoded in UTF-8 (because it is created by the application). If I understand well, I should NOT do this: auto f = nowide::ofstream((boost::filesystem::path(base_dir) / file_name).string()); because this is guaranteed to work only on Windows where I have the guarantee that base_dir is UTF-8, right? On Posix, there is a risk that base_dir is for example ISO-8859-1 while file_name is UTF-8 so that the combination made by filesystem is wrong. Am I right? Frédéric

Peter Dimov

12 Jun 12 Jun

5:52 p.m.

Artyom Beilis wrote:

...

...
Implying that there are Windows file names (that their names contain invalid UTF-16) that can't be handled by the library?

Yes indeed if you will try to use stuff like FindFileW and it returns invalid UTF-16 you will get an error trying to convert it to UTF-8.

That's what I meant, already existing files that can't be manipulated by the library. Incidentally, the mention of FileFindW leads me to my next question: by looking at my own collection of functions that basically mirror what Nowide does (everyone has his own Nowide, I suppose, as this is the only sane way to program), I see 'stat' and a 'readdir' equivalent. Have you considered providing these? One could go the filesystem route, of course, but sometimes one may wish to minimize dependencies.

...

By definition: you can't handle file names that can't be represented in UTF-8 as there is no valid UTF-8 representation exist.

Well yes in principle, but the file already exists. Although one can well argue that this is no different from putting ':' in the name or something like that.

Groke, Paul

9:20 a.m.

Artyom Beilis wrote

...

...
Not a review, just a question (actually two): how does Nowide deal with modified UTF-8?

No, modified utf-8 is not supported since it isn't utf-8 it will be considered invalid encoding.

I know modified UTF-8 is (can be) invalid UTF-8, that's why I asked. I think it could make sense to support it anyway though. Round tripping (strictly invalid, but possible) file names on Windows, easier interoperability with stuff like JNI, ... OTOH it would add overhead for systems with native UTF-8 APIs, because Nowide would at least have to check every string for "modified UTF-8 encoded" surrogate pairs and convert the string if necessary. Which of course is a good argument for not supporting modified UTF-8, because then Nowide could just pass the strings through unmodified on those systems.

Yakov Galka

9:29 a.m.

On Mon, Jun 12, 2017 at 12:20 PM, Groke, Paul via Boost < boost@lists.boost.org> wrote:

...

I know modified UTF-8 is (can be) invalid UTF-8, that's why I asked. I think it could make sense to support it anyway though. Round tripping (strictly invalid, but possible) file names on Windows, easier interoperability with stuff like JNI, ...

Don't you mean WTF-8 then? AFAIK "Modified UTF-8" is UTF-8 that encodes the null character with an overlong sequence, and thus is incompatible with standard UTF-8, unlike WTF-8 which is a compatible extension.

...

OTOH it would add overhead for systems with native UTF-8 APIs, because Nowide would at least have to check every string for "modified UTF-8 encoded" surrogate pairs and convert the string if necessary. Which of course is a good argument for not supporting modified UTF-8, because then Nowide could just pass the strings through unmodified on those systems.

Implementing WTF-8 removes a check in UTF-8 → UTF-16 conversion, and doesn't change anything in the reverse direction when there is a valid UTF-16. I suspect it isn't slower. -- Yakov Galka http://stannum.co.il/

Groke, Paul

9:55 a.m.

Yakov Galka wrote:

...

On Mon, Jun 12, 2017 at 12:20 PM, Groke, Paul via Boost <mailto:boost@lists.boost.org> wrote:

...
I know modified UTF-8 is (can be) invalid UTF-8, that's why I asked. I think it could make sense to support it anyway though. Round tripping (strictly invalid, but possible) file names on Windows, easier interoperability with stuff like JNI, ...

Don't you mean WTF-8 then? AFAIK "Modified UTF-8" is UTF-8 that encodes the null character with an overlong sequence, and thus is incompatible with standard UTF-8, unlike WTF-8 which is a compatible extension.

No, I mean modified UTF-8. Modified UTF-8 is UTF-8 plus the following extensions: - Allow encoding UTF-16 surrogates as if they were code points (=what "WTF-8" does) - Allow an over-long 2 byte encoding of the NUL character Both are not strictly UTF-8 compatible, but both don't introduce significant overhead in most situations. I don't see how over-long NUL encodings are "more incompatible" then UTF-8 encoded surrogates, but then again that's not really important.

...

...
OTOH it would add overhead for systems with native UTF-8 APIs, because Nowide would at least have to check every string for "modified UTF-8 encoded" surrogate pairs and convert the string if necessary. Which of course is a good argument for not supporting modified UTF-8, because then Nowide could just > pass the strings through unmodified on those systems.

Implementing WTF-8 removes a check in UTF-8 -> UTF-16 conversion, and doesn't change anything in the reverse direction when there is a valid UTF-16. I suspect it isn't slower.

Supporting modified UTF-8 or WTF-8 adds overhead on systems where the native OS API accepts UTF-8, but only strictly valid UTF-8. When some UTF-8 enabled function of the library is called on such a system, it would have to check for WTF-8 encoded surrogates and convert them to "true" UTF-8 before passing the string to the OS API. Because you would expect and want the "normal" UTF-8 encoding for a string to refer to the same file as the WTF-8 encoding of the same string. Regards, Paul Groke

Yakov Galka

6:12 p.m.

On Mon, Jun 12, 2017 at 12:55 PM, Groke, Paul via Boost < boost@lists.boost.org> wrote:

...

Supporting modified UTF-8 or WTF-8 adds overhead on systems where the native OS API accepts UTF-8, but only strictly valid UTF-8. When some UTF-8 enabled function of the library is called on such a system, it would have to check for WTF-8 encoded surrogates and convert them to "true" UTF-8 before passing the string to the OS API. Because you would expect and want the "normal" UTF-8 encoding for a string to refer to the same file as the WTF-8 encoding of the same string.

That's the point: if the string is at all representable in UTF-8 then its WTF-8 representation is already in a valid UTF-8 representation and no conversion has to be done. Thus you don't even have to check anything at all. This is how WTF-8 is 'more compatible' with UTF-8 than Modified UTF-8 is. By analogy, you don't need to do special checks if you want to pass a UTF-8 string to an ASCII only API, because UTF-8 is a strict superset. -- Yakov Galka http://stannum.co.il/

Peter Dimov

12:57 p.m.

...

Documentation: http://cppcms.com/files/nowide/html

Question: what do the functions do (on Windows) when the passed UTF-8 is invalid? From a brief look at the documentation, I don't see this specified, but I may have missed it.

Artyom Beilis

2:54 p.m.

On Mon, Jun 12, 2017 at 3:57 PM, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

...
Documentation: http://cppcms.com/files/nowide/html

Question: what do the functions do (on Windows) when the passed UTF-8 is invalid? From a brief look at the documentation, I don't see this specified, but I may have missed it.

It is documented in reference documentation. http://cppcms.com/files/nowide/html/namespaceboost_1_1nowide.html#a6644397c5... fopen/ifstream etc - would fail in usual way these functions fail - fail to open file. widen/narrow will return null or throw boost::locale::conv::conversion_error depending on function. Artyom

Peter Dimov

3:03 p.m.

Artyom Beilis wrote:

...

...
Question: what do the functions do (on Windows) when the passed UTF-8 is invalid? From a brief look at the documentation, I don't see this specified, but I may have missed it.

It is documented in reference documentation.

http://cppcms.com/files/nowide/html/namespaceboost_1_1nowide.html#a6644397c5...

Yes, I see now that it's documented on some, such as for instance fopen: If invalid UTF-8 given, NULL is returned and errno is set to EINVAL but not all, for instance getenv: UTF-8 aware getenv. Returns 0 if the variable is not set. or setenv: UTF-8 aware setenv, key - the variable name, value is a new UTF-8 value,. and it's not clear what happens when one feeds invalid UTF-8 to cout/cerr. Another minor point is that the library assumes Cygwin is POSIX and tries `using ::setenv;`, which is a compile error. For the purposes of Nowide, Cygwin should be considered Windows.

Artyom Beilis

5:45 p.m.

On Mon, Jun 12, 2017 at 6:03 PM, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

Artyom Beilis wrote:

...
...
Question: what do the functions do (on Windows) when the passed UTF-8 is

...
invalid? From a brief look at the documentation, I don't see this > specified, but I may have missed it.

It is documented in reference documentation.

http://cppcms.com/files/nowide/html/namespaceboost_1_1nowide.html#a6644397c5...

Yes, I see now that it's documented on some, such as for instance fopen:

If invalid UTF-8 given, NULL is returned and errno is set to EINVAL

but not all, for instance getenv:

UTF-8 aware getenv. Returns 0 if the variable is not set.

or setenv:

UTF-8 aware setenv, key - the variable name, value is a new UTF-8 value,.

I try to keep interface similar to standard one. Changing errno by getenv isn't something expected.

...

and it's not clear what happens when one feeds invalid UTF-8 to cout/cerr.

Yes you are right it should be more clear. In case of invalid UTF-8 fail bit will be set.

...

Another minor point is that the library assumes Cygwin is POSIX and tries `using ::setenv;`, which is a compile error. For the purposes of Nowide, Cygwin should be considered Windows

Not exactly. Since Cygwin 1.7 it uses UTF-8 by default so all operations like fopen cooperate properly with Windows API. So it is correct to consider Cygwin as POSIX and not Windows. Regarding specific setenv/getenv and how and if it handles narrow/wide conversions I indeed need to check. Thanks, Artyom

Peter Dimov

13 Jun 13 Jun

5:34 p.m.

...

Another minor point is that the library assumes Cygwin is POSIX and tries `using ::setenv;`, which is a compile error.

I fixed this by changing cenv.hpp as follows: #if defined(BOOST_WINDOWS) || defined(__CYGWIN__) #include <boost/nowide/windows.hpp> #endif namespace boost { namespace nowide { #if !defined(BOOST_WINDOWS) && !defined(__CYGWIN__) && !defined(BOOST_NOWIDE_DOXYGEN) and now everything compiles under Cygwin. I get the following two test failures: testing.capture-output ..\..\..\bin.v2\libs\nowide\test\test_system_n.test\gcc-cxx03-6.3.0\debug\test_system_n.run ====== BEGIN OUTPUT ====== Failed Error boost::nowide::getenv("BOOST_NOWIDE_TEST") in test_system.cpp:25 main Failed Error boost::nowide::system(command.c_str()) == 0 in test_system.cpp:75 main EXIT STATUS: 1 ====== END OUTPUT ======

Artyom Beilis

7:49 p.m.

On Tue, Jun 13, 2017 at 8:34 PM, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

...
Another minor point is that the library assumes Cygwin is POSIX and tries `using ::setenv;`, which is a compile error.

I fixed this by changing cenv.hpp as follows:

#if defined(BOOST_WINDOWS) || defined(__CYGWIN__) #include <boost/nowide/windows.hpp> #endif

namespace boost { namespace nowide { #if !defined(BOOST_WINDOWS) && !defined(__CYGWIN__) && !defined(BOOST_NOWIDE_DOXYGEN)

and now everything compiles under Cygwin. I get the following two test failures:

testing.capture-output ..\..\..\bin.v2\libs\nowide\test\test_system_n.test\gcc-cxx03-6.3.0\debug\test_system_n.run ====== BEGIN OUTPUT ====== Failed Error boost::nowide::getenv("BOOST_NOWIDE_TEST") in test_system.cpp:25 main Failed Error boost::nowide::system(command.c_str()) == 0 in test_system.cpp:75 main

EXIT STATUS: 1 ====== END OUTPUT ======

Ok I'll test cygwin (hasn't worked with one for a while) and make sure it works on this platform. Also I need to check how native cygwin getenv/setenv behaves wither regard to Unicode. Artyom

Peter Dimov

14 Jun 14 Jun

2:02 a.m.

New subject: Review of Nowide

...

- What is your evaluation of the design?

The general principles on which the library is build are solid, the design is good. As already noted, I would prefer `stat` and `opendir`/`readdir` as additions, while we're at it. Yes, we could use Filesystem. On the other hand, we _could_ use Filesystem for remove, rename, and fstream, too.

...

- What is your evaluation of the implementation?

I didn't look at the implementation too deeply, but I did take a look at, and run, the test suite. There are a number of copy/paste references to Boost.System here (test/Jamfile) and Boost.Locale there (index.html), which should ideally be corrected. Also, the test Jamfile is missing import testing ; which I do not hold against the author in any way as the Boost.System Jamfile also had it missing (it works without it unless invoked from the Boost root.) The doc/ directory is missing a Jamfile, and there's also no meta/libraries.json. There are no .travis.yml and appveyor.yml files in the root, which is odd in this day and age. :-) The global BOOST_USE_WINDOWS_H macro should probably be respected.

...

- What is your evaluation of the documentation?

Does the job, although not all functions had their behavior documented in the case the passed UTF-8 sequence is invalid.

...

- What is your evaluation of the potential usefulness of the library?

The library, or something isomorphic to it, is indispensable for any serious Windows work.

...

- Did you try to use the library? With what compiler? Did you have any problems?

As mentioned, I ran the test suite on various versions of msvc (all passed), MinGW gcc (warnings in c++03 mode, errors in c++11 and above), and Cygwin gcc/clang (errors because of missing ::getenv and friends). The problem MinGW gcc 7 (and Cygwin clang, once I enabled the Windows path in cenv.hpp) identified was In file included from test\test_system.cpp:11:0: ..\../boost/nowide/cenv.hpp: In function 'int boost::nowide::putenv(char*)': ..\../boost/nowide/cenv.hpp:104:41: warning: comparison between pointer and zero character constant [-Wpointer-compare] while(*key_end!='=' && key_end!='\0') ^~~~ which indicates that (1) the library is missing a test for putenv with the string not containing '=', and (2) the author should consider testing on Cygwin/Mingw, either locally, or on Appveyor. These two may not be important production platforms, but they enable testing the Windows-specific code paths with g++ and clang++, which as the above shows, may uncover potential problems that testing on MSVC alone would miss.

...

- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?

Quick reading.

...

- Are you knowledgeable about the problem domain?

More or less.

...

- Do you think the library should be accepted as a Boost library?

Yes. Nowide should be accepted.

Artyom Beilis

5:06 a.m.

New subject: Review of Nowide

...

There are a number of copy/paste references to Boost.System here (test/Jamfile) and Boost.Locale there (index.html), which should ideally be corrected.

Ok I'll fix it

...

Also, the test Jamfile is missing import testing ; which I do not hold against the author in any way as the Boost.System Jamfile also had it missing (it works without it unless invoked from the Boost root.)

What it does and why do I need one?

...

The doc/ directory is missing a Jamfile, and there's also no meta/libraries.json.

Ok. What I remember from Boost.Locale that doxygen wasn't run during doc build but rather HTML was pushed to directory. If today there is support of automatic building of docs with vanilla Doxygen (out of quick book scope) I'll add one.

...

There are no .travis.yml and appveyor.yml files in the root, which is odd in this day and age. :-)

What do I miss?

...

The global BOOST_USE_WINDOWS_H macro should probably be respected.

Ok good point I'll fix it.

...

...
- What is your evaluation of the documentation?

Does the job, although not all functions had their behavior documented in the case the passed UTF-8 sequence is invalid.

Good point.

...

...
- What is your evaluation of the potential usefulness of the library?

The library, or something isomorphic to it, is indispensable for any serious Windows work.

...
- Did you try to use the library? With what compiler? Did you have any problems?

As mentioned, I ran the test suite on various versions of msvc (all passed), MinGW gcc (warnings in c++03 mode, errors in c++11 and above), and Cygwin gcc/clang (errors because of missing ::getenv and friends).

The problem MinGW gcc 7 (and Cygwin clang, once I enabled the Windows path in cenv.hpp) identified was

In file included from test\test_system.cpp:11:0: ..\../boost/nowide/cenv.hpp: In function 'int boost::nowide::putenv(char*)': ..\../boost/nowide/cenv.hpp:104:41: warning: comparison between pointer and zero character constant [-Wpointer-compare] while(*key_end!='=' && key_end!='\0') ^~~~

Ooops it is serious I'll fix it. https://github.com/artyom-beilis/nowide/issues/14

...

which indicates that (1) the library is missing a test for putenv with the string not containing '=', and (2) the author should consider testing on Cygwin/Mingw, either locally, or on Appveyor. These two may not be important production platforms, but they enable testing the Windows-specific code paths with g++ and clang++, which as the above shows, may uncover potential problems that testing on MSVC alone would miss.

I general I tested using MinGW and MSVC also never tried clang on Windows - it is probably time to.

...

...
- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?

Quick reading.

...
- Are you knowledgeable about the problem domain?

More or less.

...
- Do you think the library should be accepted as a Boost library?

Yes. Nowide should be accepted.

Thank You for your effort and valuable inputs. Artyom Beilis

Peter Dimov

1:23 p.m.

New subject: Review of Nowide

Artyom Beilis wrote:

...

...
Also, the test Jamfile is missing import testing ;

...

What it does and why do I need one?

It contains the definitions of the testing rules such as 'run'. Not having it is like a missing #include in C++. It only works because someone happened to #include it first.

...

...
There are no .travis.yml and appveyor.yml files in the root, which is odd in this day and age. :-)

What do I miss?

If you're not familiar with the Travis and Appveyor CIs, you should check them out, because they are very valuable. What they do is integrate with Github and perform an automatic test on every repo push, which, if setup correctly, includes pull requests. Github then displays the test results on the PR page. This is very useful if you want to decide whether to merge a PR. You can look at the Boost.System .travis.yml and appveyor.yml files as examples.

Frédéric Bron

15 Jun 15 Jun

12:32 p.m.

This simple program compiles nicely on Linux+gcc 6.3.0 but fails to compile on x86_64-w64-mingw (6.3.0, std=c++14): #include "nowide/fstream.hpp" int main() { auto f = nowide::fstream("toto.log"); return 0; } instead of auto f=..., nowide::fstream f("toto.log") works fine. The error is: toto.cpp:4:38: erreur : use of deleted function « nowide::basic_fstream<char>::basic_fstream(const nowide::basic_fstream<char>&) » auto f = nowide::fstream("toto.log"); ^ In file included from toto.cpp:1:0: nowide/fstream.hpp:190:11: note : « nowide::basic_fstream<char>::basic_fstream(const nowide::basic_fstream<char>&) » is implicitly deleted because the default definition would be ill-formed: class basic_fstream : public std::basic_iostream<CharType,Traits> ^~~~~~~~~~~~~ nowide/fstream.hpp:190:11: erreur : use of deleted function « std::basic_iostream<_CharT, _Traits>::basic_iostream(const std::basic_iostream<_CharT, _Traits>&) [with _CharT = char; _Traits = std::char_t raits<char>] » In file included from /softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/iterator:65:0, from ./nowide/encoding_utf.hpp:14, from ./nowide/convert.hpp:12, from nowide/fstream.hpp:13, from toto.cpp:1: /softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/istream:863:7: note : declared here basic_iostream(const basic_iostream&) = delete; ^~~~~~~~~~~~~~ In file included from toto.cpp:1:0: nowide/fstream.hpp:190:11: erreur : use of deleted function « std::basic_ios<_CharT, _Traits>::basic_ios(const std::basic_ios<_CharT, _Traits>&) [with _CharT = char; _Traits = std::char_traits<char>] » class basic_fstream : public std::basic_iostream<CharType,Traits> ^~~~~~~~~~~~~ In file included from /softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/ios:44:0, from /softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/ostream:38, from /softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/iterator:64, from ./nowide/encoding_utf.hpp:14, from ./nowide/convert.hpp:12, from nowide/fstream.hpp:13, from toto.cpp:1: /softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/bits/basic_ios.h:475:7: note : declared here basic_ios(const basic_ios&) = delete; ^~~~~~~~~ In file included from toto.cpp:1:0: nowide/fstream.hpp:190:11: erreur : « nowide::scoped_ptr<T>::scoped_ptr(const nowide::scoped_ptr<T>&) [with T = nowide::basic_filebuf<char>] » is private within this context class basic_fstream : public std::basic_iostream<CharType,Traits> ^~~~~~~~~~~~~ In file included from nowide/fstream.hpp:14:0, from toto.cpp:1: ./nowide/scoped_ptr.hpp:31:5: note : declared private here scoped_ptr(scoped_ptr const &);

Artyom Beilis

6:01 p.m.

On Thu, Jun 15, 2017 at 3:32 PM, Frédéric Bron via Boost <boost@lists.boost.org> wrote:

...

This simple program compiles nicely on Linux+gcc 6.3.0 but fails to compile on x86_64-w64-mingw (6.3.0, std=c++14):

#include "nowide/fstream.hpp" int main() { auto f = nowide::fstream("toto.log"); return 0; }

instead of auto f=..., nowide::fstream f("toto.log") works fine.

The error is: toto.cpp:4:38: erreur : use of deleted function « nowide::basic_fstream<char>::basic_fstream(const nowide::basic_fstream<char>&) » auto f = nowide::fstream("toto.log"); ^

Ok I see, Nowide misses some of the C++11 interfaces - for example move constructor but it shouldn't be hard to implement ones. Also I wasn't aware that auto f = nowide::fstream("toto.log"); Can actually call move constructor. Artyom

Frédéric Bron

6:26 p.m.

...

Also I wasn't aware that auto f = nowide::fstream("toto.log"); Can actually call move constructor.

I am also surprised, I though it would be equivalent. And I use the same gcc version (6.3.0) and it causes no issue on linux, only on mingw cross compiler... Frédéric

Artyom Beilis

7:47 p.m.

at Linux nowide fstream is just an alias to std::fstream Artyom בתאריך 15 ביוני 2017 9:27 אחה״צ,‏ "Frédéric Bron" <frederic.bron@m4x.org> כתב:

...

Also I wasn't aware that auto f = nowide::fstream("toto.log"); Can actually call move constructor.

I am also surprised, I though it would be equivalent. And I use the same gcc version (6.3.0) and it causes no issue on linux, only on mingw cross compiler... Frédéric

Edward Diener

16 Jun 16 Jun

9:06 p.m.

On 6/12/2017 12:20 AM, Frédéric Bron via Boost wrote:

...

Hi Everyone,

The formal review of Artyom Beilis' Nowide library starts today and will last until Wed. 21st of June.

Your participation is encouraged, as the proposed library is uncoupled, focused and rather small. Nowadays, everyone needs to handle Unicode but this is very difficult using only the standard library in a platform independant way. Nowide offers a very simple way to handle Unicode the same way on Windows/MacOS/Linux.

Key features:

* work with UTF-8 in your code, Nowide converts to OS encoding * Easy to use functions for converting UTF-8 to/from UTF-16 * A class to fixing argc, argc and env main parameters to use UTF-8 * UTF-8 aware functions: - stdio.h functions (fopen, freopen, remove, rename) - stdlib.h functions (system, getenv, setenv, unsetenv, putenv) - fstream (filebuf, fstream/ofstream/ifstream) - iostream (cout, cerr, clog, cin)

Documentation: http://cppcms.com/files/nowide/html

GitHub: https://github.com/artyom-beilis/nowide git clone https://github.com/artyom-beilis/nowide.git

Latest tarballs: - to be unzipped in boost source: https://github.com/artyom-beilis/nowide/archive/master.zip - as a standalone library: http://cppcms.com/files/nowide/nowide_standalone.zip

Nowide has standard boost layout. So, just put it into the boost source tree under libs/nowide directory for building and running tests. Or alternatively - for header only part of the library add the nowide/include path to the include path of your projects.

How do I build the documentation for the nowide library ?

...

Please post your comments and review to the boost mailing list (preferably), or privately to the Review Manager (to me ;-). Here are some questions you might want to answer in your review:

- What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? - Did you try to use the library? With what compiler? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problem domain?

And most importantly: - Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion.

For more information about Boost Formal Review Process, see: http://www.boost.org/community/reviews.html

Thank you very much for your time and efforts.

Frédéric

Artyom Beilis

9:13 p.m.

On Sat, Jun 17, 2017 at 12:06 AM, Edward Diener via Boost <boost@lists.boost.org> wrote:

...

...
Nowide has standard boost layout. So, just put it into the boost source tree under libs/nowide directory for building and running tests. Or alternatively - for header only part of the library add the nowide/include path to the include path of your projects.

How do I build the documentation for the nowide library ?

Just run doxygen (hand't integrated yet to BB) Also note full docs can be found there: http://cppcms.com/files/nowide/html/ Artyom

Zach Laine

19 Jun 19 Jun

5:19 p.m.

On Sun, Jun 11, 2017 at 11:20 PM, Frédéric Bron via Boost < boost@lists.boost.org> wrote:

...

Hi Everyone,

The formal review of Artyom Beilis' Nowide library starts today and will last until Wed. 21st of June. [snip] Please post your comments and review to the boost mailing list (preferably), or privately to the Review Manager (to me ;-). Here are some questions you might want to answer in your review:

...

- What is your evaluation of the design?

1) I'd really much rather have an iterator-based interface for the narrow/wide conversions. There's an existing set of iterators in Boost.Regex already, and I've recently written one here: https://github.com/tzlaine/text/blob/master/boost/text/utf8.hpp The reliance on a new custom string type is a substantial mistake, IMO (boost::nowide::basic_stackstring). Providing an iterator interface (possibly cribbing the one of the two implementations above) would negate the need for this new string type -- I could use the existing std::string, MyString, QString, a char buffer, or whatever. Also, I'd greatly prefer that the new interfaces be defined in terms of string_view instead of string/basic_stackstring (there's also a string_view implementation already Boost.Utility). string_view is simply far more usable, since it binds effortlessly to either a char const * or a string. 2) I don't really understand what happens when a user passes a valid Windows filename that is *invalid* UTF-16 to a program using Nowide. Is the invalid UTF-16 filename just broken in the process of trying to convert it to UTF-8? This is partially a documentation problem, but until I understand how this is intended to work, I'm also counting it as a design issue.

...

- What is your evaluation of the implementation?

I did not look.

...

- What is your evaluation of the documentation?

I think the documentation needs a bit of work. The non-reference portion is quite thin, and drilling down into the reference did not answer at least one question I had (the one above, about invalid UTF-16): Looking at some example code in the "Using the Library" section, I saw this: " To make this program handle Unicode properly, we do the following changes: #include <boost/nowide/args.hpp> #include <boost/nowide/fstream.hpp> #include <boost/nowide/iostream.hpp> int main(int argc,char **argv) { boost::nowide::args a(argc,argv); // Fix arguments - make them UTF-8 " Ok, so I clicked "boost::nowide::args", hoping for an answer. The detailed description for args says: " args is a class that fixes standard main() function arguments and changes them to UTF-8 under Microsoft Windows. The class uses GetCommandLineW(), CommandLineToArgvW() and GetEnvironmentStringsW() in order to obtain the information. It does not relates to actual values of argc,argv and env under Windows. It restores the original values in its destructor " It tells me nothing about what happens when invalid UTF-16 is encountered. Is there an exception? Is 0xfffd inserted? If the latter, am I just stuck? I should not have to read any source code to figure this out, but it looks like I have to. This criticism can be applied to most of the documentation. My preference is that the semantics of primary functionality of the library should be explained in tutorials or other non-reference formats. The current state of the docs doesn't even explain things in the references. This must be fixed before this library can be accepted.

...

- What is your evaluation of the potential usefulness of the library?

I think this library is attempting to address a real and important issue. I just can't figure out if it's a complete solution, because how invalid UTF-16 is treated remains a question.

...

- Did you try to use the library? With what compiler? Did you have any

...

problems?

I did not. - How much effort did you put into your evaluation? A glance? A quick

...

reading? In-depth study?

A quick reading, plus a bit of discussion on the list.

...

- Are you knowledgeable about the problem domain?

I understand the UTF-8 issues reasonably well, but am ignorant of the Windows-specific issues.

...

And most importantly: - Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion.

I do not think the library should be accepted in its current form. It seems not to handle malformed UTF-16, which is a requirement for processing Windows file names (as I understand it -- please correct this if I'm wrong). Independent of this, I don't find the docs to be sufficient. Zach

Frédéric Bron

5:37 p.m.

Question to all: Why should we try to handle wrong UTF-16 (or wrong UTF-8)? 1. such files should not exist 2. if they exist, why? - if it is because it is an old file, can the user just rename it properly? - if it is because it was produced by a program, why should this program continue to work without fixing? Isn't it the best way that we get wrong filenames forever? I do not understand why we cannot just issue an error. Thanks for explanations, Frédéric 2017-06-19 19:19 GMT+02:00 Zach Laine via Boost <boost@lists.boost.org>:

...

On Sun, Jun 11, 2017 at 11:20 PM, Frédéric Bron via Boost < boost@lists.boost.org> wrote:

...
Hi Everyone,

The formal review of Artyom Beilis' Nowide library starts today and will last until Wed. 21st of June. [snip] Please post your comments and review to the boost mailing list (preferably), or privately to the Review Manager (to me ;-). Here are some questions you might want to answer in your review:

...
- What is your evaluation of the design?

1) I'd really much rather have an iterator-based interface for the narrow/wide conversions. There's an existing set of iterators in Boost.Regex already, and I've recently written one here:

https://github.com/tzlaine/text/blob/master/boost/text/utf8.hpp

The reliance on a new custom string type is a substantial mistake, IMO (boost::nowide::basic_stackstring). Providing an iterator interface (possibly cribbing the one of the two implementations above) would negate the need for this new string type -- I could use the existing std::string, MyString, QString, a char buffer, or whatever. Also, I'd greatly prefer that the new interfaces be defined in terms of string_view instead of string/basic_stackstring (there's also a string_view implementation already Boost.Utility). string_view is simply far more usable, since it binds effortlessly to either a char const * or a string.

2) I don't really understand what happens when a user passes a valid Windows filename that is *invalid* UTF-16 to a program using Nowide. Is the invalid UTF-16 filename just broken in the process of trying to convert it to UTF-8? This is partially a documentation problem, but until I understand how this is intended to work, I'm also counting it as a design issue.

...
- What is your evaluation of the implementation?

I did not look.

...
- What is your evaluation of the documentation?

I think the documentation needs a bit of work. The non-reference portion is quite thin, and drilling down into the reference did not answer at least one question I had (the one above, about invalid UTF-16):

Looking at some example code in the "Using the Library" section, I saw this:

" To make this program handle Unicode properly, we do the following changes:

#include <boost/nowide/args.hpp> #include <boost/nowide/fstream.hpp> #include <boost/nowide/iostream.hpp> int main(int argc,char **argv) { boost::nowide::args a(argc,argv); // Fix arguments - make them UTF-8 "

Ok, so I clicked "boost::nowide::args", hoping for an answer. The detailed description for args says:

" args is a class that fixes standard main() function arguments and changes them to UTF-8 under Microsoft Windows.

The class uses GetCommandLineW(), CommandLineToArgvW() and GetEnvironmentStringsW() in order to obtain the information. It does not relates to actual values of argc,argv and env under Windows.

It restores the original values in its destructor "

It tells me nothing about what happens when invalid UTF-16 is encountered. Is there an exception? Is 0xfffd inserted? If the latter, am I just stuck? I should not have to read any source code to figure this out, but it looks like I have to.

This criticism can be applied to most of the documentation. My preference is that the semantics of primary functionality of the library should be explained in tutorials or other non-reference formats. The current state of the docs doesn't even explain things in the references. This must be fixed before this library can be accepted.

...
- What is your evaluation of the potential usefulness of the library?

I think this library is attempting to address a real and important issue. I just can't figure out if it's a complete solution, because how invalid UTF-16 is treated remains a question.

...
- Did you try to use the library? With what compiler? Did you have any

...
problems?

I did not.

- How much effort did you put into your evaluation? A glance? A quick

...
reading? In-depth study?

A quick reading, plus a bit of discussion on the list.

...
- Are you knowledgeable about the problem domain?

I understand the UTF-8 issues reasonably well, but am ignorant of the Windows-specific issues.

...
And most importantly: - Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion.

I do not think the library should be accepted in its current form. It seems not to handle malformed UTF-16, which is a requirement for processing Windows file names (as I understand it -- please correct this if I'm wrong). Independent of this, I don't find the docs to be sufficient.

Zach

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- Frédéric Bron ----------------------------------------------------------- Frédéric Bron (frederic.bron@m4x.org) Villa des 4 chemins, Centre Hospitalier, BP 208 38506 Voiron Cedex tél. fixe : +33 4 76 67 17 27, tél. port.: +33 6 67 02 77 35

Zach Laine

5:43 p.m.

On Mon, Jun 19, 2017 at 12:37 PM, Frédéric Bron via Boost < boost@lists.boost.org> wrote:

...

Question to all:

Why should we try to handle wrong UTF-16 (or wrong UTF-8)? 1. such files should not exist 2. if they exist, why? - if it is because it is an old file, can the user just rename it properly? - if it is because it was produced by a program, why should this program continue to work without fixing? Isn't it the best way that we get wrong filenames forever?

This came up at C++Now last month. My understanding, from talking to people who seem to know about this, is that such filenames are considered valid by Windows. To issue an error in such a case means not allowing users to access files that Windows sees as well-formed. Zach

Frédéric Bron

7:40 p.m.

...

This came up at C++Now last month. My understanding, from talking to people who seem to know about this, is that such filenames are considered valid by Windows. To issue an error in such a case means not allowing users to access files that Windows sees as well-formed.

but who/what needs ill-formed filenames? For what reason? Frédéric

Peter Dimov

7:52 p.m.

Frédéric Bron wrote:

...

...
This came up at C++Now last month. My understanding, from talking to people who seem to know about this, is that such filenames are considered valid by Windows. To issue an error in such a case means not allowing users to access files that Windows sees as well-formed.

but who/what needs ill-formed filenames? For what reason?

The question whether to support such filenames is more a matter of principle. Namely, whether it is the job of the Nowide library to tell you what filenames you may use or not. One might argue that it should be of no concern to it whether, or for what reason, I need such filenames.

degski

20 Jun 20 Jun

5:16 a.m.

On 19 June 2017 at 22:52, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

The question whether to support such filenames is more a matter of principle. Namely, whether it is the job of the Nowide library to tell you what filenames you may use or not. One might argue that it should be of no concern to it whether, or for what reason, I need such filenames.

I think the ruling principle should be the unicode standard. Nowide should support the unicode standard and no more (at least in its' intentions). One of the intentions of the standard, as I read it, is to guarantee that a conversion can *un-ambiguously (and safely) round-trip*. On windows, if WTF-8 (what's in a name? Apart from the obvious, even Wobbly sounds bizarre), CESU-8 or Modified UTF-8 are ways to achieve that, I think that they should be supported. If not, an exception should be thrown when encountering these invalid encodings, as this is in my view an IO issue (the un-certain environment in which an application has to live), in which context, throwing seems to be the norm. I'm with Frédéric Bron on this one, though. I don't understand why invalid encodings are found in the wild in the first place and why they should continue to exist in the wild. The whole thing sounds like a vector for exploits, malicious code to generate invalid encodings after which buffer-overruns open up the system. Something I cannot understand is that some of those on this list who are most critical of Windows in view of security concerns are also the same people who happily perpetuate these weaknesses. Microsoft is very much in front here, by dropping support for OS'es (and therfore their associated compilers and CRT's), Boost should do the same and adopt rigour: "Ceterum autem censeo Carthaginem esse delendam" What's not terribly clear either is whether we are talking about NTFS or Explorer (and then there is WinFS lurking in the shadows, bound to be released someday). Windows does (many) other things that are weird, like the way it deals with capitalized names, long paths etc., NTFS itself does actually not have these issues and does quite a bit more than what the Explorer shell can do. degski -- "*Ihre sogenannte Religion wirkt bloß wie ein Opiat reizend, betäubend, Schmerzen aus Schwäche stillend.*" - Novalis 1798

Artyom Beilis

19 Jun 19 Jun

7:57 p.m.

On Mon, Jun 19, 2017 at 8:19 PM, Zach Laine via Boost <boost@lists.boost.org> wrote:

...

On Sun, Jun 11, 2017 at 11:20 PM, Frédéric Bron via Boost < boost@lists.boost.org> wrote:

...
- What is your evaluation of the design?

1) I'd really much rather have an iterator-based interface for the narrow/wide conversions. There's an existing set of iterators in Boost.Regex already, and I've recently written one here:

https://github.com/tzlaine/text/blob/master/boost/text/utf8.hpp

As well in Boost.Locale - Boost.Nowide uses UTF conversions from header only part of Boost.Locale.

...

The reliance on a new custom string type is a substantial mistake, IMO (boost::nowide::basic_stackstring). [snip] (possibly cribbing the one of the two implementations above) would negate the need for this new string type -- I could use the existing std::string, MyString, QString, a char buffer, or whatever. Also, I'd greatly prefer that the new interfaces be defined in terms of string_view instead of string/basic_stackstring (there's also a string_view implementation already Boost.Utility). string_view is simply far more usable, since it binds effortlessly to either a char const * or a string.

stackstring is barely on-stack buffer optimization - it isn't general string and never intended to be so. It isn't in details because it can actually be useful outside library scope.

...

Providing an iterator interface

There is no iterator interface in communication with OS so no sense to add it there. If you want character by character interface it exists in Boost.Locale

...

2) I don't really understand what happens when a user passes a valid Windows filename that is *invalid* UTF-16 to a program using Nowide. Is the invalid UTF-16 filename just broken in the process of trying to convert it to UTF-8? This is partially a documentation problem, but until I understand how this is intended to work, I'm also counting it as a design issue.

You are right it wasn't clear in the docs. My policy was return a error in case of invalid encoding but after comments I received I'll rather replace invalid sequences with U-FFFD Replacement Character since it was what WinAPI usually does. https://github.com/artyom-beilis/nowide/issues/16 It is something easy to fix.

...

...
- What is your evaluation of the documentation?

I think the documentation needs a bit of work. The non-reference portion is quite thin, and drilling down into the reference did not answer at least one question I had (the one above, about invalid UTF-16):

Actually documentation points out that in case of invalid character encoding a error is returned and how (in most of places - but there are some missies) However I agree that it can be improved and a special section regarding bad encoding conversion policy should and will be added.

...

It tells me nothing about what happens when invalid UTF-16 is encountered. Is there an exception? Is 0xfffd inserted? If the latter, am I just stuck? I should not have to read any source code to figure this out, but it looks like I have to.

Valid point.

...

...
And most importantly: - Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion.

I do not think the library should be accepted in its current form. It seems not to handle malformed UTF-16, which is a requirement for processing Windows file names (as I understand it -- please correct this if I'm wrong).

I strongly disagree about it. Same as converting invalid ANSI encoding shouldn't lead to invalid UTF-16 or the other way around. Since there is no way to present invalid UTF-16 using valid UTF-8 same as there is no way to do it in ANSI<->UTF-16 there is no non-ambiguous way to provide such a support.

...

Zach

Thank You for your review. Artyom

Norbert Wenzel

20 Jun 20 Jun

9:14 p.m.

...

- What is your evaluation of the design? I had no troubles finding my way. Most APIs are already well known as

This is my review of the proposed Boost.Nowide library. On 06/12/2017 06:20 AM, Frédéric Bron via Boost wrote: they resemble APIs from the standard library. I understand (after reading the discussions on this list) why Nowide does not convert all input to a common encoding on all platforms but only deals with wide strings on Windows. It was a bit of a surprise at first, so this design decision should be documented more prominently as others have already pointed out. In general I like the idea that only valid Unicode is accepted as input, although I had an issue with invalid encoding written to nowide::cout that set the stream's failbit and ignored all subsequent output. The author was already aware of that issue and said invalid characters will be replaced with U+FFFD (invalid character) after the review. I prefer this solution to stopping all output after any failure.

...

- What is your evaluation of the implementationI did not look at the implementation. It worked as expected/documented when I tested the library.

...

- What is your evaluation of the documentation? It is enough to start working with the library. I would like to see more clearly for every function when and how input strings are converted to a different encoding and what happens in case of a conversion error (exception, failbit of stream is set, invalid character, etc.).

...

- What is your evaluation of the potential usefulness of the library?The library is very useful and I intend to use it as a portable fire-and-forget solution to typical Unicode problems on Windows. I think it will be a very helpful library especially when porting software from Posix to Windows, since the library enables dealing with Unicode without dealing with (and first learning about) the Windows Unicode idiosyncrasies. As it came up in the discussion, I do not think I ever encountered non-normalized Unicode file paths on Windows so I do not think the lack of support for such filepaths is detrimental to the usefulness of this library.

...

- Did you try to use the library? With what compiler? Did you have any > problems? I tested with MSVC14 and had no technical problems building or using the library.

...

- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? I read the discussions on the list, the documentation and wrote a small toy program to test my expected usage of Nowide. All in all I invested about three hours.

...

- Are you knowledgeable about the problem domain? Somewhat. I had to work around typical Unicode issues on Windows multiple times, but always stopped when the issue was solved. So my knowledge is mostly driven by the specific issues I had so far and not profound.

...

- Do you think the library should be accepted as a Boost library? Yes.

Norbert

Artyom Beilis

22 Jun 22 Jun

8:31 p.m.

On Wed, Jun 21, 2017 at 12:14 AM, Norbert Wenzel via Boost <boost@lists.boost.org> wrote:

...

This is my review of the proposed Boost.Nowide library.

...
- What is your evaluation of the design? I had no troubles finding my way. Most APIs are already well known as

On 06/12/2017 06:20 AM, Frédéric Bron via Boost wrote: they resemble APIs from the standard library. I understand (after reading the discussions on this list) why Nowide does not convert all input to a common encoding on all platforms but only deals with wide strings on Windows. It was a bit of a surprise at first, so this design decision should be documented more prominently as others have already pointed out.

Yes indeed I'll have to provide section regarding library policy in tutorial in one of the first sections.

...

...
- What is your evaluation of the implementationI did not look at the implementation. It worked as expected/documented when I tested the library.

...
- What is your evaluation of the documentation? It is enough to start working with the library. I would like to see more clearly for every function when and how input strings are converted to a different encoding and what happens in case of a conversion error (exception, failbit of stream is set, invalid character, etc.).

Indeed - it should be improved.

...

...
- Do you think the library should be accepted as a Boost library? Yes.

Norbert

Thank You very much for the review and the comments. Artyom Beilis

2953

Age (days ago)

2963

Last active (days ago)

List overview

Download

70 comments

11 participants

participants (11)

Artyom Beilis
degski
Edward Diener
Frédéric Bron
Groke, Paul
Niall Douglas
Norbert Wenzel
Peter Dimov
Vadim Zeitlin
Yakov Galka
Zach Laine