[filesystem] becoming worse than the standard?

Vinnie Falco

14 Aug 2022 14 Aug '22

11:34 p.m.

I see this in the commit log for boost.filesystem: Deprecated path construction/assignment/appending from container types. Users are advised to use string types and iterators instead of containers to construct/assign/append to paths. In v4, the support for containers is removed. This can't be right now, can it? Because std::filesystem supports path construction from anything meeting the requirements of Source which is kind of range-like. Boost.URL and its downstream libraries HTTP-Proto, HTTP-IO, Websocket-Proto, and Websocket-IO are depending on the container-based interface to achieve reasonably decent syntax and semantics despite the limitations of the path API (which are the fault of the standard and not Boost). If std::filesystem::path supports container-based parameters to assign() and append() then why is boost::filesystem::path removing it? Thanks

Show replies by date

Andrey Semashev

15 Aug 15 Aug

10:37 a.m.

On 8/15/22 02:34, Vinnie Falco via Boost wrote:

...

I see this in the commit log for boost.filesystem:

Deprecated path construction/assignment/appending from container types. Users are advised to use string types and iterators instead of containers to construct/assign/append to paths. In v4, the support for containers is removed.

This can't be right now, can it? Because std::filesystem supports path construction from anything meeting the requirements of Source which is kind of range-like.

Paths are supposed to interoperate with strings, so that is what the change you refer to[1] is aimed for. Assigning or constructing a path from e.g. std::list<char> doesn't make much sense, and that is being deprecated. That Boost.Filesystem used to accept this is an unfortunate legacy misfeature, which is being removed. Note that the support for string types is *not* being removed. std::filesystem requirements[2] actually restrict Source arguments to std string types, iterators (read: pointers) to C-style strings and arrays of characters (which are interpreted as C-style strings). So Boost.Filesystem is getting closer to the standard here.

...

Boost.URL and its downstream libraries HTTP-Proto, HTTP-IO, Websocket-Proto, and Websocket-IO are depending on the container-based interface to achieve reasonably decent syntax and semantics despite the limitations of the path API (which are the fault of the standard and not Boost).

Could you describe why you need to construct/assign/append to paths from arbitrary containers and why you cannot use string types or iterators? [1]: https://github.com/boostorg/filesystem/commit/d829a46b3120933e3dbaf9b3c85bf1... [2]: http://eel.is/c++draft/fs.path.req#1

Vinnie Falco

2:10 p.m.

On Mon, Aug 15, 2022 at 3:37 AM Andrey Semashev via Boost <boost@lists.boost.org> wrote:

...

Could you describe why you need to construct/assign/append to paths from arbitrary containers and why you cannot use string types or iterators?

Yeah, and from your description it sounds like maybe it is just a user error on my part or a bug. The container in question is boost::urls::pct_encoded_view: <https://github.com/CPPAlliance/url/blob/cba301383791f67c31b3ccd52a401849f7cf8940/include/boost/url/pct_encoded_view.hpp#L75> This used to work: pct_encoded_view v; boost::filesystem::path p; p.append( v ); What should happen here is that a path segment will be appended to p formed by applying percent-decoding to the string referenced by 'v' (the percent-decoding happens in the iterator). Thanks

Andrey Semashev

2:57 p.m.

On 8/15/22 17:10, Vinnie Falco wrote:

...

On Mon, Aug 15, 2022 at 3:37 AM Andrey Semashev via Boost <boost@lists.boost.org> wrote:

...
Could you describe why you need to construct/assign/append to paths from arbitrary containers and why you cannot use string types or iterators?

Yeah, and from your description it sounds like maybe it is just a user error on my part or a bug. The container in question is boost::urls::pct_encoded_view:

<https://github.com/CPPAlliance/url/blob/cba301383791f67c31b3ccd52a401849f7cf8940/include/boost/url/pct_encoded_view.hpp#L75>

This used to work:

pct_encoded_view v; boost::filesystem::path p;

p.append( v );

What should happen here is that a path segment will be appended to p formed by applying percent-decoding to the string referenced by 'v' (the percent-decoding happens in the iterator).

With std::filesystem::path that is what would happen, though pct_encoded_view::operator std::string(). With boost::filesystem::path, it will construct the path from a string that is obtained from [v.begin(), v.end()) range (which, I presume, will not be percent-encoded). Since the recent change, it will also give you a deprecated warning. The correct code is either use v.to_string() or explicitly use v.begin()/v.end(), depending on what you want. BTW, duplicate "as a" here: https://github.com/CPPAlliance/url/blob/cba301383791f67c31b3ccd52a401849f7cf...

Vinnie Falco

3:32 p.m.

On Mon, Aug 15, 2022 at 7:57 AM Andrey Semashev via Boost < boost@lists.boost.org> wrote:

...

on what you want.

boost::filesystem::path used to meet the requirements of MutableString, but now does not: https://master.url.cpp.al/url/concepts/mutablestring.html -- Regards, Vinnie Follow me on GitHub: https://github.com/vinniefalco

Andrey Semashev

3:42 p.m.

On 8/15/22 18:32, Vinnie Falco wrote:

...

boost::filesystem::path used to meet the requirements of MutableString, but now does not: https://master.url.cpp.al/url/concepts/mutablestring.html <https://master.url.cpp.al/url/concepts/mutablestring.html>

It does not meet the requirements on Windows, and never did - path::value_type is wchar_t there. Also, you don't define the semantics of the operations, in particular, what is the expected result. Regardless of the platform, path::append does not merely appends characters, it appends a path, which may result in appending a directory delimiter or completely replacing the path if the argument is an absolute path. Depending on what exactly you mean by "Appends the characters", this may or may not qualify.

Niall Douglas

3:46 p.m.

On 15/08/2022 16:42, Andrey Semashev via Boost wrote:

...

On 8/15/22 18:32, Vinnie Falco wrote:

...
boost::filesystem::path used to meet the requirements of MutableString, but now does not: https://master.url.cpp.al/url/concepts/mutablestring.html <https://master.url.cpp.al/url/concepts/mutablestring.html>

It does not meet the requirements on Windows, and never did - path::value_type is wchar_t there.

Also, you don't define the semantics of the operations, in particular, what is the expected result. Regardless of the platform, path::append does not merely appends characters, it appends a path, which may result in appending a directory delimiter or completely replacing the path if the argument is an absolute path. Depending on what exactly you mean by "Appends the characters", this may or may not qualify.

Also "foo/bar" == "foo//bar" for filesystem paths. And several other comparisons are equal which wouldn't be if a filesystem path were a basic string. Niall

Andrey Semashev

3:50 p.m.

On 8/15/22 18:42, Andrey Semashev wrote:

...

On 8/15/22 18:32, Vinnie Falco wrote:

...
boost::filesystem::path used to meet the requirements of MutableString, but now does not: https://master.url.cpp.al/url/concepts/mutablestring.html <https://master.url.cpp.al/url/concepts/mutablestring.html>

It does not meet the requirements on Windows, and never did - path::value_type is wchar_t there.

Also, you don't define the semantics of the operations, in particular, what is the expected result. Regardless of the platform, path::append does not merely appends characters, it appends a path, which may result in appending a directory delimiter or completely replacing the path if the argument is an absolute path. Depending on what exactly you mean by "Appends the characters", this may or may not qualify.

I should also note that filesystem::path is not a container either because path::value_type is different from that of path::iterator.

Vinnie Falco

5:17 p.m.

On Mon, Aug 15, 2022 at 8:42 AM Andrey Semashev via Boost <boost@lists.boost.org> wrote:

...

you don't define the semantics of the operations, in particular, what is the expected result

Yes. Thank you for noticing this, as it is exactly what I intended. It is the caller's responsibility to determine if the operation makes sense. The library doesn't care what the outcome is, only that the syntactic requirements are met. Which is how I documented it :) Thanks

Peter Dimov

5:20 p.m.

Vinnie Falco wrote:

...

On Mon, Aug 15, 2022 at 8:42 AM Andrey Semashev via Boost <boost@lists.boost.org> wrote:

...
you don't define the semantics of the operations, in particular, what is the expected result

Yes. Thank you for noticing this, as it is exactly what I intended. It is the caller's responsibility to determine if the operation makes sense. The library doesn't care what the outcome is, only that the syntactic requirements are met. Which is how I documented it :)

The semantics are pretty clear when value_type is char, which is what is required. It's less clear what should happen when the library calls append or assign with iterators whose value_type is char on something whose value_type is wchar_t, which is what you seem to be doing.

Andrey Semashev

5:46 p.m.

On 8/15/22 20:20, Peter Dimov via Boost wrote:

...

Vinnie Falco wrote:

...
On Mon, Aug 15, 2022 at 8:42 AM Andrey Semashev via Boost <boost@lists.boost.org> wrote:

...
you don't define the semantics of the operations, in particular, what is the expected result

Yes. Thank you for noticing this, as it is exactly what I intended. It is the caller's responsibility to determine if the operation makes sense. The library doesn't care what the outcome is, only that the syntactic requirements are met. Which is how I documented it :)

The semantics are pretty clear when value_type is char, which is what is required.

It's less clear what should happen when the library calls append or assign with iterators whose value_type is char on something whose value_type is wchar_t, which is what you seem to be doing.

That part is defined by filesystem::path - character code conversion happens. What isn't clear in the code Vinnie posted earlier is what is supposed to be appended (percent-escaped string or not) and how (path appending or dumb concatenation). That's the problem with the docs, it says "appends", but there are too many ways to "append".

Vinnie Falco

5:52 p.m.

On Mon, Aug 15, 2022 at 10:46 AM Andrey Semashev via Boost <boost@lists.boost.org> wrote:

...

What isn't clear in the code Vinnie posted earlier is what is supposed to be appended (percent-escaped string or not)

When pct_encoded_view is iterated it produces normal individual characters. That is, any percent-escapes in the referenced character buffer have the encoding removed.

...

and how (path appending or dumb concatenation). That's the problem with the docs, it says "appends", but there are too many ways to "append".

The docs are correct in leaving the meaning of "append" unspecified. It is only necessary to meet the syntactic requirements. It is up to the user to decide whether or not the behavior of the mutable string's 'append' and/or 'assign' operations are appropriate to achieve the desired outcome. Thanks

Peter Dimov

6:28 p.m.

Andrey Semashev wrote:

...

...
The semantics are pretty clear when value_type is char, which is what is required.

It's less clear what should happen when the library calls append or assign with iterators whose value_type is char on something whose value_type is wchar_t, which is what you seem to be doing.

That part is defined by filesystem::path - character code conversion happens.

It's defined, but in my opinion, the semantics are wrong for the intended use case. The interpretation of paths coming from URLs should not depend on the current server ANSI code page - that's an implementation detail. You should be able to move your server to a different Windows installation (or even to a different OS) and the existing URLs should not suddenly change meaning. That's why I think that the only sensible interpretation of paths coming from URLs is as UTF-8, and for this case, the default filesystem conversion isn't suitable. The current MutableString requirements disallow this, as they mandate a matching value_type of char. This requirement, again in my opinion, is correct, because if relaxed, people will silently get ANSI code page dependent conversions, often without realizing it, if they don't develop on Windows. _If_ this requirement is relaxed, the library should somehow default to transcoding from UTF-8 to the value_type, instead of using the ANSI code page (as with fs::path) or a char to value_type cast (as with e.g. wstring or u16string.)

Vinnie Falco

7:10 p.m.

On Mon, Aug 15, 2022 at 11:28 AM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

That's why I think that the only sensible interpretation of paths coming from URLs is as UTF-8, and for this case, the default filesystem conversion isn't suitable.

Can you please submit a pull request showing me how route.cpp can be correctly implemented according to these semantics? Thanks

Peter Dimov

7:35 p.m.

Vinnie Falco wrote:

...

On Mon, Aug 15, 2022 at 11:28 AM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...
That's why I think that the only sensible interpretation of paths coming from URLs is as UTF-8, and for this case, the default filesystem conversion isn't suitable.

Can you please submit a pull request showing me how route.cpp can be correctly implemented according to these semantics?

The usual approach is to put boost::filesystem::path::imbue( std::locale( std::locale::classic(), new boost::filesystem::detail::utf8_codecvt_facet ) ); as the first thing in `main`. fs::path has overloads taking codecvt_type, but I've never used them so I don't know how they work. Maybe Andrey can help.

Andrey Semashev

8:57 p.m.

On 8/15/22 22:35, Peter Dimov via Boost wrote:

...

Vinnie Falco wrote:

...
On Mon, Aug 15, 2022 at 11:28 AM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...
That's why I think that the only sensible interpretation of paths coming from URLs is as UTF-8, and for this case, the default filesystem conversion isn't suitable.

Can you please submit a pull request showing me how route.cpp can be correctly implemented according to these semantics?

The usual approach is to put

boost::filesystem::path::imbue( std::locale( std::locale::classic(), new boost::filesystem::detail::utf8_codecvt_facet ) );

I'd rather boost::filesystem::path::imbue(boost::locale::generator()("C.UTF-8")); Don't recommend users to use stuff from namespace detail. :) Note though that after that call all narrow strings you pass to filesystem::path must be UTF-8; you should be careful if you obtain paths from user input, files, network, etc.

...

as the first thing in `main`.

fs::path has overloads taking codecvt_type, but I've never used them so I don't know how they work. Maybe Andrey can help.

If you made call above then that is the only thing you need to do. The overloads taking codecvt is for when you don't want to set the global filesystem locale for some reason. This may be useful during global destructors stage, for example. By default, the codecvt is taken from the global filesystem locale.

Peter Dimov

9:10 p.m.

Andrey Semashev wrote:

...

On 8/15/22 22:35, Peter Dimov via Boost wrote:

...
Vinnie Falco wrote:

...
On Mon, Aug 15, 2022 at 11:28 AM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...
That's why I think that the only sensible interpretation of paths coming from URLs is as UTF-8, and for this case, the default filesystem conversion isn't suitable.

Can you please submit a pull request showing me how route.cpp can be correctly implemented according to these semantics?

The usual approach is to put

boost::filesystem::path::imbue( std::locale( std::locale::classic(), new boost::filesystem::detail::utf8_codecvt_facet ) );

I'd rather

boost::filesystem::path::imbue(boost::locale::generator()("C.UTF-8"));

Don't recommend users to use stuff from namespace detail. :)

Sound advice in principle but that's pretty much the established practice at this point. Perhaps we should expose a non-detail way to perform the above, such as provide a function "set_utf8_as_default".

...

Note though that after that call all narrow strings you pass to filesystem::path must be UTF-8; you should be careful if you obtain paths from user input, files, network, etc.

Right. That's why I alluded to the alternative, using the codecvt_type overloads.

Andrey Semashev

9:56 p.m.

On 8/16/22 00:10, Peter Dimov via Boost wrote:

...

Andrey Semashev wrote:

...
On 8/15/22 22:35, Peter Dimov via Boost wrote:

...
Vinnie Falco wrote:

...
On Mon, Aug 15, 2022 at 11:28 AM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...
That's why I think that the only sensible interpretation of paths coming from URLs is as UTF-8, and for this case, the default filesystem conversion isn't suitable.

Can you please submit a pull request showing me how route.cpp can be correctly implemented according to these semantics?

The usual approach is to put

boost::filesystem::path::imbue( std::locale( std::locale::classic(), new boost::filesystem::detail::utf8_codecvt_facet ) );

I'd rather

boost::filesystem::path::imbue(boost::locale::generator()("C.UTF-8"));

Don't recommend users to use stuff from namespace detail. :)

Sound advice in principle but that's pretty much the established practice at this point. Perhaps we should expose a non-detail way to perform the above, such as provide a function "set_utf8_as_default".

Perhaps, move utf8_codecvt_facet to Utility or Core?

...

...
Note though that after that call all narrow strings you pass to filesystem::path must be UTF-8; you should be careful if you obtain paths from user input, files, network, etc.

Right. That's why I alluded to the alternative, using the codecvt_type overloads.

For completeness, here's a usage example: void append_utf8(std::string const& str, fs::path& p) { std::locale loc = boost::locale::generator()("C.UTF-8"); p.append(str, std::use_facet<fs::path::codecvt_type>(loc)); } Same with iterators.

Peter Dimov

10:07 p.m.

Andrey Semashev wrote:

...

...
Sound advice in principle but that's pretty much the established practice at this point. Perhaps we should expose a non-detail way to perform the above, such as provide a function "set_utf8_as_default".

Perhaps, move utf8_codecvt_facet to Utility or Core?

Maybe it's time. And make it all inline?

Andrey Semashev

10:21 p.m.

On 8/16/22 01:07, Peter Dimov via Boost wrote:

...

Andrey Semashev wrote:

...
...
Sound advice in principle but that's pretty much the established practice at this point. Perhaps we should expose a non-detail way to perform the above, such as provide a function "set_utf8_as_default".

Perhaps, move utf8_codecvt_facet to Utility or Core?

Maybe it's time. And make it all inline?

Yes. Although there's this already: https://github.com/boostorg/locale/blob/develop/include/boost/locale/utf8_co...

Alexander Grund

16 Aug 16 Aug

7 a.m.

Am 15.08.22 um 23:10 schrieb Peter Dimov via Boost:

...

Andrey Semashev wrote:

...
On 8/15/22 22:35, Peter Dimov via Boost wrote:

...
The usual approach is to put

boost::filesystem::path::imbue( std::locale( std::locale::classic(), new boost::filesystem::detail::utf8_codecvt_facet ) ); I'd rather

boost::filesystem::path::imbue(boost::locale::generator()("C.UTF-8"));

Don't recommend users to use stuff from namespace detail. :) Sound advice in principle but that's pretty much the established practice at this point. Perhaps we should expose a non-detail way to perform the above, such as provide a function "set_utf8_as_default". Is it? I see `boost::locale::generator().generate("")` being used more often, e.g. (quick Google search): http://doc.aldebaran.com/1-14/ref/libqi/boost_filesystem_example_8cpp-exampl...

And there is `boost::nowide::nowide_filesystem()` doing pretty much what you suggest: https://github.com/boostorg/nowide/blob/master/include/boost/nowide/filesyst... Am 16.08.22 um 00:21 schrieb Andrey Semashev via Boost:

...

On 8/16/22 01:07, Peter Dimov via Boost wrote:

...
Andrey Semashev wrote:

...
Perhaps, move utf8_codecvt_facet to Utility or Core? Maybe it's time. And make it all inline? Yes.

Although there's this already:

https://github.com/boostorg/locale/blob/develop/include/boost/locale/utf8_co...

And again in Nowide: https://github.com/boostorg/nowide/blob/master/include/boost/nowide/utf8_cod... Am 16.08.22 um 01:53 schrieb Vinnie Falco via Boost:

...

(where is the signature of fopen that accepts a filesystem::path?). That's outside of its purpose: It is meant to accept narrow strings, UTF-8 encoded where applicable.

Gavin Lambert

15 Aug 15 Aug

10:46 p.m.

On 16/08/2022 07:35, Peter Dimov wrote:

...

The usual approach is to put

boost::filesystem::path::imbue( std::locale( std::locale::classic(), new boost::filesystem::detail::utf8_codecvt_facet ) );

as the first thing in `main`.

As a library, you cannot assume that the application has done this (nor can you do it yourself). The code has to act correctly regardless.

Peter Dimov

11:31 p.m.

Gavin Lambert wrote:

...

On 16/08/2022 07:35, Peter Dimov wrote:

...
The usual approach is to put

boost::filesystem::path::imbue( std::locale( std::locale::classic(), new boost::filesystem::detail::utf8_codecvt_facet ) );

as the first thing in `main`.

As a library, you cannot assume that the application has done this (nor can you do it yourself). The code has to act correctly regardless.

Yes in principle, but the question I'm replying to is this:

...

Can you please submit a pull request showing me how route.cpp can be correctly implemented according to these semantics?

In this case I can assume things. route.cpp does create a path from argv[2], but interpreting it as UTF-8 would probably be a feature. Using the ACP is wrong anyway as argv[2] is likely in the console code page.

Gavin Lambert

11:45 p.m.

On 16/08/2022 11:31, Peter Dimov wrote:

...

route.cpp does create a path from argv[2], but interpreting it as UTF-8 would probably be a feature. Using the ACP is wrong anyway as argv[2] is likely in the console code page.

On Windows, the application always receives command line arguments as wchar_t. If you provide the narrow main() entrypoint method then it is the application's runtime startup code that will convert args back to ANSI using the ACP, not the console code page. As such, interpreting argv as ACP is correct (albeit lossy if there were any Unicode characters involved -- the only truly correct design is to only use the wide entrypoint or ignore the entrypoint args and reparse from the original Unicode string). Also on Windows, in most cases the console code page is also the ACP, although that is less true in more recent versions.

Vinnie Falco

11:53 p.m.

On Mon, Aug 15, 2022 at 4:46 PM Gavin Lambert via Boost <boost@lists.boost.org> wrote:

...

...

My experiences with std::filesystem and boost::filesystem have been nothing but negative. I think that the decision to make the character type different on Windows was a mistake. The need for locales and imbuements and global state and... really, it is just giving me a big headache. I think there is room for a new library that handles files, directories, file metadata, and also has some features of nowide (where is the signature of fopen that accepts a filesystem::path?). It should be utf-8 only, use Plain Old char (even on Windows), it should be completely portable, except that it requires that directories are possible and that the filesystem isn't weird (I don't really care about compatibility with grandpa's EPROMs that can hold 9-bit flat files). Possible names: Boost.LordOfTheFiles Boost.HurrDir Boost.Progra~1 Boost.FlySystem Thoughts?

Peter Dimov

16 Aug 16 Aug

12:04 a.m.

Vinnie Falco wrote:

...

My experiences with std::filesystem and boost::filesystem have been nothing but negative. I think that the decision to make the character type different on Windows was a mistake. The need for locales and imbuements and global state and... really, it is just giving me a big headache.

I think there is room for a new library that handles files, directories, file metadata, and also has some features of nowide (where is the signature of fopen that accepts a filesystem::path?). It should be utf-8 only, use Plain Old char (even on Windows), it should be completely portable, except that it requires that directories are possible and that the filesystem isn't weird (I don't really care about compatibility with grandpa's EPROMs that can hold 9-bit flat files).

Possible names:

Boost.LordOfTheFiles Boost.HurrDir Boost.Progra~1 Boost.FlySystem

Boost.LLFIO obviously.

Emil Dotchevski

12:08 a.m.

На пн, 15.08.2022 г. в 16:54 ч. Vinnie Falco via Boost < boost@lists.boost.org> написа:

...

it should be completely portable, except that it requires that directories are possible and that the filesystem isn't weird (I don't really care about compatibility with grandpa's EPROMs that can hold 9-bit flat files).

AFAIK on posix it is usual that file names are simply zero terminated byte sequences, which means that UTF-8 won't be _completely_ portable, and not only on grandpa's EPROMs.

Andrey Semashev

7:41 a.m.

On 8/16/22 03:08, Emil Dotchevski via Boost wrote:

...

На пн, 15.08.2022 г. в 16:54 ч. Vinnie Falco via Boost < boost@lists.boost.org> написа:

...
it should be completely portable, except that it requires that directories are possible and that the filesystem isn't weird (I don't really care about compatibility with grandpa's EPROMs that can hold 9-bit flat files).

AFAIK on posix it is usual that file names are simply zero terminated byte sequences, which means that UTF-8 won't be _completely_ portable, and not only on grandpa's EPROMs.

In practice, paths are in UTF-8 on all modern POSIX systems.

Tom Honermann

3:15 p.m.

On 8/16/22 3:41 AM, Andrey Semashev via Boost wrote:

...

On 8/16/22 03:08, Emil Dotchevski via Boost wrote:

...
На пн, 15.08.2022 г. в 16:54 ч. Vinnie Falco via Boost < boost@lists.boost.org> написа:

...
it should be completely portable, except that it requires that directories are possible and that the filesystem isn't weird (I don't really care about compatibility with grandpa's EPROMs that can hold 9-bit flat files). AFAIK on posix it is usual that file names are simply zero terminated byte sequences, which means that UTF-8 won't be _completely_ portable, and not only on grandpa's EPROMs. In practice, paths are in UTF-8 on all modern POSIX systems.

Could you please provide a citation for that? I have experience with people continuing to use Shift-JIS for file names in recent-ish times. Tom.

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Andrey Semashev

4:21 p.m.

On 8/16/22 18:15, Tom Honermann wrote:

...

On 8/16/22 3:41 AM, Andrey Semashev via Boost wrote:

...
On 8/16/22 03:08, Emil Dotchevski via Boost wrote:

...
На пн, 15.08.2022 г. в 16:54 ч. Vinnie Falco via Boost < boost@lists.boost.org> написа:

...
it should be completely portable, except that it requires that directories are possible and that the filesystem isn't weird (I don't really care about compatibility with grandpa's EPROMs that can hold 9-bit flat files). AFAIK on posix it is usual that file names are simply zero terminated byte sequences, which means that UTF-8 won't be _completely_ portable, and not only on grandpa's EPROMs. In practice, paths are in UTF-8 on all modern POSIX systems.

Could you please provide a citation for that?

I have experience with people continuing to use Shift-JIS for file names in recent-ish times.

That's just my experience. I haven't seen a non-UTF-8 POSIX system in decades.

Gavin Lambert

3:15 a.m.

On 16/08/2022 11:53, Vinnie Falco wrote:

...

My experiences with std::filesystem and boost::filesystem have been nothing but negative. I think that the decision to make the character type different on Windows was a mistake. The need for locales and imbuements and global state and... really, it is just giving me a big headache.

Using wchar_t on Windows is actually the least painful option. (And you don't have to worry about locales and imbuements etc if you never try to convert to not-wchar_t.) For correct behaviour, you *must* only use the W variants of the native API methods, or wchar_t methods of standard library functions. Inevitably, everything in the standard library that accepts 'char' params assumes that these are encoded in the ANSI code page, not UTF-8. This can't be "fixed" or it breaks all the legacy apps. In practice, this means that unless you can absolutely guarantee that your paths only contain pure ASCII (and the instant you accept a path or filename from the user, you lose), it is *never* safe to use any of the non-wide library methods. You *can* (and many do) store paths in other libraries and in the application in 'char'-encoded-as-UTF-8, but then you have to remember every single time you hit the standard library or direct WinAPI boundaries to convert your strings to wide before passing them across, or hilarity will ensue (without even a convenient compiler error). Storing paths as wchar_t in the first place both avoids the cost of converting back and forth and potential corruption (often overlooked, unless you regularly test with unicode paths) from accidentally forgetting a conversion.

...

(where is the signature of fopen that accepts a filesystem::path?)

Why are you using fopen in C++ in the first place? Filesystem does provide 'path' overloads for fstreams, which you should have been using instead anyway.

...

It should be utf-8 only, use Plain Old char (even on Windows), it should be completely portable, except that it requires that directories are possible and that the filesystem isn't weird (I don't really care about compatibility with grandpa's EPROMs that can hold 9-bit flat files).

In theory, the standard library (and other wrapper libraries around the WinAPI, including Filesystem) could start doing more sane things by using the C++20 'char8_t'/'u8string' types to disambiguate between UTF-8 encoded paths and legacy idkwtf-'char'-encoded paths. But this will take a very long time to percolate through the ecosystem, especially as there are a bunch of people who hate the very idea of it. And it doesn't solve the conversion performance angle. (Hopefully, Windows will eventually provide char8_t entrypoints and APIs, which will make it easier to interoperate with not-Windows.) Although as Emil has already pointed out, it's valid in not-Windows to have arbitrary not-UTF-8 byte sequences in paths, so you can get into trouble in that direction as well. That's another reason for using wchar_t in Windows and char in not-Windows: no conversions happen at all (at least where values are accepted natively from the OS), which has maximal compatibility for otherwise-invalid byte sequences that nevertheless exist.

Daniela Engert

6:27 a.m.

Am 16.08.2022 um 05:15 schrieb Gavin Lambert via Boost:

...

On 16/08/2022 11:53, Vinnie Falco wrote:

...
My experiences with std::filesystem and boost::filesystem have been nothing but negative. I think that the decision to make the character type different on Windows was a mistake. The need for locales and imbuements and global state and... really, it is just giving me a big headache.

Using wchar_t on Windows is actually the least painful option. (And you don't have to worry about locales and imbuements etc if you never try to convert to not-wchar_t.)

For correct behaviour, you *must* only use the W variants of the native API methods, or wchar_t methods of standard library functions.

Inevitably, everything in the standard library that accepts 'char' params assumes that these are encoded in the ANSI code page, not UTF-8. This can't be "fixed" or it breaks all the legacy apps.

In practice, this means that unless you can absolutely guarantee that your paths only contain pure ASCII (and the instant you accept a path or filename from the user, you lose), it is *never* safe to use any of the non-wide library methods.

You *can* (and many do) store paths in other libraries and in the application in 'char'-encoded-as-UTF-8, but then you have to remember every single time you hit the standard library or direct WinAPI boundaries to convert your strings to wide before passing them across, or hilarity will ensue (without even a convenient compiler error).

Storing paths as wchar_t in the first place both avoids the cost of converting back and forth and potential corruption (often overlooked, unless you regularly test with unicode paths) from accidentally forgetting a conversion.

...
(where is the signature of fopen that accepts a filesystem::path?)

Why are you using fopen in C++ in the first place?

Filesystem does provide 'path' overloads for fstreams, which you should have been using instead anyway.

...
It should be utf-8 only, use Plain Old char (even on Windows), it should be completely portable, except that it requires that directories are possible and that the filesystem isn't weird (I don't really care about compatibility with grandpa's EPROMs that can hold 9-bit flat files).

In theory, the standard library (and other wrapper libraries around the WinAPI, including Filesystem) could start doing more sane things by using the C++20 'char8_t'/'u8string' types to disambiguate between UTF-8 encoded paths and legacy idkwtf-'char'-encoded paths. But this will take a very long time to percolate through the ecosystem, especially as there are a bunch of people who hate the very idea of it. And it doesn't solve the conversion performance angle.

(Hopefully, Windows will eventually provide char8_t entrypoints and APIs, which will make it easier to interoperate with not-Windows.)

Although as Emil has already pointed out, it's valid in not-Windows to have arbitrary not-UTF-8 byte sequences in paths, so you can get into trouble in that direction as well.

That's another reason for using wchar_t in Windows and char in not-Windows: no conversions happen at all (at least where values are accepted natively from the OS), which has maximal compatibility for otherwise-invalid byte sequences that nevertheless exist.

Amen brother, you speak wisely! I want to add the following to stay sane on Windows: ensure that *both* the wide and the narrow execution character encoding is Unicode (i.e. UTF-16 for wchar_t (that's the default) and UTF-8 for char), build with _UNICODE defined, and link with <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>. This guarantees consistent semantics throughout the *whole* execution of the program on reasonably recent versions of Windows. And lastly, represent paths with std/boost filesystem paths and use APIs that know how to deal with them *correctly*. Similar advise applies to POSIX systems. UTF-8 everywhere is just a recommendation but no guarantee. Dani

Peter Dimov

7:44 a.m.

Gavin Lambert wrote:

...

Using wchar_t on Windows is actually the least painful option. (And you don't have to worry about locales and imbuements etc if you never try to convert to not-wchar_t.)

That's only if your program never runs on anything else. For portable code, using char and UTF-8 is the least painful option. We have an entire library in Boost for this purpose, whose documentation does a reasonable job explaining that. https://www.boost.org/doc/libs/1_80_0/libs/nowide/doc/html/index.html

Gavin Lambert

10:30 p.m.

On 16/08/2022 19:44, Peter Dimov wrote:

...

Gavin Lambert wrote:

...
Using wchar_t on Windows is actually the least painful option. (And you don't have to worry about locales and imbuements etc if you never try to convert to not-wchar_t.)

That's only if your program never runs on anything else. For portable code, using char and UTF-8 is the least painful option. We have an entire library in Boost for this purpose, whose documentation does a reasonable job explaining that.

Currently, yes. In theory, though, you could adopt a TCHAR-like approach where you use wchar_t on Windows and char/char8_t on not-Windows, selected at compile time. This would avoid all conversions and just use the native character type of the OS, which would be better. (Windows has its own version of the invalid characters problem -- it's legal to have mismatched surrogates in filenames, which work fine as long as you keep everything in wchar_t UCS-2 and never convert it, but break if you convert to UTF-8 and back. It's probably less common than not-UTF-8 non-Windows filenames, though.) The downside is that you need every single bit of code to either use this TCHAR type (which in turn means that you need to be able to recompile everything), or (better) to provide overloads for all possible underlying types (with the same name, so that the actual code is spelled the same either way), and some usages may need macros or char_traits etc. (But then that tends to lead to either code duplication or over-templating, neither of which is good.) Ideally, the standard library would have defined such a platform-specific type alias (notably, not actually a distinct type, so that existing overloads work), which would have made it easier to build up libraries around it, or at least encourage writing both overloads. Or the language would define some kind of compile-time-variant that permits separate-translation-unit implementation of overloaded types that have the "same" implementation without header-only templates. Sadly that hasn't happened yet.

Andrey Semashev

7:47 a.m.

On 8/16/22 02:53, Vinnie Falco via Boost wrote:

...

(where is the signature of fopen that accepts a filesystem::path?).

I was actually thinking of adding one in Boost.Filesystem but haven't got around to do it. In the meantime, you can use boost::filesystem::fstream.

Niall Douglas

3:19 p.m.

On 16/08/2022 00:53, Vinnie Falco via Boost wrote:

...

On Mon, Aug 15, 2022 at 4:46 PM Gavin Lambert via Boost <boost@lists.boost.org> wrote:

...
...

My experiences with std::filesystem and boost::filesystem have been nothing but negative. I think that the decision to make the character type different on Windows was a mistake. The need for locales and imbuements and global state and... really, it is just giving me a big headache.

I think there is room for a new library that handles files, directories, file metadata, and also has some features of nowide (where is the signature of fopen that accepts a filesystem::path?). It should be utf-8 only, use Plain Old char (even on Windows), it should be completely portable, except that it requires that directories are possible and that the filesystem isn't weird (I don't really care about compatibility with grandpa's EPROMs that can hold 9-bit flat files).

With proposed std::filesystem::path_view you can use UTF-8 everywhere once it is into the standard. The committee has signed off on the design, I just need to get it through LWG and it's into the next IS. I just need non-work free time to reappear to get back to working on my standards papers. It'll happen eventually. Niall

Peter Dimov

12:03 a.m.

Gavin Lambert wrote:

...

On 16/08/2022 11:31, Peter Dimov wrote:

...
route.cpp does create a path from argv[2], but interpreting it as UTF-8 would probably be a feature. Using the ACP is wrong anyway as argv[2] is likely in the console code page.

On Windows, the application always receives command line arguments as wchar_t. If you provide the narrow main() entrypoint method then it is the application's runtime startup code that will convert args back to ANSI using the ACP, not the console code page. As such, interpreting argv as ACP is correct

You're right, it is. So path::imbue is wrong even for route.cpp.

...

(albeit lossy if there were any Unicode characters involved -- the only truly correct design is to only use the wide entrypoint or ignore the entrypoint args and reparse from the original Unicode string).

Also on Windows, in most cases the console code page is also the ACP,

I'm not sure if these ever match. The console code page is 437 for me, and I assume for mostly everyone else.

Vinnie Falco

15 Aug 15 Aug

5:21 p.m.

On Mon, Aug 15, 2022 at 10:17 AM Vinnie Falco <vinnie.falco@gmail.com> wrote:

...

On Mon, Aug 15, 2022 at 8:42 AM Andrey Semashev via Boost

...
...

It comes down to this. boost::urls::pct_encoded_view is a ForwardRange of char. How do we append it as a segment to each of the following 4 things: 1. boost::filesystem::path on Windows 2. boost::filesystem::path not on Windows 3. std::filesystem::path on Windows 4. std::filesystem::path not on Windows I do not know how to answer this question, other than with annoying wide string conversions and extra memory allocations. Thanks

Andrey Semashev

5:41 p.m.

On 8/15/22 20:21, Vinnie Falco wrote:

...

On Mon, Aug 15, 2022 at 10:17 AM Vinnie Falco <vinnie.falco@gmail.com> wrote:

...
On Mon, Aug 15, 2022 at 8:42 AM Andrey Semashev via Boost

...
...

It comes down to this. boost::urls::pct_encoded_view is a ForwardRange of char. How do we append it as a segment to each of the following 4 things:

1. boost::filesystem::path on Windows 2. boost::filesystem::path not on Windows 3. std::filesystem::path on Windows 4. std::filesystem::path not on Windows

I do not know how to answer this question, other than with annoying wide string conversions and extra memory allocations.

filesystem::path will handle character code conversion. Regardless of the platform, you call path::append(). Or path::concat(), if you want dumb concatenation. The question is only what and how exactly do you want to append and what do you expect to get in the end. I didn't read Boost.URL docs, but it seems strange to me that operations are defined without an expected outcome. I mean, you can define loose concepts such as MutableString, but I don't find it helpful as I still can't tell what the operations on its models are doing.

Alan de Freitas

5:14 p.m.

...

Assigning or constructing a path from e.g. std::list<char> doesn't make much sense,

Why not? -- Alan Freitas https://github.com/alandefreitas

Andrey Semashev

5:24 p.m.

On 8/15/22 20:14, Alan de Freitas via Boost wrote:

...

...
Assigning or constructing a path from e.g. std::list<char> doesn't make much sense,

Why not?

Because list is not a string. If you're assigning a list to a path, most likely you are doing something wrong.

Alan de Freitas

6:54 p.m.

...

It's less clear what should happen when the library calls append or assign with iterators whose value_type is char on something whose value_type is wchar_t, which is what you seem to be doing.

...

std::filesystem requirements[2] actually restrict Source arguments to std string types, iterators (read: pointers) to C-style strings and arrays of characters (which are interpreted as C-style strings).

Yes. That sounds right. It looks like boost::filesystem is matching the behavior of std;:filesystem now, even if more restrictive ( https://godbolt.org/z/5WYYce6v8). Boost.URL shouldn't use it and that's for the best, since std::filesystem doesn't support it either. Non-contiguous ranges that dereference to char work with both std:: and boost::filesystem if we use the `append(InputIterator begin, InputIterator end)` overload. I think that's the source of confusion here. What C++ says about the `append(Source const& source)` overload is even more misleading. (2) and (3) participate in overload resolution only if Source and path are not the same type, and either: - Source is a specialization of std::basic_string <https://en.cppreference.com/w/cpp/string/basic_string> or std::basic_string_view <https://en.cppreference.com/w/cpp/string/basic_string_view>, or - std::iterator_traits <http://en.cppreference.com/w/cpp/iterator/iterator_traits><std::decay_t <http://en.cppreference.com/w/cpp/types/decay><Source>>::value_type is valid and denotes a possibly const-qualified encoding character type ( char, char8_t, (since C++20)char16_t, char32_t, or wchar_t). We assumed both overloads should work because of this second condition.

...

If you're assigning a list to a path, most likely you are doing something wrong.

Yes. `append(InputIterator begin, InputIterator end)` would still allow the person to do this wrong thing though. And `append(InputIterator begin, InputIterator end)` doesn't look like it's always wrong. Two obvious use cases could be (i) appending paths from resource trees or (ii) some std::ranges::view::... that transforms the input into the chars to represent a path segment for that input. If `append(InputIterator begin, InputIterator end)` is not wrong, it looks like `append(Source const& source)` would not be less wrong when Source is just the range holding the iterators for the first overload. In any case, both are still dangerous. Boost.URL and other libraries shouldn't count on it. As Peter mentioned, things like wstring and u16string could be appended, but the semantics will probably be wrong. They will convert char by char, without regards of encoding. Thank you for the explanation. Em seg., 15 de ago. de 2022 às 14:24, Andrey Semashev via Boost < boost@lists.boost.org> escreveu:

...

On 8/15/22 20:14, Alan de Freitas via Boost wrote:

...
...
Assigning or constructing a path from e.g. std::list<char> doesn't make much sense,

Why not?

Because list is not a string. If you're assigning a list to a path, most likely you are doing something wrong.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- Alan Freitas https://github.com/alandefreitas

Andrey Semashev

8:41 p.m.

On 8/15/22 21:54, Alan de Freitas via Boost wrote:

...

...
It's less clear what should happen when the library calls append or assign with iterators whose value_type is char on something whose value_type is wchar_t, which is what you seem to be doing.

...
std::filesystem requirements[2] actually restrict Source arguments to std string types, iterators (read: pointers) to C-style strings and arrays of characters (which are interpreted as C-style strings).

Yes. That sounds right. It looks like boost::filesystem is matching the behavior of std;:filesystem now, even if more restrictive ( https://godbolt.org/z/5WYYce6v8).

Boost.Filesystem v3 is more permissive, as it still compiles the std::list case, although with a warning. v4 will fail to compile.

...

Boost.URL shouldn't use it and that's for the best, since std::filesystem doesn't support it either.

Non-contiguous ranges that dereference to char work with both std:: and boost::filesystem if we use the `append(InputIterator begin, InputIterator end)` overload. I think that's the source of confusion here. What C++ says about the `append(Source const& source)` overload is even more misleading.

(2) and (3) participate in overload resolution only if Source and path are not the same type, and either:

- Source is a specialization of std::basic_string <https://en.cppreference.com/w/cpp/string/basic_string> or std::basic_string_view <https://en.cppreference.com/w/cpp/string/basic_string_view>, or - std::iterator_traits <http://en.cppreference.com/w/cpp/iterator/iterator_traits><std::decay_t <http://en.cppreference.com/w/cpp/types/decay><Source>>::value_type is valid and denotes a possibly const-qualified encoding character type ( char, char8_t, (since C++20)char16_t, char32_t, or wchar_t).

We assumed both overloads should work because of this second condition.

std::list is not an iterator, applying std::iterator_traits to it is not valid (for one, std::list does not have an iterator_category).

...

...
If you're assigning a list to a path, most likely you are doing something wrong.

Yes. `append(InputIterator begin, InputIterator end)` would still allow the person to do this wrong thing though.

And `append(InputIterator begin, InputIterator end)` doesn't look like it's always wrong.

The signature with two iterators is the established practice for obtaining elements from a foreign sequence. You have it in every std container, std::string, etc. In particular, it allows to obtain the elements from exotic sources, like reading from an std::istreambuf_iterator. There is no such practice with a single-argument signature.

...

Two obvious use cases could be (i) appending paths from resource trees or (ii) some std::ranges::view::... that transforms the input into the chars to represent a path segment for that input. If `append(InputIterator begin, InputIterator end)` is not wrong, it looks like `append(Source const& source)` would not be less wrong when Source is just the range holding the iterators for the first overload.

In any case, both are still dangerous. Boost.URL and other libraries shouldn't count on it. As Peter mentioned, things like wstring and u16string could be appended, but the semantics will probably be wrong. They will convert char by char, without regards of encoding.

If the user calls a function passing two iterators, he is arguably aware that he is constructing/assigning/appending elements one-by-one, performing element-wise conversion, if needed. Again, this is established practice. If the user passes a single object to constructor/assignment/append, he provides the call with additional information on the nature of the input sequence, and the call is expected to behave according to that knowledge. For example, the call may not copy anything at all and simply move the contents or increment a reference counter, or use strlen to discover the end of the string, or use a locale from the source to perform character code conversion, and so on. As you can see, the behavior of such call can be very different depending on the argument type. Yes, with a range like boost::iterator_range or std::span there's really nothing fancy going on, and semantically the call would be expected to behave the same as with a pair of iterators. However, this is still a special case that has to be supported by the call explicitly, among the other single-argument signatures. This is relatively novel practice, and in some cases like the one that started this discussion, it can be ambiguous as to what the call actually does. In comparison, the two-iterator signature is rather explicit and clear wherever you see it.

Alan de Freitas

8:59 p.m.

...

std::list is not an iterator, applying std::iterator_traits to it is not valid (for one, std::list does not have an iterator_category).

Oh... Right. I failed to notice Source is (i) a basic_string, (ii) basic_string_view, or (iii) an iterator with one of those value types. We assumed Source would be (i) a basic_string, (ii) basic_string_view, or (iii) a container with one of those value types. Great.

...

If the user calls a function passing two iterators, he is arguably aware that he is constructing/assigning/appending elements one-by-one, performing element-wise conversion, if needed. Again, this is established practice.

Yes. That makes sense.

...

In comparison, the two-iterator signature is rather explicit and clear wherever you see it.

Definitely. Thank you again for clarifying. -- Alan Freitas https://github.com/alandefreitas

Andrey Semashev

29 Aug 29 Aug

9:57 p.m.

On 8/15/22 23:41, Andrey Semashev wrote:

...

On 8/15/22 21:54, Alan de Freitas via Boost wrote:

...
...
If you're assigning a list to a path, most likely you are doing something wrong.

Yes. `append(InputIterator begin, InputIterator end)` would still allow the person to do this wrong thing though.

And `append(InputIterator begin, InputIterator end)` doesn't look like it's always wrong.

The signature with two iterators is the established practice for obtaining elements from a foreign sequence. You have it in every std container, std::string, etc. In particular, it allows to obtain the elements from exotic sources, like reading from an std::istreambuf_iterator. There is no such practice with a single-argument signature.

...
Two obvious use cases could be (i) appending paths from resource trees or (ii) some std::ranges::view::... that transforms the input into the chars to represent a path segment for that input. If `append(InputIterator begin, InputIterator end)` is not wrong, it looks like `append(Source const& source)` would not be less wrong when Source is just the range holding the iterators for the first overload.

In any case, both are still dangerous. Boost.URL and other libraries shouldn't count on it. As Peter mentioned, things like wstring and u16string could be appended, but the semantics will probably be wrong. They will convert char by char, without regards of encoding.

If the user calls a function passing two iterators, he is arguably aware that he is constructing/assigning/appending elements one-by-one, performing element-wise conversion, if needed. Again, this is established practice.

Actually, I was incorrect on this one, at least wrt. std::filesystem::path. It only accepts iterators that have value_type that is one of the path character types.

Rainer Deyke

15 Aug 15 Aug

11:32 a.m.

On 15.08.22 01:34, Vinnie Falco via Boost wrote:

...

I see this in the commit log for boost.filesystem:

Deprecated path construction/assignment/appending from container types. Users are advised to use string types and iterators instead of containers to construct/assign/append to paths. In v4, the support for containers is removed.

This can't be right now, can it? Because std::filesystem supports path construction from anything meeting the requirements of Source which is kind of range-like. Boost.URL and its downstream libraries HTTP-Proto, HTTP-IO, Websocket-Proto, and Websocket-IO are depending on the container-based interface to achieve reasonably decent syntax and semantics despite the limitations of the path API (which are the fault of the standard and not Boost).

If std::filesystem::path supports container-based parameters to assign() and append() then why is boost::filesystem::path removing it?

I've already given up on Boost.Filesystem due to https://github.com/boostorg/filesystem/issues/181. std::filesystem seems to work better across different compilers/platforms, and it's standard. -- Rainer Deyke (rainerd@eldwood.com)

1075

Age (days ago)

1090

Last active (days ago)

List overview

Download

45 comments

11 participants

participants (11)

Alan de Freitas
Alexander Grund
Andrey Semashev
Daniela Engert
Emil Dotchevski
Gavin Lambert
Niall Douglas
Peter Dimov
Rainer Deyke
Tom Honermann
Vinnie Falco