[filesystem] becoming worse than the standard?

I see this in the commit log for boost.filesystem: Deprecated path construction/assignment/appending from container types. Users are advised to use string types and iterators instead of containers to construct/assign/append to paths. In v4, the support for containers is removed. This can't be right now, can it? Because std::filesystem supports path construction from anything meeting the requirements of Source which is kind of range-like. Boost.URL and its downstream libraries HTTP-Proto, HTTP-IO, Websocket-Proto, and Websocket-IO are depending on the container-based interface to achieve reasonably decent syntax and semantics despite the limitations of the path API (which are the fault of the standard and not Boost). If std::filesystem::path supports container-based parameters to assign() and append() then why is boost::filesystem::path removing it? Thanks

On 8/15/22 02:34, Vinnie Falco via Boost wrote:
Paths are supposed to interoperate with strings, so that is what the change you refer to[1] is aimed for. Assigning or constructing a path from e.g. std::list<char> doesn't make much sense, and that is being deprecated. That Boost.Filesystem used to accept this is an unfortunate legacy misfeature, which is being removed. Note that the support for string types is *not* being removed. std::filesystem requirements[2] actually restrict Source arguments to std string types, iterators (read: pointers) to C-style strings and arrays of characters (which are interpreted as C-style strings). So Boost.Filesystem is getting closer to the standard here.
Could you describe why you need to construct/assign/append to paths from arbitrary containers and why you cannot use string types or iterators? [1]: https://github.com/boostorg/filesystem/commit/d829a46b3120933e3dbaf9b3c85bf1... [2]: http://eel.is/c++draft/fs.path.req#1

On Mon, Aug 15, 2022 at 3:37 AM Andrey Semashev via Boost <boost@lists.boost.org> wrote:
Could you describe why you need to construct/assign/append to paths from arbitrary containers and why you cannot use string types or iterators?
Yeah, and from your description it sounds like maybe it is just a user error on my part or a bug. The container in question is boost::urls::pct_encoded_view: <https://github.com/CPPAlliance/url/blob/cba301383791f67c31b3ccd52a401849f7cf8940/include/boost/url/pct_encoded_view.hpp#L75> This used to work: pct_encoded_view v; boost::filesystem::path p; p.append( v ); What should happen here is that a path segment will be appended to p formed by applying percent-decoding to the string referenced by 'v' (the percent-decoding happens in the iterator). Thanks

On 8/15/22 17:10, Vinnie Falco wrote:
With std::filesystem::path that is what would happen, though pct_encoded_view::operator std::string(). With boost::filesystem::path, it will construct the path from a string that is obtained from [v.begin(), v.end()) range (which, I presume, will not be percent-encoded). Since the recent change, it will also give you a deprecated warning. The correct code is either use v.to_string() or explicitly use v.begin()/v.end(), depending on what you want. BTW, duplicate "as a" here: https://github.com/CPPAlliance/url/blob/cba301383791f67c31b3ccd52a401849f7cf...

On Mon, Aug 15, 2022 at 7:57 AM Andrey Semashev via Boost < boost@lists.boost.org> wrote:
on what you want.
boost::filesystem::path used to meet the requirements of MutableString, but now does not: https://master.url.cpp.al/url/concepts/mutablestring.html -- Regards, Vinnie Follow me on GitHub: https://github.com/vinniefalco

On 8/15/22 18:32, Vinnie Falco wrote:
It does not meet the requirements on Windows, and never did - path::value_type is wchar_t there. Also, you don't define the semantics of the operations, in particular, what is the expected result. Regardless of the platform, path::append does not merely appends characters, it appends a path, which may result in appending a directory delimiter or completely replacing the path if the argument is an absolute path. Depending on what exactly you mean by "Appends the characters", this may or may not qualify.

On Mon, Aug 15, 2022 at 8:42 AM Andrey Semashev via Boost <boost@lists.boost.org> wrote:
you don't define the semantics of the operations, in particular, what is the expected result
Yes. Thank you for noticing this, as it is exactly what I intended. It is the caller's responsibility to determine if the operation makes sense. The library doesn't care what the outcome is, only that the syntactic requirements are met. Which is how I documented it :) Thanks

Vinnie Falco wrote:
The semantics are pretty clear when value_type is char, which is what is required. It's less clear what should happen when the library calls append or assign with iterators whose value_type is char on something whose value_type is wchar_t, which is what you seem to be doing.

On 8/15/22 20:20, Peter Dimov via Boost wrote:
That part is defined by filesystem::path - character code conversion happens. What isn't clear in the code Vinnie posted earlier is what is supposed to be appended (percent-escaped string or not) and how (path appending or dumb concatenation). That's the problem with the docs, it says "appends", but there are too many ways to "append".

On Mon, Aug 15, 2022 at 10:46 AM Andrey Semashev via Boost <boost@lists.boost.org> wrote:
What isn't clear in the code Vinnie posted earlier is what is supposed to be appended (percent-escaped string or not)
When pct_encoded_view is iterated it produces normal individual characters. That is, any percent-escapes in the referenced character buffer have the encoding removed.
and how (path appending or dumb concatenation). That's the problem with the docs, it says "appends", but there are too many ways to "append".
The docs are correct in leaving the meaning of "append" unspecified. It is only necessary to meet the syntactic requirements. It is up to the user to decide whether or not the behavior of the mutable string's 'append' and/or 'assign' operations are appropriate to achieve the desired outcome. Thanks

Andrey Semashev wrote:
It's defined, but in my opinion, the semantics are wrong for the intended use case. The interpretation of paths coming from URLs should not depend on the current server ANSI code page - that's an implementation detail. You should be able to move your server to a different Windows installation (or even to a different OS) and the existing URLs should not suddenly change meaning. That's why I think that the only sensible interpretation of paths coming from URLs is as UTF-8, and for this case, the default filesystem conversion isn't suitable. The current MutableString requirements disallow this, as they mandate a matching value_type of char. This requirement, again in my opinion, is correct, because if relaxed, people will silently get ANSI code page dependent conversions, often without realizing it, if they don't develop on Windows. _If_ this requirement is relaxed, the library should somehow default to transcoding from UTF-8 to the value_type, instead of using the ANSI code page (as with fs::path) or a char to value_type cast (as with e.g. wstring or u16string.)

Vinnie Falco wrote:
The usual approach is to put boost::filesystem::path::imbue( std::locale( std::locale::classic(), new boost::filesystem::detail::utf8_codecvt_facet ) ); as the first thing in `main`. fs::path has overloads taking codecvt_type, but I've never used them so I don't know how they work. Maybe Andrey can help.

On 8/15/22 22:35, Peter Dimov via Boost wrote:
I'd rather boost::filesystem::path::imbue(boost::locale::generator()("C.UTF-8")); Don't recommend users to use stuff from namespace detail. :) Note though that after that call all narrow strings you pass to filesystem::path must be UTF-8; you should be careful if you obtain paths from user input, files, network, etc.
If you made call above then that is the only thing you need to do. The overloads taking codecvt is for when you don't want to set the global filesystem locale for some reason. This may be useful during global destructors stage, for example. By default, the codecvt is taken from the global filesystem locale.

Andrey Semashev wrote:
Sound advice in principle but that's pretty much the established practice at this point. Perhaps we should expose a non-detail way to perform the above, such as provide a function "set_utf8_as_default".
Right. That's why I alluded to the alternative, using the codecvt_type overloads.

On 8/16/22 00:10, Peter Dimov via Boost wrote:
Perhaps, move utf8_codecvt_facet to Utility or Core?
For completeness, here's a usage example: void append_utf8(std::string const& str, fs::path& p) { std::locale loc = boost::locale::generator()("C.UTF-8"); p.append(str, std::use_facet<fs::path::codecvt_type>(loc)); } Same with iterators.

On 8/16/22 01:07, Peter Dimov via Boost wrote:
Yes. Although there's this already: https://github.com/boostorg/locale/blob/develop/include/boost/locale/utf8_co...

Am 15.08.22 um 23:10 schrieb Peter Dimov via Boost:
And there is `boost::nowide::nowide_filesystem()` doing pretty much what you suggest: https://github.com/boostorg/nowide/blob/master/include/boost/nowide/filesyst... Am 16.08.22 um 00:21 schrieb Andrey Semashev via Boost:
And again in Nowide: https://github.com/boostorg/nowide/blob/master/include/boost/nowide/utf8_cod... Am 16.08.22 um 01:53 schrieb Vinnie Falco via Boost:

Gavin Lambert wrote:
Yes in principle, but the question I'm replying to is this:
Can you please submit a pull request showing me how route.cpp can be correctly implemented according to these semantics?
In this case I can assume things. route.cpp does create a path from argv[2], but interpreting it as UTF-8 would probably be a feature. Using the ACP is wrong anyway as argv[2] is likely in the console code page.

On 16/08/2022 11:31, Peter Dimov wrote:
On Windows, the application always receives command line arguments as wchar_t. If you provide the narrow main() entrypoint method then it is the application's runtime startup code that will convert args back to ANSI using the ACP, not the console code page. As such, interpreting argv as ACP is correct (albeit lossy if there were any Unicode characters involved -- the only truly correct design is to only use the wide entrypoint or ignore the entrypoint args and reparse from the original Unicode string). Also on Windows, in most cases the console code page is also the ACP, although that is less true in more recent versions.

On Mon, Aug 15, 2022 at 4:46 PM Gavin Lambert via Boost <boost@lists.boost.org> wrote:
...
My experiences with std::filesystem and boost::filesystem have been nothing but negative. I think that the decision to make the character type different on Windows was a mistake. The need for locales and imbuements and global state and... really, it is just giving me a big headache. I think there is room for a new library that handles files, directories, file metadata, and also has some features of nowide (where is the signature of fopen that accepts a filesystem::path?). It should be utf-8 only, use Plain Old char (even on Windows), it should be completely portable, except that it requires that directories are possible and that the filesystem isn't weird (I don't really care about compatibility with grandpa's EPROMs that can hold 9-bit flat files). Possible names: Boost.LordOfTheFiles Boost.HurrDir Boost.Progra~1 Boost.FlySystem Thoughts?

On 8/16/22 3:41 AM, Andrey Semashev via Boost wrote:
Could you please provide a citation for that? I have experience with people continuing to use Shift-JIS for file names in recent-ish times. Tom.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 16/08/2022 11:53, Vinnie Falco wrote:
Using wchar_t on Windows is actually the least painful option. (And you don't have to worry about locales and imbuements etc if you never try to convert to not-wchar_t.) For correct behaviour, you *must* only use the W variants of the native API methods, or wchar_t methods of standard library functions. Inevitably, everything in the standard library that accepts 'char' params assumes that these are encoded in the ANSI code page, not UTF-8. This can't be "fixed" or it breaks all the legacy apps. In practice, this means that unless you can absolutely guarantee that your paths only contain pure ASCII (and the instant you accept a path or filename from the user, you lose), it is *never* safe to use any of the non-wide library methods. You *can* (and many do) store paths in other libraries and in the application in 'char'-encoded-as-UTF-8, but then you have to remember every single time you hit the standard library or direct WinAPI boundaries to convert your strings to wide before passing them across, or hilarity will ensue (without even a convenient compiler error). Storing paths as wchar_t in the first place both avoids the cost of converting back and forth and potential corruption (often overlooked, unless you regularly test with unicode paths) from accidentally forgetting a conversion.
(where is the signature of fopen that accepts a filesystem::path?)
Why are you using fopen in C++ in the first place? Filesystem does provide 'path' overloads for fstreams, which you should have been using instead anyway.
In theory, the standard library (and other wrapper libraries around the WinAPI, including Filesystem) could start doing more sane things by using the C++20 'char8_t'/'u8string' types to disambiguate between UTF-8 encoded paths and legacy idkwtf-'char'-encoded paths. But this will take a very long time to percolate through the ecosystem, especially as there are a bunch of people who hate the very idea of it. And it doesn't solve the conversion performance angle. (Hopefully, Windows will eventually provide char8_t entrypoints and APIs, which will make it easier to interoperate with not-Windows.) Although as Emil has already pointed out, it's valid in not-Windows to have arbitrary not-UTF-8 byte sequences in paths, so you can get into trouble in that direction as well. That's another reason for using wchar_t in Windows and char in not-Windows: no conversions happen at all (at least where values are accepted natively from the OS), which has maximal compatibility for otherwise-invalid byte sequences that nevertheless exist.

Am 16.08.2022 um 05:15 schrieb Gavin Lambert via Boost:
Amen brother, you speak wisely! I want to add the following to stay sane on Windows: ensure that *both* the wide and the narrow execution character encoding is Unicode (i.e. UTF-16 for wchar_t (that's the default) and UTF-8 for char), build with _UNICODE defined, and link with <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>. This guarantees consistent semantics throughout the *whole* execution of the program on reasonably recent versions of Windows. And lastly, represent paths with std/boost filesystem paths and use APIs that know how to deal with them *correctly*. Similar advise applies to POSIX systems. UTF-8 everywhere is just a recommendation but no guarantee. Dani

Gavin Lambert wrote:
That's only if your program never runs on anything else. For portable code, using char and UTF-8 is the least painful option. We have an entire library in Boost for this purpose, whose documentation does a reasonable job explaining that. https://www.boost.org/doc/libs/1_80_0/libs/nowide/doc/html/index.html

On 16/08/2022 19:44, Peter Dimov wrote:
Currently, yes. In theory, though, you could adopt a TCHAR-like approach where you use wchar_t on Windows and char/char8_t on not-Windows, selected at compile time. This would avoid all conversions and just use the native character type of the OS, which would be better. (Windows has its own version of the invalid characters problem -- it's legal to have mismatched surrogates in filenames, which work fine as long as you keep everything in wchar_t UCS-2 and never convert it, but break if you convert to UTF-8 and back. It's probably less common than not-UTF-8 non-Windows filenames, though.) The downside is that you need every single bit of code to either use this TCHAR type (which in turn means that you need to be able to recompile everything), or (better) to provide overloads for all possible underlying types (with the same name, so that the actual code is spelled the same either way), and some usages may need macros or char_traits etc. (But then that tends to lead to either code duplication or over-templating, neither of which is good.) Ideally, the standard library would have defined such a platform-specific type alias (notably, not actually a distinct type, so that existing overloads work), which would have made it easier to build up libraries around it, or at least encourage writing both overloads. Or the language would define some kind of compile-time-variant that permits separate-translation-unit implementation of overloaded types that have the "same" implementation without header-only templates. Sadly that hasn't happened yet.

On 16/08/2022 00:53, Vinnie Falco via Boost wrote:
With proposed std::filesystem::path_view you can use UTF-8 everywhere once it is into the standard. The committee has signed off on the design, I just need to get it through LWG and it's into the next IS. I just need non-work free time to reappear to get back to working on my standards papers. It'll happen eventually. Niall

On Mon, Aug 15, 2022 at 10:17 AM Vinnie Falco <vinnie.falco@gmail.com> wrote:
On Mon, Aug 15, 2022 at 8:42 AM Andrey Semashev via Boost
...
It comes down to this. boost::urls::pct_encoded_view is a ForwardRange of char. How do we append it as a segment to each of the following 4 things: 1. boost::filesystem::path on Windows 2. boost::filesystem::path not on Windows 3. std::filesystem::path on Windows 4. std::filesystem::path not on Windows I do not know how to answer this question, other than with annoying wide string conversions and extra memory allocations. Thanks

On 8/15/22 20:21, Vinnie Falco wrote:
filesystem::path will handle character code conversion. Regardless of the platform, you call path::append(). Or path::concat(), if you want dumb concatenation. The question is only what and how exactly do you want to append and what do you expect to get in the end. I didn't read Boost.URL docs, but it seems strange to me that operations are defined without an expected outcome. I mean, you can define loose concepts such as MutableString, but I don't find it helpful as I still can't tell what the operations on its models are doing.

Assigning or constructing a path from e.g. std::list<char> doesn't make much sense,
Why not? -- Alan Freitas https://github.com/alandefreitas

Yes. That sounds right. It looks like boost::filesystem is matching the behavior of std;:filesystem now, even if more restrictive ( https://godbolt.org/z/5WYYce6v8). Boost.URL shouldn't use it and that's for the best, since std::filesystem doesn't support it either. Non-contiguous ranges that dereference to char work with both std:: and boost::filesystem if we use the `append(InputIterator begin, InputIterator end)` overload. I think that's the source of confusion here. What C++ says about the `append(Source const& source)` overload is even more misleading. (2) and (3) participate in overload resolution only if Source and path are not the same type, and either: - Source is a specialization of std::basic_string <https://en.cppreference.com/w/cpp/string/basic_string> or std::basic_string_view <https://en.cppreference.com/w/cpp/string/basic_string_view>, or - std::iterator_traits <http://en.cppreference.com/w/cpp/iterator/iterator_traits><std::decay_t <http://en.cppreference.com/w/cpp/types/decay><Source>>::value_type is valid and denotes a possibly const-qualified encoding character type ( char, char8_t, (since C++20)char16_t, char32_t, or wchar_t). We assumed both overloads should work because of this second condition.
If you're assigning a list to a path, most likely you are doing something wrong.
Yes. `append(InputIterator begin, InputIterator end)` would still allow the person to do this wrong thing though. And `append(InputIterator begin, InputIterator end)` doesn't look like it's always wrong. Two obvious use cases could be (i) appending paths from resource trees or (ii) some std::ranges::view::... that transforms the input into the chars to represent a path segment for that input. If `append(InputIterator begin, InputIterator end)` is not wrong, it looks like `append(Source const& source)` would not be less wrong when Source is just the range holding the iterators for the first overload. In any case, both are still dangerous. Boost.URL and other libraries shouldn't count on it. As Peter mentioned, things like wstring and u16string could be appended, but the semantics will probably be wrong. They will convert char by char, without regards of encoding. Thank you for the explanation. Em seg., 15 de ago. de 2022 às 14:24, Andrey Semashev via Boost < boost@lists.boost.org> escreveu:
-- Alan Freitas https://github.com/alandefreitas

On 8/15/22 21:54, Alan de Freitas via Boost wrote:
Boost.Filesystem v3 is more permissive, as it still compiles the std::list case, although with a warning. v4 will fail to compile.
std::list is not an iterator, applying std::iterator_traits to it is not valid (for one, std::list does not have an iterator_category).
The signature with two iterators is the established practice for obtaining elements from a foreign sequence. You have it in every std container, std::string, etc. In particular, it allows to obtain the elements from exotic sources, like reading from an std::istreambuf_iterator. There is no such practice with a single-argument signature.
If the user calls a function passing two iterators, he is arguably aware that he is constructing/assigning/appending elements one-by-one, performing element-wise conversion, if needed. Again, this is established practice. If the user passes a single object to constructor/assignment/append, he provides the call with additional information on the nature of the input sequence, and the call is expected to behave according to that knowledge. For example, the call may not copy anything at all and simply move the contents or increment a reference counter, or use strlen to discover the end of the string, or use a locale from the source to perform character code conversion, and so on. As you can see, the behavior of such call can be very different depending on the argument type. Yes, with a range like boost::iterator_range or std::span there's really nothing fancy going on, and semantically the call would be expected to behave the same as with a pair of iterators. However, this is still a special case that has to be supported by the call explicitly, among the other single-argument signatures. This is relatively novel practice, and in some cases like the one that started this discussion, it can be ambiguous as to what the call actually does. In comparison, the two-iterator signature is rather explicit and clear wherever you see it.

std::list is not an iterator, applying std::iterator_traits to it is not valid (for one, std::list does not have an iterator_category).
Oh... Right. I failed to notice Source is (i) a basic_string, (ii) basic_string_view, or (iii) an iterator with one of those value types. We assumed Source would be (i) a basic_string, (ii) basic_string_view, or (iii) a container with one of those value types. Great.
Yes. That makes sense.
In comparison, the two-iterator signature is rather explicit and clear wherever you see it.
Definitely. Thank you again for clarifying. -- Alan Freitas https://github.com/alandefreitas

On 15.08.22 01:34, Vinnie Falco via Boost wrote:
I've already given up on Boost.Filesystem due to https://github.com/boostorg/filesystem/issues/181. std::filesystem seems to work better across different compilers/platforms, and it's standard. -- Rainer Deyke (rainerd@eldwood.com)
participants (11)
-
Alan de Freitas
-
Alexander Grund
-
Andrey Semashev
-
Daniela Engert
-
Emil Dotchevski
-
Gavin Lambert
-
Niall Douglas
-
Peter Dimov
-
Rainer Deyke
-
Tom Honermann
-
Vinnie Falco