My review of the proposed Boost.URL library follows.
Unfortunately I have been away for most of the review period and
I have not been able to spend as much time on this as I would have
liked; in particular I have not tried to run any code. But I have
looked at the source and documentation.
In previous email discussions a few months ago I complained about
the dangerous use of non-owning view types in the library interface,
and I see that these issues are still present. Specifically, quoting
the "Quick Look" docs:
result
On Sun, Aug 21, 2022 at 6:21 AM Phil Endecott via Boost
My review of the proposed Boost.URL library follows.
Thank you for taking the time to have a look at this library.
In previous email discussions a few months ago I complained about the dangerous use of non-owning view types in the library interface,
Yes but... this is how it is. The expectation for this library (and indeed, all my libraries past, present, and future) is that users have a grasp of C++ and are aware of the lifetime rules for both references, and types which have the semantics of references. That's the price of admission for the type-erasure and zero-cost abstraction.
result
r = parse_uri( s );
You could instead write result< url > = rv parse_uri( s ); and then rv.value() would be owning.
auto r = parse_uri( s );
There's no indication here that r is a non-owning view and that the caller is responsible for keeping s alive for the lifetime of r.
From https://master.url.cpp.al/url/ref/boost__urls__parse_uri.html#url.ref.boost_... "Ownership of the string is not transferred; the caller is responsible for ensuring that the lifetime of the character buffer extends until
On this point we agree. But...isn't this always a problem when using `auto`? It deprives the reader of information. When you write "auto" you are opting in for convenience over readability. If your complaint here is that the use of `auto` can make it easy to forget about things then I agree with that too. Still, I'm glad auto is in the language. However if you will note, the documentation meticulously minimizes the usage of auto in the example code, for this reason. The lifetime requirements are documented: the view is no longer being accessed."
A safer API design would, at a minimum, give the parsing function a name with "view" in it. The default, shorter-named function should return a value type.
Yes well, we considered that design and concluded that it wasn't practical. The library has multiple value types for URLs, which one should it return? And a pattern of use is to store the result of parsing in an existing variable which already has storage allocated (to support reuse and avoid unnecessary allocate/copy/free cycles).
More fundamentally, I'm not convinced that this allocation-avoiding view-rich API style is best for a general purpose vocabulary type like URL.
A URL is literally a string which meets the prescribed syntactical requirements of RFC3986. If I receive an HTTP GET request like this: GET /index.htm HTTP/1.1\r\n you think that the library should not let the user parse the substring and return a view to the URL? And instead it should allocate memory, all to protect beginners?
URL should be a "Keep It Simple" type, accessible to C++ beginners, with straightforward lifetime semantics.
std::string_view is C++17, are we saying now that Boost libraries should avoid some C++ features to appease beginners?
Compare with std::filesystem::path, which does not try to optimise by actually being a view.
Not really a good example for bolstering your case, because filesystem::path has significant problems :) Significant enough to motivate std::filesystem::path_view: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1030r4.pdf
Hmm, so url_view doesn't own the underlying storage, except when it does as a result of this mysterious method. What's the rationale for this? (Have I missed some documentation?)
`boost::urls::url` is for mutation and unique ownership. The function persist() is provided for efficient shared ownership of read-only URLs. Users can't write this function themselves, because it relies on implementation details.
What is sizeof(url) or sizeof(url_view)? I think it must be about 160 bytes, since you store offsets to each of the components. I'm not convinced that this is the best approach. ... Making the objects smaller would probably have measurable performance benefits.
This was the choice that we went with, but it can be changed in the future with sufficient experience in the field, without breaking the API. We would need measurements to know for sure.
In your non-view url object, you seem to allocate storage using new char[]. Why not use a std::string? One benefit of a std::string is SBO. Another is that you could provide a move constructor from a string.
No particular reason, this is something we could change in the future with sufficient experience in the field, without breaking the API. Move construction from `std::string` does sound interesting :) I have my doubts that very many URLs are going to be fitting into the small buffer but you never know.
In your query parameter API, you expose something that looks like a standard associative container except that you call the key and value "key" and "value" respectively, rather than "first" and "second". While I think many of us would agree that those are better names, it does make it more difficult to use this container with generic code than it would be otherwise.
This is something we could change, I guess you're proposing: using reference_type = `pair< pct_encoded_view, optional< pct_encoded_view > >` Which generic code uses first and second? Do you have an example you might point us at?
I see that you also have a bool called "has_value" in the value_type. I think this is to handle this:
?a=1 // has_value == true && value == "1" ?a= // has_value == true && value == "" ?a // has_value == false
Is that correct?
Yes
Again having this third member in the container's value_type makes generic code more difficult. I would be inclined to use a std::optional for the value instead.
We are already using boost::optional elsewhere, so we could use it here as well. But, there is a downside. If the user only cares about values which are not empty they could skip checking has_value and just look at key.empty(). If we switch to optional then everyone has to check optional::has_value(). I'm undecided on which is better, hopefully we will get more feedback.
In the past I have had situations where it was necessary to specify the case for hex letters in %-encoding; please consider adding this to pct_encode_opts.
Well percent-encoded escapes always use hexadecimal so I'm not sure what you're asking for here.
The Magnet Link example needs a paragraph at the start to say what a Magnet Link is.
Yes good idea.
Because of these concerns about the basic API design, I believe that the library should be REJECTED.
I remember I was at CppCon, I think it was 2018 and I was at some sort of official dinner - probably the speaker's dinner. I got sat next to this loudmouth who at one point starting going off on string_view and how bad it was, how dangerous it was, all the problems it caused in the jillion-line codebase at his job, and how it never should have been added to the standard. He did make some valid points of course but I couldn't shake the feeling that perhaps a programming language other than C++ would suit him better. That person was Nicolai Josuttis. Anyway the point I'm making here is that string_view in particular, and types that behave like references in general, are a contentious issue for C++. Some people hate them because "beginners" and easy-to-make-mistakes. Some people love them because "muh performance" (and beautiful type-erasure). It seems to me that one of the distinguishing features of C++ is in fact its zero-cost abstractions. How cool is it that the Boost.URL library lets you not only parse a URL in place, which is some part of a larger character buffer, but also lets you seamlessly access all of the parts? This I believe is very much in the spirit of C++. If we start thinking that users need to be coddled, have the useful but sharp edges filed down, pointy ends blunted, and so forth - I feel like we will end up with Rust. For that you can probably just save all the work and start using Rust sooner rather than later. Thanks
Vinnie Falco wrote:
URL should be a "Keep It Simple" type, accessible to C++ beginners, with straightforward lifetime semantics.
std::string_view is C++17, are we saying now that Boost libraries should avoid some C++ features to appease beginners?
Lifetime errors are not only made by beginners. They are a regular occurrence in the Clang code base, for instance, and I don't think the people working on and contributing to Clang are beginners. Returning views is basically a trap that's going to get everyone earlier or later, no matter how experienced.
On Sun, Aug 21, 2022 at 5:06 PM Peter Dimov via Boost
Returning views is basically a trap that's going to get everyone earlier or later, no matter how experienced.
One cool idea that I like is what abseil does in some cases, they return you some "proxy" that morphs into what you specify as lhs. For example absl::StrSplit() https://abseil.io/docs/cpp/guides/strings Not sure if that design has some problems beside being auto hostile.
Vinnie Falco wrote:
On Sun, Aug 21, 2022 at 6:21 AM Phil Endecott via Boost
wrote: In previous email discussions a few months ago I complained about the dangerous use of non-owning view types in the library interface,
Yes but... this is how it is. The expectation for this library (and indeed, all my libraries past, present, and future) is that users have a grasp of C++ and are aware of the lifetime rules for both references, and types which have the semantics of references.
There's an important difference between "references" and "types which have the semantics of references", when auto is used: std::string f1(); std::string& f2(); std::string_view f3(); auto r1 = f1(); // r1 is a string, OK. auto r2 = f2(); // r2 is a string, not a string&, so it's safe. auto& r2a = f2(); // r2a is a string&, but the user knows that because they wrote a &. auto r3 = f3(); // r3 is a string_view, danger!
That's the price of admission for the type-erasure and zero-cost abstraction.
I feel that maybe what I'm saying is not getting through: *I am not insisting that you remove these dangerous functions. You are welcome to keep them for those users who feel that they need the benefits that they offer. I would just like you to *also* offer safe-to-use functions, and to give the safe functions more concise names than the dangerous functions (and mention them first in the docs).*
Yes well, we considered that design and concluded that it wasn't practical. The library has multiple value types for URLs, which one should it return?
One of the safe ones. I'd suggest one that uses a std::string to store the URL.
A URL is literally a string which meets the prescribed syntactical requirements of RFC3986. If I receive an HTTP GET request like this:
GET /index.htm HTTP/1.1\r\n
you think that the library should not let the user parse the substring and return a view to the URL? And instead it should allocate memory, all to protect beginners?
No, I'm NOT SAYING THAT IT SHOULD NOT LET THE USER DO THAT, I'm saying that a user who wants to do that should have to do a bit more typing to make it clear to the next person who reads their code that they are using the unsafe view-returning function. It's also worth noting that a URL value-type using a std::string would do no allocations anyway in this case, since "/index.htm" easily fits into the string's SBO storage.
std::string_view is C++17, are we saying now that Boost libraries should avoid some C++ features to appease beginners?
A good API should use std::string_view only for parameters, and possibly for getter methods.
In the past I have had situations where it was necessary to specify the case for hex letters in %-encoding; please consider adding this to pct_encode_opts.
Well percent-encoded escapes always use hexadecimal so I'm not sure what you're asking for here.
I'm asking for an option to select either upper case or lower case letters in the hex encodings, i.e. %2D or %2d.
I remember I was at CppCon, I think it was 2018 and I was at some sort of official dinner - probably the speaker's dinner. I got sat next to this loudmouth who at one point starting going off on string_view and how bad it was, how dangerous it was, all the problems it caused in the jillion-line codebase at his job, and how it never should have been added to the standard. He did make some valid points of course but I couldn't shake the feeling that perhaps a programming language other than C++ would suit him better. That person was Nicolai Josuttis.
Yeah, well the last time I had lunch with Bjarne Stroustrup we talked a lot about how to make C++ a language that you could actually teach to students. This was around 2003, and he had just started teaching at Texas A&M at the time.
If we start thinking that users need to be coddled, have the useful but sharp edges filed down, pointy ends blunted, and so forth
*I DON'T WANT THE POINTY ENDS FILED DOWN, I JUST WANT THEM TO HAVE DANGER SIGNS ON THEM AND FOR THE TOOL TO ALSO BE USEABLE IN A SAFE WAY.*
- I feel like we will end up with Rust. For that you can probably just save all the work and start using Rust sooner rather than later.
Two of the C++Now videos that I've watched are Dave Abrahams on value semantics in Swift and Val, and David Sankel on "Rust features that I want in C++". Quote from the Swift talk: "C++ does have mutable value semantics, in fact it's the default, it's just the no-one uses it." It would be a disaster for C++ if it were to become only the language to use if Rust or Swift or D or whatever doesn't do what you want. C++ needs to learn from those languages. In particular, Dave Sankel's talk has a lot about inclusiveness in Rust. Do please view it. Regards, Phil.
On Mon, Aug 22, 2022 at 9:04 AM Phil Endecott via Boost
auto r1 = f1(); // r1 is a string, OK. auto r2 = f2(); // r2 is a string, not a string&, so it's safe. auto& r2a = f2(); // r2a is a string&, but the user knows that because they wrote a &. auto r3 = f3(); // r3 is a string_view, danger!
Ah, yes I see. To me the problem here is auto. In other words, the issue is not specific to URLs; it is general for any user that uses auto to capture the result of a function that returns a reference-like type over its arguments. For example, I see no problems here: std::string f1(); std::string& f2(); std::string_view f3(); std::string r1 = f1(); // r1 is a string, OK. std::string r2 = f2(); // r2 is a string, not a string&, so it's safe. std::string& r2a = f2(); // r2a is a string&, but the user knows that because they wrote a &. std::string_view r3 = f3(); // r3 is a string_view, danger! Instead of burdening every function that has ever been written, has been written, and will ever be written with the responsibility of communicating its return type twice - once in the actual return type and again in the function name, it seems to me that a more logical solution is that the author of the code chooses whether to opt-in to brevity (using auto) or opt-in to clarity (naming the return type).
I feel that maybe what I'm saying is not getting through:
*I am not insisting that you remove these dangerous functions. You are welcome to keep them for those users who feel that they need the benefits that they offer. I would just like you to *also* offer safe-to-use functions, and to give the safe functions more concise names than the dangerous functions
Okay so, if I am understanding you right, you are saying that this
would be better:
auto u = parse_uri_into_string_view( s ).value();
I'm not sure that I agree. The "name" of a function includes its
return type and parameter list types. `parse_uri` is really
"result
Yes well, we considered that design and concluded that it wasn't practical. The library has multiple value types for URLs, which one should it return?
One of the safe ones. I'd suggest one that uses a std::string to store the URL.
Using std::string as the storage is an implementation detail. We could do this, and it would enable the use-case where the URL can take ownership of the string, which could be cool in some cases. However if we do that then we close the door to also storing any additional variable-length metadata such as offsets or sizes of the path segments or query params. We need more field experience and real-world measurements before we make this change.
you think that the library should not let the user parse the substring and return a view to the URL? And instead it should allocate memory, all to protect beginners?
No, I'm NOT SAYING THAT IT SHOULD NOT LET THE USER DO THAT, I'm saying that a user who wants to do that should have to do a bit more typing to make it clear to the next person who reads their code that they are using the unsafe view-returning function.
I wonder if clang-tidy has a setting to warn when someone uses auto on a reference-like type...
I'm asking for an option to select either upper case or lower case letters in the hex encodings, i.e. %2D or %2d.
Hah.. I knew someone would ask for that. We have the framework in place (pct_encode_opts struct).)
*I DON'T WANT THE POINTY ENDS FILED DOWN, I JUST WANT THEM TO HAVE DANGER SIGNS ON THEM AND FOR THE TOOL TO ALSO BE USEABLE IN A SAFE WAY.*
I'm not sure how to warn people who use auto variables. Perhaps this could be something added as a compiler warning or to clang-tidy if it isn't already there? And perhaps we could propose an attribute for the language to inform the compiler of reference-like types. Like this maybe? class [[reference-value]] url_view; and then if you write extern url_view get_url_view_from( string_view ); .... auto u = get_url_view_from( s ); the compiler could show the danger sign that you clamour for, if you specify -Wreference-value. Now that I think about it Phil, this is quite a good idea because it solves the problem for everyone, forever, instead of having to have each individual author solve it over and over again for every piece of code they write. And users who want to opt-in to brevity can still do so. If you want to write this up as a proposal I would be in favor of it and you could take credit.
Dave Sankel's talk has a lot about inclusiveness in Rust. Do please view it.
Yes that was quite a good video! Thanks Regards
Thank you for the review.
auto r = parse_uri( s );
There's no indication here that r is a non-owning view and that the caller is responsible for keeping s alive for the lifetime of r.
Yes. This is a relevant issue here. This tends to arise with any function that returns a view and it's not simple to fix. Reference types are contentious and some people still seem to be against std::string_view. But people do need them in practice. I imagine many wouldn't like allocating more memory dynamically in an async server just to decide what to do with an URL string for which you already have allocated storage. On the other hand, it sounds reasonable to have a compromise for other potential use cases. It's not impossible to come up with other alternatives like parse_uri_view( s ), urls::views::parse_uri( s ), etc... I guess what std::ranges does, in practice to deal with that, is put all functions that return views in the "views" namespace. The biggest problem is still deciding which of the URL types to return from parse_uri( s ). https://github.com/CPPAlliance/url/issues/435
More fundamentally, I'm not convinced that this allocation-avoiding view-rich API style is best for a general purpose vocabulary type like URL.
Some defaults where the shorter-named functions return value types are OK. Not providing the views would be a mistake though. Most "non-beginner" use cases we know of require views. We would just be leaving this demand unattended.
I see that there is a method called persist():
std::shared_ptr< url_view const > persist() const;
This function returns a read-only copy of the URL, with shared lifetime. The returned value owns (persists) the underlying string. The algorithm used to create the value minimizes the number of individual memory allocations, making it more efficient than when using direct standard library functions.
Hmm, so url_view doesn't own the underlying storage, except when it does as a result of this mysterious method. What's the rationale for this? (Have I missed some documentation?)
Yes. I agree this function might seem confusing. The types are coherent, but we end up with something that owns the storage but is still called url_view. It allocates the memory for the URL and a url_view that references it. The rationale is to have only one dynamic memory allocation. It's sort of an advanced function and we have reviewed the documentation on that a few times. There's still a lot to improve. https://github.com/CPPAlliance/url/issues/436
I would be inclined to determine the offsets each time they are needed by the accessor methods. In the constructor, parse to validate the syntax but don't store any offsets.
Exploring ideas to make the object smaller is worth it. Implementation details can change and there might be better/lower upper limits for some URL components. I don't think parsing the URL again every time the user calls an accessor function is a reasonable solution for the general case though. Each access to an element would have a linear cost and URLs are allowed be long in many applications. Even for small URLs, reparsing it 9 times if the user asks for the components sounds unreasonable.
I believe that in the case of the path segments and the query parameters, you are not storing offsets (because the number is variable, and that would need allocations); why not adopt that approach for the other offsets?
The rationale is that these views are accessed through forward or bidirectional iterators, so the cost is not linear for each access.
In your non-view url object, you seem to allocate storage using new char[]. Why not use a std::string? One benefit of a std::string is SBO. Another is that you could provide a move constructor from a string.
Yes. That sounds interesting. https://github.com/CPPAlliance/url/issues/437
I find the names of the encoding/decoding views confusing. I would expect a "decoded_view" to refer to underlying encoded data and return equivalent decoded data when it is read from, but you seen to call that an "encoded_view".
I agree. Some uses might be confusing, like in pct_encoded_view decoded_host = u.host(); Most people who have looked at this have proposed something like decode_view, decoded_view, pct_decoded_view, or decoded_view. If we follow the convention used in std::ranges, then it would be pct_decode_view, because it's usually a verb. For instance, https://en.cppreference.com/w/cpp/ranges/transform_view https://en.cppreference.com/w/cpp/ranges#Range_adaptors https://github.com/CPPAlliance/url/issues/438
Is that correct? Again having this third member in the container's value_type makes generic code more difficult. I would be inclined to use a std::optional for the value instead.
Yes. One option is to use std::optional for the value. This looks more coherent. On the other hand, there are applications where an empty string is a reasonable replacement when the value is empty. Another is to have a class instead of a struct and so we could encapsulate the values and return them one way or the other. https://github.com/CPPAlliance/url/issues/439
In the past I have had situations where it was necessary to specify
the case for hex letters in %-encoding; please consider adding this to
pct_encode_opts.
Yes. This could be useful. Uppercase should be the default since this is what URL normalization requires. https://github.com/CPPAlliance/url/issues/440 -- Alan Freitas https://github.com/alandefreitas
participants (5)
-
Alan de Freitas
-
Ivan Matek
-
Peter Dimov
-
Phil Endecott
-
Vinnie Falco