Re: [boost] Boost.URL review


On Mon, Aug 22, 2022 at 7:47 AM Zach Laine
It sounds like what you are proposing, is that Boost.URL should remove its algorithms for performing modifications to the URL. Or at least, remove the ability to modify the path segments and query parameters in terms of their equivalent sequences (BidirectionalRange). The modifiable containers in Boost.URL maintain an important invariant: modifications to the container will always leave the URL in a valid state. The implication is that the stored string is always in its "serialized" form.
This isn't workable at all, for several reasons. First of all when the user passes an encoded parameter to a modification function (named with the suffix "_encoded") the library performs validation on the input to preserve the invariant. Second, the library remembers the decoded size so that it doesn't have to re-parse that portion of the URL again. Third, when the user passes a decoded parameter to a modification function the library applies whatever percent-escapes are necessary to make the string valid for the part being changed, and stores that. And finally, there are certain modifications to the URL which require that the library adjusts the result in order to both satisfy the user's request and preserve the invariant that all stored URLs are valid. For example, consider these statements: url u( "ldap:local:userdb/8675309" ); u.remove_scheme(); What should the resulting URL be? Well, it can't be "local:userdb/8675309" because that still has a scheme. The library has to produce: u == "local%3a/8675309"; In other words we go from an absolute-URI to a relative-ref. In addition to preserving the invariant that all modifications produce valid URLs, the library also ensures that it can satisfy every possible user-requested modification (presuming that no incorrectly-encoded strings are passed to functions which accept encoded parameters). There is a fair bit of cleverness going on behind the scenes to bring this convenience to the user. That would all be lost if we outsourced mutation operations to std containers.
huh...idk about all that. This library was designed for program to program communication and not really user-input and user display. Although it could be used for that. If someone wants to print the URL and show a squiggly line near the part that failed, they can always do that by calling the parse function that uses an iterator as an out-param, indicating where the error occurred. I don't know that I want the URL library to go out of its way to offer this functionality, especially adding a `std::function` to an API. Regards

On Mon, Aug 22, 2022 at 10:40 AM Vinnie Falco
Ok, I'm convinced. I am still not convinced that the containers that maintain these invariants should be lazy. That still seems weird to me. If they own the data, and are regular types, they should probably be eager. The larger issue to me is that they have a subset of the expected STL API.
That's kind of my point. You have a parsing minilib that is useful for URL parsing, but not *general use*. If that's the case, I think you should present it as that, and not a general use parsing lib. (The use of std::function is incidental; it could be a template parameter instead.) Zach

On Tue, Aug 23, 2022 at 8:44 AM Zach Laine
Ok, I'm convinced.
The URL classes are kind of weird in the sense that they have three parts: 1. the getter/setter functions for the singular pieces of the URL 2. the container-like interface for the segments 3. the container-like interface for the params Because each URL effectively exposes two different containers/ranges they are turned into separate types. It wouldn't make sense to have url::begin() and url::end(). Just so that we are on the same page, and anyone who is reading this now or in the future has clarity, when you write url u( "http://www.example.com/path/to/file.txt" ); segments us = u.segments(); The value `us` models a lazy, modifiable BidirectionalRange which references the underlying `url`. That is to say, that when you invoke modifiers on us, such as: us.pop_back(); it is the underlying `url` (or `static_url`) which changes. `segments` is a lightweight type which has the semantics of a reference. If you were to, say, make a copy of `us` you are just getting another reference to the same underlying url. A `segments` cannot be constructed by itself. When we say that the range is lazy, this means that the increment and decrement operation of its iterators executes in linear time rather than constant time. And of course there is no random access. The laziness refers to the fact that incrementing a path iterator requires finding the next slash ('/') character in the underlying URL.
Here is where I am a little lost. When you say "the containers that maintain these invariants" are you referring to `segments` or `url`? Because `segments` does not actually implement any of the business logic required to modify the path. All of that is delegated to private implementation details of `url` (or more correctly: `url_base`). Perhaps when you say "if they own the data", the term "they" refers to the `url`? Even in that case, laziness is required, because this library only stores URLs in their serialized form. There were earlier designs which did it differently but it became apparent very quickly that the tradeoffs were not favorable. I'm not quite sure what "eager" means in this context.
The larger issue to me is that they have a subset of the expected STL API.
Right, well we implemented as much STL-conforming API as possible under the constraint that the URL is stored in its serialized form. Really I consider myself as more of an explorer than a designer, because once we made the design choice that the URL would be stored serialized, the remainder of the API design and implementation was more of an exercise in discovering what the consequences of that design choice would be and how familiar to standard containers we could make the interfaces become. Matching the STL API would require giving up one or more things that we currently have. This is possible, but we leave it up to the user to make this decision (by copying the data into a new std container).
Yes you are right about this, it is not for general use. It is specifically designed for implementing the ABNF grammars found in protocol-related RFCs such as rfc3986 which defines URL grammars used in Boost.URL, non-well-known schemes, HTTP messages, HTTP fields, Websocket fields. For example consider this grammar (from rfc7230) Transfer-Encoding = 1#transfer-coding transfer-coding = "chunked" ; Section 4.1 / "compress" ; Section 4.2.1 / "deflate" ; Section 4.2.2 / "gzip" ; Section 4.2.3 / transfer-extension transfer-extension = token *( OWS ";" OWS transfer-parameter ) transfer-parameter = token BWS "=" BWS ( token / quoted-string ) A downstream library like not-yet-proposed-for-boost.HTTP.Proto could use this minilib thusly: constexpr auto transfer_encoding_rule = list_rule( transfer_coding_rule, 1 ); https://github.com/CPPAlliance/http_proto/blob/f2382d8eab8be2e9d6e6e14c5502d... There's a lot going on here behind the scenes. HTTP defines the list-rule which is a comma separate sequence of elements where due to legacy reasons you can have extra unnecessary comments and whitespace anywhere between the elements. In ABNF the list-rule is denoted by the hash character in the Transfer-Encoding grammar above ( "one or more of transfer-coding" ) Boost.URL provides the lazy range which allows the downstream library to express the list-rule as a ForwardRange of transfer_coding. This allows the caller to iterate the list elements in the Transfer-Encoding value without allocating memory for each element. There is a recurring theme here - I use lazy ranges to defer memory allocation :) This goes back to Beast which offers a crude and quite frankly inelegant set of lazy parsing primitives. I took that concept and formalized it in Boost.URL and used the principle to let users opt-in to interpreting the path and query as ranges of segments and params respectively. Now this is a downstream library so you might wonder what this has to do with URLs. Well, Boost.URL is designed to handle ALL URLs. This includes the well-known hierarchical schemes like http and file but also opaque schemes, of which there are uncountably many as often these schemes are private or unpublished. However, the library can't possibly know how to decompose URLs into the parts defined by these schemes. In order to do this, a user has to write a parsing component which understands the scheme. We will use the mailto scheme as an example. First let me point out that ALL URLs which use the mailto scheme, are still URLs. They follow the generic syntax, and Boost.URL is capable of parsing them with no problem - since it can parse all URLs no matter the scheme. But users who want to do a deep-dive into the mailto scheme can't be satisfied merely with parsing a mailto URL. They want it decomposed, and obviously Boost.URL can't do that in the general case because every scheme is different. Here is the syntax of a mailto URI: mailtoURI = "mailto:" [ to ] [ hfields ] to = addr-spec *("," addr-spec ) hfields = "?" hfield *( "&" hfield ) hfield = hfname "=" hfvalue hfname = *qchar hfvalue = *qchar addr-spec = local-part "@" domain local-part = dot-atom-text / quoted-string domain = dot-atom-text / "[" *dtext-no-obs "]" dtext-no-obs = %d33-90 / ; Printable US-ASCII %d94-126 ; characters not including ; "[", "]", or "\" qchar = unreserved / pct-encoded / some-delims some-delims = "!" / "$" / "'" / "(" / ")" / "*" / "+" / "," / ";" / ":" / "@" To begin, a user might write this function: result< url_view > parse_mailto_uri( string_view s ); This is easy to implement at first because all mailto URIs are URLs. We might start with this: result< url_view > rv parse_mailto_uri( string_view s ) { auto rv = parse_uri( s ); if( ! rv ) return rv.error(); if( ! grammar:ci_is_equal( rv->scheme(), "mailto" ) ) return error::scheme_mismatch; ... return *rv; } This is a good start but it is unsatisfying, because we are getting the "to" fields in the path part of the URL, and Boost.URL doesn't know how to split up the recipients of the mailto since they are comma separated and not slash separated. Remember though, that this is still a valid URL and that Boost.URL can represent it. So now we want to implement this grammar: to = addr-spec *("," addr-spec ) If you expand the addr-spec ABNF rule you will see that it has unreserved character sets, percent-encoding possibilities, and quoted strings. I can't get into all this here (perhaps it would make a good example for a contributor) but you might start like this: constexpr auto addr_spec_rule = grammar::tuple_rule( local_part_rule, squelch(grammar::delim_rule('@')), domain_rule ); then you would continue to define each of those rules, and eventually you would be able to 1. validate that a particular mailto URL matches the scheme, and 2. decompose the elements of the mailto URL based on the requirements of the scheme itself. The idea here is to incubate grammar/ in URL, as more downstream libraries get field experience with it, and then propose it as its own library. I'm hoping to see people implement custom schemes, but maybe that's wishful thinking. Phew... Thanks

Ok, I'm convinced. I am still not convinced that the containers that maintain these invariants should be lazy. That still seems weird to
me. If they own the data, and are regular types, they should probably
be eager. The larger issue to me is that they have a subset of the expected STL API.
Yes. If I understand the discussion correctly, the container that maintains the invariant is not lazy but the documentation could probably be better at explaining some other things. It's the urls::url that maintains the invariants and owns the data in url u( "ldap:local:userdb/8675309" ); u.remove_scheme(); (u == "local%3a/8675309"); The urls::segment and urls::params views are just references to part of this urls::url. The invariants belong to the url. They are helpers to operate on the urls::url segments and parameters at an individual level, while the underlying urls::url adjusts itself to keep the invariants: url u( "" ); u.set_path("//index.htm"); (u == "/.//index.htm"); url u( "" ); u.is_absolute(true); u.segments().append({"", "index.htm"}); (u == "/.//index.htm"); So the eager urls::url is the one maintaining the invariants, but the segment or parameter views are a better way to operate on the URL when dealing with individual subcomponents. Thus, the functions are not exactly a subset of the STL because the operations are ensuring the invariants of the underlying url remain valid. (The names "segments" and "params" might be a little confusing because it sounds like it owns the data. Some other review has recommended mutable_segments_view/mutable_params_view.) The views are useful because it gives us a single contiguous string with the complete URL at the end. This buffer is always ready to be used by another application. All consecutive URL components are also always immediately available as a contiguous buffer. boost::asio::connect(sock, r.resolve(u.encoded_host)); or boost::beast::http::requesthttp::string_body req{http::verb::get, target, version}; req.set(http::field::host, u.encoded_host()); Besides performance reasons, this contiguous representation is quite useful because we don't have to reassemble the URLs or combinations of components in another string to be able to use it. They can go straight in a buffer sequence. The documentation should probably highlight that design a little more: the idea that we try to keep things in a single contiguous buffer and why. An isolated container owning only the segments or parameters wouldn't be very useful because it wouldn't do much more than a std::vector or a std::map would. These containers would maintain all the invariants we need in decoded segments and params and some other function could then percent-encode the data in these containers to paste that into a url. But what the library is attempting to achieve is avoiding this process. -- Alan Freitas https://github.com/alandefreitas

On Tue, Aug 23, 2022 at 12:08 PM Alan de Freitas via Boost
This seems to be the heart of the matter. I expect params to be a real flat_map or map or sorted vec, and it's not. If my expectation is wrong because of the lack of a _view suffix in the name, that's fine -- but please add that to make things clear. However, the _view suffix in std C++ is used exclusively for immutable types, so maybe something else is indicated, like _facade or some variation of span. Zach

On Wed, Aug 24, 2022 at 7:59 AM Zach Laine via Boost
This seems to be the heart of the matter. I expect params to be a real flat_map or map or sorted vec, and it's not.
Yeah, I see what you mean. "params" sounds like its a first-class object, but it isn't.
If my expectation is wrong because of the lack of a _view suffix in the name, that's fine -- but please add that to make things clear.
There are 8 containers which need names: segments segments_view segments_encoded segments_encoded_view params params_view params_encoded params_encoded_view We prefix the type name with the type of element "segment" or "param", so that it sorts nicely. We could have went with "encoded_segments" which also makes sense but then they don't group together as well and it hurts discoverability in the docs (Ramey Rule). The _encoded suffix is self-explanatory. We chose the suffix _view to distinguish from the read-only containers from the mutating ones. Someone suggested putting "const" and "mutable" in the name but in my opinion that is ugly: const_segments_view mutable_segments_view const_encoded_segments_view mutable_encoded_segments_view const_params_view mutable_params_view const_params_view mutable_params_view
As if we didn't have a hard enough time coming up with 8 names that have synergy and aesthetics :) How about the suffix _ref? segments_ref segments_cref segments_encoded_ref segments_encoded_cref params_ref params_cref params_encoded_ref params_encoded_cref Or we could mix _ref and _view? segments_ref segments_view segments_encoded_ref segments_encoded_view params_ref params_view params_encoded_ref params_encoded_view We did the best we could with the current names and I know they aren't perfect. We're more than happy to hear feedback on whether any of these alternatives are resonating with folks, or if someone wants to propose a new set of 8 identifier names. Thanks

On Wed, Aug 24, 2022 at 10:16 AM Zach Laine
For this case, I definitely prefer ref to view.
\phew... progress :) So, to be explicit - does everyone agree that this set of container names is currently the best alternative thus far? segments_ref segments_view segments_encoded_ref segments_encoded_view params_ref params_view params_encoded_ref params_encoded_view Thanks

On Wed, Aug 24, 2022, at 9:02 PM, Vinnie Falco via Boost wrote:
Even though this is solliciting to become design-by-committee (there will never be full consensus) I will go on record that I think the `_ref`/`_cref` suffixes are reasonable to me. At least they're relatively short and consistent (ref/cref is a closer pair than mutable/const) In my code I would ironically be tempted to use auto more to hide the crufty names because I know they're views anyways. Which is probably why I don't mind the old names. In terms of minimizing surprise these names do a better job.

On Wed, Aug 24, 2022 at 2:44 PM Seth via Boost
I prefer to think of it as "focus group sampling." This is on the table now: params_view params_const_view params_encoded_view params_encoded_const_view segments_view segments_const_view segments_encoded_view segments_encoded_const_view Alan brings up a good point that while these are longer they are more memorable and it leaves no doubt as to meaning. Thanks
participants (4)
-
Alan de Freitas
-
Seth
-
Vinnie Falco
-
Zach Laine