On Tue, Aug 23, 2022 at 8:44 AM Zach Laine
Ok, I'm convinced.
The URL classes are kind of weird in the sense that they have three parts: 1. the getter/setter functions for the singular pieces of the URL 2. the container-like interface for the segments 3. the container-like interface for the params Because each URL effectively exposes two different containers/ranges they are turned into separate types. It wouldn't make sense to have url::begin() and url::end(). Just so that we are on the same page, and anyone who is reading this now or in the future has clarity, when you write url u( "http://www.example.com/path/to/file.txt" ); segments us = u.segments(); The value `us` models a lazy, modifiable BidirectionalRange which references the underlying `url`. That is to say, that when you invoke modifiers on us, such as: us.pop_back(); it is the underlying `url` (or `static_url`) which changes. `segments` is a lightweight type which has the semantics of a reference. If you were to, say, make a copy of `us` you are just getting another reference to the same underlying url. A `segments` cannot be constructed by itself. When we say that the range is lazy, this means that the increment and decrement operation of its iterators executes in linear time rather than constant time. And of course there is no random access. The laziness refers to the fact that incrementing a path iterator requires finding the next slash ('/') character in the underlying URL.
I am still not convinced that the containers that maintain these invariants should be lazy. That still seems weird to me. If they own the data, and are regular types, they should probably be eager.
Here is where I am a little lost. When you say "the containers that maintain these invariants" are you referring to `segments` or `url`? Because `segments` does not actually implement any of the business logic required to modify the path. All of that is delegated to private implementation details of `url` (or more correctly: `url_base`). Perhaps when you say "if they own the data", the term "they" refers to the `url`? Even in that case, laziness is required, because this library only stores URLs in their serialized form. There were earlier designs which did it differently but it became apparent very quickly that the tradeoffs were not favorable. I'm not quite sure what "eager" means in this context.
The larger issue to me is that they have a subset of the expected STL API.
Right, well we implemented as much STL-conforming API as possible under the constraint that the URL is stored in its serialized form. Really I consider myself as more of an explorer than a designer, because once we made the design choice that the URL would be stored serialized, the remainder of the API design and implementation was more of an exercise in discovering what the consequences of that design choice would be and how familiar to standard containers we could make the interfaces become. Matching the STL API would require giving up one or more things that we currently have. This is possible, but we leave it up to the user to make this decision (by copying the data into a new std container).
That's kind of my point. You have a parsing minilib that is useful for URL parsing, but not *general use*. If that's the case, I think you should present it as that, and not a general use parsing lib.
Yes you are right about this, it is not for general use. It is specifically designed for implementing the ABNF grammars found in protocol-related RFCs such as rfc3986 which defines URL grammars used in Boost.URL, non-well-known schemes, HTTP messages, HTTP fields, Websocket fields. For example consider this grammar (from rfc7230) Transfer-Encoding = 1#transfer-coding transfer-coding = "chunked" ; Section 4.1 / "compress" ; Section 4.2.1 / "deflate" ; Section 4.2.2 / "gzip" ; Section 4.2.3 / transfer-extension transfer-extension = token *( OWS ";" OWS transfer-parameter ) transfer-parameter = token BWS "=" BWS ( token / quoted-string ) A downstream library like not-yet-proposed-for-boost.HTTP.Proto could use this minilib thusly: constexpr auto transfer_encoding_rule = list_rule( transfer_coding_rule, 1 ); https://github.com/CPPAlliance/http_proto/blob/f2382d8eab8be2e9d6e6e14c5502d... There's a lot going on here behind the scenes. HTTP defines the list-rule which is a comma separate sequence of elements where due to legacy reasons you can have extra unnecessary comments and whitespace anywhere between the elements. In ABNF the list-rule is denoted by the hash character in the Transfer-Encoding grammar above ( "one or more of transfer-coding" ) Boost.URL provides the lazy range which allows the downstream library to express the list-rule as a ForwardRange of transfer_coding. This allows the caller to iterate the list elements in the Transfer-Encoding value without allocating memory for each element. There is a recurring theme here - I use lazy ranges to defer memory allocation :) This goes back to Beast which offers a crude and quite frankly inelegant set of lazy parsing primitives. I took that concept and formalized it in Boost.URL and used the principle to let users opt-in to interpreting the path and query as ranges of segments and params respectively. Now this is a downstream library so you might wonder what this has to do with URLs. Well, Boost.URL is designed to handle ALL URLs. This includes the well-known hierarchical schemes like http and file but also opaque schemes, of which there are uncountably many as often these schemes are private or unpublished. However, the library can't possibly know how to decompose URLs into the parts defined by these schemes. In order to do this, a user has to write a parsing component which understands the scheme. We will use the mailto scheme as an example. First let me point out that ALL URLs which use the mailto scheme, are still URLs. They follow the generic syntax, and Boost.URL is capable of parsing them with no problem - since it can parse all URLs no matter the scheme. But users who want to do a deep-dive into the mailto scheme can't be satisfied merely with parsing a mailto URL. They want it decomposed, and obviously Boost.URL can't do that in the general case because every scheme is different. Here is the syntax of a mailto URI: mailtoURI = "mailto:" [ to ] [ hfields ] to = addr-spec *("," addr-spec ) hfields = "?" hfield *( "&" hfield ) hfield = hfname "=" hfvalue hfname = *qchar hfvalue = *qchar addr-spec = local-part "@" domain local-part = dot-atom-text / quoted-string domain = dot-atom-text / "[" *dtext-no-obs "]" dtext-no-obs = %d33-90 / ; Printable US-ASCII %d94-126 ; characters not including ; "[", "]", or "\" qchar = unreserved / pct-encoded / some-delims some-delims = "!" / "$" / "'" / "(" / ")" / "*" / "+" / "," / ";" / ":" / "@" To begin, a user might write this function: result< url_view > parse_mailto_uri( string_view s ); This is easy to implement at first because all mailto URIs are URLs. We might start with this: result< url_view > rv parse_mailto_uri( string_view s ) { auto rv = parse_uri( s ); if( ! rv ) return rv.error(); if( ! grammar:ci_is_equal( rv->scheme(), "mailto" ) ) return error::scheme_mismatch; ... return *rv; } This is a good start but it is unsatisfying, because we are getting the "to" fields in the path part of the URL, and Boost.URL doesn't know how to split up the recipients of the mailto since they are comma separated and not slash separated. Remember though, that this is still a valid URL and that Boost.URL can represent it. So now we want to implement this grammar: to = addr-spec *("," addr-spec ) If you expand the addr-spec ABNF rule you will see that it has unreserved character sets, percent-encoding possibilities, and quoted strings. I can't get into all this here (perhaps it would make a good example for a contributor) but you might start like this: constexpr auto addr_spec_rule = grammar::tuple_rule( local_part_rule, squelch(grammar::delim_rule('@')), domain_rule ); then you would continue to define each of those rules, and eventually you would be able to 1. validate that a particular mailto URL matches the scheme, and 2. decompose the elements of the mailto URL based on the requirements of the scheme itself. The idea here is to incubate grammar/ in URL, as more downstream libraries get field experience with it, and then propose it as its own library. I'm hoping to see people implement custom schemes, but maybe that's wishful thinking. Phew... Thanks