Boost.URL: Something Wicked This Way Comes! (Was: Needs Review)
Folx: Well, I was on a pause but I picked Boost.URL back up and whipped it into shape. The good news, this Belle is ready for the Ball ! I invite you to kick the tires and light the fires on the next major component in my line of network and protocol libraries: Repository https://github.com/CPPAlliance/url/ Documentation https://master.url.cpp.al/ Docs are still being worked on, of course (docs are never "done"). `url_view`, `url`, `static_url`, and `authority_view` should be completely solid and ready to get beat up. And I'm ready for all that lovely negative feedback !!! If a kind soul would like to volunteer to be the review manager, that would be swell. I'm very happy to answer any questions about the library, and happy to get feedback regarding the design and implementation. -- Regards, Vinnie Follow me on GitHub: https://github.com/vinniefalco
On Tue, Oct 12, 2021 at 10:17 AM Ivan Matek
I see examples work with std::cout, would it work (in C++20 compilers) with fmt::format?
Good question! I'm not sure... the library provides operator<< for all important types. If fmt::format uses operator<< internally, then I suppose it should work. But if it doesn't work, and there's something special that Boost.URL needs to do to make it work, there's no problem adding it - conditionally compiled when C++20 is available. Thanks
Vinnie Falco wrote:
On Tue, Oct 12, 2021 at 10:17 AM Ivan Matek
wrote: I see examples work with std::cout, would it work (in C++20 compilers) with fmt::format?
Good question! I'm not sure... the library provides operator<< for all important types. If fmt::format uses operator<< internally, then I suppose it should work.
No, it doesn't. std::format has its own customization mechanism.
On Tue, Oct 12, 2021 at 10:32 AM Peter Dimov
No, it doesn't. std::format has its own customization mechanism.
The issue may be tracked here: https://github.com/CPPAlliance/url/issues/62 Thanks
Why did you use RFC 3986 as your specification? How do you feel about the WhatWG URL specification at https://url.spec.whatwg.org https://url.spec.whatwg.org/? It has a large body of tests at https://github.com/web-platform-tests/wpt/blob/master/url/resources/urltestd... https://github.com/web-platform-tests/wpt/blob/master/url/resources/urltestd... that you may consider looking at. Do you have any general plan for a strategy for handling non-ASCII input? I haven’t tested it yet, but what do you plan to do if someone passes a UTF-8 encoded non-ASCII string into the constructor?
On Oct 12, 2021, at 8:54 AM, Vinnie Falco via Boost
wrote: Folx:
Well, I was on a pause but I picked Boost.URL back up and whipped it into shape. The good news, this Belle is ready for the Ball ! I invite you to kick the tires and light the fires on the next major component in my line of network and protocol libraries:
Repository https://github.com/CPPAlliance/url/
Documentation https://master.url.cpp.al/
Docs are still being worked on, of course (docs are never "done"). `url_view`, `url`, `static_url`, and `authority_view` should be completely solid and ready to get beat up. And I'm ready for all that lovely negative feedback !!!
If a kind soul would like to volunteer to be the review manager, that would be swell.
I'm very happy to answer any questions about the library, and happy to get feedback regarding the design and implementation.
-- Regards, Vinnie
Follow me on GitHub: https://github.com/vinniefalco
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Tue, Oct 12, 2021 at 10:57 AM Alex Christensen
Why did you use RFC 3986 as your specification?
Well, this one seems to be the latest RFC regarding URLs, modulo a bit of HTTP-specific stuff like authority-form appearing in rfc7230. Is there something newer?
How do you feel about the WhatWG URL specification at https://url.spec.whatwg.org?
Quite frankly, I hate it. This "specification" manages to organize and present the information in the most obtuse manner possible. I find it hostile to implementers like myself.
It has a large body of tests ...that you may consider looking at.
Yep, it does! Integrating them is on my to-do- list: https://github.com/CPPAlliance/url/tree/c6c4b433c3b1057161b6ce50bb4fba0b5f59...
Do you have any general plan for a strategy for handling non-ASCII input?
Yes, the plan is to reject such input. Strings containing non-ASCII characters are not valid URLs. And even some ASCII characters are not allowed to appear in a URL, for example all control characters.
I haven’t tested it yet, but what do you plan to do if someone passes a UTF-8 encoded non-ASCII string...
...into the constructor? You can't construct a URL from a string, you have to go through one of
I think what you're asking is, what if someone supplies a URL which has escaped characters which, when percent-decoding is applied, become valid UTF-8 code point sequences? That's perfectly fine. Percent-encoded URL parts are in fact "ASCII strings." the parsing functions. This is because the library recognizes several variations of URL grammar, and does not favor any particular grammar by choosing one to support construction. See Table 1.1: https://master.url.cpp.al/url/parsing.html Thanks
On Oct 12, 2021, at 12:17 PM, Vinnie Falco
wrote: On Tue, Oct 12, 2021 at 10:57 AM Alex Christensen
wrote: Why did you use RFC 3986 as your specification?
Well, this one seems to be the latest RFC regarding URLs, modulo a bit of HTTP-specific stuff like authority-form appearing in rfc7230. Is there something newer? I would say that the WhatWG URL specification is that something newer, but I sympathize. It is difficult to get started with.
Do you have any general plan for a strategy for handling non-ASCII input?
Yes, the plan is to reject such input. Strings containing non-ASCII characters are not valid URLs. And even some ASCII characters are not allowed to appear in a URL, for example all control characters. That is certainly a choice you can make, but at some point you may run into issues with people trying to give your library input like http://example.com/💩 http://example.com/%F0%9F%92%A9 and expecting the URL parser to normalize it to http://example.com/%F0%9F%92%A9 http://example.com/%F0%9F%92%A9 for you like it does in some other URL libraries. I see Punycode encoding and decoding doesn’t seem to be in the scope of this library, and for your use cases that may be fine and for others that might not be fine. It seems like you’re aware of this design choice, though.
Alex Christensen wrote:
That is certainly a choice you can make, but at some point you may run into issues with people trying to give your library input like http://example.com/💩 http://example.com/%F0%9F%92%A9 and expecting the URL parser to normalize it to http://example.com/%F0%9F%92%A9 http://example.com/%F0%9F%92%A9 for you like it does in some other URL libraries.
Assuming this is supported, it raises the question of what the parser would be expected to do with http://example.com/💩/%F0%9F%92%A9. Should it encode the percents, under the assumption that they are literal because everything is non-encoded? Or should it leave the percents as-is and only encode the emoji? If the use case is "the user typed or pasted something into the address bar", I suppose a best-effort DWIM (e.g. option 2) makes more sense. But in other scenarios, option 1 might.
On Tue, Oct 12, 2021 at 1:02 PM Alex Christensen
I would say that the WhatWG URL specification is that something newer, but I sympathize. It is difficult to get started with.
I plan to pick through the WhatWG specification and see if there are any tidbits that could have value. The procedural exposition (append this character, execute this algorithm, output this string) makes it very difficult to grasp the higher level semantics of the thing. The BNF in the RFC is way easier to grok, e,g: scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) and in fact this is exactly how I have decomposed the parsing, into each individual named element described by the RFC.
That is certainly a choice you can make, but at some point you may run into issues with people trying to give your library input like http://example.com/💩
I'm not sure how someone would give the library that input. Is this expressible in a string_view?
I see Punycode encoding and decoding doesn’t seem to be in the scope of this library, and for your use cases that may be fine and for others that might not be fine.
I actually have all the punycode algorithms ready including tests: https://github.com/CPPAlliance/url/tree/punycode But after giving it some thought, I couldn't see the use-case for it. Callers who want to perform name resolution on a host can't pass a utf-8 string they have to pass the punycode-encoded string. The only use-case that I can discern for punycode is for display purposes which is out of scope. Unless, do you know of any other use-case?
It seems like you’re aware of this design choice, though.
Yes Actually, come to think of it - there is a use-case for punycode encoding and that is to take an international domain name string in utf-8 encoding and apply puny-code encoding. I think... I have no experience with this so some guidance would be helpful. -- Regards, Vinnie Follow me on GitHub: https://github.com/vinniefalco
On Tue, Oct 12, 2021 at 1:02 PM Alex Christensen
at some point you may run into issues with people trying to give your library input like http://example.com/💩 and expecting the URL parser to normalize it to http://example.com/%F0%9F%92%A9
Okay, I think what you're saying is that you will have this string literal: string_view s = "http://example.com/\xf0\x9f\x92\xa9"; Unfortunately, this is not a valid URL and I don't think that the library should accept this input. However, you could write this: url u = parse_uri( "http://example.com" ).value(); u.set_path( "/\xf0\x9f\x92\xa9" ); This will produce: assert( u.encoded_url() == "http://example.com/%f0%9f%92%a9" ); Is this what you meant? I'm guessing that the turd emoji is inserted into the C++ source file as a utf-8 encoded code point, so that's what you get in the string literal. Thanks
On Oct 12, 2021, at 3:15 PM, Vinnie Falco
wrote: On Tue, Oct 12, 2021 at 1:02 PM Alex Christensen
wrote: at some point you may run into issues with people trying to give your library input like http://example.com/💩 and expecting the URL parser to normalize it to http://example.com/%F0%9F%92%A9
Okay, I think what you're saying is that you will have this string literal:
string_view s = "http://example.com/\xf0\x9f\x92\xa9";
Unfortunately, this is not a valid URL and I don't think that the library should accept this input. It is perfectly valid input that some URL libraries I work with accept and percent encode, and some URL libraries I work with reject it as an invalid URL. I think it’s a valid URL parser input that ought to produce a valid URL, but not everyone agrees on this yet.
However, you could write this:
url u = parse_uri( "http://example.com" ).value();
u.set_path( "/\xf0\x9f\x92\xa9" );
This will produce:
assert( u.encoded_url() == "http://example.com/%f0%9f%92%a9" );
Is this what you meant? Welcome to the crazy world of URL parsing!
I'm guessing that the turd emoji is inserted into the C++ source file as a utf-8 encoded code point, so that's what you get in the string literal.
Thanks
On Tue, Oct 12, 2021 at 2:52 PM Alex Christensen
It is perfectly valid input that some URL libraries I work with accept and percent encode, and some URL libraries I work with reject it as an invalid URL. I think it’s a valid URL parser input that ought to produce a valid URL, but not everyone agrees on this yet.
Not so fast, I think that this can be decided objectively. A URL in the context of Boost.URL refers to "URI" in the rfc3986 sense. I use URL because most people never heard of URI. What you are thinking of as a "valid URL parser input" is actually an Internationalized Resource Identifier, which supports the broader universal character set instead of just ASCII and is abbreviated by the even more obscure acronym "IRI." It is covered by rfc3987: https://datatracker.ietf.org/doc/html/rfc3987 Translating your comment, I think you're saying "Boost.URL should support Internationalized Resource Identifiers." That is unfortunately out of scope for the library, as Boost.URL is mostly designed for the exchange of URLs between machines or programs and not necessarily for display to users. Perhaps someday, the entire world will have switched to IRIs (maybe after IPv4 is no longer in use) but we are not there yet, and most systems require IRIs to be mapped to their URI equivalent: https://datatracker.ietf.org/doc/html/rfc3987#section-3 There is some value to IRIs but not as much as there is for the ASCII URLs, which fill a tremendous user need (HTTP/WebSocket clients and servers using Beast or Asio). Thanks
On Tue, Oct 12, 2021 at 6:06 PM Vinnie Falco
What you are thinking of as a "valid URL parser input" is actually an Internationalized Resource Identifier, which supports the broader universal character set instead of just ASCII and is abbreviated by the even more obscure acronym "IRI." It is covered by rfc3987:
We looked over this RFC and I think, it would be possible to support IRIs simply by providing a new set of parsing functions, for example void parse_iri ( string_view, error_code&, url& ); void parse_irelative_ref ( string_view, error_code&, url& ); void parse_absolute_iri ( string_view, error_code&, url& ); void parse_iri_reference ( string_view, error_code&, url& ); It wouldn't be possible to parse into a url_view, since UTF-8 encoded characters have to be converted to percent-encoded escapes. But this could be made to work, and it fits neatly into the current implementation. There would be an additional function to take a url and convert it back into its IRI string, which is mostly just decoding percent-escaped characters. The library would disallow invalid UTF-8. Thanks
On 13/10/2021 10:15, Vinnie Falco wrote:
Okay, I think what you're saying is that you will have this string literal:
string_view s = "http://example.com/\xf0\x9f\x92\xa9";
Unfortunately, this is not a valid URL and I don't think that the library should accept this input. However, you could write this:
url u = parse_uri( "http://example.com" ).value();
u.set_path( "/\xf0\x9f\x92\xa9" );
Why would set_path accept encoding that parse_uri does not? That sounds like a red flag. I note in the docs that there is also a set_encoded_path, which implies that set_path might be "unencoded", which might explain it -- but set_path has no documentation, so this is unclear. On an unrelated note, set_query_part seems wrong, it should have a leading question mark but the BNF specifies a leading hash. Also it claims its input is an encoded string, which seems inconsistently named with set_encoded_query (which in turn mentions fragments in its docs, which also seems wrong).
On Tue, Oct 12, 2021 at 3:28 PM Gavin Lambert via Boost
I note in the docs that there is also a set_encoded_path, which implies that set_path might be "unencoded", which might explain it -- but set_path has no documentation, so this is unclear.
The docs could use work as usual. Yes, there are two functions set_encoded_path set_path For almost all functions that set a string, there is the "encoded" version which requires a valid percent-encoded string, and the plain version which accepts any string and then escapes the reserved characters as needed.
On an unrelated note, set_query_part seems wrong, it should have a leading question mark but the BNF specifies a leading hash. Also it claims its input is an encoded string, which seems inconsistently named with set_encoded_query (which in turn mentions fragments in its docs, which also seems wrong).
Little bit of under-construction going on there, it will be resolved today, thanks for looking! Regards
Le mercredi 13 octobre 2021 à 11:27 +1300, Gavin Lambert via Boost a écrit :
On 13/10/2021 10:15, Vinnie Falco wrote:
Okay, I think what you're saying is that you will have this string literal:
string_view s = "http://example.com/\xf0\x9f\x92\xa9";
Unfortunately, this is not a valid URL and I don't think that the library should accept this input. However, you could write this:
url u = parse_uri( "http://example.com" ).value();
u.set_path( "/\xf0\x9f\x92\xa9" );
Why would set_path accept encoding that parse_uri does not? That sounds like a red flag.
I would not say so. I see 2 different use cases : * parse an uri : the uri must be properly encoded * programatically build an uri: the different components shall not be encoded. -> u.set_path("%f0%9f%92%a9") will then produce "http://example.com/%25f0%259f%2592%25a9". However, that seems inconsistent with the way set_host works (the docs says it needs to be encoded), so i tend to agree with the red flag here (or maybe they're just documentation issues, since there is also set_encoded_host). And then there's the issue of the '/' character in set_path if you take unencoded strings (not sure how this should be handled...) Regards, Julien
On 13/10/2021 18:56, Julien Blanc wrote:
And then there's the issue of the '/' character in set_path if you take unencoded strings (not sure how this should be handled...)
Exactly, that's a fundamental problem with the concept of accepting not-yet-encoded strings -- the URL specification does not have uniform encoding; it is not possible to take an arbitrary URL and round-trip it between encoded and decoded forms as a whole (even ignoring equivalent variants like %20 vs + or %-encoding more characters than strictly required). Where a / occurs inside a single path-component, it must be escaped so as to not be seen as a path-component-separator. And as such, it's not possible to (reliably) pass multiple unencoded path-components as a single string. The same problems occur with query parameters and &= characters, or with ?# characters appearing anywhere. (Where the authority component is not a DNS hostname there may be issues with :@/ characters appearing there too.) I think the only sane choice here is (if supported at all) to only provide access to unencoded values at the smallest possible unit (i.e. only for a single parameter or a single path component, or a collection of such, but never as a single string). (But to still provide easy access to the encoder/decoder for arbitrary external usage.)
On Tue, Oct 12, 2021 at 11:46 PM Gavin Lambert via Boost
it is not possible to take an arbitrary URL and round-trip it between encoded and decoded forms as a whole (even ignoring equivalent variants like %20 vs + or %-encoding more characters than strictly required).
That's right.
Where a / occurs inside a single path-component, it must be escaped so as to not be seen as a path-component-separator. And as such, it's not possible to (reliably) pass multiple unencoded path-components as a single string.
Not exactly. If you call set_path() it treats slash as a separator, since it works with the entire path portion of the URL. If you call segments().push_back( "my/slash" ) then you get the slash percent-encoded. The library APIs are biased towards interpreting the path hierarchically but you can still treat the path as a monolithic string using set_path and set_encoded_path.
The same problems occur with query parameters and &= characters, or with ?# characters appearing anywhere.
There isn't any actual ambiguity here. set_query() treats '&' and '=' as separators. If that is not your intent you can use params().insert() or params().push_back() which will percent-encode these symbols. If you call set_query() with a literal string you will get that literal string as a percent-encoded query in the final URL. If you then call query() you will get back the original unencoded string. Modulo the wrinkle where "+" becomes a space on decoding, but even that can be controlled by the user: https://master.url.cpp.al/url/ref/boost__urls__pct_decode_opts.html
(Where the authority component is not a DNS hostname there may be issues with :@/ characters appearing there too.)
Well, there is no set_authority() function yet, only set_encoded_authority(). I haven't looked closely at it but it would similarly to the others. However, I'm not convinced it is necessary. If you call set_host() you can put any string you want in there and it will be correctly percent-encoded. If there is no user, password, or port, then that percent-encoded string is effectively the entire "authority" - so there is already a way to set the authority to an arbitrary string. Thanks
On Tue, Oct 12, 2021 at 11:24 PM Julien Blanc via Boost
However, that seems inconsistent with the way set_host works (the docs says it needs to be encoded), so i tend to agree with the red flag here (or maybe they're just documentation issues, since there is also set_encoded_host).
Probably doc issue, there are two functions: url::set_host( string_view ) url::set_encoded_host( string_view ) As with all the APIs, functions with "set_encoded" in the name, accept a percent-encoded string and throw an exception on invalid input. "set_host" will apply percent-encoding to the input as needed.
And then there's the issue of the '/' character in set_path if you take unencoded strings (not sure how this should be handled...)
Well, there's not much of an issue there. set_path() treats slashes as segment separators. If that's not what you want, you can set the individual unencoded segments through the container returned by url::segments(). The same goes for the query parameters (call url::params()). Thanks
Le 2021-10-13 16:54, Vinnie Falco a écrit :
On Tue, Oct 12, 2021 at 11:24 PM Julien Blanc via Boost
And then there's the issue of the '/' character in set_path if you take unencoded strings (not sure how this should be handled...)
Well, there's not much of an issue there. set_path() treats slashes as segment separators. If that's not what you want, you can set the individual unencoded segments through the container returned by url::segments(). The same goes for the query parameters (call url::params()).
I'm not at ease with that. If i build a mailto: uri, and call set_path with a email address containing a / character in it (such as valid/email@example.com), it will result in a uri containing two segments, which is nonsense for a mailto scheme (and obviously not what i expected). I understand the api is fitted toward web uris, but i see there a potential for hard to find bugs due to api misuse. Maybe it's just the naming that needs to be improved. Regards, Julien
Hi Vinnie, Vinnie Falco wrote:
Well, I was on a pause but I picked Boost.URL back up and whipped it into shape.
Thanks for posting that. Assorted comments follow. Back in 2005-ish I decided to write a URL parser based on the RFC BNF using Boost.Spirit. At that time the most recent RFCs were RFC2616 (1999) and RFC2396 (1998). I did a fairly direct translation of the BNF to Spirit and ... it didn't work. It turned out that the BNF in the RFCs was wrong, and had been for 6 years at that point. So I advise that you are careful about directly using the RFC BNF. Have you looked at the RFC 3986 errata at https://www.rfc-editor.org/errata_search.php?rfc=3986&rec_status=0 ? For example, I note that your rule for relative-part on the "Parsing" page doesn't seem to include the path-abempty alternative added in erratum 5428. I would be interested to see how compile time, run time, and lines-of-code compare between your implementation, a Spirit parser, and a regexp. You also have the choice of (a) parsing and storing pointers to each component as I believe you do, or (b) parsing to validate but storing only the complete string, and doing a partial parse when necessary to extract components. (b) could be better if you're mostly just passing around complete URLs, storing them in containers, etc. There's also the question of worst-case complexity; does your parser backtrack? If so, is there a pathological input that causes it to take exponential time? Can you add a complexity guarantee to the docs? (I posted about this in the context of regular expression libraries a few weeks ago.) This worries me: auto url = parse_url(str); The problem is that url is a view of the string parameter, but nothing warns the user about that and... str = something_else; ..now url is dangling. Am I right, or is there something that prevents this? I believe that functions that returns views should be named so that that is obvious, and in cases like this the function with the simplest name should be the safer version that returns a value-type. (There was a similar discussion for fixed_string, which originally proposed to return a view from its substr() method.) Also I note that parse_uri() returns a result<> and doesn't throw. My preference would be to name that e.g. try_parse_uri() and for the simpler-named function to be a wrapper that returns a plain uri or throws on error. My rationale is to make the simplest-named functions provide the easiest-to-use functionality, so that *teaching C++ to a Javascript programmer is not so embarrassing*, while retaining the more performant API variants for those who need them. (Is it parse_uri() or parse_url()? I thought you had decided on URL.) "pct" doesn't immediately mean "percent" to me. I'd prefer percent_encode() to pct_encode(). If that's too long for you, truncate "encode" to "enc". Or maybe I would call this "escaping" rather than "encoding". Actually I think the RFC terminology might be "URL encoding"; I don't know if I prefer that but if you want "strict adherence to the RFCs"... It doesn't look like your url type is comparable. I think it should be. The comparisons need to be careful about case-sensitivity. And it should be hashable. I recently had to implement request signing for AWS. This requires canonicalisation of URLs. There are some subtleties in percent encoding: you need to be able to control exactly which characters are escaped (I think you allow this); you need to be able to control whether + is used to escape space (do you?) and you need to be able to control whether the hex characters are upper or lower case (do you?). This code also needed to sort the query parameters by key; that raises the question of whether the query is a sequence or an associative container of key-value pairs, and if an associative container, what the comparison function is. Actually you might like to have a look at the AWS SDK URI class: http://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_http_1_1_u_r_i.html One thing that I believe it does is to set port to a default based on the scheme, if it's not specified in the url string. I'd say that's a helpful thing to do. Are there any other URI classes that you could compare/contrast with? I would hope that the path would be similar to a std::filesystem::path i.e. it should use the same vocabulary where possible (or could even actually be a std::filesystem::path, on C++17). Conversion between std::filesystem::path and file: URLs would be nice to have. In your table 1.7, "Return the type of host specified, if any." should I think say "Return the host, if any." Your percent-decoding methods return strings. I'm a bit surprised that you've not tried to return lazy decoding views, at least as an option. Anyway that's enough ramblings for now.... Regards, Phil.
On Wed, Oct 13, 2021 at 7:10 AM Phil Endecott via Boost
Have you looked at the RFC 3986 errata at https://www.rfc-editor.org/errata_search.php?rfc=3986&rec_status=0
ruh roh.. I have not gotten to that yet but yeah. I see problems :) Not difficult problems, but there are relevant errata that must be fixed before any review. Thanks!
? For example, I note that your rule for relative-part on the "Parsing" page doesn't seem to include the path-abempty alternative added in erratum 5428.
Thanks I need to update the javadocs and exposition. Although, I think the parsing is still correct despite not showing path-abempty. I will add more tests.
I would be interested to see how compile time, run time, and lines-of-code compare between your implementation, a Spirit parser, and a regexp.
Someone volunteered to do that so I might explore it at one point.
You also have the choice of (a) parsing and storing pointers to each component as I believe you do,
Right, the library keeps a small table of offsets to various parts. For the modifiable containers I plan to also keep a table for the segments and query parameters, to allow O(1) access of indexed elements (right now its linear). I think this is a good balance for accessing parts of the URL and helping speed up mutation operations.
(b) could be better if you're mostly just passing around complete URLs, storing them in containers, etc.
Well, in url the table adds overhead. If you don't want that overhead then... just store the URL string :) and parse it again when you need it, this is equivalent to your (b), and less impact on the library.
There's also the question of worst-case complexity; does your parser backtrack? If so, is there a pathological input that causes it to take exponential time?
I have done no profiling or optimization to speak of except some minor design work to facilitate the use of character-based SIMD algorithms later. I don't have a monolithic "parser" per-se, there are classes called "bnf elements" which can be used in the generic bnf::parse function. These elements call parse on child elements recursively. I do backtrack. For example, URI-reference is defined as: URI-reference = URI / relative-ref The class uri_reference_bnf implements this by first attempting to parse the scheme and colon, and if that fails then it backtracks and attempts parsing a "relative-ref." You can see this here: https://github.com/CPPAlliance/url/blob/680f6db547ffc3367e1ba93bae29e4c9278f... I'm not really sure how this can be improved, you need to know if there's a scheme and that can only be determined by scanning forward until reaching a colon or an invalid character for scheme. In theory this could scan the entire input. Thus I have not put any effort into making this better.
Can you add a complexity guarantee to the docs? (I posted about this in the context of regular expression libraries a few weeks ago.)
Maybe... I did it for JSON because performance is critical and the difference between a poorly performing library and a greatly performing library is massive. I'm not sure that URL parsing is the same. URLs tend to be short and it is sort of difficult to really get bad performance out of parsing them even if you go a character at a time. Unlike JSON you don't need to convert between base 2 and base 10 for numbers (which is a headache). Mutations always give the strong exception safety guarantee and it also guarantees only one memmove, I will probably document that. I can give you a better answer in the future when it is at the point where benchmarks are explored.
auto url = parse_url(str);
The problem is that url is a view of the string parameter, but nothing warns the user about that and...
Well, it is documented profusely. I agree there is some danger here but with great danger comes great power. Or something about great responsibility?
..now url is dangling. Am I right, or is there something that prevents this?
No you are right, in the statement below `uv` references the string and does not take ownership: url_view uv = parse_uri( "https://example.com/" ).value(); However, url is constructible from url_view and url DOES make a copy with ownership, so this statement also works and doesn't have the problem of dangling: url u = parse_uri( "https://example.com/" ).value();
I believe that functions that returns views should be named so that that is obvious, and in cases like this the function with the simplest name should be the safer version that returns a value-type.
These algorithms only return views, because there are a few different owning URL containers and they can all be constructed from the views. Overloading on return type doesn't really work and some containers may have template parameters or allocators which makes APIs to parse into them awkward. If we're debating the name though, it means we have accepted the general design of the library :)
Also I note that parse_uri() returns a result<> and doesn't throw. My preference would be to name that e.g. try_parse_uri() and for the simpler-named function to be a wrapper that returns a plain uri or throws on error. My rationale is to make the simplest-named functions provide the easiest-to-use functionality, so that *teaching C++ to a Javascript programmer is not so embarrassing*, while retaining the more performant API variants for those who need them.
Well the whole point of boost::system::result (the new type that Boost.URL uses) is to cut down on those extra names and overloads. Your suggestion would keep the additional complexity of result and also increase the number of overloads which isn't sounding like a clear cut win to me... although, this "result" return type is sort of new.
(Is it parse_uri() or parse_url()? I thought you had decided on URL.)
I use the term URL generally, but I use the precise wording when referring to grammar. "parse_uri" uses the URI grammar specified in rf3986, modulo any errata fixes which I have yet to apply.
"pct" doesn't immediately mean "percent" to me.
The grammar says "pct-encode", I followed that. I think there is sufficient use of the abbreviation that it is justified but I could be convinced otherwise if there is new data.
It doesn't look like your url type is comparable. I think it should be. The comparisons need to be careful about case-sensitivity.>
And it should be hashable.
You can track those issues here: https://github.com/CPPAlliance/url/issues/64 https://github.com/CPPAlliance/url/issues/65
I recently had to implement request signing for AWS. This requires canonicalisation of URLs. There are some subtleties in percent encoding: you need to be able to control exactly which characters are escaped (I think you allow this); you need to be able to control whether + is used to escape space (do you?)
Yes and yes.
and you need to be able to control whether the hex characters are upper or lower case (do you?).
But no to this... we can add it though, there's a "pct_encode_opts" structure: https://master.url.cpp.al/url/ref/boost__urls__pct_encode_opts.html
This code also needed to sort the query parameters by key; that raises the question of whether the query is a sequence or an associative container of key-value pairs, and if an associative container, what the comparison function is.
The library treats the query parameters as a ForwardRange (url_view) or RandomAccessRange (url) of key/value pairs, allowing for duplicate keys. The functions find/find_next can be used to visit all values with the same key. If you are using the "encoded" params container (by calling encoded_params()) then comparisons are bitwise. If you are using the "plain" params container (by calling params()) which decodes everything for you, then the comparison is made using notional percent-decoding. That is, your decoded key is compared against each container key as-if it had percent-decoding applied.
I would hope that the path would be similar to a std::filesystem::path i.e. it should use the same vocabulary where possible (or could even actually be a std::filesystem::path, on C++17).
OOF I hope never to see this. std::filesystem is kind of a mess. These are the path containers: https://master.url.cpp.al/url/ref/boost__urls__segments.html https://master.url.cpp.al/url/ref/boost__urls__segments_encoded.html https://master.url.cpp.al/url/ref/boost__urls__segments_view.html https://master.url.cpp.al/url/ref/boost__urls__segments_encoded_view.html "view" containers are read-only and reference the character buffer without ownership, these containers are obtained by calling functions on url_view. The other containers allow modification (the iterators are still read-only, because a proxy-assignment implementation proved to be clumsy) and these containers are obtained by calling functions on url.
Conversion between std::filesystem::path and file: URLs would be nice to have.
Yeah something like that is possible, falling back to boost::filesystem on C++11.
In your table 1.7, "Return the type of host specified, if any." should I think say "Return the host, if any."
hmm, yeah - the function "host_type()" is missing from that list (it returns the type of host).
Your percent-decoding methods return strings. I'm a bit surprised that you've not tried to return lazy decoding views, at least as an option.
Now that is a very interesting idea! I didn't think of it. But, its hard to see the use case, what would you do with a lazy view other than unroll it into a linear buffer? Thanks
On 14/10/2021 05:02, Vinnie Falco via Boost wrote:
On Wed, Oct 13, 2021 at 7:10 AM Phil Endecott wrote:
Have you looked at the RFC 3986 errata at https://www.rfc-editor.org/errata_search.php?rfc=3986&rec_status=0
ruh roh.. I have not gotten to that yet but yeah. I see problems :) Not difficult problems, but there are relevant errata that must be fixed before any review. Thanks!
Bear in mind that most of those are Reported and not Verified, so they should be treated as non-authoritative comments. For example, the suggested amendment for parsing too many .. path components seems incorrect and the original is more correct. (Unmatched .. components should not be left in a canonical URL.)
On Wed, Oct 13, 2021 at 3:52 PM Gavin Lambert via Boost
Bear in mind that most of those are Reported and not Verified, so they should be treated as non-authoritative comments.
Never a dull moment!!
For example, the suggested amendment for parsing too many .. path components seems incorrect and the original is more correct. (Unmatched .. components should not be left in a canonical URL.)
I'm still working on that part. But I think that the reg-name change is legit? The addition of path-abempty doesn't actually change anything it seems, but I have updated the documentation. Thanks
On 14/10/2021 11:59, Vinnie Falco wrote:
On Wed, Oct 13, 2021 at 3:52 PM Gavin Lambert wrote:
Bear in mind that most of those are Reported and not Verified, so they should be treated as non-authoritative comments.
Never a dull moment!!
For example, the suggested amendment for parsing too many .. path components seems incorrect and the original is more correct. (Unmatched .. components should not be left in a canonical URL.)
I'm still working on that part. But I think that the reg-name change is legit?
The addition of path-abempty doesn't actually change anything it seems, but I have updated the documentation.
I have no particular opinion on path-abempty; that one might be ok. But the suggested change to .. parsing is actively harmful and you should not do it -- a canonical path should never contain .. components, especially not leading ones.
On Wed, Oct 13, 2021 at 4:09 PM Gavin Lambert via Boost
...
Given the following path: /path/to/file.txt What should iterating the segments return? { "path", "to", "file.txt" } or { "/path", "/to", "/file.txt" } or something else? Given this path: my/download/folder/ What should iterating the segments return? { "my", "download", "folder", "" } or { "my/", "download/", "folder/" } or { "my", "/download", "/folder", "/" } or something else? Given an empty relative-ref (empty string) in a url u, what is the encoded path after executing this statement? u.segments().push_back( "index.htm" ); Anyone feel free to answer! Thanks
On 10/14/21 2:52 AM, Vinnie Falco via Boost wrote:
On Wed, Oct 13, 2021 at 4:09 PM Gavin Lambert via Boost
wrote: ...
Given the following path:
/path/to/file.txt
What should iterating the segments return?
{ "path", "to", "file.txt" } or { "/path", "/to", "/file.txt" }
or something else?
Given this path:
my/download/folder/
What should iterating the segments return?
{ "my", "download", "folder", "" } or { "my/", "download/", "folder/" } or { "my", "/download", "/folder", "/" }
or something else?
Given an empty relative-ref (empty string) in a url u, what is the encoded path after executing this statement?
u.segments().push_back( "index.htm" );
Anyone feel free to answer!
If std::filesystem::path is an example to follow then path elements should not contain path separators, except the root directory. The trailing separator is indicated by an empty final element. So: /my/download/folder/ would correspond to: { "/", "my", "download", "folder", "" } and /path/to/file.txt to { "/", "path", "to", "file.txt" } I think, URL paths must always have the root directory as the leading slash is a necessary separator from the host and port, so you could probably omit it from the list of elements.
On 10/14/21 3:25 AM, Andrey Semashev wrote:
On 10/14/21 2:52 AM, Vinnie Falco via Boost wrote:
On Wed, Oct 13, 2021 at 4:09 PM Gavin Lambert via Boost
wrote: ...
Given the following path:
/path/to/file.txt
What should iterating the segments return?
{ "path", "to", "file.txt" } or { "/path", "/to", "/file.txt" }
or something else?
Given this path:
my/download/folder/
What should iterating the segments return?
{ "my", "download", "folder", "" } or { "my/", "download/", "folder/" } or { "my", "/download", "/folder", "/" }
or something else?
Given an empty relative-ref (empty string) in a url u, what is the encoded path after executing this statement?
u.segments().push_back( "index.htm" );
Anyone feel free to answer!
If std::filesystem::path is an example to follow then path elements should not contain path separators, except the root directory. The trailing separator is indicated by an empty final element. So:
/my/download/folder/
would correspond to:
{ "/", "my", "download", "folder", "" }
and
/path/to/file.txt
to
{ "/", "path", "to", "file.txt" }
I think, URL paths must always have the root directory as the leading slash is a necessary separator from the host and port, so you could probably omit it from the list of elements.
Two corner cases to consider is when you have no path at all, and when you have a root directory as a path. Those two cases have to be distinct in terms of observed sequence of elements. I.e. https:://example.com?foo=42 should be different from https:://example.com/?foo=42 in terms of observed elements of the path.
On Wed, Oct 13, 2021 at 5:32 PM Andrey Semashev via Boost
https:://example.com?foo=42
should be different from
https:://example.com/?foo=42
Well... in your specific case not, because normalization of the https scheme inserts "/" when the path is zero length. Although this is scheme-specific. Generally speaking, there is always a path, it can be zero length. This is different from the query and fragment - which may or may not exist. A query can exist and be zero length, and the query can not exist. Thanks
On 10/14/21 4:15 AM, Vinnie Falco wrote:
On Wed, Oct 13, 2021 at 5:32 PM Andrey Semashev via Boost
wrote: https:://example.com?foo=42
should be different from
https:://example.com/?foo=42
Well... in your specific case not, because normalization of the https scheme inserts "/" when the path is zero length. Although this is scheme-specific. Generally speaking, there is always a path, it can be zero length. This is different from the query and fragment - which may or may not exist. A query can exist and be zero length, and the query can not exist.
Ok. But do other schemes make the same requirement? And does Boost.URL support those? My point is that if there is a case supported by Boost.URL where an empty path and a path of "/" are considered distinct, Boost.URL has to recognize and expose that distinction in its path elements sequence. For example, an empty path would correspond to an empty sequence, and the path of "/" would correspond to a single element sequence, with the element being "".
On Wed, Oct 13, 2021 at 5:25 PM Andrey Semashev via Boost
{ "/", "my", "download", "folder", "" } ... { "/", "path", "to", "file.txt" }
These are not really great because then elements are no longer homogeneous. That is, "/" is allowed in the front element but not the others. So u.segments().insert( u.segments().begin(), "/" ); is okay, but these are not: u.segments().insert( u.segments().begin(), "home" ); u.segments().push_back( "/" ); Thanks
On 10/14/21 4:13 AM, Vinnie Falco wrote:
On Wed, Oct 13, 2021 at 5:25 PM Andrey Semashev via Boost
wrote: { "/", "my", "download", "folder", "" } ... { "/", "path", "to", "file.txt" }
These are not really great because then elements are no longer homogeneous. That is, "/" is allowed in the front element but not the others. So
u.segments().insert( u.segments().begin(), "/" );
is okay, but these are not:
u.segments().insert( u.segments().begin(), "home" );
u.segments().push_back( "/" );
As I said, you're probably fine omitting the root directory.
On 14/10/2021 12:52, Vinnie Falco wrote:
Given the following path:
/path/to/file.txt
What should iterating the segments return?
{ "path", "to", "file.txt" } or { "/path", "/to", "/file.txt" }
or something else?
It would have to be the former, because supplying the latter as input to the unencoded variants would presumably try to %-encode the slash (or if it didn't, you'd have problems dealing with slashes elsewhere in the string). Also part of the point of splitting paths into subcomponents is so that you don't have to think about the path separators. Some URI libraries will also explicitly iterate "" or "/" as the first component (by itself) to indicate the difference between absolute and relative URIs, although that could be indicated another way instead. In which case it might be best to render the absolute path as: { "/", "path", "to", "file.txt" }
Given this path:
my/download/folder/
What should iterating the segments return?
{ "my", "download", "folder", "" } or { "my/", "download/", "folder/" } or { "my", "/download", "/folder", "/" }
or something else?
Again, the first.
Given an empty relative-ref (empty string) in a url u, what is the encoded path after executing this statement?
u.segments().push_back( "index.htm" );
Anyone feel free to answer!
I would imagine if u.segments() == { "my", "download", "folder", "" } before the push_back then after == { "my", "download", "folder", "index.htm" }. But that would be a special case only when the trailing component is empty. The app would have to subsequently push_back("") if it was intending to leave a trailing final slash (or just push_back the next component if it wasn't finished yet). But either way it's the app's responsibilty to explicitly indicate whether the URL has a trailing slash or not; they are semantically different and a library cannot decide that for it. Basically it ought to behave somewhat consistently with the URI resolution rules in section 5 of the RFC (except for not replacing a trailing non-empty component). (On that note, url::resolve is undocumented and doesn't seem to have the correct signature -- having both a base and relative parameter only makes sense if it's static or free, but it appears to be declared as instance. Unless the instance is only used as a completely-replaced output, but that seems weird and a free function with return value makes more sense to me.)
On Wed, Oct 13, 2021 at 5:56 PM Gavin Lambert via Boost
(On that note, url::resolve is undocumented and doesn't seem to have the correct signature -- having both a base and relative parameter only makes sense if it's static or free, but it appears to be declared as instance. Unless the instance is only used as a completely-replaced output, but that seems weird and a free function with return value makes more sense to me.)
Yes `this` is used as a "completely-replace output." A free function with a return value doesn't work, because there are a few different flavors of url (url, static_url, and possibly a couple more if users request them). They differ in how they obtain their storage. But they use url as a base class. So having `url::resolve` a as member function makes it all work out.
On Wed, Oct 13, 2021 at 5:56 PM Gavin Lambert via Boost
part of the point of splitting paths into subcomponents is so that you don't have to think about the path separators.
Yeah.
Some URI libraries will also explicitly iterate "" or "/" as the first component (by itself) to indicate the difference between absolute and relative URIs, although that could be indicated another way instead.
Currently this is done with // returns true if the path has a leading slash ('/') bool segments_view::is_absolute() const noexcept;
In which case it might be best to render the absolute path as: ... { "/", "path", "to", "file.txt" }
The problem comes with defining the mutation API. How does the user put a URL into the state above using the segments container interface?
I would imagine if u.segments() == { "my", "download", "folder", "" } before the push_back then after == { "my", "download", "folder", "index.htm" }.
But that would be a special case only when the trailing component is empty.
I didn't even think of this, but that's yet another wrinkle.. lol
The app would have to subsequently push_back("") if it was intending to leave a trailing final slash (or just push_back the next component if it wasn't finished yet). But either way it's the app's responsibilty to explicitly indicate whether the URL has a trailing slash or not; they are semantically different and a library cannot decide that for it.
Given the current container-like interface for modifiable encoded segments: https://master.url.cpp.al/url/ref/boost__urls__segments_encoded.html What might that API look like that allows the caller to indicate where the slashes are? Thanks
On Wed, Oct 13, 2021 at 6:19 PM Vinnie Falco
...
Thinking out loud here, it seems that the segments container models the path as vector< string > When it actually needs to model the container as struct { bool leading_slash; vector< string > }; Should I just add this function void segments::set_absolute( bool path_should_be_absolute ); and keep bool segments::is_absolute() const noexcept; ?
On 14/10/2021 14:28, Vinnie Falco wrote:
Thinking out loud here, it seems that the segments container models the path as
vector< string >
When it actually needs to model the container as
struct { bool leading_slash; vector< string > };
Should I just add this function
void segments::set_absolute( bool path_should_be_absolute );
and keep
bool segments::is_absolute() const noexcept;
?
I guess the answer to that depends on how you want clear-and-multiple-push_back to behave. Does that preserve or alter absoluteness of the path? Most of my experience with manipulating URIs (other than just using hard-coded strings) is via the .NET Uri class, and FWIW with those: 1. Instances are immutable; the only way to get a new Uri is to parse a new one from scratch or resolve a relative one from a base. (Both the base and the 'relative' can actually be either absolute or relative.) 2. Its 'Segments' property does include trailing slashes in the components -- so { "/", "foo/", "bar/", "index.html" } and is also read-only. (And also doesn't support relative URIs, though I don't know why not as they wouldn't be ambiguous.) I'm not saying these are necessarily good choices for your implementation as well; just as a point of comparison.
Vinnie Falco wrote:
On Wed, Oct 13, 2021 at 7:10 AM Phil Endecott via Boost
wrote:
There's also the question of worst-case complexity; does your parser backtrack? If so, is there a pathological input that causes it to take exponential time?
I have done no profiling or optimization to speak of except some minor design work to facilitate the use of character-based SIMD algorithms later. I don't have a monolithic "parser" per-se, there are classes called "bnf elements" which can be used in the generic bnf::parse function. These elements call parse on child elements recursively. I do backtrack. For example, URI-reference is defined as:
URI-reference = URI / relative-ref
The class uri_reference_bnf implements this by first attempting to parse the scheme and colon, and if that fails then it backtracks and attempts parsing a "relative-ref." You can see this here:
https://github.com/CPPAlliance/url/blob/680f6db547ffc3367e1ba93bae29e4c9278f...
I'm not really sure how this can be improved, you need to know if there's a scheme and that can only be determined by scanning forward until reaching a colon or an invalid character for scheme. In theory this could scan the entire input. Thus I have not put any effort into making this better.
Can you add a complexity guarantee to the docs? (I posted about this in the context of regular expression libraries a few weeks ago.)
Maybe... I did it for JSON because performance is critical and the difference between a poorly performing library and a greatly performing library is massive. I'm not sure that URL parsing is the same. URLs tend to be short and it is sort of difficult to really get bad performance out of parsing them even if you go a character at a time.
data: URLs are a good example of long URLs that actually exist in practice. But what I'm really thinking about is guarding against malicious malformed input. You don't want your parser to have N^2 let alone exp(N) complexity when the input could legitimately be a data: URL of a few kbytes. And there's no need for it to have worse than linear complexity, because I'm pretty sure that the URL syntax is a Regular Language (in the computer science sense, not in the "regular expression is a synonym for search pattern" I'm-a-perl-programmer sense), and all regular languages can by definition be parsed in linear time. Regarding your example, i.e. "http://foo.com/" vs the relative URL "http.html", the point is that when the parser has reached the 'p' it will be in a state meaning "either this is a scheme or the first segment of the path of a relative URL". The transformation from BNF to a no-backtracking state machine of that sort is what tools like flex and libraries like RE2 do. Now having said all that, I do suspect that the particular grammar of URLs can be parsed even by a backtracking parser in linear time (with some constant factor) - but it will depend on the rules. I think you aspire to making the library "secure", and if that includes not having pathological performance in the presence of malicious input, you need to understand this stuff!
auto url = parse_url(str);
The problem is that url is a view of the string parameter, but nothing warns the user about that and...
Well, it is documented profusely. I agree there is some danger here but with great danger comes great power. Or something about great responsibility?
I don't think this is acceptable. What do others think?
I would hope that the path would be similar to a std::filesystem::path i.e. it should use the same vocabulary where possible (or could even actually be a std::filesystem::path, on C++17).
OOF I hope never to see this. std::filesystem is kind of a mess.
Well it's not my top favourite library either. But it's familiar, and the authors have considered many of the issues that have been mentioned in other threads i.e. whether path components include directory separators, distinguishing absolute and relative paths, etc. Having mentioned data: URLs above, I have tried to understand how these would be handled by your library. (mailto: also.) Note that a data: URL will generally contains /s, e.g. in the content type e.g. image/jpeg and also in the base64 data but these are not directory separators. I think what's really needed is a specific getter/setter for the body of an "unconventional" URL scheme like these. It may be possible to use your set_encoded_path() but it's not obvious. Please fix the docs for set[_encoded]_query() which currently seem confused with the operations for fragments. Regards, Phil.
On Thu, Oct 14, 2021 at 2:04 PM Phil Endecott via Boost < boost@lists.boost.org> wrote:
auto url = parse_url(str);
The problem is that url is a view of the string parameter, but nothing warns the user about that and...
Well, it is documented profusely. I agree there is some danger here but with great danger comes great power. Or something about great responsibility?
I don't think this is acceptable. What do others think?
I don't mind. It's no different than std::string and std::string_view. You pay for copies only if you really want to. Maybe having parse_url_as_view(str) and a safer parse_url(str) using the former but not returning a view would satisfy you?
On Thu, Oct 14, 2021 at 6:02 AM Dominique Devienne via Boost
You pay for copies only if you really want to. Maybe having parse_url_as_view(str) and a safer parse_url(str) using the former but not returning a view would satisfy you?
Remember that there are potentially many url types that have ownership of the buffer: url static_url dynamic_url<Allocator> pmr_url Which one would the safe_parse_url() return? Thanks
Dominique Devienne wrote:
On Thu, Oct 14, 2021 at 2:04 PM Phil Endecott via Boost < boost@lists.boost.org> wrote:
auto url = parse_url(str);
The problem is that url is a view of the string parameter, but nothing warns the user about that and...
Well, it is documented profusely. I agree there is some danger here but with great danger comes great power. Or something about great responsibility?
I don't think this is acceptable. What do others think?
I don't mind. It's no different than std::string and std::string_view.
Exactly, it's like returning a string_view from a function - which you should never do, at least not without an obvious indication...
Maybe having parse_url_as_view(str) and a safer parse_url(str) using the former but not returning a view would satisfy you?
Yes, that's exactly what I suggested. The alternative is to try to find some way to return something that can't be assigned to auto. This may be possible with something like a private_url_view class that wraps a view and makes everything private: url u = parse_url(s); // parse_url() returns a private_url_view, url is constructable from private_view url_view v = parse_url(s); // parse_url() returns a private_url_view, url_view is constructable from private_view auto u = parse_url(s); // Can we make that fail to compile by making private_url_view's ctors private? // If not... auto p = u.path(); // ...we can prevent the user from using it for anything by hiding all of // its methods. // But: url u2 = u; func(u); // We need to prevent those too. I think we can prevent assigning from // a private_view to a url[_view] by qualifying url[_view]'s assignment // and copy ctor with &&: url::operator=(const private_url_view& v) = delete; url::operator=(private_url_view&&) { ... } I'm sure that smarter people than me already know the correct pattern for doing this. Searching I found p0672r0 which addresses how to fix the similar issues for proxy types and expression templates but doesn't mention views. I also found blog posts by Arthur O'Dwyer talking about string_view, e.g. https://quuxplusone.github.io/blog/2018/03/27/string-view-is-a-borrow-type/ Quote: A function may have a borrow type as its return type, but if so, the function must be explicitly annotated as returning a potentially dangling reference. That is, the programmer must explicitly acknowledge responsibility for the annotated function’s correctness. Regardless, if f is a function so annotated, the result of f must not be stored into any named variable except a function parameter or for-loop control variable. For example, auto x = f() must still be diagnosed as a violation. Regards, Phil.
On Fri, Oct 15, 2021 at 5:15 AM Phil Endecott via Boost
Exactly, it's like returning a string_view from a function - which you should never do, at least not without an obvious indication...
All functions in url_view which return encoded parts have a string_view return type...
Maybe having parse_url_as_view(str) and a safer parse_url(str) using the former but not returning a view would satisfy you?
Yes, that's exactly what I suggested.
How would you convert this code to use "safer parse_url(str)"? url_view u1 = parse_relative_ref( "/" ).value(); url u2 = parse_relative_ref( "/" ).value(); static_url<1024> u3 = parse_relative_ref( "/" ).value(); pmr_url u4 = parse_relative_ref( "/" ).value(); dynamic_url< std::allocator<char>> u5 = parse_relative_ref( "/" ).value();
The alternative is to try to find some way to return something that can't be assigned to auto. This may be possible with something like a private_url_view class that wraps a view and makes everything private:
This sounds like an awful lot of try-hard. string_view is one of those controversial types which is going to split people into two camps. I'm of the camp that you have to be careful how you hold the knife. Rather than put the library through contortions with questionable degrees of effectiveness at solving the "problem" I think it is just easier and more straightforward to say that ownership of strings is simply not transferred using parse functions or url_view. Thanks
On 15.10.21 17:13, Vinnie Falco via Boost wrote:
How would you convert this code to use "safer parse_url(str)"?
url_view u1 = parse_relative_ref( "/" ).value(); url u2 = parse_relative_ref( "/" ).value(); static_url<1024> u3 = parse_relative_ref( "/" ).value(); pmr_url u4 = parse_relative_ref( "/" ).value(); dynamic_url< std::allocator<char>> u5 = parse_relative_ref( "/" ).value();
The following would do the trick:
auto u1 = parse_relative_ref
On Fri, Oct 15, 2021 at 9:04 AM Rainer Deyke via Boost
The following would do the trick:
auto u1 = parse_relative_ref
( "/" ).value(); auto u2 = parse_relative_ref<url>( "/" ).value(); auto u3 = parse_relative_ref >( "/" ).value(); auto u4 = parse_relative_ref ( "/" ).value(); auto u5 = parse_relative_ref >( "/" ).value();
Yeah, kinda. This is not expert friendly but also, does not accommodate this use-case: struct connection { url u_; void process( string_view s ) { u_ = parse_uri( s ); ... } }; In the code above, the capacity for u_ is reused for each I/O cycle. Thanks
On 15.10.21 18:34, Vinnie Falco via Boost wrote:
On Fri, Oct 15, 2021 at 9:04 AM Rainer Deyke via Boost
wrote: The following would do the trick:
auto u1 = parse_relative_ref
( "/" ).value(); auto u2 = parse_relative_ref<url>( "/" ).value(); auto u3 = parse_relative_ref >( "/" ).value(); auto u4 = parse_relative_ref ( "/" ).value(); auto u5 = parse_relative_ref >( "/" ).value(); Yeah, kinda. This is not expert friendly but also, does not accommodate this use-case:
struct connection { url u_;
void process( string_view s ) { u_ = parse_uri( s ); ... } };
In the code above, the capacity for u_ is reused for each I/O cycle.
You can always do this:
struct connection {
url u_;
void process( string_view s )
{
u_ = parse_uri
Vinnie Falco wrote:
On Fri, Oct 15, 2021 at 5:15 AM Phil Endecott via Boost
wrote: Exactly, it's like returning a string_view from a function - which you should never do, at least not without an obvious indication...
All functions in url_view which return encoded parts have a string_view return type...
Right. I think it's OK for a type whose name is *_view to have getters that return views. Actually Arthur O'Dwyer's blog post that I linked to before suggested that it would be OK for a getter method of any object to return a view whose lifetime is the same as the object - I'm not sure that I agree with that: auto old_host = url.host(); url.set_host( new_host ); // oops, old_host is invalid because it was a view.
How would you convert this code to use "safer parse_url(str)"?
url_view u1 = parse_relative_ref( "/" ).value(); url u2 = parse_relative_ref( "/" ).value(); static_url<1024> u3 = parse_relative_ref( "/" ).value(); pmr_url u4 = parse_relative_ref( "/" ).value(); dynamic_url< std::allocator<char>> u5 = parse_relative_ref( "/" ).value();
There are plenty of possibilities, including: 1. Verbose names that indicate the return type. 2. Out parameters for the result. 3. Template parameter for the result type. 4. Make these constructors for the url_* types, not a free function. These all have pros and cons, none are perfect. Here's another idea that has just occurred to me: 5. Base the return type on the input string type. I.e. if the input is a string_view, return a url_view; if the input is a string with a particular allocator, return a url with the same allocator etc. (Does this avoid all the possible nasty surprises?) In particular, if the input is a string&&, you can move that into the string stored by the url: url parse_url(std::string&& s) { url u; u.str = std::move(s); ... parse u.str ... return u; } In this case, if the caller writes: auto u = parse_url( get_string_from_somewhere() ); then there is no allocation, and u is a "safe" non-view type. This seems to work for this example:
struct connection { url u_;
void process( string_view s ) { u_ = parse_uri( s ); ... } };
s is a string_view, so the result of parse is a url_view, which you can assign to the url without re-allocation (subject to capacity). Regards, Phil
On Sat, Oct 16, 2021 at 8:02 AM Phil Endecott via Boost
...
To all, thanks for all the feedback. I'm not ignoring it, but we discovered a super duper annoying problem with how the path segments and query parameters containers represent their respective parts of the URL, and fixing it is requiring some heroics so I am fixing that up. To keep things spicy this is the unit test that I am trying to make work: void testPathContainer() { auto const check = []( string_view s, std::initializer_list< string_view> init, bool abs) { url_view u = parse_uri_reference( s).value(); auto const segs = u.encoded_segments(); BOOST_TEST(abs == u.is_path_absolute()); if(! BOOST_TEST( segs.size() == init.size())) return; BOOST_TEST(equal( segs.begin(), segs.end(), init.begin(), init.end())); }; auto const abs = [&check]( string_view s, std::initializer_list< string_view> init) { check(s, init, true); }; auto const rel = [&check]( string_view s, std::initializer_list< string_view> init) { check(s, init, false); }; rel("", {}); rel("./", { "" }); rel("index.htm", { "index.htm" }); rel("path/to/file.txt", { "path", "to", "file.txt" }); rel("//example.com", {} ); rel("x:y:z", { "y:z" }); rel("x:y:z/", { "y:z", "" }); rel("./y:z", { "y:z" }); rel("./y:z/", { "y:z", "" }); abs("/", {}); abs("/./", { "" }); abs("//example.com/", {} ); abs("//example.com/./", { "" } ); abs("/index.htm", { "index.htm" }); abs("/home/", { "home", "" }); abs("//x//", { "", "" }); abs("/.//", { "", "" }); abs("//x/y", { "y" }); abs("/././/", { ".", "", "" }); abs("/.//", { "", "" }); abs("x:/.//", { "", "" }); } So yeah, if you are reading it correctly it means that if you have the relative-ref "./" and you iterate the path segments you will get just one element, an empty string { "" }. This might seem counterintuitive but it is necessary to provide the invariant that the segments container acts like vector<string>. If you push_back a series of elements to an empty path, then iterating the container will give you exactly those strings back. And the implementation preserves the "absoluteness" of the path. That is, if the path previously started with a slash then it will continue to start with a slash. If the path previously did not start with a slash, then the new path will not start with a slash - modulo the provision that any URL with an authority can never be relative. There's a new function set_absolute_path(bool) that adjusts the path accordingly. Making this all work seamlessly is a chore.. Thanks
Vinnie Falco wrote:
To all, thanks for all the feedback. I'm not ignoring it, but we discovered a super duper annoying problem with how the path segments and query parameters containers represent their respective parts of the URL
I would be very interested to see a comparison of how your decomposition of paths into segments, and the reverse, compares to what std::filesystem::path does, and rationale for the differences.
abs("/././/", { ".", "", "" });
Well that's an interesting one. Why is the middle element "" not "."? Can you clarify? I might perhaps hope that something like this would work: for (auto dir: path.segments) { chdir(dir.c_str()); } i.e. I'd end up in the same directory as chdir(path.c_str()) would get me to. But chdir("") is an error. I believe that std::filesystem::path merges adjacent directory separators, so the "" segments disappear. Regards, Phil.
On Sat, Oct 16, 2021 at 11:10 AM Phil Endecott via Boost
I would be very interested to see a comparison of how your decomposition of paths into segments, and the reverse, compares to what std::filesystem::path does, and rationale for the differences.
To be honest I have no idea, I have never used std::filesystem to any meaningful extent.
abs("/././/", { ".", "", "" });
Well that's an interesting one. Why is the middle element "" not "."? Can you clarify?
So yeah there are some basic rules. For example if you call
u.encoded_segments() = { ... }; // initializer_list
I might perhaps hope that something like this would work:
for (auto dir: path.segments) { chdir(dir.c_str()); }
I believe it will work for the cases where it should work, although I can't say with certainty until I have finished making these changes.
I believe that std::filesystem::path merges adjacent directory separators, so the "" segments disappear.
Yep, that needs to be a supported operation but it should not be called push_back or insert, because those operations which have the same names as their vector equivalent, should behave exactly like vector. And vector<string> doesn't delete a trailing "" when you push_back a non-empty string. I would have to add more functions such as (bikeshedding aside) merge_back or merge_at. Thanks
On 17/10/2021 07:53, Vinnie Falco wrote:
What should the resulting encoded URL be? Well a naive algorithm might leave us with "https://index.htm" but this is clearly wrong, as index.htm is now the host! To fix this, the library prepends "/." to the path (this is guidance from rfc3986):
assert( u.encoded_url() == "https:/.//index.htm" );
I assume this was intended to be "https://./index.htm"?
There are a handful of other cases like this (for example, removing the scheme from a relative-ref whose first segment has a colon). Coming back to:
abs("/././/", { ".", "", "" });
We treat a leading "/." as not appearing in the segments, to make the behavior of the library doing these syntactic adjustments transparent and satisfy the rule that assignments from segments produce the same result when iterated.
If you're stripping leading ./ then shouldn't the result just be "/" alone? Same reason that "/../../foo/../bar/" should become "/bar/".
On Sun, Oct 17, 2021 at 2:52 PM Gavin Lambert via Boost
assert( u.encoded_url() == "https:/.//index.htm" );
I assume this was intended to be "https://./index.htm"?
Nope, it was correct as I wrote it. You managed to produce an authority with a single dot :)
abs("/././/", { ".", "", "" });
We treat a leading "/." as not appearing in the segments, to make the behavior of the library doing these syntactic adjustments transparent and satisfy the rule that assignments from segments produce the same result when iterated.
If you're stripping leading ./ then shouldn't the result just be "/" alone? Same reason that "/../../foo/../bar/" should become "/bar/".
Well no, there's a difference between what is in the value returned by url_view::encoded_path() and what you get when you iterate the segments. Leading "/." or "./" stays in the encoded segments but is not returned by iterating segments, for the reason that it is considered "metadata" about the path that keeps it regular without changing the meaning. I think you are getting confused with "normalization" which is a different thing entirely. Given the following URL: https:/.//index.htm Normalization would leave it as-is. Given: https:/././/index.htm normalization would return https:/.//index.htm If you start with the URL above and add an authority: url u = parse_uri( "https:/.//index.htm" ).value(); u.set_encoded_authority( "example.com" ); The result is assert( u.encoded_url() == "https://example.com//index.htm" ); So there are three things at play here: 1. Modifying the path for normalization 2. Tweaking the path to match the grammar 3. Tweaking the path to provide segments() container invariants Number 1 above is what people are mostly familiar with, for example collapsing double dotted segments ".." safely. Number 2 is understood by fewer people but is a consequence of the grammar in the RFC. For example, if you have an authority, and a path that starts with double slash "//", then if you remove the authority you have to prepend "/." to the path. Another one, if you have a scheme and a relative path whose first segment contains a colon, and you remove the scheme then you have to prepend "./" to the path. These tweaks let the library guarantee that all mutation operations leave the URL in a syntactically valid state without having to do weird things like throw exceptions, return error codes, ignore the request, or worse impose additional semantic changes to the URL (for example, turning an absolute path into a relative one in a case where the user didn't explicitly request it). Number 3 is the most controversial and unintuitive, it falls out as a consequence of making the segments and encoded_segments containers behave exactly like vector<string>. For example, if you call clear() on the container, then it should return an empty list: url u = parse_relative_ref( "path/to/file.txt" ).value(); u.segments().clear(); assert( u.segments().begin() == u.segments().end() ); However, what if you have an absolute path? url u = parse_relative_ref( "/path/to/file.txt" ); assert( u.segments() == { "path", "to", "file.txt" } ); Okay so far so good but what if you clear? u.segments().clear(); assert( u.encoded_url() == "/" ); Wait, that's not clear, there's still a path segment! Well of course there is, if you clear an absolute path you should get back an absolute path. But the segments container should be empty: assert( u.segments().begin() == u.segments().end() ); // has to pass See what's happening here? Lets start with a relative path url u = parse_relative_ref( "index.htm" ); Now lets reassign the path: u.segments() = { "my:file.txt" }; Well, we can't leave the URL as "my:file.txt" because that would be interpreted as a scheme. So the library prepends "./": assert( u.encoded_segments() == "./my:file.txt" ); But we have an invariant, after you assign the segments you have to get the same thing back: assert( u.segments() = { "my:file.txt" } ); To enforce this invariant we have to treat some path prefixes as if they weren't there. "/" by itself, "./", and "/.". And that's how its done Thanks
On 18/10/2021 13:01, Vinnie Falco wrote:
assert( u.encoded_url() == "https:/.//index.htm" );
I assume this was intended to be "https://./index.htm"?
Nope, it was correct as I wrote it. You managed to produce an authority with a single dot :)
Yes, that was my intent, from your description of replacing the authority with a dot. Though I see the issue now. I was reading the input as: "https://example.com/index.htm" For which removing the authority should result in: "https:/index.htm" (Although this is unusual; typically relative URIs will omit the scheme as well.) For the input: "https://example.com//index.htm" Then it does make sense at a purely-URL-level to transform this to: "https:/.//index.htm" (Although most web servers would treat either as illegal, but you could envisage some not-HTTP protocol that requires such syntax.) Adding the authority back to this URL should result in "https://example.com/.//index.htm", however -- it should not be "ignoring" the prefix once it exists. At least not until the URL is normalised. (Unless you're documenting that URLs are always stored in normalised form, or that setters will automatically normalise.)
We treat a leading "/." as not appearing in the segments, to make the behavior of the library doing these syntactic adjustments transparent and satisfy the rule that assignments from segments produce the same result when iterated.
So what about the input "https://example.com/./index.htm"? Unless you're documenting automatic normalisation, this should still iterate the "." and "index.htm" path components separately.
On Sun, Oct 17, 2021 at 6:05 PM Gavin Lambert via Boost
So what about the input "https://example.com/./index.htm"? Unless you're documenting automatic normalisation, this should still iterate the "." and "index.htm" path components separately.
Well I wouldn't say that the library is "doing automatic normalization." It is just a special case treatment of up to 2 leading characters of the path. As a general rule I think the library needs to make a best-effort to preserve the components making up the path iteration. That means, if you have a URL with index.htm and for whatever reason the library needs to put this prefix "./" in front of it, then iterating the path should still only give index.htm. And furthermore that the implementation does not hide any "state" information outside of the string itself. If you have "//x//y" and you remove the authority to get ".//y", then add the authority back to get "//x/.//y" and iterating would produce { ".", "", "", "y" ) then we have changed what iterating the segments returns without the user asking for it. So I think the library has to treat up to 2 leading characters of the path special ("/", "./", and "/."), adding and removing them as needed to preserve syntactic correctness and semantic equivalence when performing mutation operations. To ensure the stability of iterated segments, this means that we shouldn't return that malleable prefix of the path. Thanks
On 17/10/2021 07:53, Vinnie Falco wrote:
On Sat, Oct 16, 2021 at 11:10 AM Phil Endecott wrote:
I would be very interested to see a comparison of how your decomposition of paths into segments, and the reverse, compares to what std::filesystem::path does, and rationale for the differences.
To be honest I have no idea, I have never used std::filesystem to any meaningful extent.
Getting back to this point, as I mentioned in another branch it looks like {std,boost}::filesystem breaks down "/foo/bar/baz.htm" to { "/", "foo/", "bar/", "baz.htm" }. Like it or hate it, this is the standard now, and it's likely that people would expect that Boost.Url would use similar segmentation. On a related note, you should strongly consider cross-compatibility with {std,boost}::filesystem::path anyway, since inevitably an URL-parsing library will hit one of the following scenarios sooner or later: - conversion of an absolute file:/// uri to a filesystem path and back - conversion of a relative uri path (only) to a relative filesystem path and back (or perhaps better: conversion of [absolute uri, base uri, base path] to absolute path) While you could get away with doing the second kind of interop only with complete strings, having an explicit method for it is better as it allows you to encourage security checks such as considering it illegal to use a relative path that would escape the base path.
On Sun, Oct 17, 2021 at 7:04 PM Gavin Lambert via Boost
Getting back to this point, as I mentioned in another branch it looks like {std,boost}::filesystem breaks down "/foo/bar/baz.htm" to { "/", "foo/", "bar/", "baz.htm" }.
I'm not sure that's viable or correct for URL paths though.
On a related note, you should strongly consider cross-compatibility with {std,boost}::filesystem::path anyway, since inevitably an URL-parsing library will hit one of the following scenarios sooner or later:
Yeah, I agree that there needs to be something to facilitate working with paths that are formed from URLs. I haven't given it any thought yet because I am still trying to finish the baseline functionality. Thanks
On 18/10/2021 15:06, Vinnie Falco wrote:
On Sun, Oct 17, 2021 at 7:04 PM Gavin Lambert wrote:
Getting back to this point, as I mentioned in another branch it looks like {std,boost}::filesystem breaks down "/foo/bar/baz.htm" to { "/", "foo/", "bar/", "baz.htm" }.
I'm not sure that's viable or correct for URL paths though.
Actually, sorry, I was getting my wires crossed. The above is how .NET's System.Uri class behaves. (So Microsoft evidently thinks it's viable for URLs.) {std,boost}::filesystem will break down "/foo/bar/baz.htm" to { "/", "foo", "bar", "baz.htm" } (and "/foo/bar/" to { "/", "foo", "bar", "" } -- except boost:: returns "." for the final element). Filesystems perhaps have a slight advantage here as AFAIK no filesystem accepts an escaped path separator within a filename (modulo Linux allowing backslashes; including WSL, but the latter actually stores a different character), while there's no rule against %2F appearing inside an URL path (although it's more common in a query string). But this isn't insurmountable. You can perhaps argue that the leading "/" element is not needed if there is another method to indicate whether the URI is absolute or not; Filesystem includes it for lexicographic sorting & comparison purposes, but those don't entirely apply since we're only considering part of the URL here anyway. So that is more justifiable. With that in mind, it's probably reasonable for this iteration: - "/foo/bar/baz.htm" => { "foo", "bar", "baz.htm" } - "/foo/bar/" => { "foo", "bar", "" } - "/foo/bar" => { "foo", "bar" } - "/.//foo/bar" => { ".", "", "foo", "bar" } But as I said before, this makes it impossible to convert between relative and absolute paths via the path-segment API alone (as the initial "/" isn't exposed, and instead a different method must be used). Whether this is a feature or a bug is in the eye of the beholder.
On a related note, you should strongly consider cross-compatibility with {std,boost}::filesystem::path anyway, since inevitably an URL-parsing library will hit one of the following scenarios sooner or later:
Yeah, I agree that there needs to be something to facilitate working with paths that are formed from URLs. I haven't given it any thought yet because I am still trying to finish the baseline functionality.
It's worthwhile considering these things from the start, as they can inform design of your baseline (such as compatibility of path segment iteration).
On Sun, Oct 17, 2021 at 8:56 PM Gavin Lambert via Boost
You can perhaps argue that the leading "/" element is not needed if there is another method to indicate whether the URI is absolute or not;
Yep, you're right that is needed and we have these functions bool url_view::is_path_absolute() const noexcept; // returns true on success bool url::set_path_absolute( bool should_be_absolute ); set_path_absolute(false) can fail in the one case where there is an authority and at least one segment.
- "/.//foo/bar" => { ".", "", "foo", "bar" }
The list looks fine except for the above, which I think has to be { "", "foo", "bar" } for the reason that assigning the path should give you back the same results when iterated: url u = parse_uri( "http:" ).value(); u.segments() = { "", "foo", "bar" }; assert( u.encoded_url() == "http:/.//foo/bar" ); assert( u.segments() == { "", "foo", "bar" } ); // same list If we then remove the scheme, I think the library needs to remove the prefix that it added. Not a full "normalization" (that's a separate member function that the user calls explicitly). The rationale is that if the library can add something extra to make things consistent, it should strive to remove that something extra when it can do so. Yes this means that if the user adds the prefix themselves and then performs a mutation, the library could end up removing the thing that the user added. I think that's ok though, because the (up to 2 character) path prefixes that the library can add are all semantic no-ops even in the filesystem cases.
It's worthwhile considering these things from the start, as they can inform design of your baseline (such as compatibility of path segment iteration).
Library-added prefixes are semantic no-ops anyway, even for the filesystem case, so they should not change anything in terms of filesystem compatibility. And they don't appear in segment iteration so it is like they were never there, thus zero impact. Thanks
On Sun, Oct 17, 2021 at 8:56 PM Gavin Lambert via Boost
It's worthwhile considering these things from the start, as they can inform design of your baseline (such as compatibility of path segment iteration).
Segment iteration is not going to be compatible. In addition to adding an initial "/" segment for absolute paths, Filesystem also collapses consecutive / separators. So iterating "/foo//bar//baz///" produces "/" │ "foo" │ "bar" │ "baz" │ "" (https://godbolt.org/z/EsjKzc5f1) A design goal of URL seems to be that the information that the accessors give accurately reflects the contents of the string (and that there's no hidden metadata that the string doesn't reflect.) So the segments of the above path are { "foo", "", "bar", "", "baz", "", "", "" } because otherwise the segments of the above and "/foo/bar/baz/" will be the same, which means that it won't be possible to reconstruct the string from the information the URL accessors give. That is, if you have a string, and you construct an URL object from it, then take its state as returned by the accessors, copy it into another empty URL, the resultant string should be the same as the original. And similarly, if you take an empty URL, set its parts to some values, take the string, create another URL object from the string, the accessors should give the original values. Destructive transformations such as what Filesystem does above cannot work in this model.
On Mon, Oct 18, 2021 at 12:38 PM Peter Dimov via Boost
That is, if you have a string, and you construct an URL object from it, then take its state as returned by the accessors, copy it into another empty URL, the resultant string should be the same as the original
Yes, and note that the "absoluteness" of the path is a property of the URL which is reflected in the url API and not the segments: bool url_view::is_path_absolute() const noexcept; bool url::set_path_absolute( bool should_be_absolute ); Thanks
Peter Dimov wrote:
On Sun, Oct 17, 2021 at 8:56 PM Gavin Lambert via Boost
It's worthwhile considering these things from the start, as they can inform design of your baseline (such as compatibility of path segment iteration).
Segment iteration is not going to be compatible. In addition to adding an initial "/" segment for absolute paths, Filesystem also collapses consecutive / separators. So iterating "/foo//bar//baz///" produces
"/" ? "foo" ? "bar" ? "baz" ? ""
(https://godbolt.org/z/EsjKzc5f1)
A design goal of URL seems to be that the information that the accessors give accurately reflects the contents of the string (and that there's no hidden metadata that the string doesn't reflect.)
So the segments of the above path are
{ "foo", "", "bar", "", "baz", "", "", "" }
because otherwise the segments of the above and "/foo/bar/baz/" will be the same, which means that it won't be possible to reconstruct the string from the information the URL accessors give.
Right. But why has it chosen that goal, rather than the alternatives? What's the rationale? It seems to me that a URL with redundant /s (e.g. http://foo.com/path/////to/file) is either (a) malicious or erroneous input, or (b) equivalent to the versions without the redundant /s. So a user might want to (a) get an exception or error, or (b) ignore the redundant segments. Under what circumstances would a user want to see the empty segments between those /s? Here's an alternative: - Skip over duplicate adjacent '/' when iterating segments. - Return "/" as the first segment for absolute paths. - Return "" as the last segment for paths with a trailing "/". - Give p.push_back(s) a precondition that s must not be empty if p.back() is empty. I think this gives pretty sane behaviour. The invariant that push_back()ing a series of segments and then iterating returns the same strings holds. Vinnie Falco wrote:
note that the "absoluteness" of the path is a property of the URL which is reflected in the url API and not the segments:
You're saying this because that's what the BNF says. Your URL api doesn't have to exactly mirror the BNF. If it would make sense for "absoluteness" to be a property of the path rather than the URL from the point of view of the library user, you can do that. Two other things to consider: - What is your operator== going to do about redundant /s? - How does this all work with data: and mailto: URLs? Regards, Phil.
On 20/10/2021 02:12, Phil Endecott wrote:
Right. But why has it chosen that goal, rather than the alternatives? What's the rationale?
It seems to me that a URL with redundant /s (e.g. http://foo.com/path/////to/file) is either (a) malicious or erroneous input, or (b) equivalent to the versions without the redundant /s. So a user might want to (a) get an exception or error, or (b) ignore the redundant segments. Under what circumstances would a user want to see the empty segments between those /s?
The rationale is the URI standard RFC. It is not permitted for URI parsers to perform such collapsing because such URIs are entirely legal. Unusual, certainly, especially for http[s] URIs that usually eventually end up at a filesystem that would collapse consecutive slashes. Most web servers (regardless of whether evaluating vs a filesystem or not) will indeed collapse consecutive slashes too, at least while matching against resource rules. But URIs are not required to wind up leading to a filesystem or to be served by a web server -- it's entirely possible that some other scheme may treat additional consecutive slashes as significant for some purpose.
On 19/10/2021 05:10, Vinnie Falco wrote:
- "/.//foo/bar" => { ".", "", "foo", "bar" }
The list looks fine except for the above, which I think has to be { "", "foo", "bar" } for the reason that assigning the path should give you back the same results when iterated:
url u = parse_uri( "http:" ).value();
u.segments() = { "", "foo", "bar" };
assert( u.encoded_url() == "http:/.//foo/bar" ); assert( u.segments() == { "", "foo", "bar" } ); // same list
Again, what about the case where the original input URL contained that leading dot? You can't argue "we must report it unchanged" when by definition there are conditions when you are changing it. The only mechanism that seems sane to me is that encoded_url() and friends are documented to normalise (or at least to partially normalise, limited to adding/removing the path prefix) the URL before returning a string, at which point segments() may change content. (But it's important that it doesn't break if you push_back each segment individually instead of assigning it all at once.)
If we then remove the scheme, I think the library needs to remove the prefix that it added. Not a full "normalization" (that's a separate member function that the user calls explicitly). The rationale is that if the library can add something extra to make things consistent, it should strive to remove that something extra when it can do so. Yes this means that if the user adds the prefix themselves and then performs a mutation, the library could end up removing the thing that the user added. I think that's ok though, because the (up to 2 character) path prefixes that the library can add are all semantic no-ops even in the filesystem cases.
Segment iteration is not going to be compatible. In addition to adding an initial "/" segment for absolute paths, Filesystem also collapses consecutive / separators. So iterating "/foo//bar//baz///" produces
"/" │ "foo" │ "bar" │ "baz" │ "" Fair point; I hadn't considered that one. That's unfortunate. I agree
I don't disagree with this, but I do disagree with the iteration methods trying to "hide" elements that are actually present in the URL. On 19/10/2021 08:37, Peter Dimov wrote: that URL cannot collapse adjacent separators.
On Friday, October 15th, 2021 at 5:14 AM, Phil Endecott via Boost
Exactly, it's like returning a string_view from a function - which you should never do, at least not without an obvious indication...
My primary objection to this line of thinking is that the standard begin/end functions have the same dangerous implications as this function does. The input can be a string and the output is something that becomes invalid when that string is destroyed. C++ doesn't have a good solution to this situation and clearly begin/end don't give an obvious indication.
Ahmed Charles wrote:
On Friday, October 15th, 2021 at 5:14 AM, Phil Endecott via Boost
wrote: Exactly, it's like returning a string_view from a function - which you should never do, at least not without an obvious indication...
My primary objection to this line of thinking is that the standard begin/end functions have the same dangerous implications as this function does. The input can be a string and the output is something that becomes invalid when that string is destroyed.
begin/end don't have this problem in practice because you need to call both, and that's hard to do on the same temporary. c_str does, but virtually nobody makes this mistake for some reason.
On 21/10/2021 13:42, Peter Dimov wrote:
begin/end don't have this problem in practice because you need to call both, and that's hard to do on the same temporary.
It's a constant thorn in Range's side, though, especially with all the various kinds of non-owning filters. But by the same token people should start getting more used to these sorts of considerations and be less prone to making such errors. And the static analysis tooling is a bit better now at spotting such things, though not perfect.
c_str does, but virtually nobody makes this mistake for some reason.
I recall some older (broken) C++ compiler (or rather: optimizer) implementations where the temporary in (rvalue).c_str() was sometimes destroyed after calling c_str() instead of after the entire full-expression, which caused all sorts of Fun™ for the common cases of passing this to a method call, which in turn led to defensively creating a lot of otherwise unnecessary lvalues. Thankfully modern compilers don't have that problem any more.
Gavin Lambert wrote:
And the static analysis tooling is a bit better now at spotting such things, though not perfect.
Not really. https://godbolt.org/z/P8xb4Ern4 https://godbolt.org/z/Y56KnxGeT https://godbolt.org/z/M7aGb7jq8 https://pdimov.github.io/blog/2020/05/17/state-of-c-static-analysis-circa-20... https://godbolt.org/z/Pba64YjYv Well, there is https://godbolt.org/z/nz9Pvzfoj, I suppose. Oh, -fanalyzer has improved: https://godbolt.org/z/xhnr88ah1 It wasn't very useful for C++ last I checked, but it works now.
Oh, -fanalyzer has improved: https://godbolt.org/z/xhnr88ah1
It wasn't very useful for C++ last I checked, but it works now.
Or not. That's a false positive. :-)
On Wed, Oct 20, 2021 at 5:58 PM Gavin Lambert via Boost
people should start getting more used to these sorts of considerations and be less prone to making such errors.
This is my position as well. string_view is "dangerous" but you get a corresponding improvement in power. It really isn't much different than calling async functions with unowned buffers, e.g.: std::string s; sock.async_write_some( asio::buffer(s), ... ); There isn't any realistic way to protect from this which is why I think that the proposed schemes for helping the users avoid shooting themselves in the foot are not good solutions. Thanks
Ahmed Charles wrote:
On Friday, October 15th, 2021 at 5:14 AM, Phil Endecott via Boost
wrote: Exactly, it's like returning a string_view from a function - which you should never do, at least not without an obvious indication...
My primary objection to this line of thinking is that the standard begin/end functions have the same dangerous implications as this function does. The input can be a string and the output is something that becomes invalid when that string is destroyed.
I'd be happy with the parsing function returning a view if its name reflected that. My complaint is that the short name parse_url() doesn't give the user that "obvious indication". There are a couple of differences with begin()/end(). Firstly, "everyone knows" that begin() and end() return iterators so the names do give the required clue. Secondly, these are methods, not free functions; you can argue that it's OK to return iterator/pointer/view types from getter methods if the lifetime of the pointee is the same as the object. (Yes I'm conveniently overlooking the begin()/end() free functions.) I'd also suggest that the existence of foot-guns already in the language is not an excuse for adding more! These are not absolute, yes/no decisions; there are trade-offs to be made. The only benefit of naming this view-returning function parse_url() is less typing. I don't think that's worthwhile. Regards, Phil.
On Thu, Oct 21, 2021 at 6:47 AM Phil Endecott via Boost
I'd be happy with the parsing function returning a view if its name reflected that. My complaint is that the short name parse_url() doesn't give the user that "obvious indication".
In C++ the "name" of a function includes its return type
and parameter types, so actually the name is:
result
On Wed, Oct 13, 2021 at 7:10 AM Phil Endecott via Boost
Have you looked at the RFC 3986 errata at https://www.rfc-editor.org/errata_search.php?rfc=3986&rec_status=0?
Okay, this is taken care of now!!! Thanks
On 10/13/21 7:09 AM, Phil Endecott via Boost wrote:
Hi Vinnie,
Vinnie Falco wrote:
Well, I was on a pause but I picked Boost.URL back up and whipped it into shape.
Thanks for posting that. Assorted comments follow.
Back in 2005-ish I decided to write a URL parser based on the RFC BNF using Boost.Spirit. At that time the most recent RFCs were RFC2616 (1999) and RFC2396 (1998). I did a fairly direct translation of the BNF to Spirit and ... it didn't work. It turned out that the BNF in the RFCs was wrong, and had been for 6 years at that point. So I advise that you are careful about directly using the RFC BNF. ...
Too me, all this argues for the usage of boost spirit as a url parser. Maintenance, Verification, etc. Of a protocol with the problems you mention is much easier when the specification of the syntax is separated from the code which does the parsing. Before one says "performance" don't bother - even if it were true in this case - which I actually doubt.
Anyway that's enough ramblings for now....
Regards, Phil.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Thu, Oct 14, 2021 at 11:35 AM Robert Ramey via Boost
Too me, all this argues for the usage of boost spirit as a url parser.
Never going to happen. I would never take a dependency on Spirit for anything network related or that will have security reviews. And users don't want this dependency either. Thanks
On 10/14/21 11:38 AM, Vinnie Falco via Boost wrote:
On Thu, Oct 14, 2021 at 11:35 AM Robert Ramey via Boost
wrote: Too me, all this argues for the usage of boost spirit as a url parser.
Never going to happen. I would never take a dependency on Spirit for anything network related or that will have security reviews. And users don't want this dependency either.
Hmmm - how about this idea. Create all your tests and examples first. Basically this will consist of a long list of urls - each one more elaborate and indecipherable then the last. Write the spirit parse to run in these tests. When you get all this working, write your "hand-rolled" parser (this will take a while) and test against the same list of urls and verify that that the parsing matches the spirit generated ones. Then you'll have 2 things: a) verifiable, maintainable, confirmation that the test parsing is very correct b) test results which show that your hand rolled parser is either correct or difficult to prove otherwise. Robert Ramey
Thanks
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Thu, Oct 14, 2021 at 11:00 PM Robert Ramey via Boost < boost@lists.boost.org> wrote:
On 10/14/21 11:38 AM, Vinnie Falco via Boost wrote:
On Thu, Oct 14, 2021 at 11:35 AM Robert Ramey via Boost
wrote: Too me, all this argues for the usage of boost spirit as a url parser.
Never going to happen. I would never take a dependency on Spirit for anything network related or that will have security reviews. And users don't want this dependency either.
Hmmm - how about this idea. Create all your tests and examples first. Basically this will consist of a long list of urls - each one more elaborate and indecipherable then the last. Write the spirit parse to run in these tests. When you get all this working, write your "hand-rolled" parser (this will take a while) and test against the same list of urls and verify that that the parsing matches the spirit generated ones. Then you'll have 2 things:
a) verifiable, maintainable, confirmation that the test parsing is very correct b) test results which show that your hand rolled parser is either correct or difficult to prove otherwise.
I think maybe even better is to take some other implementations from other languages(if they exist), maybe C# or Rust have official/unofficial equivalents of Boost.URL so one could do a "diff" against them.
On Thu, Oct 14, 2021 at 8:33 PM Ivan Matek via Boost
a) verifiable, maintainable, confirmation that the test parsing is very correct b) test results which show that your hand rolled parser is either correct or difficult to prove otherwise.
I already have all that... this is like HTTP parsing redux. The parsing of URLs is the least interesting and least difficult aspect of this project, but seems to be what people focus on. Where there is difficulty is in the API for presenting and modifying the path segments and the query parameters as a container. Have you had a look at the query parameters container value_type? https://github.com/CPPAlliance/url/blob/d866b5b09f08d12d597f783da295be823ad6... Thanks
Robert Ramey wrote:
On 10/13/21 7:09 AM, Phil Endecott via Boost wrote:
Back in 2005-ish I decided to write a URL parser based on the RFC BNF using Boost.Spirit. At that time the most recent RFCs were RFC2616 (1999) and RFC2396 (1998). I did a fairly direct translation of the BNF to Spirit and ... it didn't work. It turned out that the BNF in the RFCs was wrong, and had been for 6 years at that point. So I advise that you are careful about directly using the RFC BNF. ...
Too me, all this argues for the usage of boost spirit as a url parser. Maintenance, Verification, etc. Of a protocol with the problems you mention is much easier when the specification of the syntax is separated from the code which does the parsing.
Spirit is also a backtracking parser, so can't promise linear complexity - unless you can show that it does for the particular grammar that you are using. This isn't a performance issue, it's a security issue i.e. you don't want your application to suffer denial of service when it encounters malicious input. Regards, Phil.
participants (12)
-
Ahmed Charles
-
Alex Christensen
-
Andrey Semashev
-
Dominique Devienne
-
Gavin Lambert
-
Ivan Matek
-
Julien Blanc
-
Peter Dimov
-
Phil Endecott
-
Rainer Deyke
-
Robert Ramey
-
Vinnie Falco