Hi Vinnie, Vinnie Falco wrote:
Well, I was on a pause but I picked Boost.URL back up and whipped it into shape.
Thanks for posting that. Assorted comments follow. Back in 2005-ish I decided to write a URL parser based on the RFC BNF using Boost.Spirit. At that time the most recent RFCs were RFC2616 (1999) and RFC2396 (1998). I did a fairly direct translation of the BNF to Spirit and ... it didn't work. It turned out that the BNF in the RFCs was wrong, and had been for 6 years at that point. So I advise that you are careful about directly using the RFC BNF. Have you looked at the RFC 3986 errata at https://www.rfc-editor.org/errata_search.php?rfc=3986&rec_status=0 ? For example, I note that your rule for relative-part on the "Parsing" page doesn't seem to include the path-abempty alternative added in erratum 5428. I would be interested to see how compile time, run time, and lines-of-code compare between your implementation, a Spirit parser, and a regexp. You also have the choice of (a) parsing and storing pointers to each component as I believe you do, or (b) parsing to validate but storing only the complete string, and doing a partial parse when necessary to extract components. (b) could be better if you're mostly just passing around complete URLs, storing them in containers, etc. There's also the question of worst-case complexity; does your parser backtrack? If so, is there a pathological input that causes it to take exponential time? Can you add a complexity guarantee to the docs? (I posted about this in the context of regular expression libraries a few weeks ago.) This worries me: auto url = parse_url(str); The problem is that url is a view of the string parameter, but nothing warns the user about that and... str = something_else; ..now url is dangling. Am I right, or is there something that prevents this? I believe that functions that returns views should be named so that that is obvious, and in cases like this the function with the simplest name should be the safer version that returns a value-type. (There was a similar discussion for fixed_string, which originally proposed to return a view from its substr() method.) Also I note that parse_uri() returns a result<> and doesn't throw. My preference would be to name that e.g. try_parse_uri() and for the simpler-named function to be a wrapper that returns a plain uri or throws on error. My rationale is to make the simplest-named functions provide the easiest-to-use functionality, so that *teaching C++ to a Javascript programmer is not so embarrassing*, while retaining the more performant API variants for those who need them. (Is it parse_uri() or parse_url()? I thought you had decided on URL.) "pct" doesn't immediately mean "percent" to me. I'd prefer percent_encode() to pct_encode(). If that's too long for you, truncate "encode" to "enc". Or maybe I would call this "escaping" rather than "encoding". Actually I think the RFC terminology might be "URL encoding"; I don't know if I prefer that but if you want "strict adherence to the RFCs"... It doesn't look like your url type is comparable. I think it should be. The comparisons need to be careful about case-sensitivity. And it should be hashable. I recently had to implement request signing for AWS. This requires canonicalisation of URLs. There are some subtleties in percent encoding: you need to be able to control exactly which characters are escaped (I think you allow this); you need to be able to control whether + is used to escape space (do you?) and you need to be able to control whether the hex characters are upper or lower case (do you?). This code also needed to sort the query parameters by key; that raises the question of whether the query is a sequence or an associative container of key-value pairs, and if an associative container, what the comparison function is. Actually you might like to have a look at the AWS SDK URI class: http://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_http_1_1_u_r_i.html One thing that I believe it does is to set port to a default based on the scheme, if it's not specified in the url string. I'd say that's a helpful thing to do. Are there any other URI classes that you could compare/contrast with? I would hope that the path would be similar to a std::filesystem::path i.e. it should use the same vocabulary where possible (or could even actually be a std::filesystem::path, on C++17). Conversion between std::filesystem::path and file: URLs would be nice to have. In your table 1.7, "Return the type of host specified, if any." should I think say "Return the host, if any." Your percent-decoding methods return strings. I'm a bit surprised that you've not tried to return lazy decoding views, at least as an option. Anyway that's enough ramblings for now.... Regards, Phil.