URL Review

21 Aug 2022

      My review of the proposed Boost.URL library follows.

Unfortunately I have been away for most of the review period and
I have not been able to spend as much time on this as I would have
liked; in particular I have not tried to run any code. But I have
looked at the source and documentation.

In previous email discussions a few months ago I complained about
the dangerous use of non-owning view types in the library interface,
and I see that these issues are still present. Specifically, quoting
the "Quick Look" docs:

   result<url_view> r = parse_uri( s );

   url u = parse_uri( s ).value();

That's fine for C++03, but since C++11 we have auto, and this is
where it becomes dangerous:

   auto r = parse_uri( s );

There's no indication here that r is a non-owning view and that
the caller is responsible for keeping s alive for the lifetime of r.
Such errors could be difficult to detect; the code may appear to work
correctly much of the time if the storage for s is not re-used.

   auto u = parse_uri( s ).value();

That's even worse, because we're calling a method called value()
but what it returns is not actually a value, it's still a view.

A safer API design would, at a minimum, give the parsing function
a name with "view" in it. The default, shorter-named function
should return a value type.

More fundamentally, I'm not convinced that this allocation-avoiding
view-rich API style is best for a general purpose vocabulary type
like URL. URL should be a "Keep It Simple" type, accessible to
C++ beginners, with straightforward lifetime semantics. Compare with
std::filesystem::path, which does not try to optimise by actually
being a view.

Because of these concerns about the basic API design, I believe that
the library should be REJECTED.

Following are some comments about other more minor issues that I
have noticed:

I see that there is a method called persist():

   std::shared_ptr< url_view const >
   persist() const;

   This function returns a read-only copy of the URL, with shared
   lifetime. The returned value owns (persists) the underlying string.
   The algorithm used to create the value minimizes the number of
   individual memory allocations, making it more efficient than when
   using direct standard library functions.

Hmm, so url_view doesn't own the underlying storage, except when it
does as a result of this mysterious method. What's the rationale for
this? (Have I missed some documentation?)

What is sizeof(url) or sizeof(url_view)? I think it must be about
160 bytes, since you store offsets to each of the components. I'm
not convinced that this is the best approach. I would be inclined to
determine the offsets each time they are needed by the accessor
methods. In the constructor, parse to validate the syntax but don't
store any offsets.

I believe that in the case of the path segments and the query
parameters, you are not storing offsets (because the number is variable,
and that would need allocations); why not adopt that approach for the
other offsets? Making the objects smaller would probably have
measurable performance benefits.

In your non-view url object, you seem to allocate storage using
new char[]. Why not use a std::string? One benefit of a std::string
is SBO. Another is that you could provide a move constructor from
a string.

I find the names of the encoding/decoding views confusing. I would
expect a "decoded_view" to refer to underlying encoded data and
return equivalent decoded data when it is read from, but you seen to
call that an "encoded_view".

In your query parameter API, you expose something that looks like a
standard associative container except that you call the key and value
"key" and "value" respectively, rather than "first" and "second".
While I think many of us would agree that those are better names, it
does make it more difficult to use this container with generic code
than it would be otherwise.

I see that you also have a bool called "has_value" in the value_type.
I think this is to handle this:

  ?a=1   // has_value == true && value == "1"
  ?a=    // has_value == true && value == ""
  ?a     // has_value == false

Is that correct? Again having this third member in the container's
value_type makes generic code more difficult. I would be inclined to
use a std::optional for the value instead.

In the past I have had situations where it was necessary to specify
the case for hex letters in %-encoding; please consider adding this to
pct_encode_opts.

The Magnet Link example needs a paragraph at the start to say what a
Magnet Link is.

Regards, Phil.

Phil Endecott

Vinnie Falco

Peter Dimov

Ivan Matek

Phil Endecott

Vinnie Falco

Alan de Freitas

tags

participants (5)