Re: [boost] Boost.URL: Something Wicked This Way Comes!

14 Oct 2021

      Vinnie Falco wrote:
...
On Wed, Oct 13, 2021 at 7:10 AM Phil Endecott via Boost
<boost@lists.boost.org> wrote:
...
...
There's also the question of worst-case complexity; does your parser
backtrack? If so, is there a pathological input that causes it to
take exponential time?
I have done no profiling or optimization to speak of except some minor
design work to facilitate the use of character-based SIMD algorithms
later. I don't have a monolithic "parser" per-se, there are classes
called "bnf elements" which can be used in the generic bnf::parse
function. These elements call parse on child elements recursively. I
do backtrack. For example, URI-reference is defined as:
URI-reference = URI / relative-ref
The class uri_reference_bnf implements this by first attempting to
parse the scheme and colon, and if that fails then it backtracks and
attempts parsing a "relative-ref." You can see this here:
https://github.com/CPPAlliance/url/blob/680f6db547ffc3367e1ba93bae29e4c9278f...
I'm not really sure how this can be improved, you need to know if
there's a scheme and that can only be determined by scanning forward
until reaching a colon or an invalid character for scheme. In theory
this could scan the entire input. Thus I have not put any effort into
making this better.
...
Can you add a complexity guarantee to the
docs? (I posted about this in the context of regular expression
libraries a few weeks ago.)
Maybe... I did it for JSON because performance is critical and the
difference between a poorly performing library and a greatly
performing library is massive. I'm not sure that URL parsing is the
same. URLs tend to be short and it is sort of difficult to really get
bad performance out of parsing them even if you go a character at a
time.
data: URLs are a good example of long URLs that actually exist in
practice. But what I'm really thinking about is guarding against
malicious malformed input. You don't want your parser to have N^2
let alone exp(N) complexity when the input could legitimately be
a data: URL of a few kbytes.

And there's no need for it to have worse than linear complexity,
because I'm pretty sure that the URL syntax is a Regular Language (in
the computer science sense, not in the "regular expression is a
synonym for search pattern" I'm-a-perl-programmer sense), and all
regular languages can by definition be parsed in linear time.

Regarding your example, i.e. "http://foo.com/" vs the relative URL
"http.html", the point is that when the parser has reached the 'p'
it will be in a state meaning "either this is a scheme or the first
segment of the path of a relative URL". The transformation from
BNF to a no-backtracking state machine of that sort is what tools
like flex and libraries like RE2 do.

Now having said all that, I do suspect that the particular grammar
of URLs can be parsed even by a backtracking parser in linear time
(with some constant factor) - but it will depend on the rules.
I think you aspire to making the library "secure", and if that
includes not having pathological performance in the presence of
malicious input, you need to understand this stuff!
...
...
auto url = parse_url(str);
The problem is that url is a view of the string parameter, but
nothing warns the user about that and...
Well, it is documented profusely. I agree there is some danger here
but with great danger comes great power. Or something about great
responsibility?
I don't think this is acceptable. What do others think?
...
...
I would hope that the path would be similar to a std::filesystem::path
i.e. it should use the same vocabulary where possible (or could
even actually be a std::filesystem::path, on C++17).
OOF I hope never to see this. std::filesystem is kind of a mess.
Well it's not my top favourite library either. But it's familiar,
and the authors have considered many of the issues that have been
mentioned in other threads i.e. whether path components include
directory separators, distinguishing absolute and relative paths,
etc.

Having mentioned data: URLs above, I have tried to understand how
these would be handled by your library. (mailto: also.) Note that
a data: URL will generally contains /s, e.g. in the content type
e.g. image/jpeg and also in the base64 data but these are not
directory separators.

I think what's really needed is a specific getter/setter for the
body of an "unconventional" URL scheme like these. It may be
possible to use your set_encoded_path() but it's not obvious.

Please fix the docs for set[_encoded]_query() which currently
seem confused with the operations for fragments.

Regards, Phil.