Vinnie Falco wrote:
On Wed, Oct 13, 2021 at 7:10 AM Phil Endecott via Boost
wrote:
There's also the question of worst-case complexity; does your parser backtrack? If so, is there a pathological input that causes it to take exponential time?
I have done no profiling or optimization to speak of except some minor design work to facilitate the use of character-based SIMD algorithms later. I don't have a monolithic "parser" per-se, there are classes called "bnf elements" which can be used in the generic bnf::parse function. These elements call parse on child elements recursively. I do backtrack. For example, URI-reference is defined as:
URI-reference = URI / relative-ref
The class uri_reference_bnf implements this by first attempting to parse the scheme and colon, and if that fails then it backtracks and attempts parsing a "relative-ref." You can see this here:
https://github.com/CPPAlliance/url/blob/680f6db547ffc3367e1ba93bae29e4c9278f...
I'm not really sure how this can be improved, you need to know if there's a scheme and that can only be determined by scanning forward until reaching a colon or an invalid character for scheme. In theory this could scan the entire input. Thus I have not put any effort into making this better.
Can you add a complexity guarantee to the docs? (I posted about this in the context of regular expression libraries a few weeks ago.)
Maybe... I did it for JSON because performance is critical and the difference between a poorly performing library and a greatly performing library is massive. I'm not sure that URL parsing is the same. URLs tend to be short and it is sort of difficult to really get bad performance out of parsing them even if you go a character at a time.
data: URLs are a good example of long URLs that actually exist in practice. But what I'm really thinking about is guarding against malicious malformed input. You don't want your parser to have N^2 let alone exp(N) complexity when the input could legitimately be a data: URL of a few kbytes. And there's no need for it to have worse than linear complexity, because I'm pretty sure that the URL syntax is a Regular Language (in the computer science sense, not in the "regular expression is a synonym for search pattern" I'm-a-perl-programmer sense), and all regular languages can by definition be parsed in linear time. Regarding your example, i.e. "http://foo.com/" vs the relative URL "http.html", the point is that when the parser has reached the 'p' it will be in a state meaning "either this is a scheme or the first segment of the path of a relative URL". The transformation from BNF to a no-backtracking state machine of that sort is what tools like flex and libraries like RE2 do. Now having said all that, I do suspect that the particular grammar of URLs can be parsed even by a backtracking parser in linear time (with some constant factor) - but it will depend on the rules. I think you aspire to making the library "secure", and if that includes not having pathological performance in the presence of malicious input, you need to understand this stuff!
auto url = parse_url(str);
The problem is that url is a view of the string parameter, but nothing warns the user about that and...
Well, it is documented profusely. I agree there is some danger here but with great danger comes great power. Or something about great responsibility?
I don't think this is acceptable. What do others think?
I would hope that the path would be similar to a std::filesystem::path i.e. it should use the same vocabulary where possible (or could even actually be a std::filesystem::path, on C++17).
OOF I hope never to see this. std::filesystem is kind of a mess.
Well it's not my top favourite library either. But it's familiar, and the authors have considered many of the issues that have been mentioned in other threads i.e. whether path components include directory separators, distinguishing absolute and relative paths, etc. Having mentioned data: URLs above, I have tried to understand how these would be handled by your library. (mailto: also.) Note that a data: URL will generally contains /s, e.g. in the content type e.g. image/jpeg and also in the base64 data but these are not directory separators. I think what's really needed is a specific getter/setter for the body of an "unconventional" URL scheme like these. It may be possible to use your set_encoded_path() but it's not obvious. Please fix the docs for set[_encoded]_query() which currently seem confused with the operations for fragments. Regards, Phil.