
On Thu, Jun 18, 2020 at 1:22 PM Phil Endecott via Boost <boost@lists.boost.org> wrote:
This is my review of Zach Laine's proposed Boost.Text.
Thanks for the review, Phil.
Zach is offering several things here: Unicode, a "better string", a trie, and more. I'm pleased that these components are all individually exposed to users and are independent of each other, rather than some being hidden as implementation details.
I've spent a couple of days looking at a few parts of the library. I'll discuss at the Unicode stuff first.
Boost gained support for Unicode functionality when Boost.Locale was accepted almost a decade ago, but I was not particularly impressed by that library and have not used it in my own work. Does the current proposal do better? I think it does, and though it's not perfect I think it would be worth accepting but subject to a concern about copyright.
See another thread for the copyright discussion. Basically, if I understand correctly, Zach's code includes tables that are derived from Unicode data and the licence for that data requires a copyright attribution. The Boost licence does not require attribution and I support that. So if I've understood the situation correctly - and I am not a lawyer! - this means that sadly Boost must REJECT the affected code.
Agreed.
I'll say some more about the other aspects of the Unicode code, just in case the copyright issue turns out to be surmountable.
Me too. I've sent an email to someone at ICU, as alluded to earlier in this thread.
I've tried just a couple of things. Firstly, I tried the UTF-8 to UTF-32 conversion. This seems to work, and the API is reasonable. Performance is not bad, but it could be better (see another thread); that could be fixed later. I think it would be worth having just a UTF library in Boost!
Could you provide the exact scenario you tried, and point me to the conversion technique that you used that was faster?
Secondly I looked at the line-splitting algorithm, i.e.
template<typename CPIter, typename Sentinel, typename Extent, typename CPExtentFunc> unspecified lines(CPIter first, Sentinel last, Extent max_extent, CPExtentFunc cp_extent, bool break_overlong_lines = true);
Again this seems to work, and the API is reasonable. Reference documentation is somewhat lacking though; in the description of lines(), the requirements for CPExtentFunc need to be described, and the "unspecified" return type needs to document that it includes members including iter and hard_line_break. I think there are probably similar omissions in much of the reference documentation.
Thanks! I think all those omissions in the reference docs are covered by the tutorial docs, but yeah, I definitely need to add explanations there.
I'm unconvinced by the use of the word "sentinel" here. A sentinel is a value with a special meaning, e.g. a terminating null; what you have here is an end-iterator. It's good that you're allowing the begin and end iterators to have different types, but I think you should name that type something like CPEndIter.
This is standard terminology, as of C++20. See http://eel.is/c++draft/ranges for numerous uses. It means end-iterator-or-other-type-that-acts-as-the-end-tag.
The implementation of line splitting seems to call its word length callback more often than necessary, i.e. when a word overflows a line, that word is subsequently measured a second time. That should be easily fixable.
Thanks, I'll revisit that.
Moving on to other parts of the proposal, I'm less happy with text::string. The first thing I noticed was the very strange decision to use int, rather than (signed?) size_t, for indexing; yes I do work with chunks of text larger than 2 GB! I would REJECT text::string for this alone.
Sure. You're in good company.
I also dislike the use of operator() to extract substrings; the meaning just isn't obvious:
auto x = y(4);
I think this is true because of lack of familiarity. Seeing this a few times would make it familiar pretty quickly. That being said, I'd prefer if I could write y[0:4] or y[0,4] but I can't, at least not without using some kind of strong typedef and a UDL, like y[0_something,4]. Anyway, the string types are moot at this point.
Is that the 4th character of y? Or the substring from 4 to the end? No, it's the substring from the start to 3. I'd need to add a comment to explain what it's doing. But actually there's another problem with that line: x is a non-owning view of that substring. I think that methods that return a view (or span, etc.) should have a name that makes that clear, e.g. "subview". (This wouldn't have been a problem before auto.) This came up in the static_string review, where the substr() method was originally proposed to return a view; that was renamed subview() with substr() returning a copied string.
Again, this is a question of familiarity. If std::string did not have substr() (as text::string does not) there would not be confusion. Silent memory allocations (substr() semantics) are a larger footgun than dangling references (op() semantics) in the age of ASan, IMO.
I don't mind that the string isn't templated on the char type, but I do miss the allocator support; I've used strings with arena allocators (for performance) and with shared memory allocators. I know allocator support is a lot of work but it is necessary, IMO.
I disagree. Allocators are a poor design, in that they are too general. Consider std::vector. An insert operation has different optimal implementations depending on whether the vector is a fixed capacity, inline-storage type or a heap-allocated type. An allocator changed the allocation strategy, but leaves all the other code that often depends on the allocation semantics unchanged. A better design is to have a separate templates for inline storage (e.g. boost::container::static_vector), sbo-optimized heap storage (e.g. boost::container::small_vector), and heap storage (e.g. std::vector). Many other libraries have come to the same conclusion, including FB Folly, LLVM's internal lib, and more.
text::text seems to be the obvious combination of text::string with unicode functionality, and I've not looked at that.
If you have time, please take a look. text::text is almost certainly soon to be reimplemented in terms of std::string.
I have had a look at the trie; I have sometimes wondered to what extent a trie can be implemented in C++ that works as much as possible like a standard container so this is interesting.
The docs should mention briefly why trie is here, i.e. where it is used, just in anticipation of that obvious question from users.
trie::operator[] returns an "optional ref". But it's not like a std::optional; as well as operator* it has conversion operators to access the value. Zach notes that this is "wonky" when the value type is bool, but it also seems to break for me when it's int:
boost::trie::trie_map<std::string, std::string> counts; auto x = counts["Hello"]; if (x) std::cout << "present\n"; else std::cout << "absent\n"; // Prints absent, OK.
boost::trie::trie_map<std::string, int> int_counts; // Now int. auto y = int_counts["Hello"]; if (y) std::cout << "present\n"; else std::cout << "absent\n";
// ../text/include/boost/text/trie.hpp:78: // boost::trie::optional_ref<T, Const>::operator T&() & // [with T = int; bool Const = false]: Assertion `t_' failed.
Thanks! I was missing an operator bool() overload for *mutable* lvalue ref. It works when I add that.
operator T&() is being called by if(y), rather than operator bool(). There aren't has_value() and value() methods that I can use to work around this.
Like a few other things in this proposal, Zach has implemented trie::operator[] how he wishes the standard library worked. As another example, dereferencing a trie_map iterator doesn't return a pair<KEY,VALUE> as a normal associative container would, but rather a struct with members called key and value.
At worst, this will hit problems that the standard solutions avoid (as above). At best, it will make the library more difficult to learn for users who are familiar with the standard library (including its problems),
I think Boost should be a place to experiment with approaches to addressing those problems, rather than propagating them. I find .first and .second to be horrible names for the key-value pair used to represent the mapping of a std::map. I don't think it's appropriate for new code. Code that uses structured bindings should not be affected.
and make generic code less likely to work with it.
As a test I wrote a simple program to count word frequencies:
using counts_t = std::map<std::string,int>; counts_t counts;
while (std::cin.good()) { std::string s; std::cin >> s; counts[s]++; }
for (auto i: counts) { std::cout << i.first << " = " << i.second << "\n"; }
It would be great if I could just change it to:
using counts_t = boost::trie_map<std::string,int>;
but I can't do that; I need to change things:
struct count { int value = 0; }; // Wrap the int. boost::trie::trie_map<std::string, count> counts;
while (std::cin.good()) { std::string s; std::cin >> s;
// operator[] doesn't insert, so: auto r = counts[s]; if (r) r->value++; else counts.insert(s,count{1}); }
for (auto i: counts) { std::cout << i.key << " = " << i.value.value << "\n"; }
That works; how about performance? Trie should perform well when the stored strings frequently share common prefixes, so I've tested with a large web server log file; this has timestamps and URLs which share common prefixes. The results:
std::map: run time 1.2 seconds, memory 1007 pages trie_map: run time 1.8 seconds, memory 4868 pages
So that's rather disappointing.
First, it's important to note that frequent common prefixes don't necessarily make the trie more lookup-efficient. If you are looking for "fred" and you actually have a trie containing f -> r- > e -> d as nodes, you'll visit all 4 nodes as you look up "fred". For a string of 10 chars, you'll visit all 10 nodes for a match, etc. For an R/B tree like std::map, you'll only visit log(N) nodes, where N is the size() of the map. This is unlikely to be nearly as large as the longest string for many data sets. The docs explicitly state that the trie_map and trie_set templates are there for convenience, and that if you care about performance you should use trie. Moreover, if you're trying to eat the longest token that matches some list of tokens encoded in a trie, the trie will be a lot faster, since you don't have to redo a log(N) search M times, where M is the length of the token you're eating. If you're using a trie outside of its niche, you should not expect ideal outcomes. You should use the thing that actually delivers ideal outcomes in that case you care about.
Some general comments:
Some people care about whether libraries are header-only or not, and the docs should mention this, and possibly say which functionality can be expected to work header-only.
I got quite a few missing field initializer and unused parameter warnings; please fix these if at all possible. (gcc 6.3, -W -Wall). I think you should re-order the docs so that the "Hello World" example is much earlier.
Some functions clearly need to embed large Unicode tables; it would be useful if the docs mentioned roughly how much the executable size is increased by.
Some source files are executable.
Conclusions:
I fear that the only part of this that Boost should accept is the UTF transcoding. The implementation is not perfect but we could work on that. The other Unicode stuff would be good to have but the copyright issue is a blocker.
Zach