Re: [boost] Boost.Text review

22 Jun 2020

      On Thu, Jun 18, 2020 at 1:22 PM Phil Endecott via Boost
<boost@lists.boost.org> wrote:
...
This is my review of Zach Laine's proposed Boost.Text.
Thanks for the review, Phil.
...
Zach is offering several things here: Unicode, a "better string",
a trie, and more. I'm pleased that these components are all
individually exposed to users and are independent of each other,
rather than some being hidden as implementation details.
I've spent a couple of days looking at a few parts of the library.
I'll discuss at the Unicode stuff first.
Boost gained support for Unicode functionality when Boost.Locale
was accepted almost a decade ago, but I was not particularly
impressed by that library and have not used it in my own work.
Does the current proposal do better? I think it does, and
though it's not perfect I think it would be worth accepting but
subject to a concern about copyright.
See another thread for the copyright discussion. Basically,
if I understand correctly, Zach's code includes tables that are
derived from Unicode data and the licence for that data requires
a copyright attribution. The Boost licence does not require
attribution and I support that. So if I've understood the
situation correctly - and I am not a lawyer! - this means that
sadly Boost must REJECT the affected code.
Agreed.
...
I'll say some more about the other aspects of the Unicode code,
just in case the copyright issue turns out to be surmountable.
Me too.  I've sent an email to someone at ICU, as alluded to earlier
in this thread.
...
I've tried just a couple of things. Firstly, I tried the UTF-8
to UTF-32 conversion. This seems to work, and the API is
reasonable. Performance is not bad, but it could be better
(see another thread); that could be fixed later. I think it
would be worth having just a UTF library in Boost!
Could you provide the exact scenario you tried, and point me to the
conversion technique that you used that was faster?
...
Secondly I looked at the line-splitting algorithm, i.e.
template<typename CPIter, typename Sentinel, typename Extent,
         typename CPExtentFunc>
  unspecified lines(CPIter first, Sentinel last, Extent max_extent,
                    CPExtentFunc cp_extent, bool break_overlong_lines = true);
Again this seems to work, and the API is reasonable. Reference
documentation is somewhat lacking though; in the description of
lines(), the requirements for CPExtentFunc need to be described,
and the "unspecified" return type needs to document that it includes
members including iter and hard_line_break. I think there are
probably similar omissions in much of the reference documentation.
Thanks!  I think all those omissions in the reference docs are covered
by the tutorial docs, but yeah, I definitely need to add explanations
there.
...
I'm unconvinced by the use of the word "sentinel" here. A sentinel
is a value with a special meaning, e.g. a terminating null; what
you have here is an end-iterator. It's good that you're allowing
the begin and end iterators to have different types, but I think
you should name that type something like CPEndIter.
This is standard terminology, as of C++20.  See
http://eel.is/c++draft/ranges for numerous uses.  It means
end-iterator-or-other-type-that-acts-as-the-end-tag.
...
The implementation of line splitting seems to call its word
length callback more often than necessary, i.e. when a word
overflows a line, that word is subsequently measured a second
time. That should be easily fixable.
Thanks, I'll revisit that.
...
Moving on to other parts of the proposal, I'm less happy with
text::string. The first thing I noticed was the very strange
decision to use int, rather than (signed?) size_t, for indexing;
yes I do work with chunks of text larger than 2 GB! I would
REJECT text::string for this alone.
Sure.  You're in good company.
...
I also dislike the use of operator() to extract substrings; the
meaning just isn't obvious:
auto x = y(4);
I think this is true because of lack of familiarity.  Seeing this a
few times would make it familiar pretty quickly.  That being said, I'd
prefer if I could write y[0:4] or y[0,4] but I can't, at least not
without using some kind of strong typedef and a UDL, like
y[0_something,4].  Anyway, the string types are moot at this point.
...
Is that the 4th character of y? Or the substring from 4 to the
end? No, it's the substring from the start to 3. I'd need to
add a comment to explain what it's doing. But actually there's
another problem with that line: x is a non-owning view of that
substring. I think that methods that return a view (or span,
etc.) should have a name that makes that clear, e.g. "subview".
(This wouldn't have been a problem before auto.)  This came up in
the static_string review, where the substr() method was originally
proposed to return a view; that was renamed subview() with
substr() returning a copied string.
Again, this is a question of familiarity.  If std::string did not have
substr() (as text::string does not) there would not be confusion.
Silent memory allocations (substr() semantics) are a larger footgun
than dangling references (op() semantics) in the age of ASan, IMO.
...
I don't mind that the string isn't templated on the char type,
but I do miss the allocator support; I've used strings with arena
allocators (for performance) and with shared memory allocators.
I know allocator support is a lot of work but it is necessary, IMO.
I disagree.  Allocators are a poor design, in that they are too
general.  Consider std::vector.  An insert operation has different
optimal implementations depending on whether the vector is a fixed
capacity, inline-storage type or a heap-allocated type.  An allocator
changed the allocation strategy, but leaves all the other code that
often depends on the allocation semantics unchanged.  A better design
is to have a separate templates for inline storage (e.g.
boost::container::static_vector), sbo-optimized heap storage (e.g.
boost::container::small_vector), and heap storage (e.g. std::vector).
Many other libraries have come to the same conclusion, including FB
Folly, LLVM's internal lib, and more.
...
text::text seems to be the obvious combination of text::string
with unicode functionality, and I've not looked at that.
If you have time, please take a look.  text::text is almost certainly
soon to be reimplemented in terms of std::string.
...
I have had a look at the trie; I have sometimes wondered to what
extent a trie can be implemented in C++ that works as much as
possible like a standard container so this is interesting.
The docs should mention briefly why trie is here, i.e. where it
is used, just in anticipation of that obvious question from users.
trie::operator[] returns an "optional ref". But it's not like
a std::optional; as well as operator* it has conversion operators
to access the value. Zach notes that this is "wonky" when the
value type is bool, but it also seems to break for me when it's
int:
boost::trie::trie_map<std::string, std::string> counts;
auto x = counts["Hello"];
if (x) std::cout << "present\n"; else std::cout << "absent\n";
// Prints absent, OK.
boost::trie::trie_map<std::string, int> int_counts;  // Now int.
auto y = int_counts["Hello"];
if (y) std::cout << "present\n"; else std::cout << "absent\n";
// ../text/include/boost/text/trie.hpp:78:
// boost::trie::optional_ref<T, Const>::operator T&() &
// [with T = int; bool Const = false]: Assertion `t_' failed.
Thanks!  I was missing an operator bool() overload for *mutable*
lvalue ref.  It works when I add that.
...
operator T&() is being called by if(y), rather than
operator bool(). There aren't has_value() and value() methods
that I can use to work around this.
Like a few other things in this proposal, Zach has implemented
trie::operator[] how he wishes the standard library worked.
As another example, dereferencing a trie_map iterator doesn't
return a pair<KEY,VALUE> as a normal associative container
would, but rather a struct with members called key and value.
At worst, this will hit problems that the standard solutions
avoid (as above). At best, it will make the library more
difficult to learn for users who are familiar with the standard
library (including its problems),
I think Boost should be a place to experiment with approaches to
addressing those problems, rather than propagating them.  I find
.first and .second to be horrible names for the key-value pair used to
represent the mapping of a std::map.  I don't think it's appropriate
for new code.  Code that uses structured bindings should not be
affected.
...
and make generic code less
likely to work with it.
As a test I wrote a simple program to count word frequencies:
using counts_t = std::map<std::string,int>;
counts_t counts;
while (std::cin.good()) {
  std::string s;
  std::cin >> s;
  counts[s]++;
}
for (auto i: counts) {
  std::cout << i.first << " = " << i.second << "\n";
}
It would be great if I could just change it to:
using counts_t = boost::trie_map<std::string,int>;
but I can't do that; I need to change things:
struct count { int value = 0; };  // Wrap the int.
boost::trie::trie_map<std::string, count> counts;
while (std::cin.good()) {
  std::string s;
  std::cin >> s;
// operator[] doesn't insert, so:
  auto r = counts[s];
  if (r) r->value++;
  else counts.insert(s,count{1});
}
for (auto i: counts) {
  std::cout << i.key << " = " << i.value.value << "\n";
}
That works; how about performance? Trie should perform
well when the stored strings frequently share common prefixes,
so I've tested with a large web server log file; this has
timestamps and URLs which share common prefixes. The results:
std::map: run time 1.2 seconds, memory 1007 pages
trie_map: run time 1.8 seconds, memory 4868 pages
So that's rather disappointing.
First, it's important to note that frequent common prefixes don't
necessarily make the trie more lookup-efficient.  If you are looking
for "fred" and you actually have a trie containing f -> r- > e -> d as
nodes, you'll visit all 4 nodes as you look up "fred".  For a string
of 10 chars, you'll visit all 10 nodes for a match, etc.  For an R/B
tree like std::map, you'll only visit log(N) nodes, where N is the
size() of the map.  This is unlikely to be nearly as large as the
longest string for many data sets.

The docs explicitly state that the trie_map and trie_set templates are
there for convenience, and that if you care about performance you
should use trie.  Moreover, if you're trying to eat the longest token
that matches some list of tokens encoded in a trie, the trie will be a
lot faster, since you don't have to redo a log(N) search M times,
where M is the length of the token you're eating.

If you're using a trie outside of its niche, you should not expect
ideal outcomes.  You should use the thing that actually delivers ideal
outcomes in that case you care about.
...
Some general comments:
Some people care about whether libraries are header-only or not,
and the docs should mention this, and possibly say which
functionality can be expected to work header-only.
I got quite a few missing field initializer and unused parameter
warnings; please fix these if at all possible. (gcc 6.3, -W -Wall).
I think you should re-order the docs so that the "Hello World"
example is much earlier.
Some functions clearly need to embed large Unicode tables; it
would be useful if the docs mentioned roughly how much the
executable size is increased by.
Some source files are executable.
Conclusions:
I fear that the only part of this that Boost should accept is
the UTF transcoding. The implementation is not perfect but we
could work on that. The other Unicode stuff would be good to
have but the copyright issue is a blocker.
Zach