This is a review for the Boost.Text library, submitted a day late (but hopefully not a dollar short! (U.S. colloquialism, don't mind me!)). The library has 3 somewhat related but (somewhat?) separable sub-libraries. In "building block" order, these are: - A string layer (a new std::string) - A unicode layer (algorithms and data) - A text layer (string, but if it gave a single flying crap about Unicode) There are 4 (3?) types to care about in the string layer: unencoded_rope, string, segmented_vector (and string_builder...?). There are not many new types in the unicode layer save for things that help the algorithms/data do the things well and report findings from algorithm calls. There are 2 types to care about in the unicode layer: text, and rope. We will start with the lowest building block layer. This will likely not be a typical review: most others have called out in the documentation and other places things that have failed, so I will focus primarily on the utility and design of the layers and what they can bring to the table. ====== Layer 0 ====== [[string]] It gets rid of char_traits (yay!) but then also throws out the allocator (ewwww!). ... That's it. This type does not affect my view of the library because it can be (mostly) safely ignored. It can be nuked from orbit and nothing of value will be lost. I would actually recommend std::string be used underneath, because why do this to the ecosystem for the (N+1)th time? [[unencoded_rope]][[segmented_vector]] These two data structures are FAR more spicy and incredibly interesting. They provide different guarantees of insertion and erasure complexity. Of course, neither have allocators built in so I can't really customize how this works without hijacking global new and delete, but Zach has made clear his distaste for the allocator world and having recently built several standard containers I don't blame him. Nevertheless, both of these data structures are being talked about together because they provide the same type of functionality: unencoded_rope is just specialized for char storage. Notably, segmented_vector has an insert for (iterator, value_type) while unencoded_rope seemed to be missing that and only wanted to deal with "strings"/ranges, rather than single elements. This made me applying my fun text-wrapper on unencoded_rope mildly annoying because single-insert was just not present: phd::text::basic_text<utf8, nfd, boost::text::unencoded_rope> wee; wee.insert(u8'A'); // kabloosh! Nevertheless, my "shortcuts" for single insertion are honestly a waste of space because I can just turn that into a range of size 1 and use less of the "required" SequenceContainer (https://en.cppreference.com/w/cpp/named_req/SequenceContainer) bits 'n' bobs anyway. So no real harm, no actual foul! Running my tests with an encoding slapped on top of the unencoded rope or the segmented vector worked, which meant I could get a different storage policy with the nfd normalization form and the encoding of my choice. (Well, I only tested utf8/16/32, one byte encoding, and then the current execution character set (which was just utf8 anyways so that's not really that exhaustive, is it?)). This layer has immense value. Keep it and ship it; great job, Zach! [[string_builder]] I think this is vestigial. So, uh, doesn't really affect the review, and I don't care for it? ======= Layer 1 ======= Yes. ... That's it. That's literally it: ship it. Goddamn, ship this layer like your favorite movie couple. This is what we need. This is what we crave. It's ICU, except if ICU went to study under Stepanov, Lee and Plauger instead of Gosling, Sheridan and Naughton. No complaints, no problems: having this layer makes this library ABSOLUTELY worth it, 210%. There are even special normalize-in-place algorithms for strings, which can save on performance. You can implement your own Unicode text-aware layer on top of this stuff, it provides a robust set of algorithms and normalization forms (hell yeah!) and makes every second that this library is not in Boost a tragedy. Passed the necessary tests on my machine despite taking an age, but that's moreso because generated Unicode tests is a doozy. Speaking of "implement your own Unicode text-aware layer..." ====== Layer 2 ====== This is the layer I am -- on a library design level and a personal philosophy level -- the most opposed to. But my answer is still to accept it (well, modulo it being based on the above string type. Please just use std::string). [[ text ]] [[ rope ]] While these containers can be evaluated individually, other reviews have picked up a great deal of pickings at them and so I won't bother. There was some grumbling about how a rope-like data structure is not interesting enough to be included and I will just quietly wave that off as "my use case is the only use case that matters and therefore I don't care about other people's invariants or needs". There are many implicitly (and explicitly) stated and maintained opinions in this layer: - UTF-8 is the way, truth, and life. - Unicode is the only encoding that matters ever, for all time, in perpetuity. - Allocators are shit! - NFC is probably the best we can do here for varying reasons. - Who needs non-contiguous storage anyways? - Who needs non-SBO storage, anyways? These are all opinions, many of which are present in the design of the text container. And they allow this text container to ship. But that lack of flexibility -- while okay for Qt or Apple's CoreText or whatever other platform-specific hoo-ha you want to get involved with -- does not help. In fact, it cuts them off: more than one person during Meeting C++ spoke to me of Boost.Text and said it could not meet their needs because it maintained encoding or normalization invariants that did not interoperate with their existing system. Storage is also an issue: while "I use boost::text::string underneath" is fine and dandy, many systems (next to none, maybe?) are going to speak in "text" or its related string type. They will want the underlying container to speak to. For duck-type purposes, it works. But for everyone else, it fails. Since the string layer uses an `int` for its size and capacity, it is lopsidedly incompatible with existing STL's implementations of string, to the point that a reinterpret_cast -- however evil -- is not suitable for transporting a reference-without-copy into these APIs. God bless string_view and its friends, because it allows us to at least continue to talk to some APIs since the text type guarantees contiguous storage. This means that at the boundaries of an application -- or even as a plugin to a wider ecosystem -- I am paying a (sometimes obscene) cost to interoperate between std::string/llvm::SmallString/unicode_code_unit_sequence and all the other things people have developed to sit between them and what they believe their string needs are. And while it is whack that so many of these classes exist, they do. That lack of interoperability -- and once again, the lack of an allocator template parameter -- hampers this library from COMPLETELY DOMINATING the string scene. It will always be used as a solution, maybe even 80% of the time. Those seeking more will have to figure out how to build their own UTF16 containers, or their own special-encoded containers, with very little support from the text library (save for some transcoding functions they can leverage, but only from specific Unicode encodings). Onto the good news: the text and rope classes work like I expect them to. Pass my tests. A+ great job keeps my text in utf8 and the prescribed normalization form! Despite the length of my previous critique that basically amounts to "who died and made you King of my string layout and memory allocation?", this layer and the library should still be accepted. ============ Okay, Seriously? ============= Yep. See, the problem right now with C++ -- and the standard in General -- is that we like to wait for something to bake for an eternity, often long after it's useful and necessary for the end user. In C++11 we introduced a "codecvt"-style thing called "std::wstring_convert", whose sole purpose was transcoding, plus or minus some platform shenanigans. It was implemented poorly on almost all platforms, its performance is hot garbage (https://github.com/ThePhD/sol2/issues/571), and it generally was a bug-ridden mess. But it shipped. What we did when we both deprecated and removed std::wstring_convert and its related facets is we took a real pain point in the C++ community and decided to make it far worse than it already was. See, C++ -- and C++11 -- were steaming piles of dogpoo when it came to Unicode (https://stackoverflow.com/a/17106065). So when wstring_convert came on the scene, it was a breath of fresh air. Yeah, the performance is garbage, yes the interface is trash, yes it hasn't learned anything from Stepanov's fantastic work, but it was there. It was workable. And it was standard. And the Committee ripped it out of the user's hands. Boost.Text, for however many extremely opinionated decisions it makes that ends up excluding certain parts of the C++ ecosystem, provide a SORELY needed relief for the majority of the C++ community who have been struggling for the tiniest bit of a text solution. So even if the storage has a mandated encoding; a strict normalization form is given; and, everything else costs you a pound of flesh to build yourself, the whole point is that there is a default, and it is a pretty good default. This is something that cannot be understated in the slightest; we have nothing -- and I mean, N O T H I N G -- that reflects a good C++ library for Unicode. Even if you do not like Zach's decisions, other people can pick up Zach's container types and run with them for quite a while. Sure, the 7x performance gains I got in my last job using solely allocators is impossible with Boost Text! But, Layer 1 exists: I can leverage well-done Unicode algorithms to do the job I need to, even if it is not as convenient and pre-packaged as I would like it to be. This is not only important for the ecosystem at large, but for the Boost Community. For a long time people have wondered if Boost will lead the charge towards a better, brighter future by solving problems that users face the most, or if it would fade into compatibility-library obscurity and be repeatedly reviled for its special build needs and required setup over its standard library equivalents. Boost.Text is one of many libraries I *expect* to see land in Boost to solve critical problems, to be iterated and shipped towards the wider C++ ecosystem and have an impact that most library developers would only dream of. ========== In Conclusion ========== Just one really big thing for me: - Use std::string underneath. `int` is not a good size type. People work with strings larger than 1 GB (INT_MAX / 2, as reported by the string implementation). Other people commented on the other fixes I would care about and most of those have already been noted, thanks! Other than that... Please accept Boost.Text for inclusion in the next available version of Boost and continue to work towards the end of our collective 40 year string nightmare. We can sort out COMPLETE DOMINATION of the design space a little later, since this design is -- thankfully -- not one that is immune to source backwards compatible improvements.