
This is my review of Zach Laine's proposed Boost.Text. Zach is offering several things here: Unicode, a "better string", a trie, and more. I'm pleased that these components are all individually exposed to users and are independent of each other, rather than some being hidden as implementation details. I've spent a couple of days looking at a few parts of the library. I'll discuss at the Unicode stuff first. Boost gained support for Unicode functionality when Boost.Locale was accepted almost a decade ago, but I was not particularly impressed by that library and have not used it in my own work. Does the current proposal do better? I think it does, and though it's not perfect I think it would be worth accepting but subject to a concern about copyright. See another thread for the copyright discussion. Basically, if I understand correctly, Zach's code includes tables that are derived from Unicode data and the licence for that data requires a copyright attribution. The Boost licence does not require attribution and I support that. So if I've understood the situation correctly - and I am not a lawyer! - this means that sadly Boost must REJECT the affected code. I'll say some more about the other aspects of the Unicode code, just in case the copyright issue turns out to be surmountable. I've tried just a couple of things. Firstly, I tried the UTF-8 to UTF-32 conversion. This seems to work, and the API is reasonable. Performance is not bad, but it could be better (see another thread); that could be fixed later. I think it would be worth having just a UTF library in Boost! Secondly I looked at the line-splitting algorithm, i.e. template<typename CPIter, typename Sentinel, typename Extent, typename CPExtentFunc> unspecified lines(CPIter first, Sentinel last, Extent max_extent, CPExtentFunc cp_extent, bool break_overlong_lines = true); Again this seems to work, and the API is reasonable. Reference documentation is somewhat lacking though; in the description of lines(), the requirements for CPExtentFunc need to be described, and the "unspecified" return type needs to document that it includes members including iter and hard_line_break. I think there are probably similar omissions in much of the reference documentation. I'm unconvinced by the use of the word "sentinel" here. A sentinel is a value with a special meaning, e.g. a terminating null; what you have here is an end-iterator. It's good that you're allowing the begin and end iterators to have different types, but I think you should name that type something like CPEndIter. The implementation of line splitting seems to call its word length callback more often than necessary, i.e. when a word overflows a line, that word is subsequently measured a second time. That should be easily fixable. Moving on to other parts of the proposal, I'm less happy with text::string. The first thing I noticed was the very strange decision to use int, rather than (signed?) size_t, for indexing; yes I do work with chunks of text larger than 2 GB! I would REJECT text::string for this alone. I also dislike the use of operator() to extract substrings; the meaning just isn't obvious: auto x = y(4); Is that the 4th character of y? Or the substring from 4 to the end? No, it's the substring from the start to 3. I'd need to add a comment to explain what it's doing. But actually there's another problem with that line: x is a non-owning view of that substring. I think that methods that return a view (or span, etc.) should have a name that makes that clear, e.g. "subview". (This wouldn't have been a problem before auto.) This came up in the static_string review, where the substr() method was originally proposed to return a view; that was renamed subview() with substr() returning a copied string. I don't mind that the string isn't templated on the char type, but I do miss the allocator support; I've used strings with arena allocators (for performance) and with shared memory allocators. I know allocator support is a lot of work but it is necessary, IMO. text::text seems to be the obvious combination of text::string with unicode functionality, and I've not looked at that. I have had a look at the trie; I have sometimes wondered to what extent a trie can be implemented in C++ that works as much as possible like a standard container so this is interesting. The docs should mention briefly why trie is here, i.e. where it is used, just in anticipation of that obvious question from users. trie::operator[] returns an "optional ref". But it's not like a std::optional; as well as operator* it has conversion operators to access the value. Zach notes that this is "wonky" when the value type is bool, but it also seems to break for me when it's int: boost::trie::trie_map<std::string, std::string> counts; auto x = counts["Hello"]; if (x) std::cout << "present\n"; else std::cout << "absent\n"; // Prints absent, OK. boost::trie::trie_map<std::string, int> int_counts; // Now int. auto y = int_counts["Hello"]; if (y) std::cout << "present\n"; else std::cout << "absent\n"; // ../text/include/boost/text/trie.hpp:78: // boost::trie::optional_ref<T, Const>::operator T&() & // [with T = int; bool Const = false]: Assertion `t_' failed. operator T&() is being called by if(y), rather than operator bool(). There aren't has_value() and value() methods that I can use to work around this. Like a few other things in this proposal, Zach has implemented trie::operator[] how he wishes the standard library worked. As another example, dereferencing a trie_map iterator doesn't return a pair<KEY,VALUE> as a normal associative container would, but rather a struct with members called key and value. At worst, this will hit problems that the standard solutions avoid (as above). At best, it will make the library more difficult to learn for users who are familiar with the standard library (including its problems), and make generic code less likely to work with it. As a test I wrote a simple program to count word frequencies: using counts_t = std::map<std::string,int>; counts_t counts; while (std::cin.good()) { std::string s; std::cin >> s; counts[s]++; } for (auto i: counts) { std::cout << i.first << " = " << i.second << "\n"; } It would be great if I could just change it to: using counts_t = boost::trie_map<std::string,int>; but I can't do that; I need to change things: struct count { int value = 0; }; // Wrap the int. boost::trie::trie_map<std::string, count> counts; while (std::cin.good()) { std::string s; std::cin >> s; // operator[] doesn't insert, so: auto r = counts[s]; if (r) r->value++; else counts.insert(s,count{1}); } for (auto i: counts) { std::cout << i.key << " = " << i.value.value << "\n"; } That works; how about performance? Trie should perform well when the stored strings frequently share common prefixes, so I've tested with a large web server log file; this has timestamps and URLs which share common prefixes. The results: std::map: run time 1.2 seconds, memory 1007 pages trie_map: run time 1.8 seconds, memory 4868 pages So that's rather disappointing. Some general comments: Some people care about whether libraries are header-only or not, and the docs should mention this, and possibly say which functionality can be expected to work header-only. I got quite a few missing field initializer and unused parameter warnings; please fix these if at all possible. (gcc 6.3, -W -Wall). I think you should re-order the docs so that the "Hello World" example is much earlier. Some functions clearly need to embed large Unicode tables; it would be useful if the docs mentioned roughly how much the executable size is increased by. Some source files are executable. Conclusions: I fear that the only part of this that Boost should accept is the UTF transcoding. The implementation is not perfect but we could work on that. The other Unicode stuff would be good to have but the copyright issue is a blocker. Regards, Phil.

On Thu, Jun 18, 2020 at 11:22 AM Phil Endecott via Boost <boost@lists.boost.org> wrote:
if I understand correctly, Zach's code includes tables that are derived from Unicode data and the licence for that data requires a copyright attribution.
Could we ask the owner of the copyright to dual-license the data under the BSL? That might be a cheap and easy solution. Thanks

I also am not a lawyer but know from experience that you can't copyright data, or simple expressions of it. A program is an original work. A table of facts it uses is not. Someone can write what they like in a licence but if the table can be (ideally, has been) produced manually or algorithmically from information (published in whatever form, even if copyright) that simply conveys facts you can use it. There is some risk/issue if you use the way the facts are arranged "creatively" e.g. as a graph or figure or if included in a database (that is organised "creatively" to allow use/retrieval) etc.. Also copyright does not protect an underlying algorithm - only the program that implements it. And if there is simply/only one obvious way to represent said algorithm etc it doesn't prevent one writing the same thing... Obviously a whole program composed of fragments that aren't individually copyrightable but "creatively arranged" in some original program can't then be arranged into a "new" program to form a whole that is - effectively - the original work without that copying of creative arrangement being a breach of copyright ... Also there may be a need to cite sources. and this is only polite - reasonable anyway - but that does NOT require things like inclusion of copyright notices with distributed binaries etc. Citing the source in the comments/ docs / readme other human readable content of the work (the source code) is sufficient. I have purposely not, so far, looked at the specifics of this (original source of this table or usage in proposed boost.text). I might be being naive in my assumption that this is a relatively simple expression of facts. It might be necessary for the removal of all doubt to produce a boost table and code that uses it from primary sources and descriptions (not programs) re how the table/algorithm is used/works. On Fri, 19 Jun 2020, 4:25 am Vinnie Falco via Boost, <boost@lists.boost.org> wrote:
On Thu, Jun 18, 2020 at 11:22 AM Phil Endecott via Boost <boost@lists.boost.org> wrote:
if I understand correctly, Zach's code includes tables that are derived from Unicode data and the licence for that data requires a copyright attribution.
Could we ask the owner of the copyright to dual-license the data under the BSL? That might be a cheap and easy solution.
Thanks
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

El vie., 19 jun. 2020 a las 0:45, Darryl Green via Boost (< boost@lists.boost.org>) escribió:
I also am not a lawyer but know from experience that you can't copyright data, or simple expressions of it. A program is an original work. A table of facts it uses is not. Someone can write what they like in a licence but if the table can be (ideally, has been) produced manually or algorithmically from information (published in whatever form, even if copyright) that simply conveys facts you can use it. There is some risk/issue if you use the way the facts are arranged "creatively" e.g. as a graph or figure or if included in a database (that is organised "creatively" to allow use/retrieval) etc.. Also copyright does not protect an underlying algorithm - only the program that implements it. And if there is simply/only one obvious way to represent said algorithm etc it doesn't prevent one writing the same thing... Obviously a whole program composed of fragments that aren't individually copyrightable but "creatively arranged" in some original program can't then be arranged into a "new" program to form a whole that is - effectively - the original work without that copying of creative arrangement being a breach of copyright ...
The problem is that what is copyright, what is copyrightable, and what is legal or illegal is defined in a country by country basis. Exploring a WW application/interpretation of law is extremely expensive and that is why when an org settles on a license it sticks to it to keep it simple and avoid risk of overlooking something in the process and get in big trouble somewhere.

I seem to be agreeing with you re sticking to Boost licence. Only suggesting what can be done to allow the library to do that. The concepts I described apply in many jurisdictions (I don't presume to know all of them). However for certitude the options are more limited. One is to "clean room" functionally equivalent table/code given information/primary sources needed to do so (i.e. the spec rather than any machine usable form of it). This is potentially impractical to do and maintain (including testing that all is correct and consistent with unicode spec). Another is to NOT include the unicode.org headers/data in the lib itself but to have a compile time dependency on them, some doc's re downloading or referencing them where (potentially) already installed on a system, and ref to the unicode licence, which each user is free to assess for themselves. This approach avoids the need to make the (probably invalid in most jurisdictions) "act of download is unequivocal acceptance" licence text part of a boost distribution and avoiding a need to build equivalent data independently or argue about licence validity, copyright of tabulated data necessary to conform to a spec etc.. Each Boost user can make that call without involving Boost. For this potential user at least, that is fine. Are there any other options? A lib that has to deliver conformance with a spec that has licence terms should not be rejected due to its dependency on that spec. Neither should depending on that spec cause any Boost code, doc's etc to have a different licence. I'm out of suggestions for achieving that. On Sat, 20 Jun 2020, 12:37 am Damian Vicino, <damian@vicino.com.ar> wrote:
El vie., 19 jun. 2020 a las 0:45, Darryl Green via Boost (< boost@lists.boost.org>) escribió:
I also am not a lawyer but know from experience that you can't copyright data, or simple expressions of it. A program is an original work. A table of facts it uses is not. Someone can write what they like in a licence but if the table can be (ideally, has been) produced manually or algorithmically from information (published in whatever form, even if copyright) that simply conveys facts you can use it. There is some risk/issue if you use the way the facts are arranged "creatively" e.g. as a graph or figure or if included in a database (that is organised "creatively" to allow use/retrieval) etc.. Also copyright does not protect an underlying algorithm - only the program that implements it. And if there is simply/only one obvious way to represent said algorithm etc it doesn't prevent one writing the same thing... Obviously a whole program composed of fragments that aren't individually copyrightable but "creatively arranged" in some original program can't then be arranged into a "new" program to form a whole that is - effectively - the original work without that copying of creative arrangement being a breach of copyright ...
The problem is that what is copyright, what is copyrightable, and what is legal or illegal is defined in a country by country basis. Exploring a WW application/interpretation of law is extremely expensive and that is why when an org settles on a license it sticks to it to keep it simple and avoid risk of overlooking something in the process and get in big trouble somewhere.

On Thu, Jun 18, 2020 at 1:25 PM Vinnie Falco via Boost <boost@lists.boost.org> wrote:
On Thu, Jun 18, 2020 at 11:22 AM Phil Endecott via Boost <boost@lists.boost.org> wrote:
if I understand correctly, Zach's code includes tables that are derived from Unicode data and the licence for that data requires a copyright attribution.
Could we ask the owner of the copyright to dual-license the data under the BSL? That might be a cheap and easy solution.
Maybe? I'll email the one ICU developer I've ever spoken to, and see if he can direct me to the right person to ask about this. Zach

On Sun, 21 Jun 2020, 7:42 am Zach Laine via Boost, <boost@lists.boost.org> wrote:
On Thu, Jun 18, 2020 at 1:25 PM Vinnie Falco via Boost <boost@lists.boost.org> wrote:
On Thu, Jun 18, 2020 at 11:22 AM Phil Endecott via Boost <boost@lists.boost.org> wrote:
if I understand correctly, Zach's code includes tables that are derived from Unicode data and the licence for that data requires a copyright attribution.
Could we ask the owner of the copyright to dual-license the data under the BSL? That might be a cheap and easy solution.
Maybe? I'll email the one ICU developer I've ever spoken to, and see if he can direct me to the right person to ask about this.
The use that was made of the .txt data files to extract the information needed to produce headers that are part of Boost.Text is in my opinion fine. Boost.Text users never use or need access to these .txt files. If one takes the contrary view that the licence of these files is something that the user of Boost.Text needs to care about, then be aware that it is unlikely any single person or entity can re-license these files as I believe they may have various contributors/licences. I would try to avoid pulling on that string for fear of the tangle that is likely at the end of it. That said, it might be possible to ask ICU for explicit authorisation (to the extent they alone can give it) for the use of the .txt files as a source of constant data required to process unicode in accordance with the standard without requiring that headers in Boost.Text that embed those constants, for that purpose, be licenced under the ICU licence. One reasonably expects that was their intent in producing these files, but it doesn't hurt to ask. With or without that confirmation it would be good practice to cite the source of the DATA values (not code) in Boost.Text documentation and/or in the headers that embed the constants.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Thu, 18 Jun 2020 at 13:22, Phil Endecott via Boost <boost@lists.boost.org> wrote:
I have had a look at the trie; I have sometimes wondered to what extent a trie can be implemented in C++ that works as much as possible like a standard container so this is interesting.
https://github.com/ned14/nedtries : no need to start from scratch, I have seen the author's name being mentioned here and there, one in a while. Explosive development without reflection, design and consolidation, amounts to anarchy. degski

On 19/06/2020 03:01, degski via Boost wrote:
On Thu, 18 Jun 2020 at 13:22, Phil Endecott via Boost <boost@lists.boost.org> wrote:
I have had a look at the trie; I have sometimes wondered to what extent a trie can be implemented in C++ that works as much as possible like a standard container so this is interesting.
https://github.com/ned14/nedtries : no need to start from scratch, I have seen the author's name being mentioned here and there, one in a while. Explosive development without reflection, design and consolidation, amounts to anarchy.
Alas nedtries is a *bitwise* trie implementation, not a character based one. A bitwise trie indexing algorithm is superb for associative containers where the key is a bunch of bits small enough to allow the CPU's bit scan reverse opcode to work on them (i.e. as an unordered_map or map replacement), but I'm not sure that that is directly useful for UTF-8 parsing. Niall

On Thu, Jun 18, 2020 at 1:22 PM Phil Endecott via Boost <boost@lists.boost.org> wrote:
This is my review of Zach Laine's proposed Boost.Text.
Thanks for the review, Phil.
Zach is offering several things here: Unicode, a "better string", a trie, and more. I'm pleased that these components are all individually exposed to users and are independent of each other, rather than some being hidden as implementation details.
I've spent a couple of days looking at a few parts of the library. I'll discuss at the Unicode stuff first.
Boost gained support for Unicode functionality when Boost.Locale was accepted almost a decade ago, but I was not particularly impressed by that library and have not used it in my own work. Does the current proposal do better? I think it does, and though it's not perfect I think it would be worth accepting but subject to a concern about copyright.
See another thread for the copyright discussion. Basically, if I understand correctly, Zach's code includes tables that are derived from Unicode data and the licence for that data requires a copyright attribution. The Boost licence does not require attribution and I support that. So if I've understood the situation correctly - and I am not a lawyer! - this means that sadly Boost must REJECT the affected code.
Agreed.
I'll say some more about the other aspects of the Unicode code, just in case the copyright issue turns out to be surmountable.
Me too. I've sent an email to someone at ICU, as alluded to earlier in this thread.
I've tried just a couple of things. Firstly, I tried the UTF-8 to UTF-32 conversion. This seems to work, and the API is reasonable. Performance is not bad, but it could be better (see another thread); that could be fixed later. I think it would be worth having just a UTF library in Boost!
Could you provide the exact scenario you tried, and point me to the conversion technique that you used that was faster?
Secondly I looked at the line-splitting algorithm, i.e.
template<typename CPIter, typename Sentinel, typename Extent, typename CPExtentFunc> unspecified lines(CPIter first, Sentinel last, Extent max_extent, CPExtentFunc cp_extent, bool break_overlong_lines = true);
Again this seems to work, and the API is reasonable. Reference documentation is somewhat lacking though; in the description of lines(), the requirements for CPExtentFunc need to be described, and the "unspecified" return type needs to document that it includes members including iter and hard_line_break. I think there are probably similar omissions in much of the reference documentation.
Thanks! I think all those omissions in the reference docs are covered by the tutorial docs, but yeah, I definitely need to add explanations there.
I'm unconvinced by the use of the word "sentinel" here. A sentinel is a value with a special meaning, e.g. a terminating null; what you have here is an end-iterator. It's good that you're allowing the begin and end iterators to have different types, but I think you should name that type something like CPEndIter.
This is standard terminology, as of C++20. See http://eel.is/c++draft/ranges for numerous uses. It means end-iterator-or-other-type-that-acts-as-the-end-tag.
The implementation of line splitting seems to call its word length callback more often than necessary, i.e. when a word overflows a line, that word is subsequently measured a second time. That should be easily fixable.
Thanks, I'll revisit that.
Moving on to other parts of the proposal, I'm less happy with text::string. The first thing I noticed was the very strange decision to use int, rather than (signed?) size_t, for indexing; yes I do work with chunks of text larger than 2 GB! I would REJECT text::string for this alone.
Sure. You're in good company.
I also dislike the use of operator() to extract substrings; the meaning just isn't obvious:
auto x = y(4);
I think this is true because of lack of familiarity. Seeing this a few times would make it familiar pretty quickly. That being said, I'd prefer if I could write y[0:4] or y[0,4] but I can't, at least not without using some kind of strong typedef and a UDL, like y[0_something,4]. Anyway, the string types are moot at this point.
Is that the 4th character of y? Or the substring from 4 to the end? No, it's the substring from the start to 3. I'd need to add a comment to explain what it's doing. But actually there's another problem with that line: x is a non-owning view of that substring. I think that methods that return a view (or span, etc.) should have a name that makes that clear, e.g. "subview". (This wouldn't have been a problem before auto.) This came up in the static_string review, where the substr() method was originally proposed to return a view; that was renamed subview() with substr() returning a copied string.
Again, this is a question of familiarity. If std::string did not have substr() (as text::string does not) there would not be confusion. Silent memory allocations (substr() semantics) are a larger footgun than dangling references (op() semantics) in the age of ASan, IMO.
I don't mind that the string isn't templated on the char type, but I do miss the allocator support; I've used strings with arena allocators (for performance) and with shared memory allocators. I know allocator support is a lot of work but it is necessary, IMO.
I disagree. Allocators are a poor design, in that they are too general. Consider std::vector. An insert operation has different optimal implementations depending on whether the vector is a fixed capacity, inline-storage type or a heap-allocated type. An allocator changed the allocation strategy, but leaves all the other code that often depends on the allocation semantics unchanged. A better design is to have a separate templates for inline storage (e.g. boost::container::static_vector), sbo-optimized heap storage (e.g. boost::container::small_vector), and heap storage (e.g. std::vector). Many other libraries have come to the same conclusion, including FB Folly, LLVM's internal lib, and more.
text::text seems to be the obvious combination of text::string with unicode functionality, and I've not looked at that.
If you have time, please take a look. text::text is almost certainly soon to be reimplemented in terms of std::string.
I have had a look at the trie; I have sometimes wondered to what extent a trie can be implemented in C++ that works as much as possible like a standard container so this is interesting.
The docs should mention briefly why trie is here, i.e. where it is used, just in anticipation of that obvious question from users.
trie::operator[] returns an "optional ref". But it's not like a std::optional; as well as operator* it has conversion operators to access the value. Zach notes that this is "wonky" when the value type is bool, but it also seems to break for me when it's int:
boost::trie::trie_map<std::string, std::string> counts; auto x = counts["Hello"]; if (x) std::cout << "present\n"; else std::cout << "absent\n"; // Prints absent, OK.
boost::trie::trie_map<std::string, int> int_counts; // Now int. auto y = int_counts["Hello"]; if (y) std::cout << "present\n"; else std::cout << "absent\n";
// ../text/include/boost/text/trie.hpp:78: // boost::trie::optional_ref<T, Const>::operator T&() & // [with T = int; bool Const = false]: Assertion `t_' failed.
Thanks! I was missing an operator bool() overload for *mutable* lvalue ref. It works when I add that.
operator T&() is being called by if(y), rather than operator bool(). There aren't has_value() and value() methods that I can use to work around this.
Like a few other things in this proposal, Zach has implemented trie::operator[] how he wishes the standard library worked. As another example, dereferencing a trie_map iterator doesn't return a pair<KEY,VALUE> as a normal associative container would, but rather a struct with members called key and value.
At worst, this will hit problems that the standard solutions avoid (as above). At best, it will make the library more difficult to learn for users who are familiar with the standard library (including its problems),
I think Boost should be a place to experiment with approaches to addressing those problems, rather than propagating them. I find .first and .second to be horrible names for the key-value pair used to represent the mapping of a std::map. I don't think it's appropriate for new code. Code that uses structured bindings should not be affected.
and make generic code less likely to work with it.
As a test I wrote a simple program to count word frequencies:
using counts_t = std::map<std::string,int>; counts_t counts;
while (std::cin.good()) { std::string s; std::cin >> s; counts[s]++; }
for (auto i: counts) { std::cout << i.first << " = " << i.second << "\n"; }
It would be great if I could just change it to:
using counts_t = boost::trie_map<std::string,int>;
but I can't do that; I need to change things:
struct count { int value = 0; }; // Wrap the int. boost::trie::trie_map<std::string, count> counts;
while (std::cin.good()) { std::string s; std::cin >> s;
// operator[] doesn't insert, so: auto r = counts[s]; if (r) r->value++; else counts.insert(s,count{1}); }
for (auto i: counts) { std::cout << i.key << " = " << i.value.value << "\n"; }
That works; how about performance? Trie should perform well when the stored strings frequently share common prefixes, so I've tested with a large web server log file; this has timestamps and URLs which share common prefixes. The results:
std::map: run time 1.2 seconds, memory 1007 pages trie_map: run time 1.8 seconds, memory 4868 pages
So that's rather disappointing.
First, it's important to note that frequent common prefixes don't necessarily make the trie more lookup-efficient. If you are looking for "fred" and you actually have a trie containing f -> r- > e -> d as nodes, you'll visit all 4 nodes as you look up "fred". For a string of 10 chars, you'll visit all 10 nodes for a match, etc. For an R/B tree like std::map, you'll only visit log(N) nodes, where N is the size() of the map. This is unlikely to be nearly as large as the longest string for many data sets. The docs explicitly state that the trie_map and trie_set templates are there for convenience, and that if you care about performance you should use trie. Moreover, if you're trying to eat the longest token that matches some list of tokens encoded in a trie, the trie will be a lot faster, since you don't have to redo a log(N) search M times, where M is the length of the token you're eating. If you're using a trie outside of its niche, you should not expect ideal outcomes. You should use the thing that actually delivers ideal outcomes in that case you care about.
Some general comments:
Some people care about whether libraries are header-only or not, and the docs should mention this, and possibly say which functionality can be expected to work header-only.
I got quite a few missing field initializer and unused parameter warnings; please fix these if at all possible. (gcc 6.3, -W -Wall). I think you should re-order the docs so that the "Hello World" example is much earlier.
Some functions clearly need to embed large Unicode tables; it would be useful if the docs mentioned roughly how much the executable size is increased by.
Some source files are executable.
Conclusions:
I fear that the only part of this that Boost should accept is the UTF transcoding. The implementation is not perfect but we could work on that. The other Unicode stuff would be good to have but the copyright issue is a blocker.
Zach

On Mon, 22 Jun 2020 at 09:32, Zach Laine via Boost <boost@lists.boost.org> wrote:
I don't mind that the string isn't templated on the char type, but I do miss the allocator support; I've used strings with arena allocators (for performance) and with shared memory allocators. I know allocator support is a lot of work but it is necessary, IMO.
I disagree. Allocators are a poor design, in that they are too general. Consider std::vector. An insert operation has different optimal implementations depending on whether the vector is a fixed capacity, inline-storage type or a heap-allocated type. An allocator changed the allocation strategy, but leaves all the other code that often depends on the allocation semantics unchanged. A better design is to have a separate templates for inline storage (e.g. boost::container::static_vector), sbo-optimized heap storage (e.g. boost::container::small_vector), and heap storage (e.g. std::vector). Many other libraries have come to the same conclusion, including FB Folly, LLVM's internal lib, and more.
I don't claim to understand the details of insertion for different-type vectors, but the container<->allocator interface certainly cannot really communicate a whole lotta things that a container might want to tell the allocator, like "I'm going to be replacing this old allocation with this new allocation", and "I'm going to be allocating this many elements with individual allocations, but I'm not seeking to allocate an array". Such things can be communicated straight-to-the-allocator in an allocator-specific way outside the container's control, but there's no generic interface for that sort of hinting, which has always seemed quite funny to me, considering that memory allocation is the allocator's responsibility, but things that might have an immense impact on the proper allocation strategy that should be selected are not hinted to the allocator. And they're not necessarily static.

On Sun, Jun 21, 2020 at 11:32 PM Zach Laine via Boost <boost@lists.boost.org> wrote:
Allocators are a poor design, in that they are too general.
I agree that containers could do better than the general Allocator interface. But it does not then follow that making a container capable of using only the default allocator is an improvement. Regards

Hi, Il 18/06/20 20:21, Phil Endecott via Boost ha scritto:
See another thread for the copyright discussion. Basically, if I understand correctly, Zach's code includes tables that are derived from Unicode data and the licence for that data requires a copyright attribution. The Boost licence does not require attribution and I support that. So if I've understood the situation correctly - and I am not a lawyer! - this means that sadly Boost must REJECT the affected code.
Are these tables the same that are embedded in Boost.Spirit headers? I am referring to files in libs/spirit/include/boost/spirit/home/support/char_encoding/unicode. If those files are problematic, then probably something should be done about Boost.Spirit as well. Also, it seems those files are more than 10 years old now. They should probably be updated, or, even better, not distributed at all with Boost (and the headers should be regenerated at compile time starting from updated Unicode description files, possibly those already available with the operating system when it is the case). Giovanni. -- Giovanni Mascellani <g.mascellani@gmail.com> Postdoc researcher - Université Libre de Bruxelles

Giovanni Mascellani wrote:
Hi,
Il 18/06/20 20:21, Phil Endecott via Boost ha scritto:
See another thread for the copyright discussion. Basically, if I understand correctly, Zach's code includes tables that are derived from Unicode data and the licence for that data requires a copyright attribution. The Boost licence does not require attribution and I support that. So if I've understood the situation correctly - and I am not a lawyer! - this means that sadly Boost must REJECT the affected code.
Are these tables the same that are embedded in Boost.Spirit headers? I am referring to files in libs/spirit/include/boost/spirit/home/support/char_encoding/unicode. If those files are problematic, then probably something should be done about Boost.Spirit as well.
Well spotted. See also: https://github.com/boostorg/spirit/blob/develop/workbench/unicode/DerivedCor... That starts: # DerivedCoreProperties-5.2.0.txt # Date: 2009-08-26, 00:45:22 GMT [MD] # # Unicode Character Database # Copyright (c) 1991-2009 Unicode, Inc. # For terms of use, see http://www.unicode.org/terms_of_use.html # For documentation, see http://www.unicode.org/reports/tr44/ # It is ok to redistribute this file "solely for informational # purposes in the creation of products supporting the Unicode Standard". # We don't nee to add a Boost License to this file: boostinspect:nolicense. Note that I don't know who wrote that, nor where the "solely for informational purposes..." quote comes from. And what are "informational purposes" anyway? - it doesn't sound to me as if that were written by a lawyer. This really needs an expert to look at it, IMO. Regards, Phil.

On 26/06/2020 00:07, Phil Endecott wrote:
# It is ok to redistribute this file "solely for informational # purposes in the creation of products supporting the Unicode Standard". # We don't nee to add a Boost License to this file: boostinspect:nolicense.
Note that I don't know who wrote that, nor where the "solely for informational purposes..." quote comes from. And what are "informational purposes" anyway? - it doesn't sound to me as if that were written by a lawyer.
This really needs an expert to look at it, IMO.
That text appears at https://spdx.org/licenses/Unicode-TOU.html Although that is perhaps an older version of the Unicode license, as it doesn't match what is at https://www.unicode.org/license.html (or at http://www.unicode.org/copyright.html)
participants (10)
-
Damian Vicino
-
Darryl Green
-
degski
-
Gavin Lambert
-
Giovanni Mascellani
-
Niall Douglas
-
Phil Endecott
-
Ville Voutilainen
-
Vinnie Falco
-
Zach Laine