November 2019 - Boost-users - lists.preview.boost.org

Interest in a Unicode library for Boost?
by Zach Laine 04 Nov '19

04 Nov '19

About 14 months ago I posted the same thing. There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out. Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest. This library, in part, is something I want to standardize. It started as a better string library for namespace "std2", with minimal Unicode support. Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need. Github: https://github.com/tzlaine/text Online docs: https://tzlaine.github.io/text If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback. I gave a talk about this at C++Now in May 2018, and now it's a bit out of date, as the library was not then finished. It's three hours, so, y'know, maybe skip it. For completeness' sake: https://www.youtube.com/watch?v=944GjKxwMBo&index=7&list=PL_AKIMJc4roVSbTTf… https://www.youtube.com/watch?v=GJ2xMAqCZL8&list=PL_AKIMJc4roVSbTTfHReQTl1d… Zach

11 30

Re: [Boost-users] Interest in a Unicode library for Boost?
by Zach Laine 02 Nov '19

02 Nov '19

On Fri, Nov 1, 2019 at 3:41 PM Mathias Gaunard <mathias.gaunard(a)ens-lyon.org> wrote: > This email appears to only be addressed to me and not the list, is > that intentional? > I typed this on my phone so maybe things didn't work exactly as I intended. > It was not intentional. > On Fri, 1 Nov 2019 at 16:09, Zach Laine <whatwasthataddress(a)gmail.com> > wrote: > > > Right. Unicode encodes all natural languages that anyone has taken the > time to put into Unicode. I stand by the implication that natural > languages are crazy. > > Natural languages -- and their transliteration to bytes -- are > arbitrary conventions yes, but they're sufficiently consistent to be > used by millions of people in their personal and professional life > everyday. > There are rules, and however alien they might look to someone who only > needs the fairly straightforward ASCII system, they are not crazy, and > have their own reasons. > > > > So then maybe don't use those parts? They're independent; you don't > have to use them to use the Unicode algorithms. > > Separation of concerns, they do appear coupled in the documentation. > That could just be due to the way the documentation is written. > Could be ... could you point to something in the docs that left you with that impression? I explicitly divided up the documentation along the lines of the separation of concerns found in the code -- the string layer, the Unicode layer, and the text layer. I also tried to be explicit that the first is without dependencies, the second does not depend on the first or the third, and that the third depends on the second. > > Clearly you are more capable than I am. It took me a lot longer to do > than 2 months. Why did you never submit this for a Boost review? You were > thinking about it, ~10 years ago, but you never did.... > > The use case for Unicode is quite niche, and I wasn't particularly > interested in working in that niche further. > Any serious natural language processing engine would probably bypass > Unicode and come up with its own heuristics to parse text, and > applications like GUI toolkits or text editors are specialized enough > that they'd probably never use a library which wasn't specifically > designed for their use case in mind. > That leaves everyday programming, which very rarely needs to do > anything with text beyond treating it as a byte stream, and I believe > wanting to impose the heavyweight Unicode machinery systematically > everywhere is a mistake. > > Anyway, I was just mentioning this because there is prior work > (Boost.Unicode) that you apparently never looked at. > Of course being developed for only a short period of time it doesn't > have the same level of polish as your library. > Ok. FWIW, I did know about your library -- I was there for the BoostCon presentation -- and I was looking forward to using it. Ten years later, I did not look at it again, simply because it had been ten years. > >> and that the database itself is not even accessible, > > > > > > That's also intentional. Another goal of the library is to make Unicode > as simple as possible for naive users who just want to do the basics. If I > find requests for any new feature that has a compelling use case, I'll add > that. > > Being able to replace things like std::isalpha etc. is pretty basic I'd > say. > Having an API that tells you a number of properties about a code point > is one of the most useful parts of Unicode. > I agree I the abstract, but I think that there is a real problem that the properties for a given code point are not necessarily consistent across the various Unicode algorithms. For example, there are multiple notions of "number" that are treated differently in the bidirectional algorithm. Moreover, I don't know of a great use case for a boost::text::is_alpha(). Specifically, it seams that if you are looking for alphabetical characters, you are usually doing something like word-breaking, for which there is already an algorithm, doing a regex match, for which is_alpha() is insufficient, etc. I'm open to hearing about such use cases, of course. > > I don't consider 1.5MB for a database containing all human languages in > widespread use on computers to be a ridiculous size, but YMMV. > > I suppose that's reasonable. > I think the Boost.Unicode one was only 500k but that's an old Unicode > version, and I'm not sure it ever had full collation support. > > > > >> > >> It also doesn't provide the ability to do fast substring search, which > >> you'd typically do by searching for a substring at the character > >> encoding level and then eliminating matches that do not fall on a > >> satisfying boundary, instead suggesting to do the search at the > >> grapheme level which is much slower, and the facility to test for > >> boundary isn't provided anyway. > > > > > > I honestly don't know what you mean here. > > I'm not sure how to clarify this better, but I'll try. > > To search for the utf-8 substring "foo" in the utf-8 string "I really > like foo dogs", there is no need to iterate the string per code point > or per grapheme as you do in your examples. You can just perform the > search at the code unit level, then check that the position before and > after the match does not lie inside a grapheme cluster, i.e. they are > on a valid boundary. > What you need to be able to do that is a function that tells you > whether an arbitrary position in your sequence of utf-8 code units > lies at a grapheme cluster boundary or not (which would probably be a > composition of two separate functions, one that test whether the code > unit is on a code point boundary, and one that tests whether the code > point is on a grapheme cluster boundary). This functionality is not > provided. > > This sort of thing is briefly touched upon in Unicode TR#29 6.4. > I see. This seems like it might be really useful to add. I'll open a ticket for it on Github. Zach

2 3

Re: [Boost-users] Interest in a Unicode library for Boost?
by Zach Laine 01 Nov '19

01 Nov '19

Mathias pointed out that I sent this just to him. So I'm replying again to get this onto the list. Sorry for the noise. On Fri, Nov 1, 2019 at 11:09 AM Zach Laine <whatwasthataddress(a)gmail.com> wrote: > On Fri, Nov 1, 2019 at 6:35 AM Mathias Gaunard < > mathias.gaunard(a)ens-lyon.org> wrote: > >> On Sat, 26 Oct 2019 at 02:11, Zach Laine via Boost-users >> <boost-users(a)lists.boost.org> wrote: >> > >> > About 14 months ago I posted the same thing. There was significant >> work that needed to be done to Boost.Text (the proposed library), and I was >> a bit burned out. >> > >> > Now I've managed to make the necessary changes, and I feel the library >> is ready for review, if there is interest. >> > >> > This library, in part, is something I want to standardize. >> > >> > It started as a better string library for namespace "std2", with >> minimal Unicode support. Though "std2" will almost certainly never happen >> now, those string types are still in there, and the library has grown to >> also include all the Unicode features most users will ever need. >> > >> > Github: https://github.com/tzlaine/text >> > Online docs: https://tzlaine.github.io/text >> >> I would start by removing the superlative statements about Unicode >> being "hard" or "crazy". >> It's not that complicated compared to the actual hard problems that >> software engineers solve everyday. The only thing is that people >> misunderstand what the scope of Unicode is, it's not just an encoding, >> it's a a database and a set of algorithms (relying on said database) >> to facilitate natural text processing of arbitrary scripts, and does >> compromises to integrate with existing industry practices prior to all >> those scripts being brought together under the same umbrella. >> > > Right. Unicode encodes all natural languages that anyone has taken the > time to put into Unicode. I stand by the implication that natural > languages are crazy. > > >> Now the string/container/memory management, this is quite irrelevant. >> That sort of stuff has nothing to do with Unicode and I certainly do >> not want some Unicode library to mess with the way I am organizing how >> my data is stored in memory. >> Your rope etc. containers belong in a completely independent library. >> > > So then maybe don't use those parts? They're independent; you don't have > to use them to use the Unicode algorithms. > > >> What's important is providing an efficient Unicode character database, >> and implementing the algorithms in a way that is generic, working for >> arbitrary ranges and being able to be lazily evaluated (i.e. range >> adaptors). >> I already did all that work more than 10 years ago as a two-month GSoC >> project, though there are some limitations since at that time ranges >> and ranges adaptors were still fairly new ideas for C++. It does >> however provide a generic framework to define arbitrary algorithms >> that can be evaluated either lazily or eagerly. >> > > Clearly you are more capable than I am. It took me a lot longer to do > than 2 months. Why did you never submit this for a Boost review? You were > thinking about it, ~10 years ago, but you never did.... > > >> To be honest I can't say I find your library to be much of an >> improvement, at least in terms of usability, since the programming >> interface seems more constrained (why don't things work with arbitrary >> ranges rather than this "text" containers) > > > They do, of course. I'm not sure why it is you think otherwise. > > >> and verbose (just look at >> the code to do transcoding with iterators), > > > Are you referring to the verbosity of: > > char const * some_utf8 = /* ... */ ; > out = std::ranges::copy(boost::text::as_utf32(some_utf8), out); > > , or: > > out = boost::text::transcode_utf_8_to_32(utf8_first, utf8_last, out); > > , or something else? > > >> the set of features is >> quite small, > > > That is quite intentional. I want to standardize *basic* Unicode > support. I feel that what I have in Boost.Text is the basic set that users > will need, just to support languages or formatting conventions that are not > common in their favorite environment. For instance, today there is no > standard way of taking UTF-8 and turning it into UTF-16, or vice versa; > this library is intended to work at that level. That is, it is intended to > fill in needless gaps in Unicode support that exist in C++ -- gaps that no > other major language besides C has. It is specifically not intended to > replace all ICU functionality. Do you have specific things in mind that > you think ~90% of Unicode-aware C++ users will need? Note that I did not > say 100%. > > >> and that the database itself is not even accessible, > > > That's also intentional. Another goal of the library is to make Unicode > as simple as possible for naive users who just want to do the basics. If I > find requests for any new feature that has a compelling use case, I'll add > that. > > >> and >> last I remember your implementation was ridiculously bloated in size. >> > > I don't consider 1.5MB for a database containing all human languages in > widespread use on computers to be a ridiculous size, but YMMV. > > >> It also doesn't provide the ability to do fast substring search, which >> you'd typically do by searching for a substring at the character >> encoding level and then eliminating matches that do not fall on a >> satisfying boundary, instead suggesting to do the search at the >> grapheme level which is much slower, and the facility to test for >> boundary isn't provided anyway. >> > > I honestly don't know what you mean here. If you use the text::text or > text::string types, those are just contiguous sequences of bits, like a > std::vector or std::string. text::text exposes iterators to those bits > which can be used to get grapheme, code point, and/or UTF-8 byte views of > the underlying data. If you are using something else besides text::text or > text::string two types, you presumably have access to your own bits in your > own representation. What prevents you from doing whatever substring search > you like, via std::search(), std::ranges::includes(), or something else? > Boost.Text is not intended as a string algorithms library. > > I'm pretty sure I made similar comments in the past, but I don't feel >> like any of them has been addressed. >> > > I think you're referring to this email you sent in the Boost.Text > interest thread from 14 months ago: > > """ > The Unicode library I did as a SoC project in 2009 was significantly > smaller than that and if I recall correctly it has more data than the one > in your library. > Clearly some work can be done here to better optimize the database size. > """ > > I did make it a bit smaller. The other comments are new. > > Zach > >

1 0

Serialization of an instance of a Derived Template Class
by Pritam Gharat 01 Nov '19

01 Nov '19

Hi, I am using Boost to serialize pointers of a base class which points to instances of a derived template class. I am using boost 1.7.0 and running it on Ubuntu 16.04. The source code of the example is available at https://github.com/PritamMG/Boost-Serialization-Example My example has a base class Bar_A (definition available in include/Abstract.hpp) which has a virtual function toStr(). It has two derived classes Bar_B and Bar_C. Bar_B is a template class. I have registered the derived classes in sc/Test.cpp. Class ObjectNew (include/Object.hpp) has a member pointer of type Bar_A which points to instances of classes Bar_B and Bar_C. I get an error as "unregistered class". I am not sure for which class do I get this error message. There is also an issue of non-default. If Bar_B does not have a default constructor, save_construct_data and load_construct_data are over-ridden and there I run into an error "too many template-parameter-lists". How do I define these functions? Thanks, Pritam

2 4