Re: [Boost-users] Interest in a Unicode library for Boost?
On Fri, Nov 1, 2019 at 3:41 PM Mathias Gaunard
This email appears to only be addressed to me and not the list, is that intentional? I typed this on my phone so maybe things didn't work exactly as I intended.
It was not intentional.
On Fri, 1 Nov 2019 at 16:09, Zach Laine
wrote: Right. Unicode encodes all natural languages that anyone has taken the time to put into Unicode. I stand by the implication that natural languages are crazy.
Natural languages -- and their transliteration to bytes -- are arbitrary conventions yes, but they're sufficiently consistent to be used by millions of people in their personal and professional life everyday. There are rules, and however alien they might look to someone who only needs the fairly straightforward ASCII system, they are not crazy, and have their own reasons.
So then maybe don't use those parts? They're independent; you don't have to use them to use the Unicode algorithms.
Separation of concerns, they do appear coupled in the documentation. That could just be due to the way the documentation is written.
Could be ... could you point to something in the docs that left you with that impression? I explicitly divided up the documentation along the lines of the separation of concerns found in the code -- the string layer, the Unicode layer, and the text layer. I also tried to be explicit that the first is without dependencies, the second does not depend on the first or the third, and that the third depends on the second.
Clearly you are more capable than I am. It took me a lot longer to do than 2 months. Why did you never submit this for a Boost review? You were thinking about it, ~10 years ago, but you never did....
The use case for Unicode is quite niche, and I wasn't particularly interested in working in that niche further. Any serious natural language processing engine would probably bypass Unicode and come up with its own heuristics to parse text, and applications like GUI toolkits or text editors are specialized enough that they'd probably never use a library which wasn't specifically designed for their use case in mind. That leaves everyday programming, which very rarely needs to do anything with text beyond treating it as a byte stream, and I believe wanting to impose the heavyweight Unicode machinery systematically everywhere is a mistake.
Anyway, I was just mentioning this because there is prior work (Boost.Unicode) that you apparently never looked at. Of course being developed for only a short period of time it doesn't have the same level of polish as your library.
Ok. FWIW, I did know about your library -- I was there for the BoostCon presentation -- and I was looking forward to using it. Ten years later, I did not look at it again, simply because it had been ten years.
and that the database itself is not even accessible,
That's also intentional. Another goal of the library is to make Unicode as simple as possible for naive users who just want to do the basics. If I find requests for any new feature that has a compelling use case, I'll add that.
Being able to replace things like std::isalpha etc. is pretty basic I'd say. Having an API that tells you a number of properties about a code point is one of the most useful parts of Unicode.
I agree I the abstract, but I think that there is a real problem that the properties for a given code point are not necessarily consistent across the various Unicode algorithms. For example, there are multiple notions of "number" that are treated differently in the bidirectional algorithm. Moreover, I don't know of a great use case for a boost::text::is_alpha(). Specifically, it seams that if you are looking for alphabetical characters, you are usually doing something like word-breaking, for which there is already an algorithm, doing a regex match, for which is_alpha() is insufficient, etc. I'm open to hearing about such use cases, of course.
I don't consider 1.5MB for a database containing all human languages in widespread use on computers to be a ridiculous size, but YMMV.
I suppose that's reasonable. I think the Boost.Unicode one was only 500k but that's an old Unicode version, and I'm not sure it ever had full collation support.
It also doesn't provide the ability to do fast substring search, which you'd typically do by searching for a substring at the character encoding level and then eliminating matches that do not fall on a satisfying boundary, instead suggesting to do the search at the grapheme level which is much slower, and the facility to test for boundary isn't provided anyway.
I honestly don't know what you mean here.
I'm not sure how to clarify this better, but I'll try.
To search for the utf-8 substring "foo" in the utf-8 string "I really like foo dogs", there is no need to iterate the string per code point or per grapheme as you do in your examples. You can just perform the search at the code unit level, then check that the position before and after the match does not lie inside a grapheme cluster, i.e. they are on a valid boundary. What you need to be able to do that is a function that tells you whether an arbitrary position in your sequence of utf-8 code units lies at a grapheme cluster boundary or not (which would probably be a composition of two separate functions, one that test whether the code unit is on a code point boundary, and one that tests whether the code point is on a grapheme cluster boundary). This functionality is not provided.
This sort of thing is briefly touched upon in Unicode TR#29 6.4.
I see. This seems like it might be really useful to add. I'll open a ticket for it on Github. Zach
On 01.11.19 22:22, Zach Laine via Boost-users wrote:
Moreover, I don't know of a great use case for a boost::text::is_alpha(). Specifically, it seams that if you are looking for alphabetical characters, you are usually doing something like word-breaking, for which there is already an algorithm, doing a regex match, for which is_alpha() is insufficient, etc. I'm open to hearing about such use cases, of course.
Filtering text input. Parsing programming languages or data description
languages. Gathering statistics on a piece of text.
My own codebase has has four instances of #include <cctype> and three
instances of #include
On Sat, Nov 2, 2019 at 4:12 AM Rainer Deyke via Boost-users < boost-users@lists.boost.org> wrote:
On 01.11.19 22:22, Zach Laine via Boost-users wrote:
Moreover, I don't know of a great use case for a boost::text::is_alpha(). Specifically, it seams that if you are looking for alphabetical characters, you are usually doing something like word-breaking, for which there is already an algorithm, doing a regex match, for which is_alpha() is insufficient, etc. I'm open to hearing about such use cases, of course.
Filtering text input. Parsing programming languages or data description languages. Gathering statistics on a piece of text.
My own codebase has has four instances of #include <cctype> and three instances of #include
, but that's an artificially low number because character classification is trivial to do by hand for ASCII and because cctype doesn't support Unicode. Exactly zero of these instances can be replaced by any algorithm provided by the proposed library. All of them could technically be replaced by regular expressions, but only in the sense that it is possible to (inefficiently) implement the cctype interface in terms of regular expressions.
These use cases fall under the regex use case I mentioned. I still think they're more appropriately solved that way. Have you heard of CTRE? Hana is working on adding Unicode support to that, including character classes like is_alpha. Zach
On Fri, Nov 1, 2019 at 4:22 PM Zach Laine
On Fri, Nov 1, 2019 at 3:41 PM Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
To search for the utf-8 substring "foo" in the utf-8 string "I really
like foo dogs", there is no need to iterate the string per code point
or per grapheme as you do in your examples. You can just perform the search at the code unit level, then check that the position before and after the match does not lie inside a grapheme cluster, i.e. they are on a valid boundary. What you need to be able to do that is a function that tells you whether an arbitrary position in your sequence of utf-8 code units lies at a grapheme cluster boundary or not (which would probably be a composition of two separate functions, one that test whether the code unit is on a code point boundary, and one that tests whether the code point is on a grapheme cluster boundary). This functionality is not provided.
This sort of thing is briefly touched upon in Unicode TR#29 6.4.
I see. This seems like it might be really useful to add. I'll open a ticket for it on Github.
After writing this, I realized this is supported by calling prev_grapheme_break(first, it, last) == it. There is an exception to this, though, when it == last. I should either remove that exception (which sounds like the right answer regardless of the rest), or provide at_grapheme_break(first, it, last) (probably a good thing to do regardless of the rest). Zach
participants (2)
-
Rainer Deyke
-
Zach Laine