On Fri, Nov 1, 2019 at 3:41 PM Mathias Gaunard <mathias.gaunard(a)ens-lyon.org>
wrote:
> This email appears to only be addressed to me and not the list, is
> that intentional?
> I typed this on my phone so maybe things didn't work exactly as I intended.
>
It was not intentional.
> On Fri, 1 Nov 2019 at 16:09, Zach Laine <whatwasthataddress(a)gmail.com>
> wrote:
>
> > Right. Unicode encodes all natural languages that anyone has taken the
> time to put into Unicode. I stand by the implication that natural
> languages are crazy.
>
> Natural languages -- and their transliteration to bytes -- are
> arbitrary conventions yes, but they're sufficiently consistent to be
> used by millions of people in their personal and professional life
> everyday.
> There are rules, and however alien they might look to someone who only
> needs the fairly straightforward ASCII system, they are not crazy, and
> have their own reasons.
>
>
> > So then maybe don't use those parts? They're independent; you don't
> have to use them to use the Unicode algorithms.
>
> Separation of concerns, they do appear coupled in the documentation.
> That could just be due to the way the documentation is written.
>
Could be ... could you point to something in the docs that left you with
that impression? I explicitly divided up the documentation along the lines
of the separation of concerns found in the code -- the string layer, the
Unicode layer, and the text layer. I also tried to be explicit that the
first is without dependencies, the second does not depend on the first or
the third, and that the third depends on the second.
> > Clearly you are more capable than I am. It took me a lot longer to do
> than 2 months. Why did you never submit this for a Boost review? You were
> thinking about it, ~10 years ago, but you never did....
>
> The use case for Unicode is quite niche, and I wasn't particularly
> interested in working in that niche further.
> Any serious natural language processing engine would probably bypass
> Unicode and come up with its own heuristics to parse text, and
> applications like GUI toolkits or text editors are specialized enough
> that they'd probably never use a library which wasn't specifically
> designed for their use case in mind.
> That leaves everyday programming, which very rarely needs to do
> anything with text beyond treating it as a byte stream, and I believe
> wanting to impose the heavyweight Unicode machinery systematically
> everywhere is a mistake.
>
> Anyway, I was just mentioning this because there is prior work
> (Boost.Unicode) that you apparently never looked at.
> Of course being developed for only a short period of time it doesn't
> have the same level of polish as your library.
>
Ok. FWIW, I did know about your library -- I was there for the BoostCon
presentation -- and I was looking forward to using it. Ten years later, I
did not look at it again, simply because it had been ten years.
> >> and that the database itself is not even accessible,
> >
> >
> > That's also intentional. Another goal of the library is to make Unicode
> as simple as possible for naive users who just want to do the basics. If I
> find requests for any new feature that has a compelling use case, I'll add
> that.
>
> Being able to replace things like std::isalpha etc. is pretty basic I'd
> say.
> Having an API that tells you a number of properties about a code point
> is one of the most useful parts of Unicode.
>
I agree I the abstract, but I think that there is a real problem that the
properties for a given code point are not necessarily consistent across the
various Unicode algorithms. For example, there are multiple notions of
"number" that are treated differently in the bidirectional algorithm.
Moreover, I don't know of a great use case for a boost::text::is_alpha().
Specifically, it seams that if you are looking for alphabetical characters,
you are usually doing something like word-breaking, for which there is
already an algorithm, doing a regex match, for which is_alpha() is
insufficient, etc. I'm open to hearing about such use cases, of course.
> > I don't consider 1.5MB for a database containing all human languages in
> widespread use on computers to be a ridiculous size, but YMMV.
>
> I suppose that's reasonable.
> I think the Boost.Unicode one was only 500k but that's an old Unicode
> version, and I'm not sure it ever had full collation support.
>
> >
> >>
> >> It also doesn't provide the ability to do fast substring search, which
> >> you'd typically do by searching for a substring at the character
> >> encoding level and then eliminating matches that do not fall on a
> >> satisfying boundary, instead suggesting to do the search at the
> >> grapheme level which is much slower, and the facility to test for
> >> boundary isn't provided anyway.
> >
> >
> > I honestly don't know what you mean here.
>
> I'm not sure how to clarify this better, but I'll try.
>
> To search for the utf-8 substring "foo" in the utf-8 string "I really
> like foo dogs", there is no need to iterate the string per code point
> or per grapheme as you do in your examples. You can just perform the
> search at the code unit level, then check that the position before and
> after the match does not lie inside a grapheme cluster, i.e. they are
> on a valid boundary.
> What you need to be able to do that is a function that tells you
> whether an arbitrary position in your sequence of utf-8 code units
> lies at a grapheme cluster boundary or not (which would probably be a
> composition of two separate functions, one that test whether the code
> unit is on a code point boundary, and one that tests whether the code
> point is on a grapheme cluster boundary). This functionality is not
> provided.
>
> This sort of thing is briefly touched upon in Unicode TR#29 6.4.
>
I see. This seems like it might be really useful to add. I'll open a
ticket for it on Github.
Zach