On Fri, Nov 1, 2019 at 4:22 PM Zach Laine
On Fri, Nov 1, 2019 at 3:41 PM Mathias Gaunard < mathias.gaunard@ens-lyon.org> wrote:
To search for the utf-8 substring "foo" in the utf-8 string "I really
like foo dogs", there is no need to iterate the string per code point
or per grapheme as you do in your examples. You can just perform the search at the code unit level, then check that the position before and after the match does not lie inside a grapheme cluster, i.e. they are on a valid boundary. What you need to be able to do that is a function that tells you whether an arbitrary position in your sequence of utf-8 code units lies at a grapheme cluster boundary or not (which would probably be a composition of two separate functions, one that test whether the code unit is on a code point boundary, and one that tests whether the code point is on a grapheme cluster boundary). This functionality is not provided.
This sort of thing is briefly touched upon in Unicode TR#29 6.4.
I see. This seems like it might be really useful to add. I'll open a ticket for it on Github.
After writing this, I realized this is supported by calling prev_grapheme_break(first, it, last) == it. There is an exception to this, though, when it == last. I should either remove that exception (which sounds like the right answer regardless of the rest), or provide at_grapheme_break(first, it, last) (probably a good thing to do regardless of the rest). Zach