Re: [boost] [rfc] Unicode GSoC project

That's good, these are needed. Also needed are tables that store the various character properties, and (hopefully) some parsers that build the
And UnicodeGrapheme concept doesn't make sense to me. You say, "A model of >UnicodeGrapheme is a range of Unicode code points that is a single grapheme >cluster in Normalized Form C." A grapheme cluster != Unicode code point. It >may be many code points representing a base character an many zero-width >combining characters. So what exactly is being
It is thus important to be able to apply algorithms with graphemes as
Dear Eric/ Mathias, tables >directly from the Unicode character database so we can easily rev it >whenever the database changes. A good reloadable character library is in the vault. traversed by a >UnicodeGrapheme range? the
unit rather than code points to deal with graphemes not representable by a >single code point.
I think that a grapheme is more of an iterator concept than a data type concept. By specialising it you will unnecessarily complicate any library. Don't forget that, for example, the current grapheme may start as one character, then suddenly 'grab' the surrounding characters as it makes a combined glyph. I have never found a use case in practise where specialising the grapheme as other than a validated series of code points was helpful. The two cases where graphemes are important is in display [which requires intermediate glyph conversion anyway, and works just as well on runs of code points, so code points are fine] and in editing - and the grapheme-ness here alters during typing.
The Unicode standard also specifies various features such as a
collation >algorithm in Technical Standard #10 - Unicode Collation Algorithm for >comparison and ordering of strings with a locale-specific criterion, as >well as mechanisms to iterate over words, sentences and lines Have a look at the character library that I posted in the vault - if you can do graphemes then you can do words, paragraphs etc as they are all just attributes of the characters with simple rules. Graphemes come in to their own for text display and editing and you would need these as well to be able to support that. Don't forget that windows GDI only supports point arithmetic and this means that you need to be able to locate word boundaries to display text well at different scales to work around the GDI scaling rounding [and GDI+ is not much better].

Graham wrote:
A good reloadable character library is in the vault.
I'll be reviewing it in a while. I'm not too sure about the memory layout it uses (__uni_char_data could really be compressed to use less memory for example), nor about the interface it exposes, but it does seem to work well. About is_grapheme_break though, isn't the implementation for legacy grapheme cluster rather than extended ones though?
I think that a grapheme is more of an iterator concept than a data type concept. By specialising it you will unnecessarily complicate any library. Don't forget that, for example, the current grapheme may start as one character, then suddenly 'grab' the surrounding characters as it makes a combined glyph. I have never found a use case in practise where specialising the grapheme as other than a validated series of code points was helpful.
A grapheme is nothing more than a subrange of code points, at least in my current design.
The two cases where graphemes are important is in display [which requires intermediate glyph conversion anyway, and works just as well on runs of code points, so code points are fine] and in editing - and the grapheme-ness here alters during typing.
It's also useful for grapheme-level searching. Searching for the substring "foo", in the string "foo\u20d7" shouldn't match anything, because the extremities of the match are not at grapheme boundaries.
if you can do graphemes then you can do words, paragraphs etc as they are all just attributes of the characters with simple rules. Graphemes come in to their own for text display and editing and you would need these as well to be able to support that.
Those are not as important in my opinion, and given the time I have is restricted focus won't be on these. Adding them later makes perfect sense, however.
participants (2)
-
Graham
-
Mathias Gaunard