
Anthony Williams wrote:
"Reece Dunn" <msclrhd@hotmail.com> writes:
[2] Basic Type And Iteration
The basic representation is more complex, because now we are dealing with character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode string). At this stage, combining characters and marks should not be concerned with, only complete characters.
Here is the issue. What constitutes a complete character? At the lowest level, a single codepoint is a character. At the next level, a collection of codepoints (base+combining marks) is a character (e.g. e + acute accent is a single character). Sometimes there are many equivalent sequences of codepoints that constitute the same character. Sometimes there may be a single codepoint that is equivalent to a set of codepoints (e.g. e + acute accent => e-acute).
At another level, a set of codepoints represents a glyph. This glyph may cover one or more characters. There may be several alternative glyphs for a single set of codepoints.
Yes, but there may be more glyphs for one codepoint as well. If your definition of glyph is the same as mine (to me it has to do with graphics rather than meaning), glyphs have nothing to do with Unicode text handling, but rather with font drawing (AFAIK ICU deals with both). [...]
I would also suggest that there be another iterator that operates on std::pair< unicode_string::iterator, unicode_string::iterator > to group combining marks, etc. Thus, there would also be a function
unicode_string::utf32_t combine ( std::pair< unicode_string::iterator, unicode_string::iterator > & ucr )
that will map the range into a single code point. You could therefore have a combined_iterator that will provide access to the sequence of combined characters.
You cannot always map the sequence of codepoints that make up a character into a single codepoint.
However, it is agreed that it would be nice to have a means of dealing with "character" chunks, independently of the number of codepoints that make up that character.
Yes. It seems to me that the discussion so far is about storage, rather than use of Unicode strings. One character may be defined by more than one codepoint, and the different ways to define one character are semantically equivalent (canonically equivalent, see the Unicode standard, 3.7). So U+00E0 ("a with grave") is equivalent to U+0061 U+0300 ("a" "combining grave"). I think characters in this sense should be at the heart of a usable Unicode string. I would propose a class unicode_char, containing one or more codepoints (e.g., in a vector <utf32_t>). operator== (unicode_char, unicode_char) should return true for equivalent sequences. A Unicode string would be a basic_string-like container of unicode_char's. The find_first_of and such functions would then have the expected behaviour. The implementation should probably be more optimised than requiring an allocation for every character, but IMO a good Unicode library should *transparently* deal with such things as canonical equivalence for all operations, like searching, deleting characters, etcetera. unicode_string should be as easy to use as basic_string. Regards, Rogier