[boost] Re: Thoughts on Unicode and Strings

17 Apr 2004

      Anthony Williams wrote:
...
"Reece Dunn" <msclrhd@hotmail.com> writes:
...
[2] Basic Type And Iteration
The basic representation is more complex, because now we are dealing with
character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode
string). At this stage, combining characters and marks should not be
concerned with, only complete characters.
Here is the issue. What constitutes a complete character? At the lowest level,
a single codepoint is a character. At the next level, a collection of
codepoints (base+combining marks) is a character (e.g. e + acute accent is a
single character). Sometimes there are many equivalent sequences of codepoints
that constitute the same character. Sometimes there may be a single codepoint
that is equivalent to a set of codepoints (e.g. e + acute accent => e-acute).
At another level, a set of codepoints represents a glyph. This glyph may cover
one or more characters. There may be several alternative glyphs for a single
set of codepoints.
Yes, but there may be more glyphs for one codepoint as well. If your 
definition of glyph is the same as mine (to me it has to do with 
graphics rather than meaning), glyphs have nothing to do with Unicode 
text handling, but rather with font drawing (AFAIK ICU deals with both).

[...]
...
...
I would also suggest that there be another iterator that operates on
std::pair< unicode_string::iterator, unicode_string::iterator > to group
combining marks, etc. Thus, there would also be a function
unicode_string::utf32_t combine
(
  std::pair< unicode_string::iterator, unicode_string::iterator > & ucr
)
that will map the range into a single code point. You could therefore have a
combined_iterator that will provide access to the sequence of combined
characters.
You cannot always map the sequence of codepoints that make up a character into
a single codepoint.
However, it is agreed that it would be nice to have a means of dealing with
"character" chunks, independently of the number of codepoints that make up
that character.
Yes. It seems to me that the discussion so far is about storage, rather 
than use of Unicode strings. One character may be defined by more than 
one codepoint, and the different ways to define one character are 
semantically equivalent (canonically equivalent, see the Unicode 
standard, 3.7). So U+00E0 ("a with grave") is equivalent to U+0061 
U+0300 ("a" "combining grave"). I think characters in this sense should 
be at the heart of a usable Unicode string.

I would propose a class unicode_char, containing one or more codepoints 
(e.g., in a vector <utf32_t>). operator== (unicode_char, unicode_char) 
should return true for equivalent sequences. A Unicode string would be a 
basic_string-like container of unicode_char's. The find_first_of and 
such functions would then have the expected behaviour.

The implementation should probably be more optimised than requiring an 
allocation for every character, but IMO a good Unicode library should 
*transparently* deal with such things as canonical equivalence for all 
operations, like searching, deleting characters, etcetera. 
unicode_string should be as easy to use as basic_string.

Regards,
Rogier

[boost] Re: Thoughts on Unicode and Strings

Rogier van Dalen