[boost] Re: Thoughts on Unicode and Strings

16 Apr 2004

      "Reece Dunn" <msclrhd@hotmail.com> writes:
...
[1] Storage And Representation
The storage can easily be represented as a container type, and so we have:
template
<
   typename CharT,
   template< typename T, class A > class Container = std::vector,
   class AllocT = std::allocator< CharT >
...
class string_storage: public Container< CharT, AllocT >
{
};
I am not sure this really gains us anything over just using the underlying
container directly.
...
[2] Basic Type And Iteration
The basic representation is more complex, because now we are dealing with
character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode
string). At this stage, combining characters and marks should not be
concerned with, only complete characters.
Here is the issue. What constitutes a complete character? At the lowest level,
a single codepoint is a character. At the next level, a collection of
codepoints (base+combining marks) is a character (e.g. e + acute accent is a
single character). Sometimes there are many equivalent sequences of codepoints
that constitute the same character. Sometimes there may be a single codepoint
that is equivalent to a set of codepoints (e.g. e + acute accent => e-acute).

At another level, a set of codepoints represents a glyph. This glyph may cover
one or more characters. There may be several alternative glyphs for a single
set of codepoints.
...
The Unicode string should provide at least 3 types of iterator, regardless
of the internal representation (NOTE: as such, they will be implementation
dependant on how the string is represented):
...
*  UTF-8  -- provides access to the UTF-8 representation of the string;
*  UTF-16 -- provides access to the UTF-16 representation of the string;
*  UTF-32 -- provides access to the Unicode character type.
I agree we need conversions to/from all 3 formats.
...
Therefore, no matter what the representation, it should be possible to use
the UTF-32 iterator variant and "see" the string in native Unicode; this
should, therefore, be the standard iterator and the others should be used
when converting between formats.
That is my POV.
...
NOTE: I am not well versed in how Unicode is represented, so I do not know
how feasible it is to implement backwards traversal, but I do know that it
would probably be wise to know the position of the last good end of a
Unicode character (e.g. when dealing with multi-character UTF-8 and UTF-16
representations).
Backwards traversal is generally possible, though with UTF-8 it is very slow,
as you don't know how many bytes there are until the beginning of the
character (though you know when you've got there).
...
As a side note, it should be feasible to provide specialist wrappers around
existing Unicode libraries (like Win32 (CharNext, etc.), ICU and libiconv?),
Agreed.
...
so I would suggest having something akin to char_traits in basic_string.
I am not sure how that helps.
...
RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16 are
multi-character encodings of UTF-32 (not considering combining marks at this
stage), whereas UTF-32 is a single character encoding.
Yes, that is why I believe we should use UTF-32 as the base (despite the
performance considerations others have raised).
...
[3] Algorithms, Locales, etc.
These are build upon the UTF-32 view of the Unicode string, like the string
algorithms in the Boost library. Therefore, instead of str.find(
unicode_string( "World" )), you would have find( str, unicode_string(
"World" )).
I am not sure how non-member vs member makes any difference.
...
I would also suggest that there be another iterator that operates on
std::pair< unicode_string::iterator, unicode_string::iterator > to group
combining marks, etc. Thus, there would also be a function
unicode_string::utf32_t combine
(
   std::pair< unicode_string::iterator, unicode_string::iterator > & ucr
)
that will map the range into a single code point. You could therefore have a
combined_iterator that will provide access to the sequence of combined
characters.
You cannot always map the sequence of codepoints that make up a character into
a single codepoint.

However, it is agreed that it would be nice to have a means of dealing with
"character" chunks, independently of the number of codepoints that make up
that character.

Anthony
-- 
Anthony Williams
Senior Software Engineer, Beran Instruments Ltd.