
"Reece Dunn" <msclrhd@hotmail.com> writes:
[1] Storage And Representation
The storage can easily be represented as a container type, and so we have:
template < typename CharT, template< typename T, class A > class Container = std::vector, class AllocT = std::allocator< CharT >
class string_storage: public Container< CharT, AllocT > { };
I am not sure this really gains us anything over just using the underlying container directly.
[2] Basic Type And Iteration
The basic representation is more complex, because now we are dealing with character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode string). At this stage, combining characters and marks should not be concerned with, only complete characters.
Here is the issue. What constitutes a complete character? At the lowest level, a single codepoint is a character. At the next level, a collection of codepoints (base+combining marks) is a character (e.g. e + acute accent is a single character). Sometimes there are many equivalent sequences of codepoints that constitute the same character. Sometimes there may be a single codepoint that is equivalent to a set of codepoints (e.g. e + acute accent => e-acute). At another level, a set of codepoints represents a glyph. This glyph may cover one or more characters. There may be several alternative glyphs for a single set of codepoints.
The Unicode string should provide at least 3 types of iterator, regardless of the internal representation (NOTE: as such, they will be implementation dependant on how the string is represented):
* UTF-8 -- provides access to the UTF-8 representation of the string; * UTF-16 -- provides access to the UTF-16 representation of the string; * UTF-32 -- provides access to the Unicode character type.
I agree we need conversions to/from all 3 formats.
Therefore, no matter what the representation, it should be possible to use the UTF-32 iterator variant and "see" the string in native Unicode; this should, therefore, be the standard iterator and the others should be used when converting between formats.
That is my POV.
NOTE: I am not well versed in how Unicode is represented, so I do not know how feasible it is to implement backwards traversal, but I do know that it would probably be wise to know the position of the last good end of a Unicode character (e.g. when dealing with multi-character UTF-8 and UTF-16 representations).
Backwards traversal is generally possible, though with UTF-8 it is very slow, as you don't know how many bytes there are until the beginning of the character (though you know when you've got there).
As a side note, it should be feasible to provide specialist wrappers around existing Unicode libraries (like Win32 (CharNext, etc.), ICU and libiconv?),
Agreed.
so I would suggest having something akin to char_traits in basic_string.
I am not sure how that helps.
RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16 are multi-character encodings of UTF-32 (not considering combining marks at this stage), whereas UTF-32 is a single character encoding.
Yes, that is why I believe we should use UTF-32 as the base (despite the performance considerations others have raised).
[3] Algorithms, Locales, etc.
These are build upon the UTF-32 view of the Unicode string, like the string algorithms in the Boost library. Therefore, instead of str.find( unicode_string( "World" )), you would have find( str, unicode_string( "World" )).
I am not sure how non-member vs member makes any difference.
I would also suggest that there be another iterator that operates on std::pair< unicode_string::iterator, unicode_string::iterator > to group combining marks, etc. Thus, there would also be a function
unicode_string::utf32_t combine ( std::pair< unicode_string::iterator, unicode_string::iterator > & ucr )
that will map the range into a single code point. You could therefore have a combined_iterator that will provide access to the sequence of combined characters.
You cannot always map the sequence of codepoints that make up a character into a single codepoint. However, it is agreed that it would be nice to have a means of dealing with "character" chunks, independently of the number of codepoints that make up that character. Anthony -- Anthony Williams Senior Software Engineer, Beran Instruments Ltd.