[boost] Thoughts on Unicode and Strings

16 Apr 2004

      Here are my thoughts on Unicode strings, based partially on the current 
discussions of the topic. As I understand it, the problem with strings 
(standard character and Unicode strings) can be broken down into several 
stages:

[1] Storage And Representation

This is how the underlying string is stored (allocation and memory mapping 
policy) and how it is represented (which is governed by locale, but at this 
stage, it is best to know the type: UTF-8, UTF-16, UTF-32, etc.)

The storage can easily be represented as a container type, and so we have:

template
<
   typename CharT,
   template< typename T, class A > class Container = std::vector,
   class AllocT = std::allocator< CharT >
...
class string_storage: public Container< CharT, AllocT >
{
};

Here, I have chosen std::vector as the standard storage policy, as this 
reflects the current storage policies; thus, basic_string< CharT, Traits > 
would therefore be based on string_storage< CharT >.

It would be easy, then, to select other representations like a reference 
counted storage (a variant of std::auto_ptr< std::vector >) and even an 
SGI-like rope! (Although, this would mean that a new std::roped_vector class 
would need to be implemented: does such a thing already exist in Boost?)

[2] Basic Type And Iteration

The basic representation is more complex, because now we are dealing with 
character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode 
string). At this stage, combining characters and marks should not be 
concerned with, only complete characters.

The Unicode string should provide at least 3 types of iterator, regardless 
of the internal representation (NOTE: as such, they will be implementation 
dependant on how the string is represented):
*  UTF-8  -- provides access to the UTF-8 representation of the string;
*  UTF-16 -- provides access to the UTF-16 representation of the string;
*  UTF-32 -- provides access to the Unicode character type.

Therefore, no matter what the representation, it should be possible to use 
the UTF-32 iterator variant and "see" the string in native Unicode; this 
should, therefore, be the standard iterator and the others should be used 
when converting between formats.

NOTE: I am not well versed in how Unicode is represented, so I do not know 
how feasible it is to implement backwards traversal, but I do know that it 
would probably be wise to know the position of the last good end of a 
Unicode character (e.g. when dealing with multi-character UTF-8 and UTF-16 
representations).

As a side note, it should be feasible to provide specialist wrappers around 
existing Unicode libraries (like Win32 (CharNext, etc.), ICU and libiconv?), 
so I would suggest having something akin to char_traits in basic_string.

RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16 are 
multi-character encodings of UTF-32 (not considering combining marks at this 
stage), whereas UTF-32 is a single character encoding.

[3] Algorithms, Locales, etc.

These are build upon the UTF-32 view of the Unicode string, like the string 
algorithms in the Boost library. Therefore, instead of str.find( 
unicode_string( "World" )), you would have find( str, unicode_string( 
"World" )).

I would also suggest that there be another iterator that operates on 
std::pair< unicode_string::iterator, unicode_string::iterator > to group 
combining marks, etc. Thus, there would also be a function

unicode_string::utf32_t combine
(
   std::pair< unicode_string::iterator, unicode_string::iterator > & ucr
)

that will map the range into a single code point. You could therefore have a 
combined_iterator that will provide access to the sequence of combined 
characters.

NOTE:

If ucr.first == ucr.second, then combine( ucr ) = *( ucr.first ).

Regards,
Reece

_________________________________________________________________
Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo

[boost] Thoughts on Unicode and Strings

Reece Dunn