
Here are my thoughts on Unicode strings, based partially on the current discussions of the topic. As I understand it, the problem with strings (standard character and Unicode strings) can be broken down into several stages: [1] Storage And Representation This is how the underlying string is stored (allocation and memory mapping policy) and how it is represented (which is governed by locale, but at this stage, it is best to know the type: UTF-8, UTF-16, UTF-32, etc.) The storage can easily be represented as a container type, and so we have: template < typename CharT, template< typename T, class A > class Container = std::vector, class AllocT = std::allocator< CharT >
class string_storage: public Container< CharT, AllocT > { }; Here, I have chosen std::vector as the standard storage policy, as this reflects the current storage policies; thus, basic_string< CharT, Traits > would therefore be based on string_storage< CharT >. It would be easy, then, to select other representations like a reference counted storage (a variant of std::auto_ptr< std::vector >) and even an SGI-like rope! (Although, this would mean that a new std::roped_vector class would need to be implemented: does such a thing already exist in Boost?) [2] Basic Type And Iteration The basic representation is more complex, because now we are dealing with character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode string). At this stage, combining characters and marks should not be concerned with, only complete characters. The Unicode string should provide at least 3 types of iterator, regardless of the internal representation (NOTE: as such, they will be implementation dependant on how the string is represented): * UTF-8 -- provides access to the UTF-8 representation of the string; * UTF-16 -- provides access to the UTF-16 representation of the string; * UTF-32 -- provides access to the Unicode character type. Therefore, no matter what the representation, it should be possible to use the UTF-32 iterator variant and "see" the string in native Unicode; this should, therefore, be the standard iterator and the others should be used when converting between formats. NOTE: I am not well versed in how Unicode is represented, so I do not know how feasible it is to implement backwards traversal, but I do know that it would probably be wise to know the position of the last good end of a Unicode character (e.g. when dealing with multi-character UTF-8 and UTF-16 representations). As a side note, it should be feasible to provide specialist wrappers around existing Unicode libraries (like Win32 (CharNext, etc.), ICU and libiconv?), so I would suggest having something akin to char_traits in basic_string. RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16 are multi-character encodings of UTF-32 (not considering combining marks at this stage), whereas UTF-32 is a single character encoding. [3] Algorithms, Locales, etc. These are build upon the UTF-32 view of the Unicode string, like the string algorithms in the Boost library. Therefore, instead of str.find( unicode_string( "World" )), you would have find( str, unicode_string( "World" )). I would also suggest that there be another iterator that operates on std::pair< unicode_string::iterator, unicode_string::iterator > to group combining marks, etc. Thus, there would also be a function unicode_string::utf32_t combine ( std::pair< unicode_string::iterator, unicode_string::iterator > & ucr ) that will map the range into a single code point. You could therefore have a combined_iterator that will provide access to the sequence of combined characters. NOTE: If ucr.first == ucr.second, then combine( ucr ) = *( ucr.first ). Regards, Reece _________________________________________________________________ Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo