Re: [boost] Thoughts on Unicode and Strings

Jeremy Maitin-Shepard wrote:
"Reece Dunn" writes:
[1] Storage And Representation
This is how the underlying string is stored (allocation and memory mapping policy) and how it is represented (which is governed by locale, but at this stage, it is best to know the type: UTF-8, UTF-16, UTF-32, etc.)
I am not sure what you mean by "governed by locale." I believe the
What I meant was that things like character identification, upper/lower case conversion, etc. would be in the locale, although I did not express this too well. I know there are issues associated with using locales and I have very little experience with them to comment.
The storage can easily be represented as a container type, and so we have:
template < typename CharT, template< typename T, class A > class Container = std::vector, class AllocT = std::allocator< CharT >
class string_storage: public Container< CharT, AllocT > { };
Here, I have chosen std::vector as the standard storage policy, as this reflects the current storage policies; thus, basic_string< CharT, Traits > would therefore be based on string_storage< CharT >.
I do like this idea, although I think something like this might be better:
template <encoding enc, class Container = ...> class unicode_string;
"encoding" would be an enumeration type, such that enc would specify one of UTF-8, UTF-16, or UTF-32. This is, I would say, a more explicit way to specify the encoding that relying on the size of the type specified, and also it avoids problems in cases where the platform does not have distinct types for UTF-16 and UTF-32 code units (unlikely, I admit).
That would be a better idea. You would need something like: template< int > class encoding_type{}; class encoding_type< utf8_enc >{ typedef char type; } // ... template < encoding enc, template< typename T, class A = std::allocate< T > > class Container = std::vector
class unicode_string: public Container< encoding_type< enc >::type > { ... };
One issue which would need to be dealt with is that while it seems necessary for some containers, such as a rope, to have direct access to the container, publicly inheriting this unicode_string from the container type means that the additions to the interface must be more limited.
I don't get this. Surely the roped_vector, or whatever rope-like container is used, will have an STL container interface like std::vector so you would use them interchangeably. The unicode_string facilities would make use if the insert/append functions, iterators, etc of the storage container to implement their specific facilities. Thus, all the rope internals would be handled by roped_vector (or whatever the rope container is called), allowing you to use it like a std::vector, so the user of the container would be removed from the internals. This is the idea of having a Container as a template in the first place.
[2] Basic Type And Iteration
The basic representation is more complex, because now we are dealing with character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode string). At this stage, combining characters and marks should not be concerned with, only complete characters.
The Unicode string should provide at least 3 types of iterator, regardless of the internal representation (NOTE: as such, they will be implementation dependant on how the string is represented): * UTF-8 -- provides access to the UTF-8 representation of the string; * UTF-16 -- provides access to the UTF-16 representation of the string; * UTF-32 -- provides access to the Unicode character type.
This seems reasonable, although I practice the UTF-32/code-point iterator would be the most likely to be used.
Agreed, but the others would be useful: writing the string to a file as an example. This is why I suggest that the UTF-32 iterator is the default iterator (i.e. unicode_string::iterator is a UTF-32 iterator).
As a side note, it should be feasible to provide specialist wrappers around existing Unicode libraries (like Win32 (CharNext, etc.), ICU and libiconv?), so I would suggest having something akin to char_traits in basic_string.
I would say there is not much need to provide "specialist wrappers" over other libraries. Presumably, a lot of a Boost Unicode library could use code from ICU, but there is no advantage in attempting to use platform-specific facilities, and doing so would surely introduce inefficiency and great complication.
Okay. It was just an idea :)
RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16 are multi-character encodings of UTF-32 (not considering combining marks at this stage), whereas UTF-32 is a single character encoding.
Yes, most processing would at the very least need to internally use this code-point iterator.
There is a significant advantage, however, in standardizing on a single POD-array representation as well, such that much of the library need not be templated (and thus implemented largely in header files, and thus compiled each use), and less efficient functions could also be provided which allow arbitrary iterators through runtime polymorphism. I think it will be particularly important to examine this trade-off, because Unicode support involves a large number of heavy-weight facilities.
Agreed. However, this is contradictory to allowing the user to specify the container used for string storage. Maybe having a templatized version for users that want a custom storage policy, like a rope, and a static representation (UTF-16?) for those that are not bothered about how the unicode string is stored. The interfaces of these should be the same to allow the higher-level facilities to interoperate with both representations.
[3] Algorithms, Locales, etc.
These are build upon the UTF-32 view of the Unicode string, like the string algorithms in the Boost library. Therefore, instead of str.find( unicode_string( "World" )), you would have find( str, unicode_string( "World" )).
Well, except that you would want the strength, etc. to be adjustable, and of course localized, and string literals pose additional problems...
The logic behind this was for unicode_string to deal with navigating through the internal represtentation and mapping to the internal representation. The find functions, etc could then be implemented by iterating over the UTF-32 iterators and could be done as a template, e.g. string algorithms.
I would also suggest that there be another iterator that operates on std::pair< unicode_string::iterator, unicode_string::iterator > to group combining marks, etc.
These code point groups are referred to as grapheme clusters, and I certainly agree that it is necessary to provide an iterator interface to grapheme clusters. I would not suggest, however, that normalization be integrated into that interface, because only a small portion of the possible grapheme clusters can be normalized into a single code point, and I don't think it is a particularly common operation to do so, especially for only a single grapheme cluster.
That makes sense. Going further into grapheme clusters would be too complicated for a generic unicode library, as you would need to then consider how to map the cluster into the appropriate font: this would be platform specific and far too complex (e.g. overlaying combining marks, etc.)! Regards, Reece _________________________________________________________________ It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger

At 9:48 PM +0100 4/16/04, Reece Dunn wrote:
Jeremy Maitin-Shepard wrote:
"Reece Dunn" writes:
[ big snip ]
[2] Basic Type And Iteration
The basic representation is more complex, because now we are dealing with character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode string). At this stage, combining characters and marks should not be concerned with, only complete characters.
The Unicode string should provide at least 3 types of iterator, regardless of the internal representation (NOTE: as such, they will be implementation dependant on how the string is represented): * UTF-8 -- provides access to the UTF-8 representation of the string; * UTF-16 -- provides access to the UTF-16 representation of the string; * UTF-32 -- provides access to the Unicode character type.
This seems reasonable, although I practice the UTF-32/code-point iterator would be the most likely to be used.
Agreed, but the others would be useful: writing the string to a file as an example. This is why I suggest that the UTF-32 iterator is the default iterator (i.e. unicode_string::iterator is a UTF-32 iterator).
[ more snipped ]
RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16 are multi-character encodings of UTF-32 (not considering combining marks at this stage), whereas UTF-32 is a single character encoding.
I'm pretty sure that this is a bad assumption. You can't just ignore combining characters. I believe that Miro posted an example of how (even using UTF-32), you may not have a single character <<-->> single "entry" mapping. -- -- Marshall Marshall Clow Idio Software <mailto:marshall@idio.com> I want a machine that thinks I'm more important than it is, and acts like it. -- Eric Herrmann

"Reece Dunn" <msclrhd@hotmail.com> writes:
[snip]
One issue which would need to be dealt with is that while it seems necessary for some containers, such as a rope, to have direct access to the container, publicly inheriting this unicode_string from the container type means that the additions to the interface must be more limited.
I don't get this. Surely the roped_vector, or whatever rope-like container is used, will have an STL container interface like std::vector so you would use them interchangeably. The unicode_string facilities would make use if the insert/append functions, iterators, etc of the storage container to implement their specific facilities.
The issue is that the benefits of using a specialized data structure such as a rope are likely seen only by using the rope-specific interface; the container interface would probably not provide many advantages. Access to the underlying container could nonetheless be provided by a container() method.
[snip]
There is a significant advantage, however, in standardizing on a single POD-array representation as well, such that much of the library need not be templated (and thus implemented largely in header files, and thus compiled each use), and less efficient functions could also be provided which allow arbitrary iterators through runtime polymorphism. I think it will be particularly important to examine this trade-off, because Unicode support involves a large number of heavy-weight facilities.
Agreed. However, this is contradictory to allowing the user to specify the container used for string storage.
Yes, I realize that. On the one hand, I really like making everything work with any encoding, any container, etc. On the other hand, I don't think it is feasible to stick everything in header files, although it may prove possible to make the access to the locale and Unicode data non-templated, and thus limit the code that must be in the header files.
Maybe having a templatized version for users that want a custom storage policy, like a rope, and a static representation (UTF-16?) for those that are not bothered about how the unicode string is stored. The interfaces of these should be the same to allow the higher-level facilities to interoperate with both representations.
It seems that this might introduce even more overhead. A less run-time efficient, but more code-size efficient and compile-time efficient solution would be, as I described, to provide a UTF-16 array interface and a run-time polymorphic iterator interface, which would be used (automatically) for all non-UTF-16 array sources/iterators. In practice then, it might be useful to limit unicode_string at least to containers which can provide an array of code units, such as vector or basic_string. (Unfortunately, the interface for getting the array of code units differs for vector and basic_string.) Alternatively, it might make sense to not allow the user to specify a storage container to unicode_string. Maybe you have some other ideas about this. To get a sense of just how complex Unicode handling is, download the source to the ICU library: ftp://www-126.ibm.com/pub/icu/2.6.2/icu-2.6.2.zip (9.4 MB) or ftp://www-126.ibm.com/pub/icu/2.6.2/icu-2.6.2.tgz (8.3 MB) For instance, searching is implemented in usearch.cpp.
[snip: searching]
[snip: grapheme cluster iterator notes]
That makes sense. Going further into grapheme clusters would be too complicated for a generic unicode library, as you would need to then consider how to map the cluster into the appropriate font: this would be platform specific and far too complex (e.g. overlaying combining marks, etc.)!
The ICU library provides some additional facilities which could be used by formatting engines, such as an implementation of the Unicode Bidirectional Algorithm. It might be best to avoid trying to add such facilities to Boost, however. -- Jeremy Maitin-Shepard
participants (3)
-
Jeremy Maitin-Shepard
-
Marshall Clow
-
Reece Dunn