New subject: Thoughts on Unicode and Strings

16 Apr 2004

      Jeremy Maitin-Shepard wrote:
...
"Reece Dunn" writes:
...
[1] Storage And Representation
...
...
This is how the underlying string is stored (allocation and memory 
mapping
policy) and how it is represented (which is governed by locale, but at 
this
stage, it is best to know the type: UTF-8, UTF-16, UTF-32, etc.)
...
I am not sure what you mean by "governed by locale."  I believe the
What I meant was that things like character identification, upper/lower case 
conversion, etc. would be in the locale, although I did not express this too 
well. I know there are issues associated with using locales and I have very 
little experience with them to comment.
...
...
The storage can easily be represented as a container type, and so we 
have:
...
template
<
   typename CharT,
   template< typename T, class A > class Container = std::vector,
   class AllocT = std::allocator< CharT >
...
class string_storage: public Container< CharT, AllocT >
{
};
...
Here, I have chosen std::vector as the standard storage policy, as this 
reflects
the current storage policies; thus, basic_string< CharT, Traits > would
therefore be based on string_storage< CharT >.
I do like this idea, although I think something like this might be
better:
template <encoding enc, class Container = ...> class unicode_string;
"encoding" would be an enumeration type, such that enc would specify
one of UTF-8, UTF-16, or UTF-32.  This is, I would say, a more explicit
way to specify the encoding that relying on the size of the type
specified, and also it avoids problems in cases where the platform does
not have distinct types for UTF-16 and UTF-32 code units (unlikely, I
admit).
That would be a better idea. You would need something like:

template< int >
class encoding_type{};

class encoding_type< utf8_enc >{ typedef char type; }
// ...

template
<
   encoding enc,
   template< typename T, class A = std::allocate< T > > class Container = 
std::vector
...
class unicode_string: public Container< encoding_type< enc >::type >
{
   ...
};
...
One issue which would need to be dealt with is that while it seems
necessary for some containers, such as a rope, to have direct access to
the container, publicly inheriting this unicode_string from the
container type means that the additions to the interface must be more
limited.
I don't get this. Surely the roped_vector, or whatever rope-like container 
is used, will have an STL container interface like std::vector so you would 
use them interchangeably. The unicode_string facilities would make use if 
the insert/append functions, iterators, etc of the storage container to 
implement their specific facilities.

Thus, all the rope internals would be handled by roped_vector (or whatever 
the rope container is called), allowing you to use it like a std::vector, so 
the user of the container would be removed from the internals. This is the 
idea of having a Container as a template in the first place.
...
...
[2] Basic Type And Iteration
...
...
The basic representation is more complex, because now we are dealing 
with
character boundaries (when dealing with UTF-8 and UTF-16 views of a 
Unicode
string). At this stage, combining characters and marks should not be 
concerned
with, only complete characters.
...
...
The Unicode string should provide at least 3 types of iterator, 
regardless of
the internal representation (NOTE: as such, they will be implementation
dependant on how the string is represented):
*  UTF-8  -- provides access to the UTF-8 representation of the string;
*  UTF-16 -- provides access to the UTF-16 representation of the string;
*  UTF-32 -- provides access to the Unicode character type.
...
This seems reasonable, although I practice the UTF-32/code-point
iterator would be the most likely to be used.
Agreed, but the others would be useful: writing the string to a file as an 
example. This is why I suggest that the UTF-32 iterator is the default 
iterator (i.e. unicode_string::iterator is a UTF-32 iterator).
...
...
As a side note, it should be feasible to provide specialist wrappers 
around
existing Unicode libraries (like Win32 (CharNext, etc.), ICU and 
libiconv?), so
I would suggest having something akin to char_traits in basic_string.
...
I would say there is not much need to provide "specialist wrappers"
over other libraries.  Presumably, a lot of a Boost Unicode library
could use code from ICU, but there is no advantage in attempting to use
platform-specific facilities, and doing so would surely introduce
inefficiency and great complication.
Okay. It was just an idea :)
...
...
RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16 
are
multi-character encodings of UTF-32 (not considering combining marks at 
this
stage), whereas UTF-32 is a single character encoding.
...
Yes, most processing would at the very least need to internally use
this code-point iterator.
...
There is a significant advantage, however, in standardizing on a single
POD-array representation as well, such that much of the library need not
be templated (and thus implemented largely in header files, and thus
compiled each use), and less efficient functions could also be provided
which allow arbitrary iterators through runtime polymorphism.  I think
it will be particularly important to examine this trade-off, because
Unicode support involves a large number of heavy-weight facilities.
Agreed. However, this is contradictory to allowing the user to specify the 
container used for string storage.

Maybe having a templatized version for users that want a custom storage 
policy, like a rope, and a static representation (UTF-16?) for those that 
are not bothered about how the unicode string is stored. The interfaces of 
these should be the same to allow the higher-level facilities to 
interoperate with both representations.
...
...
[3] Algorithms, Locales, etc.
...
These are build upon the UTF-32 view of the Unicode string, like the 
string
algorithms in the Boost library. Therefore, instead of str.find( 
unicode_string(
"World" )), you would have find( str, unicode_string( "World" )).
Well, except that you would want the strength, etc. to be adjustable,
and of course localized, and string literals pose additional problems...
The logic behind this was for unicode_string to deal with navigating through 
the internal represtentation and mapping to the internal representation. The 
find functions, etc could then be implemented by iterating over the UTF-32 
iterators and could be done as a template, e.g. string algorithms.
...
...
I would also suggest that there be another iterator that operates on 
std::pair<
unicode_string::iterator, unicode_string::iterator > to group combining 
marks,
etc.
These code point groups are referred to as grapheme clusters, and I
certainly agree that it is necessary to provide an iterator interface
to grapheme clusters.  I would not suggest, however, that normalization
be integrated into that interface, because only a small portion of the
possible grapheme clusters can be normalized into a single code point,
and I don't think it is a particularly common operation to do so,
especially for only a single grapheme cluster.
That makes sense. Going further into grapheme clusters would be too 
complicated for a generic unicode library, as you would need to then 
consider how to map the cluster into the appropriate font: this would be 
platform specific and far too complex (e.g. overlaying combining marks, 
etc.)!

Regards,
Reece

_________________________________________________________________
It's fast, it's easy and it's free. Get MSN Messenger today! 
http://www.msn.co.uk/messenger

Re: [boost] Thoughts on Unicode and Strings

Reece Dunn

Marshall Clow

Jeremy Maitin-Shepard

tags

participants (3)