Thoughts on Unicode and Strings

older
RE: [boost] date-generators (was...

Reece Dunn

16 Apr 2004 16 Apr '04

10:30 a.m.

Here are my thoughts on Unicode strings, based partially on the current discussions of the topic. As I understand it, the problem with strings (standard character and Unicode strings) can be broken down into several stages: [1] Storage And Representation This is how the underlying string is stored (allocation and memory mapping policy) and how it is represented (which is governed by locale, but at this stage, it is best to know the type: UTF-8, UTF-16, UTF-32, etc.) The storage can easily be represented as a container type, and so we have: template < typename CharT, template< typename T, class A > class Container = std::vector, class AllocT = std::allocator< CharT >

...

class string_storage: public Container< CharT, AllocT > { }; Here, I have chosen std::vector as the standard storage policy, as this reflects the current storage policies; thus, basic_string< CharT, Traits > would therefore be based on string_storage< CharT >. It would be easy, then, to select other representations like a reference counted storage (a variant of std::auto_ptr< std::vector >) and even an SGI-like rope! (Although, this would mean that a new std::roped_vector class would need to be implemented: does such a thing already exist in Boost?) [2] Basic Type And Iteration The basic representation is more complex, because now we are dealing with character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode string). At this stage, combining characters and marks should not be concerned with, only complete characters. The Unicode string should provide at least 3 types of iterator, regardless of the internal representation (NOTE: as such, they will be implementation dependant on how the string is represented): * UTF-8 -- provides access to the UTF-8 representation of the string; * UTF-16 -- provides access to the UTF-16 representation of the string; * UTF-32 -- provides access to the Unicode character type. Therefore, no matter what the representation, it should be possible to use the UTF-32 iterator variant and "see" the string in native Unicode; this should, therefore, be the standard iterator and the others should be used when converting between formats. NOTE: I am not well versed in how Unicode is represented, so I do not know how feasible it is to implement backwards traversal, but I do know that it would probably be wise to know the position of the last good end of a Unicode character (e.g. when dealing with multi-character UTF-8 and UTF-16 representations). As a side note, it should be feasible to provide specialist wrappers around existing Unicode libraries (like Win32 (CharNext, etc.), ICU and libiconv?), so I would suggest having something akin to char_traits in basic_string. RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16 are multi-character encodings of UTF-32 (not considering combining marks at this stage), whereas UTF-32 is a single character encoding. [3] Algorithms, Locales, etc. These are build upon the UTF-32 view of the Unicode string, like the string algorithms in the Boost library. Therefore, instead of str.find( unicode_string( "World" )), you would have find( str, unicode_string( "World" )). I would also suggest that there be another iterator that operates on std::pair< unicode_string::iterator, unicode_string::iterator > to group combining marks, etc. Thus, there would also be a function unicode_string::utf32_t combine ( std::pair< unicode_string::iterator, unicode_string::iterator > & ucr ) that will map the range into a single code point. You could therefore have a combined_iterator that will provide access to the sequence of combined characters. NOTE: If ucr.first == ucr.second, then combine( ucr ) = *( ucr.first ). Regards, Reece _________________________________________________________________ Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo

Show replies by date

Anthony Williams

16 Apr 16 Apr

1:25 p.m.

"Reece Dunn" <msclrhd@hotmail.com> writes:

...

[1] Storage And Representation

The storage can easily be represented as a container type, and so we have:

template < typename CharT, template< typename T, class A > class Container = std::vector, class AllocT = std::allocator< CharT >

...
class string_storage: public Container< CharT, AllocT > { };

I am not sure this really gains us anything over just using the underlying container directly.

...

[2] Basic Type And Iteration

The basic representation is more complex, because now we are dealing with character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode string). At this stage, combining characters and marks should not be concerned with, only complete characters.

Here is the issue. What constitutes a complete character? At the lowest level, a single codepoint is a character. At the next level, a collection of codepoints (base+combining marks) is a character (e.g. e + acute accent is a single character). Sometimes there are many equivalent sequences of codepoints that constitute the same character. Sometimes there may be a single codepoint that is equivalent to a set of codepoints (e.g. e + acute accent => e-acute). At another level, a set of codepoints represents a glyph. This glyph may cover one or more characters. There may be several alternative glyphs for a single set of codepoints.

...

The Unicode string should provide at least 3 types of iterator, regardless of the internal representation (NOTE: as such, they will be implementation dependant on how the string is represented):

...

* UTF-8 -- provides access to the UTF-8 representation of the string; * UTF-16 -- provides access to the UTF-16 representation of the string; * UTF-32 -- provides access to the Unicode character type.

I agree we need conversions to/from all 3 formats.

...

Therefore, no matter what the representation, it should be possible to use the UTF-32 iterator variant and "see" the string in native Unicode; this should, therefore, be the standard iterator and the others should be used when converting between formats.

That is my POV.

...

NOTE: I am not well versed in how Unicode is represented, so I do not know how feasible it is to implement backwards traversal, but I do know that it would probably be wise to know the position of the last good end of a Unicode character (e.g. when dealing with multi-character UTF-8 and UTF-16 representations).

Backwards traversal is generally possible, though with UTF-8 it is very slow, as you don't know how many bytes there are until the beginning of the character (though you know when you've got there).

...

As a side note, it should be feasible to provide specialist wrappers around existing Unicode libraries (like Win32 (CharNext, etc.), ICU and libiconv?),

Agreed.

...

so I would suggest having something akin to char_traits in basic_string.

I am not sure how that helps.

...

RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16 are multi-character encodings of UTF-32 (not considering combining marks at this stage), whereas UTF-32 is a single character encoding.

Yes, that is why I believe we should use UTF-32 as the base (despite the performance considerations others have raised).

...

[3] Algorithms, Locales, etc.

These are build upon the UTF-32 view of the Unicode string, like the string algorithms in the Boost library. Therefore, instead of str.find( unicode_string( "World" )), you would have find( str, unicode_string( "World" )).

I am not sure how non-member vs member makes any difference.

...

I would also suggest that there be another iterator that operates on std::pair< unicode_string::iterator, unicode_string::iterator > to group combining marks, etc. Thus, there would also be a function

unicode_string::utf32_t combine ( std::pair< unicode_string::iterator, unicode_string::iterator > & ucr )

that will map the range into a single code point. You could therefore have a combined_iterator that will provide access to the sequence of combined characters.

You cannot always map the sequence of codepoints that make up a character into a single codepoint. However, it is agreed that it would be nice to have a means of dealing with "character" chunks, independently of the number of codepoints that make up that character. Anthony -- Anthony Williams Senior Software Engineer, Beran Instruments Ltd.

Rogier van Dalen

17 Apr 17 Apr

12:18 p.m.

Anthony Williams wrote:

...

"Reece Dunn" <msclrhd@hotmail.com> writes:

...
[2] Basic Type And Iteration

The basic representation is more complex, because now we are dealing with character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode string). At this stage, combining characters and marks should not be concerned with, only complete characters.

Here is the issue. What constitutes a complete character? At the lowest level, a single codepoint is a character. At the next level, a collection of codepoints (base+combining marks) is a character (e.g. e + acute accent is a single character). Sometimes there are many equivalent sequences of codepoints that constitute the same character. Sometimes there may be a single codepoint that is equivalent to a set of codepoints (e.g. e + acute accent => e-acute).

At another level, a set of codepoints represents a glyph. This glyph may cover one or more characters. There may be several alternative glyphs for a single set of codepoints.

Yes, but there may be more glyphs for one codepoint as well. If your definition of glyph is the same as mine (to me it has to do with graphics rather than meaning), glyphs have nothing to do with Unicode text handling, but rather with font drawing (AFAIK ICU deals with both). [...]

...

...
I would also suggest that there be another iterator that operates on std::pair< unicode_string::iterator, unicode_string::iterator > to group combining marks, etc. Thus, there would also be a function

unicode_string::utf32_t combine ( std::pair< unicode_string::iterator, unicode_string::iterator > & ucr )

that will map the range into a single code point. You could therefore have a combined_iterator that will provide access to the sequence of combined characters.

You cannot always map the sequence of codepoints that make up a character into a single codepoint.

However, it is agreed that it would be nice to have a means of dealing with "character" chunks, independently of the number of codepoints that make up that character.

Yes. It seems to me that the discussion so far is about storage, rather than use of Unicode strings. One character may be defined by more than one codepoint, and the different ways to define one character are semantically equivalent (canonically equivalent, see the Unicode standard, 3.7). So U+00E0 ("a with grave") is equivalent to U+0061 U+0300 ("a" "combining grave"). I think characters in this sense should be at the heart of a usable Unicode string. I would propose a class unicode_char, containing one or more codepoints (e.g., in a vector <utf32_t>). operator== (unicode_char, unicode_char) should return true for equivalent sequences. A Unicode string would be a basic_string-like container of unicode_char's. The find_first_of and such functions would then have the expected behaviour. The implementation should probably be more optimised than requiring an allocation for every character, but IMO a good Unicode library should *transparently* deal with such things as canonical equivalence for all operations, like searching, deleting characters, etcetera. unicode_string should be as easy to use as basic_string. Regards, Rogier

Anthony Williams

19 Apr 19 Apr

9:16 a.m.

Rogier van Dalen <R.C.van.Dalen@umail.leidenuniv.nl> writes:

...

Anthony Williams wrote:

...
At another level, a set of codepoints represents a glyph. This glyph may cover one or more characters. There may be several alternative glyphs for a single set of codepoints.

Yes, but there may be more glyphs for one codepoint as well. If your definition of glyph is the same as mine (to me it has to do with graphics rather than meaning), glyphs have nothing to do with Unicode text handling, but rather with font drawing (AFAIK ICU deals with both).

Yes, but I was referring to more than just font differences. IIRC, there are examples in Arabic, where there are alternative representations of whole words, so the rendering engine does more than just translate characters to images, it may rearrange the characters, or treat groups of characters as a single item. But yes, in general it is beyond simple text handling, which is partly what I meant by "At another level".

...

[...]

...
...
I would also suggest that there be another iterator that operates on std::pair< unicode_string::iterator, unicode_string::iterator > to group combining marks, etc. Thus, there would also be a function

unicode_string::utf32_t combine ( std::pair< unicode_string::iterator, unicode_string::iterator > & ucr )

that will map the range into a single code point. You could therefore have a combined_iterator that will provide access to the sequence of combined characters. You cannot always map the sequence of codepoints that make up a character into a single codepoint. However, it is agreed that it would be nice to have a means of dealing with "character" chunks, independently of the number of codepoints that make up that character.

Yes. It seems to me that the discussion so far is about storage, rather than use of Unicode strings. One character may be defined by more than one codepoint, and the different ways to define one character are semantically equivalent (canonically equivalent, see the Unicode standard, 3.7). So U+00E0 ("a with grave") is equivalent to U+0061 U+0300 ("a" "combining grave"). I think characters in this sense should be at the heart of a usable Unicode string.

I would propose a class unicode_char, containing one or more codepoints (e.g., in a vector <utf32_t>). operator== (unicode_char, unicode_char) should return true for equivalent sequences. A Unicode string would be a basic_string-like container of unicode_char's. The find_first_of and such functions would then have the expected behaviour.

I think that actually storing character strings like that would be too slow, but you would certainly want an interface that dealt with such constructs, which is what I meant by '"character" chunks'.

...

The implementation should probably be more optimised than requiring an allocation for every character, but IMO a good Unicode library should *transparently* deal with such things as canonical equivalence for all operations, like searching, deleting characters, etcetera. unicode_string should be as easy to use as basic_string.

Yes, ideally. Anthony -- Anthony Williams Senior Software Engineer, Beran Instruments Ltd.

Rogier van Dalen

10:35 a.m.

Anthony Williams wrote: [..]

...

Yes, but I was referring to more than just font differences. IIRC, there are examples in Arabic, where there are alternative representations of whole words, so the rendering engine does more than just translate characters to images, it may rearrange the characters, or treat groups of characters as a single item.

But yes, in general it is beyond simple text handling, which is partly what I meant by "At another level".

Agreed. AFAIU, rearranging is done for text rendering only, which means it would not at all be relevant for text handling, just like right-to-left issues in mixed Latin/Arabic text, even for complex text handling. Please correct me if I'm wrong. [...]

...

...
I would propose a class unicode_char, containing one or more codepoints (e.g., in a vector <utf32_t>). operator== (unicode_char, unicode_char) should return true for equivalent sequences. A Unicode string would be a basic_string-like container of unicode_char's. The find_first_of and such functions would then have the expected behaviour.

I think that actually storing character strings like that would be too slow, but you would certainly want an interface that dealt with such constructs, which is what I meant by '"character" chunks'.

...
The implementation should probably be more optimised than requiring an allocation for every character, but IMO a good Unicode library should *transparently* deal with such things as canonical equivalence for all operations, like searching, deleting characters, etcetera. unicode_string should be as easy to use as basic_string.

Yes, ideally.

I would like to make my point slightly clearer than I did before. I don't think it would do for a Unicode string library to concentrate on code points. Yes, the raw Unicode data should be available somewhere, so it can be written to file or sent to the OS's display routines. However, IMO it should use characters as its *only* interface for manipulation. The library should discourage using codepoints directly, because it will lead to all kinds of errors that do not often appear in English text manipulation but will for other languages. Think of such simple examples as the equivalence of rôle and rôle in different normalisations. Regards, Rogier

Jeremy Maitin-Shepard

20 Apr 20 Apr

4:33 a.m.

Rogier van Dalen <R.C.van.Dalen@umail.leidenuniv.nl> writes:

...

[snip]

...

I would like to make my point slightly clearer than I did before. I don't think it would do for a Unicode string library to concentrate on code points. Yes, the raw Unicode data should be available somewhere, so it can be written to file or sent to the OS's display routines. However, IMO it should use characters as its *only* interface for manipulation.

In practice, that isn't very useful. The most common operations would probably be collation and substring matching (as well as regular expression handling, perhaps). None of these operations are defined directly in terms of grapheme clusters, for a variety of reasons. What it comes down to is that the string will be the most basic unit for most operations.

...

[snip]

-- Jeremy Maitin-Shepard

Miro Jurisic

16 Apr 16 Apr

5:47 p.m.

In article <BAY7-F139vuK0SqXY4300057386@hotmail.com>, "Reece Dunn" <msclrhd@hotmail.com> wrote:

...

unicode_string::utf32_t combine ( std::pair< unicode_string::iterator, unicode_string::iterator > & ucr )

that will map the range into a single code point. You could therefore have a combined_iterator that will provide access to the sequence of combined characters.

It doesn't work that way; not every sequence of a composing characters can be mapped to a single code point. (If I understand what you meant by that.) meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

Jeremy Maitin-Shepard

7:05 p.m.

"Reece Dunn" <msclrhd@hotmail.com> writes:

...

Here are my thoughts on Unicode strings, based partially on the current discussions of the topic. As I understand it, the problem with strings (standard character and Unicode strings) can be broken down into several stages:

...

[1] Storage And Representation

...

This is how the underlying string is stored (allocation and memory mapping policy) and how it is represented (which is governed by locale, but at this stage, it is best to know the type: UTF-8, UTF-16, UTF-32, etc.)

I am not sure what you mean by "governed by locale." I believe the intention is that platform-specific encodings be abandoned, and that within this library, the internal representation of strings and characters be one of the Unicode encodings.

...

The storage can easily be represented as a container type, and so we have:

...

template < typename CharT, template< typename T, class A > class Container = std::vector, class AllocT = std::allocator< CharT >

...
class string_storage: public Container< CharT, AllocT > { };

...

Here, I have chosen std::vector as the standard storage policy, as this reflects the current storage policies; thus, basic_string< CharT, Traits > would therefore be based on string_storage< CharT >.

I do like this idea, although I think something like this might be better: template <encoding enc, class Container = ...> class unicode_string; "encoding" would be an enumeration type, such that enc would specify one of UTF-8, UTF-16, or UTF-32. This is, I would say, a more explicit way to specify the encoding that relying on the size of the type specified, and also it avoids problems in cases where the platform does not have distinct types for UTF-16 and UTF-32 code units (unlikely, I admit). The purpose of unicode_string would be to wrap an existing container with an encoding specifier and a few Unicode-specific operations, such as appending a Unicode code point, or inserting from another unicode string, and possibly adding some iterator typedefs. One issue which would need to be dealt with is that while it seems necessary for some containers, such as a rope, to have direct access to the container, publicly inheriting this unicode_string from the container type means that the additions to the interface must be more limited.

...

[snip: rope_vector in boost]

...

[2] Basic Type And Iteration

...

The basic representation is more complex, because now we are dealing with character boundaries (when dealing with UTF-8 and UTF-16 views of a Unicode string). At this stage, combining characters and marks should not be concerned with, only complete characters.

...

The Unicode string should provide at least 3 types of iterator, regardless of the internal representation (NOTE: as such, they will be implementation dependant on how the string is represented): * UTF-8 -- provides access to the UTF-8 representation of the string; * UTF-16 -- provides access to the UTF-16 representation of the string; * UTF-32 -- provides access to the Unicode character type.

This seems reasonable, although I practice the UTF-32/code-point iterator would be the most likely to be used.

...

[snip]

...

As a side note, it should be feasible to provide specialist wrappers around existing Unicode libraries (like Win32 (CharNext, etc.), ICU and libiconv?), so I would suggest having something akin to char_traits in basic_string.

I would say there is not much need to provide "specialist wrappers" over other libraries. Presumably, a lot of a Boost Unicode library could use code from ICU, but there is no advantage in attempting to use platform-specific facilities, and doing so would surely introduce inefficiency and great complication.

...

RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and UTF-16 are multi-character encodings of UTF-32 (not considering combining marks at this stage), whereas UTF-32 is a single character encoding.

Yes, most processing would at the very least need to internally use this code-point iterator. There is a significant advantage, however, in standardizing on a single POD-array representation as well, such that much of the library need not be templated (and thus implemented largely in header files, and thus compiled each use), and less efficient functions could also be provided which allow arbitrary iterators through runtime polymorphism. I think it will be particularly important to examine this trade-off, because Unicode support involves a large number of heavy-weight facilities.

...

[3] Algorithms, Locales, etc.

...

These are build upon the UTF-32 view of the Unicode string, like the string algorithms in the Boost library. Therefore, instead of str.find( unicode_string( "World" )), you would have find( str, unicode_string( "World" )).

Well, except that you would want the strength, etc. to be adjustable, and of course localized, and string literals pose additional problems...

...

I would also suggest that there be another iterator that operates on std::pair< unicode_string::iterator, unicode_string::iterator > to group combining marks, etc.

These code point groups are referred to as grapheme clusters, and I certainly agree that it is necessary to provide an iterator interface to grapheme clusters. I would not suggest, however, that normalization be integrated into that interface, because only a small portion of the possible grapheme clusters can be normalized into a single code point, and I don't think it is a particularly common operation to do so, especially for only a single grapheme cluster. -- Jeremy Maitin-Shepard

7766

Age (days ago)

7770

Last active (days ago)

List overview

Download

7 comments

5 participants

participants (5)

Anthony Williams
Jeremy Maitin-Shepard
Miro Jurisic
Reece Dunn
Rogier van Dalen