Re: [boost] Re: Boost Unicode support ideas

13 Apr 2004

      Miro Jurisic <macdev@meeroh.org> writes:
...
[snip]
...
You are forgetting that abstract Unicode characters are defined as sequences of
code points (even if those code points are 32-bit) and string manipulation has 
to take this into account (there are numerous combinations of characters and 
combining marks that must be treated as single units for purpose of searching, 
collation, etc.) A single encoded character type may be 32 bits, but encoded 
characters are often not the level on which the clients need to manipulate 
strings.
Right, it will certainly be necessary to provide a
grapheme_cluster_iterator (with value_type = the Unicode string
type).  ICU should help with this.  Nonetheless, it is useful to
represent a single code point, for several reasons:

 - For the purpose of string construction, the Unicode specification
   explicitly states that any sequence of code points is well formed,
   and so this provides the smallest unit by which
   guaranteed-well-formed strings can be formed.

 - It would be useful to provide functions for querying the Unicode
   properties of individual code points, and this code_point type
   would be the only suitable parameter type.

I do agree, however, that for almost any output formatting, the
locale-specific or user-specified fill text/symbols should be specified
as strings, rather than as individual characters.

-- 
Jeremy Maitin-Shepard