
Miro Jurisic <macdev@meeroh.org> writes:
[snip]
You are forgetting that abstract Unicode characters are defined as sequences of code points (even if those code points are 32-bit) and string manipulation has to take this into account (there are numerous combinations of characters and combining marks that must be treated as single units for purpose of searching, collation, etc.) A single encoded character type may be 32 bits, but encoded characters are often not the level on which the clients need to manipulate strings.
Right, it will certainly be necessary to provide a grapheme_cluster_iterator (with value_type = the Unicode string type). ICU should help with this. Nonetheless, it is useful to represent a single code point, for several reasons: - For the purpose of string construction, the Unicode specification explicitly states that any sequence of code points is well formed, and so this provides the smallest unit by which guaranteed-well-formed strings can be formed. - It would be useful to provide functions for querying the Unicode properties of individual code points, and this code_point type would be the only suitable parameter type. I do agree, however, that for almost any output formatting, the locale-specific or user-specified fill text/symbols should be specified as strings, rather than as individual characters. -- Jeremy Maitin-Shepard