Re: [boost] Strings tagged with their character set

27 Sep 2007

      "James Porter" <porterj@alum.rit.edu> writes:
...
I see what you mean. Still, fixed-width-encoded strings are a lot easier to
code, and I think we should focus on them first just to get something
working and to have a platform to test code conversion on, which in my
opinion is the most important part.
I think as others have said, in practice a fixed-width encoding really
gains you very little or nothing at all.  Needing random access to code
points is, I think, an extremely rare operation.  Replacing one code
point with another code point is also likewise a rare operation; in
general you would replace one substring (perhaps a grapheme cluster)
with another substring (which may also be a grapheme cluster).

[snip]
...
That said, I think a good (general) roadmap for this project would be:
1) Extend std::basic_string to store UCS-2 / UCS-4 (should be easy, though
string constants may pose a problem)
UCS-2 is bogus and should not be used at all.  Conceivably UCS-4 is
legitimate but in practice not likely to be used by anyone.  Still, it
is probably important to support it.  The primary encodings of Unicode
to be supported should be UTF-8 and UTF-16.
...
2) Add code conversion to move between encodings, especially for I/O
3) Create VWE string class (fairly easy if immutable, hard if mutable)
I don't think the issues of a mutable UTF-8/UTF-16 representation are
very different from the issues of a mutable UTF-32 representation.  In
practice, in handling non-ASCII text, all searching and replacement will
be in terms of substrings (likely single or sequences of grapheme
clusters).

-- 
Jeremy Maitin-Shepard