Re: [boost] Strings tagged with their character set

27 Sep 2007

      On 9/27/07, Jeremy Maitin-Shepard <jbms@cmu.edu> wrote:
...
I think as others have said, in practice a fixed-width encoding really
gains you very little or nothing at all.  Needing random access to code
points is, I think, an extremely rare operation.
I know, but it'd be easy to put together a fixed-width encoded basic_string,
and we could use that as a basis for building a code conversion framework,
at least as a proof-of-concept. Of course, that assumes that we'd be using
basic_string for fixed-width strings, which isn't necessarily the case.

UCS-2 is bogus and should not be used at all.  Conceivably UCS-4 is
...
legitimate but in practice not likely to be used by anyone.  Still, it
is probably important to support it.
Are there any situations where UCS-2 is actually needed (deprecated
libraries, for instance)? If not, then I agree that we can eliminate it.

I don't think the issues of a mutable UTF-8/UTF-16 representation are
...
very different from the issues of a mutable UTF-32 representation.  In
practice, in handling non-ASCII text, all searching and replacement will
be in terms of substrings (likely single or sequences of grapheme
clusters).
I suppose it depends on how we allow UTF-8/UTF-16 strings to be modified.
Direct (mutable) character access through operator [] would be bad, but
substrings would be better. Depending on the situation, it may be better to
use a stringstream to compose a new string from the old. I'd have to think
about it some more.

- James

Re: [boost] Strings tagged with their character set

James Porter