Re: [boost] Strings tagged with their character set

27 Sep 2007

      On 9/27/07, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:
...
Just nit-picking here: it converts to wchar_t, which may or may not be
UTF-16. On Win32 platforms, it is, but on Linux, for example, it's UTF-32.
Yeah, I realized that after I clicked "send". I guess I should eat breakfast
before sending email. :)
...
True. I think the strings should be immutable. I think experience with
Java and C# compared to C++ shows that an immutable string class is
superior in most use cases.
There should be some means to (possibly indirectly) modify a
variable-width-encoded string, though it doesn't necessarily have to be
through the class itself. A stringstream may be more appropriate.
...
...
1) Extend std::basic_string to store UCS-2 / UCS-4 (should be easy,
That said, I think a good (general) roadmap for this project would be:
though
...
string constants may pose a problem)
Doesn't basic_string<wchar_t> do just that already?
It doesn't do it in a portable manner. In Windows, basic_string<wchar_t> is,
ostensibly, UTF-16, but in Linux, it's UTF-32. There should be a portable
solution that guarantees a particular fixed-width encoding. I'd argue that
basic_string<wchar_t> isn't exactly Unicode at all, though I'm being
nit-picky. char_traits<wchar_t>::state_type is mbstate_t, which is the state
type used by codecvt to convert a narrow (ASCII) stream into a wide stream.
In short, the stream (and ultimately the string) isn't Unicode, it's just
ASCII stored with 2 (or 4) bytes per character. This goes back to the
problems with using wfstream.

I think, to have a truly distinct basic_string specialization, we'd need
portable 16- and 32-bit char types, and a way to unambiguously specify its
encoding. My hope is that we can use char_traits<...>::state_type as a way
to make code conversion simpler. Ideally, I'd like something that examines
the state_type of the source and the target, and builds a converter based on
those two pieces of information. It would be great if I could say something
like:

  ofstream<utf8> file("out.txt");
  file << ucs4string << utf16string << jisstring << asciistring << endl;

and have it work automatically.

- James

Re: [boost] Strings tagged with their character set

James Porter