
On 9/27/07, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:
Just nit-picking here: it converts to wchar_t, which may or may not be UTF-16. On Win32 platforms, it is, but on Linux, for example, it's UTF-32.
Yeah, I realized that after I clicked "send". I guess I should eat breakfast before sending email. :)
True. I think the strings should be immutable. I think experience with Java and C# compared to C++ shows that an immutable string class is superior in most use cases.
There should be some means to (possibly indirectly) modify a variable-width-encoded string, though it doesn't necessarily have to be through the class itself. A stringstream may be more appropriate.
1) Extend std::basic_string to store UCS-2 / UCS-4 (should be easy,
That said, I think a good (general) roadmap for this project would be: though
string constants may pose a problem)
Doesn't basic_string<wchar_t> do just that already?
It doesn't do it in a portable manner. In Windows, basic_string<wchar_t> is, ostensibly, UTF-16, but in Linux, it's UTF-32. There should be a portable solution that guarantees a particular fixed-width encoding. I'd argue that basic_string<wchar_t> isn't exactly Unicode at all, though I'm being nit-picky. char_traits<wchar_t>::state_type is mbstate_t, which is the state type used by codecvt to convert a narrow (ASCII) stream into a wide stream. In short, the stream (and ultimately the string) isn't Unicode, it's just ASCII stored with 2 (or 4) bytes per character. This goes back to the problems with using wfstream. I think, to have a truly distinct basic_string specialization, we'd need portable 16- and 32-bit char types, and a way to unambiguously specify its encoding. My hope is that we can use char_traits<...>::state_type as a way to make code conversion simpler. Ideally, I'd like something that examines the state_type of the source and the target, and builds a converter based on those two pieces of information. It would be great if I could say something like: ofstream<utf8> file("out.txt"); file << ucs4string << utf16string << jisstring << asciistring << endl; and have it work automatically. - James