
"James Porter" <porterj@alum.rit.edu> writes:
I see what you mean. Still, fixed-width-encoded strings are a lot easier to code, and I think we should focus on them first just to get something working and to have a platform to test code conversion on, which in my opinion is the most important part.
I think as others have said, in practice a fixed-width encoding really gains you very little or nothing at all. Needing random access to code points is, I think, an extremely rare operation. Replacing one code point with another code point is also likewise a rare operation; in general you would replace one substring (perhaps a grapheme cluster) with another substring (which may also be a grapheme cluster). [snip]
That said, I think a good (general) roadmap for this project would be: 1) Extend std::basic_string to store UCS-2 / UCS-4 (should be easy, though string constants may pose a problem)
UCS-2 is bogus and should not be used at all. Conceivably UCS-4 is legitimate but in practice not likely to be used by anyone. Still, it is probably important to support it. The primary encodings of Unicode to be supported should be UTF-8 and UTF-16.
2) Add code conversion to move between encodings, especially for I/O 3) Create VWE string class (fairly easy if immutable, hard if mutable)
I don't think the issues of a mutable UTF-8/UTF-16 representation are very different from the issues of a mutable UTF-32 representation. In practice, in handling non-ASCII text, all searching and replacement will be in terms of substrings (likely single or sequences of grapheme clusters). -- Jeremy Maitin-Shepard