
John Maddock wrote:
I have already gone over this in other posts, but, in short, std::basic_string makes performance guarantees that are at odds with Unicode strings.
Basic_string is a sequence of code points, no more no less, all performance guarentees for basic_string can be met as such.
Using basic_string is container for code points is fine, but what to do with the other operations: find, replace, whatever else. It would be nice if the interface that user will use most of the time is the most convenient. If we agree that user most likely to want find/replace whatever on sequence of *characters*, it's not good to require something like std::find(unicode_iterator(s.begin()), unicode_iterator(s.end()), .....); to do that, and since it's not possible to change definition of std::string you might want boost::unicode_string which find methods works on characters. Another possible approach can be: typedef basic_string<wchar_t> unicode_codepoints_string; class unicode_characters_string { public: unicode_characters_string(const unicode_codepoints_string&); class iterator { }; iterator begin(); iterator end(); // no find* methods! private: // might even hold rep by reference. unicode_codepoints_string& m_rep; }; After that, one simply states that to do find/replace in 'unicode_characters_string' one should use the string_algo library. Together with a big warning that basic_string<> does not really do 100% correct find/replace this might be enough. In fact, I'm still not sure basic_string is all that usefull. If you have unicode_characters_string which does all operations correctly, and basic_string, which does only some operations correctly, why would you use basic_string? For efficiency?
I'm talking about code-points (and sequences thereof), not characters or glyphs which as you say consist of multiple code points.
I would handle "characters" and "glyphs" as iterator adapters sitting on top of sequences of code points. For code points, basic_string is as good a container as any (as are vector and deque and anything else you care to define).
iterator adapters are fine for implementation. I fear that requiring user to employ iterator adapters directly is bad decision.
Working on sequences of code points always requires care: clearly one could erase a low surrogate and leave a high surrogate "orphanned" behind for example. One would need to make it clear in the documention that potential problems like this can occur.
And what can user do to avoid such problems, except for not using basic_string? - Volodya