
Sundell Software wrote:
Each UTF-8/16/32 has its own iterator type, but all output UTF-32 when accessed. Look at std::istream_iterator/std::ostream_iterator for design. There would propably be helper functions for the most common tasks and i think you should be able to do all the nessesary tasks with just iterators.
Yep. That is basically how the current implementation works. It's all (bi-directional) iterators. A unicode string is by nature a bi-directional sequence, so your basically forced to work that way.
typedef basic_string<utf_8> ustring8; typedef basic_string<utf_16> ustring16;
ustring8 u8; ustring16 u16;
// Would propably make .begin() default. unicode_iterator i8(u8, u8.begin());
// This would be a slow way of doing operator[]. the assignment would // insert/remove elements from the basic_string if nessesary. *std::advance(unicode_iterator(u16, u16.begin()), 5) = *(i8++);
Note that the client is responible for giving a valid iterator to unicode_iterator.
An implementation like this is already in place, but not locked to basic_string. A mutable code_point_iterator (unicode_iterator in your code) can be created from any random access sequence. You won't be getting random access to the unicode sequence though, like I mentioned above.
BTW, is using UTF-8/16 in the container really overall cheaper than UTF-32. Since if the client changes a character, and it happens to be larger/smaller then all the elements behind it would need to be moved. Does that happen rarely enough? Though the client should propably know that themselves.
UTF-8, no. That is for people who require small size above all. But UTF-16 usually is, unless you are using some obscure language that is not within the BMP (Basic Multilingual Plane). - Erik