[boost] Re: [Unicode strings] We're off

4 Apr 2005


      Sundell Software wrote:
...
Each UTF-8/16/32 has its own iterator type, but all output UTF-32 when
accessed. Look at std::istream_iterator/std::ostream_iterator for
design. There would propably be helper functions for the most common
tasks and i think you should be able to do all the nessesary tasks
with just iterators.
Yep. That is basically how the current implementation works. It's all 
(bi-directional) iterators. A unicode string is by nature a 
bi-directional sequence, so your basically forced to work that way.
...
typedef basic_string<utf_8> ustring8;
typedef basic_string<utf_16> ustring16;
ustring8  u8;
ustring16 u16;
// Would propably make .begin() default.
unicode_iterator i8(u8, u8.begin());
// This would be a slow way of doing operator[]. the assignment would
// insert/remove elements from the basic_string if nessesary.
*std::advance(unicode_iterator(u16, u16.begin()), 5) = *(i8++);
Note that the client is responible for giving a valid iterator to
unicode_iterator.
An implementation like this is already in place, but not locked to 
basic_string. A mutable code_point_iterator (unicode_iterator in your 
code) can be created from any random access sequence. You won't be 
getting random access to the unicode sequence though, like I mentioned 
above.
...
BTW, is using UTF-8/16 in the container really overall cheaper than
UTF-32. Since if the client changes a character, and it happens to be
larger/smaller then all the elements behind it would need to be moved.
Does that happen rarely enough? Though the client should propably know
that themselves.
UTF-8, no. That is for people who require small size above all. But 
UTF-16 usually is, unless you are using some obscure language that is 
not within the BMP (Basic Multilingual Plane).

- Erik

[boost] Re: [Unicode strings] We're off

Erik Wien