
Joseph Gauterin wrote:
IIRC, some of the non-const std::basic_string methods aren't suitable for handling variable width encodings like utf8 and utf16 - non-const operator[] in paticular returns a reference to the character type - a big problem if you want to assign a value > 0x7F (i.e. a character that uses 2 or more bytes).
Yes, very true. One option is to convert to a fixed-size character set before doing anything like operator[], and to not allow strings of variable-width character sets. If you do want to apply operator[] to a UTF8 string, what type should it return? A reference to a range of bytes, somehow? A proxy that encodes/decodes to a UCS4 character? Or, you could say that the iterator is a byte iterator, not a character iterator. Lots of possibilities.
I've noticed that there are frequent requests/proposals for some sort of boost unicode/string encoding library. I've thought about the problem and it seems to big for one person to handle in their spare time
Let me say "part time" rather than "spare time"...
- perhaps a group of us should get together to discuss working on one? I'd be happy to participate.
I would definitely encourage breaking the work up into smaller chunks. IMHO "smaller is better" for Boost libraries; there have been a number of occasions when I've discovered that a feature I want is hidden as an internal component of a Boost library, and I've felt that it should have been a stand-alone public entity. So let's think about how this work can be split up: - A charset_trait class. I have started on this. The missing piece is a way to look up traits of character sets that are known at run-time; input would be appreciated. - Compile-time and run-time tagged strings. The basics of this are straightforward and done. - Conversions. My approach at present is to use iconv via a functor that I wrote a while ago. I believe iconv is widely available; however, some implementations may support only a small set of character sets. Alternatives would be interesting. - Variable width iterators, including the issue that you raised above. - Interaction with locales, internationalisation, and system APIs. and no doubt more. Thinking about the interfaces between these areas and the user would be a good place to start. Regards, Phil.