
Yes, very true. One option is to convert to a fixed-size character set before doing anything like operator[], and to not allow strings of variable-width character sets. If you do want to apply operator[] to a UTF8 string, what type should it return? A reference to a range of bytes, somehow? A proxy that encodes/decodes to a UCS4 character? Or, you could say that the iterator is a byte iterator, not a character iterator. Lots of possibilities. I'd add making the string classes immutable to the list. That way dereferencing an iterator (by which I mean calling unary op*) of any type could then return a unicode code point by value. Mutable sequences that pretend to hold a different type than they actually do don't work well with C++ idoms (e.g. vector<bool>). Strings could be built using a stringstream like approach or by using concatenation (with possible expression template optimizations).
Making the iterator a byte iterator, not a code point iterator, pushes the responsibility for knowing how to handle the variable widthness of the different encodings back onto the user. There are certainly a lot of possibilities, and we should try to get some sort of consensus before we go further with this.
I would definitely encourage breaking the work up into smaller chunks. Agreed
Conversions. My approach at present is to use iconv via a functor that I wrote a while ago. I believe iconv is widely available; however, some implementations may support only a small set of character sets. Alternatives would be interesting. IIRC, iconv is licensed under the GPL, which would prevent it from being integrated into boost. We should make whatever interface we come up with easily extendable, so that people could write add support for whatever encoding they require, possibly using iconv if using GPL software isn't a problem with them.
- Interaction with locales, internationalisation, and system APIs. We'll definitely need a way to convert to a raw pointer representation (like std::string.c_str()) for interaction with some APIs.
Lots to think about.