
I think we could use the locale/code conversion functionality available in the standard I/O streams library to minimize the amount of new code needed and to make it more, well, standard. In general, I'd expect most code conversions to be occurring during I/O anyway (exceptions to this could probably be handled using stringstreams). Appendix D of "The C++ Programming Language" has a fair amount of information on the topic (online here: http://www.research.att.com/~bs/3rd_loc0.html ) The I/O streams' code conversion (through std::codecvt) can potentially convert between any two encodings/character sets, assuming code is written for that particular conversion. std::codecvt takes 3 template parameters: internal character encoding, external encoding, and conversion scheme (called "state"). We could specialize this to take 4 parameters, replacing the single conversion scheme with a pair: one from the internal encoding to the character set itself, and one from the character set to the external encoding. So something like this: std::codecvt< utf16,utf8,pair<utf16_to_ucs4,ucs4_to_utf8> > would convert an internal UTF-16 encoding of a string to an external UTF-8 encoding. However, an I/O stream can only have one codecvt instance at a time (via imbuing a locale), so this raises the question of how we should handle streaming out two Unicode strings with different encodings. On a different note, does anyone see a practical use in having (mutable) strings with variable-width character encodings? I can't think of any practical use for them that wouldn't be equally well-served with an array of bytes (like the email MIME-type example). As for run-time tagging of strings, I doubt it would work very well, since it would be difficult to extend a run-time tagged string class to handle new encodings/character sets. - James Phil Endecott wrote:
I would definitely encourage breaking the work up into smaller chunks. IMHO "smaller is better" for Boost libraries; there have been a number of occasions when I've discovered that a feature I want is hidden as an internal component of a Boost library, and I've felt that it should have been a stand-alone public entity. So let's think about how this work can be split up:
- A charset_trait class. I have started on this. The missing piece is a way to look up traits of character sets that are known at run-time; input would be appreciated.
- Compile-time and run-time tagged strings. The basics of this are straightforward and done.
- Conversions. My approach at present is to use iconv via a functor that I wrote a while ago. I believe iconv is widely available; however, some implementations may support only a small set of character sets. Alternatives would be interesting.
- Variable width iterators, including the issue that you raised above.
- Interaction with locales, internationalisation, and system APIs.
and no doubt more. Thinking about the interfaces between these areas and the user would be a good place to start.
Regards,
Phil.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost