
Phil Endecott wrote:
Yes, other people have suggested similar things. Even if it were true that most charset conversion occured during I/O - and that's not been my experience in my own work - then I would still argue that charset conversion should be available for use in other contexts.
I don't mean to say that the recode function shouldn't exist, but that it should exist only as a convenience function. The actual conversion should be directly usable by I/O operations, so a string doesn't need to be fully converted (and allocated) before output. For all their problems (mostly runtime-specified conversion), the std::codecvt facets make it fairly easy to handle partial conversion and shift states.
I imagine that an I/O streams library or some sort of adapter layer compatible with these strings would be necessary.
I think this is key, and goes back to my argument that string conversion should be seen as an I/O operation, and separate from the strings themselves. Unless you merely want a raw byte array, you're (conceptually) converting bytes into code points and then into some internal storage container. For ASCII this is trivial, since each byte is equivalent to a code point and the storage container is just a char/byte, so you're back to where you started. For Unicode, this is considerably more complicated (or we wouldn't be discussing it!). The stream must at least be aware of the external (file) encoding, in order to keep track of shift states. I don't think we'd be able to delegate that responsibility to the strings we'd be filling with data.
Yes, this has some advantages. But using a map has the disadvantage that lookups are more expensive, compared to the array indexed by enum that I have; in my code, getting the char* name of a charset is a compile-time-constant operation. I'm not sure how much that matters in practice.
Another option would be, for every encoding, to create a class (for compile-time tagging) and a global instance of that class (for run-time tagging). - James