Re: [boost] Strings with character sets

18 Oct 2007

      Phil Endecott wrote:
...
Yes, other people have suggested similar things.  Even if it were true 
that most charset conversion occured during I/O - and that's not been 
my experience in my own work - then I would still argue that charset 
conversion should be available for use in other contexts.
I don't mean to say that the recode function shouldn't exist, but that 
it should exist only as a convenience function. The actual conversion 
should be directly usable by I/O operations, so a string doesn't need to 
be fully converted (and allocated) before output. For all their problems 
(mostly runtime-specified conversion), the std::codecvt facets make it 
fairly easy to handle partial conversion and shift states.
...
I imagine that an I/O streams library or some sort of adapter layer 
compatible with these strings would be necessary.
I think this is key, and goes back to my argument that string conversion 
should be seen as an I/O operation, and separate from the strings 
themselves. Unless you merely want a raw byte array, you're 
(conceptually) converting bytes into code points and then into some 
internal storage container.

For ASCII this is trivial, since each byte is equivalent to a code point 
and the storage container is just a char/byte, so you're back to where 
you started. For Unicode, this is considerably more complicated (or we 
wouldn't be discussing it!). The stream must at least be aware of the 
external (file) encoding, in order to keep track of shift states. I 
don't think we'd be able to delegate that responsibility to the strings 
we'd be filling with data.
...
Yes, this has some advantages.  But using a map has the disadvantage 
that lookups are more expensive, compared to the array indexed by enum 
that I have; in my code, getting the char* name of a charset is a 
compile-time-constant operation.  I'm not sure how much that matters in practice.
Another option would be, for every encoding, to create a class (for 
compile-time tagging) and a global instance of that class (for run-time 
tagging).

- James

Re: [boost] Strings with character sets

James Porter