Re: [boost] Strings tagged with their character set

26 Sep 2007

      I think we could use the locale/code conversion functionality available 
in the standard I/O streams library to minimize the amount of new code 
needed and to make it more, well, standard. In general, I'd expect most 
code conversions to be occurring during I/O anyway (exceptions to this 
could probably be handled using stringstreams). Appendix D of "The C++ 
Programming Language" has a fair amount of information on the topic 
(online here: http://www.research.att.com/~bs/3rd_loc0.html )

The I/O streams' code conversion (through std::codecvt) can potentially 
convert between any two encodings/character sets, assuming code is 
written for that particular conversion. std::codecvt takes 3 template 
parameters: internal character encoding, external encoding, and 
conversion scheme (called "state"). We could specialize this to take 4 
parameters, replacing the single conversion scheme with a pair: one from 
the internal encoding to the character set itself, and one from the 
character set to the external encoding. So something like this:

   std::codecvt< utf16,utf8,pair<utf16_to_ucs4,ucs4_to_utf8> >

would convert an internal UTF-16 encoding of a string to an external 
UTF-8 encoding.

However, an I/O stream can only have one codecvt instance at a time (via 
imbuing a locale), so this raises the question of how we should handle 
streaming out two Unicode strings with different encodings.

On a different note, does anyone see a practical use in having (mutable) 
strings with variable-width character encodings? I can't think of any 
practical use for them that wouldn't be equally well-served with an 
array of bytes (like the email MIME-type example).

As for run-time tagging of strings, I doubt it would work very well, 
since it would be difficult to extend a run-time tagged string class to 
handle new encodings/character sets.

- James

Phil Endecott wrote:
...
I would definitely encourage breaking the work up into smaller chunks.  
IMHO "smaller is better" for Boost libraries; there have been a number 
of occasions when I've discovered that a feature I want is hidden as an 
internal component of a Boost library, and I've felt that it should 
have been a stand-alone public entity.  So let's think about how this 
work can be split up:
- A charset_trait class.  I have started on this.  The missing piece is 
a way to look up traits of character sets that are known at run-time; 
input would be appreciated.
- Compile-time and run-time tagged strings.  The basics of this are 
straightforward and done.
- Conversions.  My approach at present is to use iconv via a functor 
that I wrote a while ago.  I believe iconv is widely available; 
however, some implementations may support only a small set of character 
sets.  Alternatives would be interesting.
- Variable width iterators, including the issue that you raised above.
- Interaction with locales, internationalisation, and system APIs.
and no doubt more.  Thinking about the interfaces between these areas 
and the user would be a good place to start.
Regards,
Phil.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Re: [boost] Strings tagged with their character set

James Porter