New subject: Comment on string / unicode discussion

6 Jul 2006

      Sean Parent wrote:
...
What we do need - are good standard algorithms which can be applied  
to any string class.
Agreed. This is where Boost.StringAlgorithms come in.

What I am interested in is *efficient* codepage -> codepage conversion.
For example, I may want to read a file in that is stored as 8-bit ASCII as
UTF8. Likewise, I may want to take UTF8 data and save it as UTF16,
MacRoman or some other encoding.

What you need is an encoding -> UTF32 converter and a UTF32 ->
encoding converter. Ideally, I would like each of these to be as
efficient as possible. They should also be able to accept partial data.
That is, if I am reading in a UTF8 file in blocks, it is possible to hit
the middle of a character. This can be solved in a random access
stream by seeking to the previous character, but this is not always
possible. Consider:

   std::basic_ostringstream< uchar32_t > utf;
   utf.set_locale( utf8_to_utf32_cvt()); // not sure on exact code here

   std::string utf8 = some_utf8_data();
   std::copy( utf8.begin(), utf8.end(), stream_inserter( utf ));

The problem with this is that the stringstream type is uchar32_t, but
has an *input* character type of char.

The conversion mappings are of the form:

   n source characters -> m destination characters

where these may be encoded byte sequences (e.g. UTF8),
surrogate pairs (e.g. UTF16) or combining characters (e.g.
a + umlaut).

I am not sure how good locales are for this kind of functionality
and also how good C++ streams are for this. However, it would
be nice to have a stream interface (i.e. << and >>).

- Reece
_________________________________________________________________
Be one of the first to try Windows Live Mail.
http://ideas.live.com/programpage.aspx?versionId=5d21c51a-b161-4314-9b0e-491...

Re: [boost] Comment on string / unicode discussion

Reece Dunn

loufoque

tags

participants (2)