Re: [boost] Comment on string / unicode discussion

Sean Parent wrote:
What we do need - are good standard algorithms which can be applied to any string class.
Agreed. This is where Boost.StringAlgorithms come in. What I am interested in is *efficient* codepage -> codepage conversion. For example, I may want to read a file in that is stored as 8-bit ASCII as UTF8. Likewise, I may want to take UTF8 data and save it as UTF16, MacRoman or some other encoding. What you need is an encoding -> UTF32 converter and a UTF32 -> encoding converter. Ideally, I would like each of these to be as efficient as possible. They should also be able to accept partial data. That is, if I am reading in a UTF8 file in blocks, it is possible to hit the middle of a character. This can be solved in a random access stream by seeking to the previous character, but this is not always possible. Consider: std::basic_ostringstream< uchar32_t > utf; utf.set_locale( utf8_to_utf32_cvt()); // not sure on exact code here std::string utf8 = some_utf8_data(); std::copy( utf8.begin(), utf8.end(), stream_inserter( utf )); The problem with this is that the stringstream type is uchar32_t, but has an *input* character type of char. The conversion mappings are of the form: n source characters -> m destination characters where these may be encoded byte sequences (e.g. UTF8), surrogate pairs (e.g. UTF16) or combining characters (e.g. a + umlaut). I am not sure how good locales are for this kind of functionality and also how good C++ streams are for this. However, it would be nice to have a stream interface (i.e. << and >>). - Reece _________________________________________________________________ Be one of the first to try Windows Live Mail. http://ideas.live.com/programpage.aspx?versionId=5d21c51a-b161-4314-9b0e-491...

Reece Dunn wrote :
What I am interested in is *efficient* codepage -> codepage conversion. I may want to read a file in that is stored as 8-bit ASCII
8-bit ASCII doesn't exist. ASCII is defined on 7 bits.
as UTF8.
ASCII is valid UTF-8.
What you need is an encoding -> UTF32 converter and a UTF32 -> encoding converter.
Honestly, I don't see what's so good about UTF-32. Yes it has fixed size, but it wastes memory ; usually a bidirectional iterator is everything that you need to manipulate your string, so utf-8 seems like a more interesting base.
participants (2)
-
loufoque
-
Reece Dunn