Re: [boost] Strings tagged with their character set

26 Sep 2007

      Joseph Gauterin wrote:
...
Making the iterator a byte iterator, not a code point iterator, pushes
the responsibility for knowing how to handle the variable widthness of
the different encodings back onto the user.
Indeed, and smart users might prefer to take that responsibility 
sometimes.  For example, if I want to break up a lump of UTF8 text into 
lines at each \n then I can just treat it as bytes and look for \n, 
since \n never occurs in a multibyte character in UTF8.  As another 
example, an XML parser can exploit this when looking for its various 
punctuation characters.  Because a UTF8 character-iterator has the 
overhead of determining the character width, and also as variable-width 
iterator operations like operator- are not O(1), having the option to 
use a byte iterator could be a significant performance help.

Of course you could just use a vector<char> or similar when you want to 
do this sort of thing, but that's not great if you want to 
mix-and-match byte and character operations without copying the whole string.

I'm wondering about offering distinct "unit" (e.g. byte) and 
"character" types in the charset_traits class, and providing separate 
unit_iterator and character_iterator types and operations.  Or maybe 
the character_iterators are best provided by some sort of "adapter" layer?
...
IIRC, iconv is licensed under the GPL
The iconv API is a POSIX and SUS standard.  There is an implementation 
in glibc, which is LGPLed; I believe that other OSes have their own 
implementations (including BSD-licensed ones).  I thought that it was 
included in Windows since NT but Google tells me I'm wrong.

We would certainly want a conversion interface that could be adapted to 
std::codecvt, iconv, recode (which is a GNU-only thing), icu, etc.  I 
have already written functor wrappers for iconv and recode which work 
like this:

Iconver latin1_to_utf8("latin1","utf8");
utf8string s = latin1_to_utf8(x);

The functor can store any state for variable-width charsets.  Iconv 
takes charset names as char*s; I have put a char* name in my 
charset_traits class to support this.  Something is needed to indicate 
policy for conversion problems, e.g. throw or insert '?' when there is 
no corresponding character in the target charset.  How compatible could 
this be made with codecvt and icu?

Thanks for the many replies.  Do keep posting.  I'm not going to try to 
keep up with replies to everything, though; I'm going to try and write 
come code!

Regards,

Phil.

Re: [boost] Strings tagged with their character set

Phil Endecott