
Joseph Gauterin wrote:
Making the iterator a byte iterator, not a code point iterator, pushes the responsibility for knowing how to handle the variable widthness of the different encodings back onto the user.
Indeed, and smart users might prefer to take that responsibility sometimes. For example, if I want to break up a lump of UTF8 text into lines at each \n then I can just treat it as bytes and look for \n, since \n never occurs in a multibyte character in UTF8. As another example, an XML parser can exploit this when looking for its various punctuation characters. Because a UTF8 character-iterator has the overhead of determining the character width, and also as variable-width iterator operations like operator- are not O(1), having the option to use a byte iterator could be a significant performance help. Of course you could just use a vector<char> or similar when you want to do this sort of thing, but that's not great if you want to mix-and-match byte and character operations without copying the whole string. I'm wondering about offering distinct "unit" (e.g. byte) and "character" types in the charset_traits class, and providing separate unit_iterator and character_iterator types and operations. Or maybe the character_iterators are best provided by some sort of "adapter" layer?
IIRC, iconv is licensed under the GPL
The iconv API is a POSIX and SUS standard. There is an implementation in glibc, which is LGPLed; I believe that other OSes have their own implementations (including BSD-licensed ones). I thought that it was included in Windows since NT but Google tells me I'm wrong. We would certainly want a conversion interface that could be adapted to std::codecvt, iconv, recode (which is a GNU-only thing), icu, etc. I have already written functor wrappers for iconv and recode which work like this: Iconver latin1_to_utf8("latin1","utf8"); utf8string s = latin1_to_utf8(x); The functor can store any state for variable-width charsets. Iconv takes charset names as char*s; I have put a char* name in my charset_traits class to support this. Something is needed to indicate policy for conversion problems, e.g. throw or insert '?' when there is no corresponding character in the target charset. How compatible could this be made with codecvt and icu? Thanks for the many replies. Do keep posting. I'm not going to try to keep up with replies to everything, though; I'm going to try and write come code! Regards, Phil.