
Dear All, Something that I have been thinking about for a while is storing strings tagged with their character set. Since I now have a practical need for this I plan to try to implement something. Your feedback would be appreciated. The starting point is the idea that the character set of a string may be known at compile time or at run time, and so two types of tagging are possible. First compile-time tagging: template <character_set> class tagged_string { ... }; tagged_string<utf8> s1; tagged_string<latin1> s2; Some typedefs would be appropriate: typedef tagged_string<utf8> utf8string; Now run-time tagging: class rt_tagged_string { private: character_set cs; public: rt_tagged_string(character_set cs_): cs(cs_) ... ... }; rt_tagged_string(utf8) s3; (Consise-yet-clear names for any of these classes would be great.) I propose to implement conversion between the strings using icode and/or GNU recode. It would be easy to allow this conversion to happen invisibly, but it might be wiser to make conversion explicit. I'm not sure what the 'character_set' that I've used above should be. It needs to be some sort of user-extensible enum or type-tag. We need character types of 8, 16 and 32 bits. wchar is not useful here because it's not defined whether it's 16 or 32 bits. So I propose the following, modelled after cstdint: typedef char char8_t; typedef <implementation-defined> char16_t; typedef <implementation-defined> char32_t; I then propose a character_set_traits class: template <character_set> class character_set_traits; template <> class character_set_traits<utf8> { typdef char8_t char_t; const bool variable_width = true; ... }; For the fixed-width, compile-time-tagged strings I think it makes sense to inherit from std::basic_string< character_set_traits<charset>::char_t >. The only problem I can see with this is that latin1string s1 = "hello world"; s1.substr(1,5) <--- this returns a std::string, not a latin1string If latin1string has a constructor from std::string (which is its own base type) that's fine, i.e. we can still write: latin1string s2 = s1.substr(1,5); but unfortunately we can also write latin2string s3 = s1.substr(1,5); which is not so good. So a different approach is to define a set of character-set-specific character types, and build string types from them: typedef char8_t latin1char; typedef char8_t latin2char; For variable-width character sets, the methods of std::string are less useful (though far from useless). I understand that there's already a utf8 iterator somewhere in Boost, can it help? For run-time character sets, is there any way to provide e.g. run-time iterators? I imagine these strings being used as follows: - Input to the program is either run-time or compile-time tagged with any character set. - Data that is not manipulated in any way it just passed through. - Data that will be processed will first be converted to a suitable, compile-time-tagged, character set, and if appropriate converted back afterwards. So the absence of (useful) string operations on run-time-tagged or variable-width character set data is not a problem. For conversions, there is the question of partial characters in variable-width character sets. If a program is processing data in chunks it may be legitimate for a chunk boundary to fall in the middle of a UTF8 character. IIRC, icode has a method to deal with this which we could expose in a stateful converter: charset_converter utf8_to_ucs4(utf8,ucs4); while (!eof) { utf8string s = get_chunk(); ucs4string t = utf8_to_ucs4(s); send_chunk(t); } utf8_to_ucs4.flush(); - but many applications may only need a stateless converter. I will be working on this over the next couple of weeks, so any feedback would be much appreciated. Regards, Phil.