[boost] Strings tagged with their character set

23 Sep 2007

      Dear All,

Something that I have been thinking about for a while is storing 
strings tagged with their character set.  Since I now have a practical 
need for this I plan to try to implement something.  Your feedback 
would be appreciated.

The starting point is the idea that the character set of a string may 
be known at compile time or at run time, and so two types of tagging 
are possible.  First compile-time tagging:

template <character_set>
class tagged_string { ... };

tagged_string<utf8> s1;
tagged_string<latin1> s2;

Some typedefs would be appropriate:

typedef tagged_string<utf8> utf8string;

Now run-time tagging:

class rt_tagged_string {
private:
   character_set cs;
public:
   rt_tagged_string(character_set cs_): cs(cs_) ...
   ...
};

rt_tagged_string(utf8) s3;

(Consise-yet-clear names for any of these classes would be great.)

I propose to implement conversion between the strings using icode 
and/or GNU recode.  It would be easy to allow this conversion to happen 
invisibly, but it might be wiser to make conversion explicit.

I'm not sure what the 'character_set' that I've used above should be.  
It needs to be some sort of user-extensible enum or type-tag.

We need character types of 8, 16 and 32 bits.  wchar is not useful here 
because it's not defined whether it's 16 or 32 bits.  So I propose the 
following, modelled after cstdint:

typedef char char8_t;
typedef <implementation-defined> char16_t;
typedef <implementation-defined> char32_t;

I then propose a character_set_traits class:

template <character_set>
class character_set_traits;

template <>
class character_set_traits<utf8> {
   typdef char8_t char_t;
   const bool variable_width = true;
   ...
};

For the fixed-width, compile-time-tagged strings I think it makes sense 
to inherit from std::basic_string< 
character_set_traits<charset>::char_t >.  The only problem I can see 
with this is that

latin1string s1 = "hello world";
s1.substr(1,5)  <--- this returns a std::string, not a latin1string

If latin1string has a constructor from std::string (which is its own 
base type) that's fine, i.e. we can still write:

latin1string s2 = s1.substr(1,5);

but unfortunately we can also write

latin2string s3 = s1.substr(1,5);

which is not so good.

So a different approach is to define a set of character-set-specific 
character types, and build string types from them:

typedef char8_t latin1char;
typedef char8_t latin2char;

For variable-width character sets, the methods of std::string are less 
useful (though far from useless).  I understand that there's already a 
utf8 iterator somewhere in Boost, can it help?

For run-time character sets, is there any way to provide e.g. run-time iterators?

I imagine these strings being used as follows:
- Input to the program is either run-time or compile-time tagged with 
any character set.
- Data that is not manipulated in any way it just passed through.
- Data that will be processed will first be converted to a suitable, 
compile-time-tagged, character set, and if appropriate converted back afterwards.

So the absence of (useful) string operations on run-time-tagged or 
variable-width character set data is not a problem.

For conversions, there is the question of partial characters in 
variable-width character sets.  If a program is processing data in 
chunks it may be legitimate for a chunk boundary to fall in the middle 
of a UTF8 character.  IIRC, icode has a method to deal with this which 
we could expose in a stateful converter:

charset_converter utf8_to_ucs4(utf8,ucs4);
while (!eof) {
   utf8string s = get_chunk();
   ucs4string t = utf8_to_ucs4(s);
   send_chunk(t);
}
utf8_to_ucs4.flush();

- but many applications may only need a stateless converter.

I will be working on this over the next couple of weeks, so any 
feedback would be much appreciated.

Regards,

Phil.

[boost] Strings tagged with their character set

Phil Endecott