Re: [boost] Strings tagged with their character set

26 Sep 2007

      Joseph Gauterin wrote:
...
IIRC, some of the non-const std::basic_string methods aren't suitable
for handling variable width encodings like utf8 and utf16 - non-const
operator[] in paticular returns a reference to the character type - a
big problem if you want to assign a value > 0x7F (i.e. a character
that uses 2 or more bytes).
Yes, very true.  One option is to convert to a fixed-size character set 
before doing anything like operator[], and to not allow strings of 
variable-width character sets.  If you do want to apply operator[] to a 
UTF8 string, what type should it return?  A reference to a range of 
bytes, somehow?  A proxy that encodes/decodes to a UCS4 character?  Or, 
you could say that the iterator is a byte iterator, not a character 
iterator.  Lots of possibilities.
...
I've noticed that there are frequent requests/proposals for some sort
of boost unicode/string encoding library. I've thought about the
problem and it seems to big for one person to handle in their spare
time
Let me say "part time" rather than "spare time"...
...
- perhaps a group of us should get together to discuss working on
one? I'd be happy to participate.
I would definitely encourage breaking the work up into smaller chunks.  
IMHO "smaller is better" for Boost libraries; there have been a number 
of occasions when I've discovered that a feature I want is hidden as an 
internal component of a Boost library, and I've felt that it should 
have been a stand-alone public entity.  So let's think about how this 
work can be split up:

- A charset_trait class.  I have started on this.  The missing piece is 
a way to look up traits of character sets that are known at run-time; 
input would be appreciated.

- Compile-time and run-time tagged strings.  The basics of this are 
straightforward and done.

- Conversions.  My approach at present is to use iconv via a functor 
that I wrote a while ago.  I believe iconv is widely available; 
however, some implementations may support only a small set of character 
sets.  Alternatives would be interesting.

- Variable width iterators, including the issue that you raised above.

- Interaction with locales, internationalisation, and system APIs.

and no doubt more.  Thinking about the interfaces between these areas 
and the user would be a good place to start.

Regards,

Phil.

Re: [boost] Strings tagged with their character set

Phil Endecott