Re: [boost] Strings tagged with their character set

10 Jan 2008

      Sebastian Redl wrote:
...
Phil Endecott wrote:
...
For a UTF-8 string, my proposal offered
a mutable random-access byte iterator
What is the use case for this?
It's for when you want to treat the data as a sequence of bytes.  For 
example, another thread at the moment is discussing base64 encoding.  
The input to a base64 encoder could be a byte stream iterator.

There are also cases where you can exploit knowledge about the encoding 
to use a byte iterator in place of a character iterator.  Specifically, 
in UTF-8 all bytes after the first of a multi-byte character are
...
=128.  So in a parser, I might want to skip forward to the next '"', 
or '<' or whatever; since those are both <128, I can do this 
significantly more efficiently using the byte iterator.
...
...
Concerning mutable vs. immutable strings: which is best in any 
particular case clearly depends on the size of the string, the 
operation being performed, and whether it has a variable-length 
encoding.  The programmer should be allowed to choose which to use.  
(An interesting case is where the size or character set changes at 
run-time, and a run-time choice of algorithm is appropriate.)
Why on earth would you change the character set of a string at runtime?
I should have written "where the size or character set _varies_ at run-time".

Phil.