Re: [boost] Strings tagged with their character set

9 Jan 2008

      It's nice to see this thread from September picked up again, as I was a 
bit disappointed by the volume of response at the time to my proposal.  
I may be plugging this code in to something real quite soon, and will 
try to drum up some interest again here if I do.  With XML being 
mentioned again, I think that character sets are something that need attention.

[Be warned that some readers will not see new messages on this old thread.]

Sebastian Redl wrote:
...
David Rodr?guez Ibeas wrote:
...
On Sep 27, 2007 5:31 PM, Joseph Gauterin <joseph.gauterin@googlemail.com> wrote:
[putting back the context]
...
...
...
If we had mutable strings consider how badly the following would perform:
std::replace(utfString.begin(),utfString.end(),SingleByteChar,MultiByteChar);
Although this looks O(n) at first glance, it's actually O(n^2), as the 
container has to expand itself for every replacement. I don't think a 
library should make writing worst case scenario type code that easy.
...
...
While this is a problem that I don't know if has a solution, an alternative
replace can be implemented in the library that performs in linear time by
constructing a new string copying values an replacing on the same iteration.
Could std::replace() be disabled somehow?? (SFINAE??)
It ought to be possible to overload it and, if the string is not part of
std, have the overloaded version be picked up with ADL. Only if
replace() isn't explicitly qualified, of course, which is a problem.
But I think immutable strings are the way forward anyway.
For a UTF-8 string, my proposal offered

   a mutable random-access byte iterator
   a const bidirectional character iterator
   a mutable output character iterator

std::replace needs a mutable forward iterator, so you wouldn't be able 
to apply it to the character iterator.  The library wouldn't "let you 
write worst case code".

There is, however, the replace_copy algorithm, which I think does 
exactly what you need; it takes a pair of input iterators and an output 
iterator, i.e. something like

utf8_string s1 = "......";
utf8_string s2;
std::replace_copy(s1.begin(),s1.end(),
                   utf8_string::character_output_iterator(s2),
                   L'x',L'y');

Concerning mutable vs. immutable strings: which is best in any 
particular case clearly depends on the size of the string, the 
operation being performed, and whether it has a variable-length 
encoding.  The programmer should be allowed to choose which to use.  
(An interesting case is where the size or character set changes at 
run-time, and a run-time choice of algorithm is appropriate.)

Regards,

Phil.

Re: [boost] Strings tagged with their character set

Phil Endecott