Re: [boost] [general] What will string handling in C++ looklikeinthe future [was Always treat ... ]

20 Jan 2011

      On 01/19/2011 12:56 PM, Robert Ramey wrote:
...
... elision by patrick ...
std::string - a sequence of bytes
utf8_string - a sequence of "code points" implemented in terms of
std::string.
With the ability to specify a conversion facet to convert from your 
local encoding to utf-8.  The string would still validate the utf-8 
received from the conversion facet.
What do you do about things that can validly be represented by one 
character, or by a basic character with one or more combining 
characters.  For example Ü can be represented by U+00DC, a capital U 
with diaeresis or by the two combining characters U+0055 U+0308, a U and 
a combining diaeresis. Ü<=- That one is done with two combining 
characters and the previous one is just one character.  The spec says 
that these must be considered absolutely equivalent.  Will our 
utf8_string class always choose one representation over another?  
Certainly to make choices like this you'd need the characterization 
database from Unicode.

So, if you're iterating the utf8_string with an iterator iter, what type 
does *iter return?  It could _consume_ a lot of bytes.

Is it a char32_t with the character in it, is it another utf8-string 
with only one character in it?  I'd say char32_t because that can hold 
anything in ucs.

So then what about *iter=thechar.  What type or types can thechar be?

char32_t char16_t, wchar_t, char, unsigned char, int, int32_t, a 
utf8_string with only one "character" to be copied in, a utf8_string and 
we'll just take the first char?

I'd probably use char32_t in both those cases.

Food for thought.  I agree I'd like to see it be derived from 
std::string so you can pass it to things that expect a std::string and 
don't care so much about encoding.

Patrick

Re: [boost] [general] What will string handling in C++ looklikeinthe future [was Always treat ... ]

Patrick Horgan