Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

20 Jan 2011

      On 1/19/2011 6:25 PM, Brent Spillner wrote:
...
On 1/19/2011 11:33 AM, Peter Dimov wrote:
...
This was the prevailing thinking once. First this number of bits was 16,
which incorrect assumption claimed Microsoft and Java as victims, then
it became 21 (or 22?). Eventually, people realized that this will never
happen even if we allocate 32 bits per character, so here we are.
The OED lists ~600,000 words, so 32 bits is enough space to provide a
fully pictographic alphabet for over 7,000 languages as rich as English,
with room for a few line-drawing characters left over.  Surely that's enough?
It is technically enough. In fact Unicode only uses 0x10FFF code points 
in the range 0 to 0x10FFF, and a UTF-32 value will therefore not exceed 
0x10FFF. So in fact UTF-32 can easily handle all of the code points in 
Unicode.

But Unicode has the idea of an abstract character, which may be 
represented by a more than 1 code point. Whether an abstract character 
is always considered a single character, or an amalgam of a single 
character ( code point ) and various formatting/graphical code points, 
is probably debatable. But if one assumes that an abstract character is 
a single "character" in some encoding, then the way that Unicode has 
mapped out abstract characters allows for that "character" to be larger 
than what will fit into a single UTF-32 encoding.

Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

Edward Diener