Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

20 Jan 2011


      On 01/19/2011 02:33 AM, Matus Chochlik wrote:
...
... elision by patrick ...
- It is extensible, so once we have done the painful
transition we will not have to do it again. Currently
utf-8 uses 1-4 (or 1-6) byte sequences to encode code
The 5 and 6 byte sequences are from early versions of the utf-8 and have 
known negative security implications.  You should never use them in your 
encoding, nor should you ever accept them as valid utf-8.  The entire 
unicode code space (all 2^31 codes) is encodable in 4 byte standard 
compliant utf-8.  Please see RFC3629 UTF-8, a transformation format of 
ISO 10646. F. Yergeau. November 2003.  This is also STD0063.  Also see 
Table 3-7. Well-Formed UTF-8 Byte Sequences from version 5.2 of the 
Unicode Standard.  I can't emphasize this enough.  There have been real, 
serious problems, that cost people money from following the older naive 
spec.
...
to 1-N bytes (unlike UCS-X and i'm not sure about
UTF-16/32).
If you extended it, then it would not be utf-8 which is an encoding of UCS.
So,
[dark-sarcasm]
even if we dig out the stargate or join the United
Federation of Planets and captain Kirk, every time
he returns home, brings a truckload of new writing
scripts to support, UTF-8 will be able to handle it.
Well, most of the code space of UCS is still unused.  There's plenty of 
room.  2^31 codes is a lot.
just my 0.02 strips of gold pressed latinum :)
[/dark-sarcasm]
Best regards,
Matus
_______________________________________________
Unsubscribe&  other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

Patrick Horgan