
On 01/19/2011 02:33 AM, Matus Chochlik wrote:
... elision by patrick ... - It is extensible, so once we have done the painful transition we will not have to do it again. Currently utf-8 uses 1-4 (or 1-6) byte sequences to encode code The 5 and 6 byte sequences are from early versions of the utf-8 and have known negative security implications. You should never use them in your encoding, nor should you ever accept them as valid utf-8. The entire unicode code space (all 2^31 codes) is encodable in 4 byte standard compliant utf-8. Please see RFC3629 UTF-8, a transformation format of ISO 10646. F. Yergeau. November 2003. This is also STD0063. Also see Table 3-7. Well-Formed UTF-8 Byte Sequences from version 5.2 of the Unicode Standard. I can't emphasize this enough. There have been real, serious problems, that cost people money from following the older naive spec.
to 1-N bytes (unlike UCS-X and i'm not sure about UTF-16/32). If you extended it, then it would not be utf-8 which is an encoding of UCS. So, [dark-sarcasm] even if we dig out the stargate or join the United Federation of Planets and captain Kirk, every time he returns home, brings a truckload of new writing scripts to support, UTF-8 will be able to handle it. Well, most of the code space of UCS is still unused. There's plenty of room. 2^31 codes is a lot. just my 0.02 strips of gold pressed latinum :) [/dark-sarcasm]
Best regards,
Matus _______________________________________________ Unsubscribe& other changes: http://lists.boost.org/mailman/listinfo.cgi/boost