
On 1/19/2011 6:25 PM, Brent Spillner wrote:
On 1/19/2011 11:33 AM, Peter Dimov wrote:
This was the prevailing thinking once. First this number of bits was 16, which incorrect assumption claimed Microsoft and Java as victims, then it became 21 (or 22?). Eventually, people realized that this will never happen even if we allocate 32 bits per character, so here we are.
The OED lists ~600,000 words, so 32 bits is enough space to provide a fully pictographic alphabet for over 7,000 languages as rich as English, with room for a few line-drawing characters left over. Surely that's enough?
It is technically enough. In fact Unicode only uses 0x10FFF code points in the range 0 to 0x10FFF, and a UTF-32 value will therefore not exceed 0x10FFF. So in fact UTF-32 can easily handle all of the code points in Unicode. But Unicode has the idea of an abstract character, which may be represented by a more than 1 code point. Whether an abstract character is always considered a single character, or an amalgam of a single character ( code point ) and various formatting/graphical code points, is probably debatable. But if one assumes that an abstract character is a single "character" in some encoding, then the way that Unicode has mapped out abstract characters allows for that "character" to be larger than what will fit into a single UTF-32 encoding.