Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

On 1/19/2011 11:33 AM, Peter Dimov wrote:
This was the prevailing thinking once. First this number of bits was 16, which incorrect assumption claimed Microsoft and Java as victims, then it became 21 (or 22?). Eventually, people realized that this will never happen even if we allocate 32 bits per character, so here we are.
The OED lists ~600,000 words, so 32 bits is enough space to provide a fully pictographic alphabet for over 7,000 languages as rich as English, with room for a few line-drawing characters left over. Surely that's enough?

On 1/19/2011 6:25 PM, Brent Spillner wrote:
On 1/19/2011 11:33 AM, Peter Dimov wrote:
This was the prevailing thinking once. First this number of bits was 16, which incorrect assumption claimed Microsoft and Java as victims, then it became 21 (or 22?). Eventually, people realized that this will never happen even if we allocate 32 bits per character, so here we are.
The OED lists ~600,000 words, so 32 bits is enough space to provide a fully pictographic alphabet for over 7,000 languages as rich as English, with room for a few line-drawing characters left over. Surely that's enough?
It is technically enough. In fact Unicode only uses 0x10FFF code points in the range 0 to 0x10FFF, and a UTF-32 value will therefore not exceed 0x10FFF. So in fact UTF-32 can easily handle all of the code points in Unicode. But Unicode has the idea of an abstract character, which may be represented by a more than 1 code point. Whether an abstract character is always considered a single character, or an amalgam of a single character ( code point ) and various formatting/graphical code points, is probably debatable. But if one assumes that an abstract character is a single "character" in some encoding, then the way that Unicode has mapped out abstract characters allows for that "character" to be larger than what will fit into a single UTF-32 encoding.

At Wed, 19 Jan 2011 23:25:34 +0000, Brent Spillner wrote:
On 1/19/2011 11:33 AM, Peter Dimov wrote:
This was the prevailing thinking once. First this number of bits was 16, which incorrect assumption claimed Microsoft and Java as victims, then it became 21 (or 22?). Eventually, people realized that this will never happen even if we allocate 32 bits per character, so here we are.
The OED lists ~600,000 words, so 32 bits is enough space to provide a fully pictographic alphabet for over 7,000 languages as rich as English, with room for a few line-drawing characters left over. Surely that's enough?
Even if it's theoretically possible, the best standards organization the world has come up with for addressing these issues was unable to produce a standard that did it. As far as I'm concerned, Boost is stuck with the results of the Unicode Consortium until some better standards body comes along, and the likelihood of anyone generating the will to overturn their results as the dominant paradigm is so low as to render that possibility unworthy of attention. Certainly, doing it ourselves is out-of-scope for Boost. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Dave Abrahams <dave <at> boostpro.com> writes:
At Wed, 19 Jan 2011 23:25:34 +0000, Brent Spillner wrote:
On 1/19/2011 11:33 AM, Peter Dimov wrote:
This was the prevailing thinking once. First this number of bits was 16, which incorrect assumption claimed Microsoft and Java as victims, then it became 21 (or 22?). Eventually, people realized that this will never happen even if we allocate 32 bits per character, so here we are.
The OED lists ~600,000 words, so 32 bits is enough space to provide a fully pictographic alphabet for over 7,000 languages as rich as English, with room for a few line-drawing characters left over. Surely that's enough?
Even if it's theoretically possible, the best standards organization the world has come up with for addressing these issues was unable to produce a standard that did it.
I must confess a lack of knowledge wrt to encodings, but my understanding is that strings are sequences of some raw data (without semantic), code points and glyphs. Current/Upcoming std::string , std::u16string and std::u32string would be the raw data containers, with char*, char16_t* and char32_t* as random iterators. I believe that wrt encoding, one size does not fit all because of the domain/architecture specific tradeoffs between memory consumption and random access speed. (However, maybe two sizes fit all, namely utf-8 for compact representation and utf-32 for random access). So my uniformed wish would be for something along (disregarding constness issues for the moment) namespace std { namespace unicode { template<typename CharT> struct code_points { typedef implementation defined iterator; explicit code_points(std::basic_string<CharT> & s_): s(s_){} iterator begin(); iterator end(); ... std::basic_string<CharT>& s; }; // convenience functions template<typename CharT> code_points<CharT> as_code_points(std::basic_string<CharT>& s) { return code_points<CharT>(s);} }} code_points<> would be specialized to provide a random access code_points<std::char32_t>::iterator while code_points<char>::iterator would be a forward iterator. Algorithms processing sequences of code points could be specialized to take advantage of random access when available. template<typename CharT> struct glyphs{}; would also be provided but no random access could be provided (utf-64 anyone ? :) ) Note that the usual idiom of for( ; b != e; ++b) { process(*b); } would not be as efficient as possible for variable lenght encoding of code points (e.g. utf-8) because process certainly performs the same operations as ++b to retrieve the whole code points, so we should prefer while( b != e) { b= process(b);} The problem is that I don't have the knowledge to know if processing code points (instead of glyphs) is truly relevant in practice. If it is, I believe that something along my proposal would : 1°) leverage existing std::basic_string<>, 2°) empower the end-user to select the memory consumption / algorithmic complexity tradeoff when processing code points. What do other think of this ? Best Regards, Bernard

On 20/01/2011 09:41, bernardH wrote:
Dave Abrahams<dave<at> boostpro.com> writes:
At Wed, 19 Jan 2011 23:25:34 +0000, Brent Spillner wrote:
On 1/19/2011 11:33 AM, Peter Dimov wrote:
This was the prevailing thinking once. First this number of bits was 16, which incorrect assumption claimed Microsoft and Java as victims, then it became 21 (or 22?). Eventually, people realized that this will never happen even if we allocate 32 bits per character, so here we are.
The OED lists ~600,000 words, so 32 bits is enough space to provide a fully pictographic alphabet for over 7,000 languages as rich as English, with room for a few line-drawing characters left over. Surely that's enough?
Even if it's theoretically possible, the best standards organization the world has come up with for addressing these issues was unable to produce a standard that did it.
I must confess a lack of knowledge wrt to encodings, but my understanding is that strings are sequences of some raw data (without semantic), code points and glyphs.
The difference between graphemes and glyphs is the main reason for the complications of dealing with text on computers. A grapheme is the unit of natural text, while glyphs are the units used for its graphical representation. Different glyphs can represent the same grapheme (this is usually considered a typeface difference, albeit some typefaces support multiple glyphs for the same grapheme). A grapheme can be represented by several glyphs (mostly diacritics). A single glyph can represent several graphemes, with ligatures, albeit some consider this a typeface quirk and not really a glyph, since a glyph should be at most one grapheme. Unicode mostly tries to encode graphemes (it doesn't encode all variations of 'a' for example, nor all graphic variations of CJK characters), but due to historical reasons, the whole thing is quite a mess. A code point is therefore an element in the Unicode mapping, which semantics depend on what that element actually is. It can be a ligature, a diacritic, a code that is semantically equivalent to another, but not necessarily functionally equivalent, etc. UTF-X are then a series of encoding that describe how code points are encoded as a series of X-sized code units.

Hi,
The OED lists ~600,000 words, so 32 bits is enough space to provide a fully pictographic alphabet for over 7,000 languages as rich as English, with room for a few line-drawing characters left over. Surely that's enough?
It could be. Depends on what problems you are trying to solve. Languages in the world operate in many interestingly different ways. Enabling computers to input, store, display, typeset, hyphenate, search, spell check, render to speech, and perform other multi-lingual text tasks sometimes involves rules more complex than those used for English. Unicode consortium (unicode.org) provides lots of excellent material on these issues, including FAQs. If you are genuinely interesting in solving text processing issues for the entire world, I highly recommend a visit over there. Not all software needs to care about those problems. For a library one needs to decide which set of tasks and languages to support. If the target is all text processing tasks for the entire world, one may end up having strange ideas like variable number of code units or that random access to strings is lower priority. Then there are constraints. Coming across as unnecessarily having doubled the app memory use might earn library designers some seriously bad reputation. Refusing to, say, display all files in a directory may get users upset, even if the filenames aren't valid by some standard or another. Of course when there are other goals - perhaps software needs to handle any text but treats it as an opaque blob, or perhaps author values beauty of internal design more than supporting languages in far-flung corners of the world, or the app is such that butchering the names of 50% of world's population will have no dire consequences - one will likely end up with a different design. To give you a taste of some the complex issues, here's a few quotes from South Asian Scripts FAQ http://www.unicode.org/versions/Unicode5.0.0/ch09.pdf:
The writing systems that employ Devanagari and other Indic scripts constitute abugidas -- a cross between syllabic writing systems and alphabetic writing systems. The effective unit of these writing systems is the orthographic syllable, consisting of a consonant and vowel (CV) core and, optionally, one or more preceding consonants, with a canonical structure of (((C)C)C)V. [...] Devanagari characters, like characters from many other scripts, can combine or change shape depending on their context. [...] Additionally, a few Devanagari characters cause a change in the order of the displayed characters. [...] Some Devanagari consonant letters have alternative presentation forms whose choice depends on neighboring consonants. [...] Devanagari has a collection of nonspacing dependent vowel signs that may appear above or below a consonant letter, as well as spacing dependent vowel signs that may occur to the right or to the left of a consonant letter or consonant cluster. [...] If the superscript mark RAsup is to be applied to a dead consonant that is subsequently replaced by its half-consonant form, then the mark is positioned so that it applies to the form that serves as the base of the consonant cluster. [...]
You might want to also read: http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html http://blog.mozilla.com/dmandelin/2008/02/14/wtf-16/ Regards, Lassi
participants (6)
-
bernardH
-
Brent Spillner
-
Dave Abrahams
-
Edward Diener
-
Lassi Tuura
-
Mathias Gaunard