
Dave Abrahams <dave <at> boostpro.com> writes:
At Wed, 19 Jan 2011 23:25:34 +0000, Brent Spillner wrote:
On 1/19/2011 11:33 AM, Peter Dimov wrote:
This was the prevailing thinking once. First this number of bits was 16, which incorrect assumption claimed Microsoft and Java as victims, then it became 21 (or 22?). Eventually, people realized that this will never happen even if we allocate 32 bits per character, so here we are.
The OED lists ~600,000 words, so 32 bits is enough space to provide a fully pictographic alphabet for over 7,000 languages as rich as English, with room for a few line-drawing characters left over. Surely that's enough?
Even if it's theoretically possible, the best standards organization the world has come up with for addressing these issues was unable to produce a standard that did it.
I must confess a lack of knowledge wrt to encodings, but my understanding is that strings are sequences of some raw data (without semantic), code points and glyphs. Current/Upcoming std::string , std::u16string and std::u32string would be the raw data containers, with char*, char16_t* and char32_t* as random iterators. I believe that wrt encoding, one size does not fit all because of the domain/architecture specific tradeoffs between memory consumption and random access speed. (However, maybe two sizes fit all, namely utf-8 for compact representation and utf-32 for random access). So my uniformed wish would be for something along (disregarding constness issues for the moment) namespace std { namespace unicode { template<typename CharT> struct code_points { typedef implementation defined iterator; explicit code_points(std::basic_string<CharT> & s_): s(s_){} iterator begin(); iterator end(); ... std::basic_string<CharT>& s; }; // convenience functions template<typename CharT> code_points<CharT> as_code_points(std::basic_string<CharT>& s) { return code_points<CharT>(s);} }} code_points<> would be specialized to provide a random access code_points<std::char32_t>::iterator while code_points<char>::iterator would be a forward iterator. Algorithms processing sequences of code points could be specialized to take advantage of random access when available. template<typename CharT> struct glyphs{}; would also be provided but no random access could be provided (utf-64 anyone ? :) ) Note that the usual idiom of for( ; b != e; ++b) { process(*b); } would not be as efficient as possible for variable lenght encoding of code points (e.g. utf-8) because process certainly performs the same operations as ++b to retrieve the whole code points, so we should prefer while( b != e) { b= process(b);} The problem is that I don't have the knowledge to know if processing code points (instead of glyphs) is truly relevant in practice. If it is, I believe that something along my proposal would : 1°) leverage existing std::basic_string<>, 2°) empower the end-user to select the memory consumption / algorithmic complexity tradeoff when processing code points. What do other think of this ? Best Regards, Bernard