Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

20 Jan 2011

      Dave Abrahams <dave <at> boostpro.com> writes:
...
At Wed, 19 Jan 2011 23:25:34 +0000,
Brent Spillner wrote:
...
On 1/19/2011 11:33 AM, Peter Dimov wrote:
...
This was the prevailing thinking once. First this number of bits was 16,
which incorrect assumption claimed Microsoft and Java as victims, then
it became 21 (or 22?). Eventually, people realized that this will never
happen even if we allocate 32 bits per character, so here we are.
The OED lists ~600,000 words, so 32 bits is enough space to provide a
fully pictographic alphabet for over 7,000 languages as rich as English,
with room for a few line-drawing characters left over.  Surely that's enough?
Even if it's theoretically possible, the best standards organization
the world has come up with for addressing these issues was unable to
produce a standard that did it.
I must confess a lack of knowledge wrt to encodings, but my understanding
is that strings are sequences of some raw data (without semantic),
code points and glyphs.

Current/Upcoming std::string , std::u16string and std::u32string
would be the raw data containers, with char*, char16_t* and
char32_t* as random iterators.

I believe that wrt encoding, one size does not fit all because
of the domain/architecture specific tradeoffs between memory
consumption and random access speed.
(However, maybe two sizes fit all, namely utf-8 for compact
representation and utf-32 for random access).

So my uniformed wish would be for something along
(disregarding constness issues for the moment)
namespace std {
namespace unicode
{
template<typename CharT> struct code_points {
 typedef implementation defined iterator;

 explicit code_points(std::basic_string<CharT> & s_): s(s_){}

 iterator begin();
 iterator end();
...
std::basic_string<CharT>& s;
};
// convenience functions
template<typename CharT> 
code_points<CharT> as_code_points(std::basic_string<CharT>& s)
{ return code_points<CharT>(s);}

}}
code_points<> would be specialized to
provide a random access code_points<std::char32_t>::iterator
while code_points<char>::iterator would be a forward iterator.

Algorithms processing sequences of code points could
be specialized to take advantage of random access when available.

template<typename CharT> struct glyphs{}; would also be provided
but no random access could be provided (utf-64 anyone ? :) )

Note that the usual idiom of
for( ; b != e; ++b)
{ process(*b); }
would not be as efficient as possible for variable lenght
encoding of code points (e.g. utf-8) because process
certainly performs the same operations as ++b to retrieve the
whole code points, so we should prefer
while( b != e)
{ b= process(b);}

The problem is that I don't have the knowledge to know if
processing code points (instead of glyphs) is truly relevant
in practice. If it is, I believe that something along my
proposal would :
1°) leverage existing std::basic_string<>,
2°) empower the end-user to select the memory consumption
/ algorithmic complexity tradeoff when processing code points.

What do other think of this ?

Best Regards,

Bernard

Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

bernardH