Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

29 Jan 2012

      ----- Original Message -----
...
From: Beman Dawes <bdawes@acm.org>
To: Boost Developers List <boost@lists.boost.org>
Cc: 
Sent: Saturday, January 28, 2012 6:46 PM
Subject: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World
Beman.github.com/string-interoperability/interop_white_paper.html
describes Boost components intended to ease string interoperability in
general and Unicode string interoperability in particular.
These proposals are the Boost version of the TR2 proposals made in
N3336, Adapting Standard Library Strings and I/O to a Unicode World.
See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3336.html.
I'm very interested in hearing comments about either the Boost or the
TR2 proposal. Are these useful additions? Is there a better way to
achieve the same easy interoperability goals?
Where is the best home for the Boost proposals? A separate library?
Part of some existing library?
Are these proposals orthogonal to the need for deeper Unicode
functionality, such as Mathias Gaunard's Unicode components?
--Beman
_______________________________________________
Unsubscribe & other changes: 
http://lists.boost.org/mailman/listinfo.cgi/boost
Before I address specific points in the draft I'd like to say - it is 

not the way to go.

In order to make Unicode work, we need two things:

1. First of all to define in the standard

   that any compiler should be able to treat literals as UTF-8 and the input
   text as UTF-8 text and recommend that it would be the default.

   This would make the developers life much easier whether they
   develop for "Wide" Unicode or for the Narrow UTF-8.

2. The standard does not define what locales are actually supported and
   how they are defined.

   The standard should define explicitly that UTF-8 locales must be
   supported.

The rest become trivial:

   std::wcout << L"שלום"

and

   std::cout << "שלום"

Would work and much more.

We are all working so hard to workaround a design flaw of C++ and C++ standard
library that allows ANSI encoding and works with them.

If the standard would require and recommend to handle UTF-8 by default
we would not have all the boost::filesystem::path::imbue and other
stuff that make the life a nightmare.

If we want to go forward with Unicode we need to deprecate non-UTF encoding
we should have UTF-8, UTF-16 and UTF-32 by default, or defined
in compilation time in C++ and let the standard library to handle it.

Take a look on what Go did. All modern languages are Unicode by their
nature, net C++ be as well.

All other stuff is just a workaround of a deeper problem and makes the
programming harder.

Now it would be possible if the standard committee would vote for it.

--------------------------------------------

Now some specific points about converting iterator:

It is fine for Unicode encoding conversion but it is very problematic 

for non-Unicode encodings.

Small note:
...
Interfaces don't work well with generic programming  techniques, particularly iterators.
Iterator is bad design for general encoding conversion for several reasons:
In many cases conversion is stateful and iterator is this case is not the best concept.

Some conversions require complex algorithms that should be be inlined 

but rather implemented with ineritence.Using iterator would require
several Virtual function calls per character with trivial implementation
and would be very complex as would require buffering techinques withing
iterator, that is why codecvt iterface is actually good for encoding
conversion (even thought it has a design flaw with mbstate_t that
is useless for implementing stateful encoders, but if mbstate_t
was something reasonable it would be very good interface).

Now I explain why.

1. In some cases you may want to perform normalization before conversion
   or some other operations because it is not always correct assumption

   that  XYZ-encoding-character <-> Unicode code point.

   Sometimes several characters may be join to single code point and
   the other way around.

2. When you operate on complex encodings it is better to pass a buffer
   for performance. Because conversion algorithm would work
   much better on a chunk of text rathern over some API.

   Even take a look on MSVC standard library. The wide to narrow
   conversion calls codecvt for **every** code point rather
   then using buffers. So do you really expect that
   implementations would actually create an efficent iterators?

Bottom line.

The "iterators range" is not good method for handling.

-------------------------------------

Iterator concept.

This paper does not require what iterator concept is defined?
Input? Output? Forward, Bidirectional? Random?

For some encodings it can work as bidirectional or even random
iterator, for some it may be forward iterator only.

-----------------------------

So I don't really think this is a way to go.

Artyom Beilis
--------------
CppCMS - C++ Web Framework:   http://cppcms.com/
CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/