Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter

11 Aug 2011

      ...
From: Soares Chen Ruo Fei <crf@hypershell.org>
...
To: boost@lists.boost.org
Sent: Tuesday, August 9, 2011 10:53 AM
Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
My post has probably slipped through the radar so I'm just going to
bump this post again. Please feel free to criticize if you think that
my library has any fundamental design flaw.
As a student and GSoC
participant, I think the most important thing is for me is to learn
what I did wrong in the project so that I will not repeat the same
mistake, and also to allow me to gain enough experience so that I can
really give useful contribution to the open source community in
future.
Any feedback is really much appreciated. Thanks.
Hello,

First of all I want to tell that I'm as the author of Boost.Locale
library have very strong opinion on how strings and Unicode should
be handled. 

My strong opinion is:

a. Strings should be just container object with default encoding
   and some useful API to handle it.
b. Default encoding MUST be UTF-8
c. There are several ways to implement strings COW, Mutable, Immutable,
   with small string optimization and so on. This way or other
   std::string is de-facto string and I think we should live with
   it and use some alternative containers where it matters.
d. Code point and code unit are meaningless unless you develop
   some Unicode algorithm - and you don't - you use one written
   by experts.

So my biggest problem is motivation:
-----------------------------------
...
The main reason that Boost.Ustr is developed is because current
...
raw string types such as std::string requires developers to make
 assumption on the encoding of the string content, such as UTF-8
for std::string. This creates inconsistency when a string passed
 to library APIs has different encoding from the library expects.
This Ustr does not solve this problem as it does not provide
really some kind of

  adapter<generic encoding> {
    string content
  }

This is some kind of thing that may be useful, but not in
this case. Basically your library provides wrapper
around string and outputs Unicode code points but it
does it for UTF encodings only!

It does not benefit too much. You provide encoding traits
but it is basically meaningless for the propose you had given
as:

It does not provide traits for non-Unicode encodings
like lets say Shift-JIS or ISO-8859-8

BTW you can't create traits for many encodings, for
example you can't implement traits requirements:

http://crf.scriptmatrix.net/ustr/ustr/advanced.html#ustr.advanced.custom_enc...

For popular encodings like Shift-JIS or GBK...

Homework: tell me why ;-)

Also it is likely that encoding is something that 

can be changed in the runtime not compile time and
it seems that this adapter does not support such
option.
...
The problem mainly arise because there are a small minority of
developers who use different encoding for the same string type. 
If someone uses strings with different encodings he usually
knows their encoding...
The problem is that API inconsistent as on Windows narrow
string is some ANSI code page and anywhere else it is UTF-8.

This is entirely different problem and such adapters don't
really solve them but actually make it worse...

Other problem is
================

I don't believe that string adapter would solve any real problems
because:

   a) If you iterate over code points you are very likely do something
      wrong. As code point != character and this is very common mistake.

   b) If you want to iterate over code points it is better to have some
      kind of utf_iterator that receives a range and iterate over it,
      it would be more generic and do not require to have an additional
      class.

      For example Boost.Locale has utf_traits that allow to implement
      iteration over code points quite easily.

      See:
       http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/namespaceboost_1_1...
       http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/structboost_1_1loc...

      And you don't need any kind of specific adapters.

   c) The problem in Boost is not missing Unicode String and it is not
      even required to have yet-another-unicode-string that we have
      good Unicode support.

      The problem is policy the problem is Boost just can't decide once
      and forever that std::string is UTF-8...

But don't get me wrong. This is My Opinion, many
would disagree with me.

=================================

Bottom line,

Unicode strings, cool string adapters, UTF-iterators
and even Boost.Unicode and Boost.Locale would not solve
the problems that Boost libraries use inconsistent
encodings on different platforms.

IMHO: the only way to solve it is POLICY.

Artyom Beilis
--------------
CppCMS - C++ Web Framework:   http://cppcms.sf.net/
CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/