Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter

12 Aug 2011

      On Thu, Aug 11, 2011, Artyom Beilis wrote:
...
My strong opinion is:
a. Strings should be just container object with default encoding
   and some useful API to handle it.
b. Default encoding MUST be UTF-8
c. There are several ways to implement strings COW, Mutable, Immutable,
   with small string optimization and so on. This way or other
   std::string is de-facto string and I think we should live with
   it and use some alternative containers where it matters.
d. Code point and code unit are meaningless unless you develop
   some Unicode algorithm - and you don't - you use one written
   by experts.
...
This Ustr does not solve this problem as it does not provide
really some kind of
adapter<generic encoding> {
    string content
  }
This is some kind of thing that may be useful, but not in
this case. Basically your library provides wrapper
around string and outputs Unicode code points but it
does it for UTF encodings only!
It does not benefit too much. You provide encoding traits
but it is basically meaningless for the propose you had given
as:
It does not provide traits for non-Unicode encodings
like lets say Shift-JIS or ISO-8859-8
The library is designed to be flexible without intending to include
every possible encodings to the library by default. The point is that
external developers can leverage the EncodingTraits template parameter
to implement the desired encoding *themselves*. The core library
should be as small as possible without being bloated by translation
tables between encodings that are not commonly used by the rest of the
world. You may request me to add a sub-library for Shift-JIS or other
encodings, and I'll consider implementing it for popular demand.
...
BTW you can't create traits for many encodings, for
example you can't implement traits requirements:
http://crf.scriptmatrix.net/ustr/ustr/advanced.html#ustr.advanced.custom_enc...
For popular encodings like Shift-JIS or GBK...
Homework: tell me why ;-)
I was trying to write a few lines of prototype code to show you that
it'd work, but I've run out of time and missed so many reply so I'll
show you next time. But why not? There are already standard
translation table offered by Unicode Consortium at
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT,
so all it was needed to make the encoder/decoder work with that
translation table.

Perhaps you are referring to the non-roundtrip conversion of some
Shift-JIS characters, as mentioned by Microsoft at
http://support.microsoft.com/kb/170559. But the objective is to make
best effort emulation not completely perfect. Such problem cannot be
solved by any other implementation means anyway, so if you are trying
to convert Shift-JIS strings to Unicode before passing it to
Unicode-oriented functions, you are still screwed the same way.

Or perhaps you mean that you can't properly encode non-Japanese
characters into the Shift-JIS encoding. In that case the character
will just be substituted by a replacement character, or throw an
exception according to the provided policy. But the user of such
Unicode-emulated string should only read and not modify on it anyway,
i.e. it is probably a bad idea to create a
unicode_string_adapter_builder instance of the string and pass it to
mutation function that assumes full Unicode encoding functionality.
Then again you are also screwed the same way if you are trying to do
manual conversion from Unicode-encoded string to Shift-JIS encoded
string and pass it to Shift-JIS-oriented function.

The conclusion is that Boost.Ustr with custom encoding traits is
intended for convenience of automatically converting between two
encoding between library boundaries. It will not solve any encoding
conversion problem that you can't even solve with manual conversion.
...
Also it is likely that encoding is something that
can be changed in the runtime not compile time and
it seems that this adapter does not support such
option.
It is always possible to add a dynamic layer on top of the static
encoding layer but not the other way round. It shouldn't be too hard
to write a class with virtual interfaces to call the proper template
instance of unicode_string_adapter. But that is currently outside of
the scope and the fundamental design will still be static regardless.
...
If someone uses strings with different encodings he usually
knows their encoding...
The problem is that API inconsistent as on Windows narrow
string is some ANSI code page and anywhere else it is UTF-8.
This is entirely different problem and such adapters don't
really solve them but actually make it worse...
If I'm not wrong however, the wide version of strings on Windows is
always UTF-16 encoded, am I correct?

So a manual solution of constructing UTF-8 strings on Windows would be
similar to:

std::wstring wide_str = L"世界你好";
std::string u8_str;
generic_conversion::u16_to_u8(wide_str.begin(), wide_str.end(),
std::back_inserter(u8_str));

except that it will not be portable across Unix systems. But with
Boost.Ustr you can achieve the same thing by

unicode_string_adapter<std::string> u8_str = USTR("世界你好");

which gets expanded into

unicode_string_adapter<std::string> u8_str =
unicode_string_adapter<std::wstring>( std::wstring(L"世界你好") );

So while it's hard to construct UTF-8 string literals on Windows, it
should still be possible by writing code that manually insert UTF-8
code units into std::string. After all std::string does not have any
restriction on which bytes we can manually insert into it.

Perhaps you are talking about printing the string out to std::out, but
that's another hard problem that we have to tackle separately.
...
Other problem is
================
I don't believe that string adapter would solve any real problems
because:
a) If you iterate over code points you are very likely do something
      wrong. As code point != character and this is very common mistake.
I am well aware of it but I decide to separate the concern into
different layers and tackle the abstract character problem at a higher
layer. The unicode_string_adapter class has a well defined role which
is to offer *code point* level access. I did plan to write another
class that does the character iteration, but I don't have enough time
to do it yet.

But I believe you can pass the code point iterators to Mathias'
Boost.Unicode methods to create abstract character iterators that does
the job you want.
...
b) If you want to iterate over code points it is better to have some
      kind of utf_iterator that receives a range and iterate over it,
      it would be more generic and do not require to have an additional
      class.
For example Boost.Locale has utf_traits that allow to implement
      iteration over code points quite easily.
See:
       http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/namespaceboost_1_1...
       http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/structboost_1_1loc...
And you don't need any kind of specific adapters.
I added a static method
unicode_string_adapter::make_codepoint_iterator to do what you have
requested. It accepts three code unit iterator parameters, current,
begin, and end, so that it can iterate in both directions without
going out of bound. Hope that is what you looking for.
...
c) The problem in Boost is not missing Unicode String and it is not
      even required to have yet-another-unicode-string that we have
      good Unicode support.
The problem is policy the problem is Boost just can't decide once
      and forever that std::string is UTF-8...
But don't get me wrong. This is My Opinion, many
would disagree with me.
Bottom line,
Unicode strings, cool string adapters, UTF-iterators
and even Boost.Unicode and Boost.Locale would not solve
the problems that Boost libraries use inconsistent
encodings on different platforms.
IMHO: the only way to solve it is POLICY.
It is ok if you disagree with my approach, but keep in mind that the
focus of this thread is just to tell whether my current proposed
solution is good, and not to propose a better solution.

Thanks.

cheers,

Soares

Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter

Soares Chen Ruo Fei