Re: [boost] GSoC Proposal Preparation For Encoding Awared String

24 Mar 2011

      Hi everyone,

Thank you very much and I appreciate all your feedback! :)

I have talked privately with Chad Nelson and Mathias Gaunard over the
past few days and they have give me a lot of useful suggestions. Based
on the feedbacks and some study into Chad's Unicode library,
Boost.Unicode and Boost.Filesystem, I have come out with some ideas on
what kind of library should I build in this GSoC project.

== Observation ==

Before I go into the concept of the library I'm proposing, I would
like to point out a few observations. Firstly for Chad's code, I
notice that his utf*_t classes have signatures similar to the
following:

class utf8_t   : public specialized_string_t<utf8_t, std::basic_string<char>>
class utf16_t : public specialized_string_t<utf16_t,
std::basic_string<char16_t>>
class utf32_t : public specialized_string_t<utf32_t,
std::basic_string<char32_t>>

where char16_t and char32_t are custom typedef to 16-bit and 32-bit
characters if not in C++0x.

Notice that the classes are all derived from a template called
specialized_string_t that has generic interface that access to the
underlying string. This makes it possible to add Unicode encoding
semantics to any string class that only handle raw bytes by creating
new template instances following the pattern
`specialized_string_t<ClassName, RawStringContainerClass>`.

This pattern actually somewhat similar to the view<> concept mentioned
by Dean Michael Berris in the boost::string discussion. Dean's view
concept has the signature of `class view<Encoding>` and wraps the
proposed boost::string as it's underlying container. Notice that the
view template can actually be generalized to wrap other strings, such
as std::string, by adding one template parameter to make it `class
view<Encoding, StringT>`.

In the boost::string discussion, it is also generally agreeable that a
string class should really be just a dumb container that store raw
bytes and do not care about the meaning of those bytes. This is also
why even the new proposed boost::string class (now called Boost.Chain)
also do not attempt to add Unicode semantics into it. Instead, the
view<> class is used at one level higher than boost::string to add
encoding semantics to the raw string container.

This pattern can also be seen applied in Boost.Filesystem, where it
use a special class to represent the path, rather than the raw
std::basic_string<> variants. The path class has the following
signature:

template <class StringT, class PathTraits>
class basic_path;

where StringT is the type for the internal raw string container, and
PathTraits contains two conversion functions that know how to convert
one type of external (incoming) strings into the type of it's
underlying string container. This allows Boost.Filesystem's developers
to choose a consistent internal string format, such as the 16-bit
wchar_t, while still able to compare it against other string format,
such as the 8-bit std::string.

There is one inefficiency I notice in the basic_path design, which is
that the path traits is restricted to only able to convert between two
string types, instead of arbitrary external string type to one
internal string type. This means that for example, if the developer
chose the path traits to convert between 8-bit and 16-bit character
strings, then it is not possible to implicitly convert a 32-bit
character string into that path type. Fortunately this is probably
fine for Boost.Filesystem as at this moment, Windows use 16-bit
wchar_t* in it's filesystem API while all other OS use 8-bit char*.
However the design does not scale to general usage as there are
definitely more than two string types in use in C++ today. It would
also probably bring problem to Boost.Filesystem one day in future, if
some OS developers ever decided to use 32-bit character string in
their filesystem API. (Well it most probably will never happen. 16-bit
ought to be enough for everyone, but who knows? :P)

== Proposal ==

So following the observations I mentioned above, there are two
conclusions that I can make:
1, A string class should be a dumb container for characters sequences.
It's main focus is to enable manipulation on the character sequences
and it should not focus on the meaning of those characters.
2. It is generally agreeable that it is good to have classes at higher
level perspective that take care of the meaning of characters in a
string. These classes, which I'll call the _string wrapper class_, or
the _string adapter class_, warp around the raw string classes to
bring in semantics in the form of character encoding. The string
wrapper class do not care about the internal workings of character
manipulation, but makes sure that the semantics at end result of the
manipulation is always valid.

Now I believe that the string wrapper pattern would have been applied
in some places I do not know about, but as far as I know, I have not
seen this pattern been formally studied or be designed as a general
solution. So I would like to take this opportunity to design a Unicode
string wrapper library that operates one level higher than the raw
string classes and bring in consistent semantics of UTF encoding into
these string classes.

The string wrapper class that I propose will have the following signature:

template <typename StringT, typename StringTraits, typename
EncodingTraits, typename Policy>
class unicode_string_adapter;

where

StringT is any kind of string class in any character size, that may or
may not have encoding semantics, including but not limited to
std::basic_string<>.

StringTraits generalizes on the interfaces to access a given string
type, such as to get the code unit iterator/range, to modify string
content, to append characters, to concatenate strings, to copy
strings, and to create/destroy strings. It also provides type
information such as the character type, character size, and character
traits.

EncodingTraits provides generic interface to process code unit
iterators. The interface can accept generic character type and compare
characters with their CodeUnitTraits, which provides generic way to
access character information. It is necessary for an encoding traits
to at least specialize in code unit size, which determines whether the
string is encoded in UTF-8/16/32. Note that with the generic
interface, it is possible to even create encoding traits that process
non-Unicode encodings, and thus making a non-Unicode string "pretend"
to act like Unicode string, although this is definitely not within the
initial scope of this project.

Policy is the policy class that handles errors occured during Unicode
processing. When invalid Unicode code point is found, the policy class
determines whether to throw an exception, to ignore it, to replace the
code point, or to do anything else.

Upon completion, it should be trivial to define commonly used Unicode
string classes with simple typedef:

typedef unicode_string_adapter<std::string, .....> utf8_string;
typedef unicode_string_adapter<std::basic_string<wchar_t>, .....> utf16_string;
typedef unicode_string_adapter<std::basic_string<char32_t>, .....> utf32_string;

it should also be possible to build adapter for other commonly used
string types:

typedef unicode_string_adapter<QString, .....> utf16_qstring;
typedef unicode_string_adapter<boost::chain, .....> utf8_chain; // for
Dean's proposed Boost.Chain string class
typedef unicode_string_adapter<const char*, .....> utf8_raw_string;
typedef unicode_string_adapter<UnicodeString, ....> utf16_icu_string;
// ICU's Unicode string

== Benefits of Using unicode_string_adapter ==

So why should developers use a template instance of
unicode_string_adapter in their library APIs, instead of the plain old
std::string? Sure, the added safety of encoding correctness is nice,
but it is quite tedious to wrap everything in it just for the
safeness, as most people would be expected to complain.

However there is another great benefit of wrapping raw strings inside
unicode_string_adapter - it provides automatic conversion between any
template instance of unicode_string_adapter. This means that in case
that the caller of a library is using a string format that is
different from the string format that the library accepts, the
implicit constructor of unicode_string_adapter will be called and the
caller's string will then be transparently converted into the
library's string format.

Here I will present a use case for a simple program that uses Qt's GUI
framework to retrieve a file name input from users and load the file
from the filesystem:

utf16_qstring Qt::promptInput(utf16_qstring question); // hypothetical
functions
void Filesystem::loadFile(utf16_string path);
utf8_string Config::getConfigValue(utf8_string key);

main() {
    ....

    utf8_string document_dir = Config::getConfigValue(utf8_string("doc_dir"));
    utf16_qstring file_name = Qt::promptInput(utf16_qstring("Enter
file name: "));

    implicit conversion between the three string types
    utf16_string file_path = document_dir + utf8_string("/") + file_name;
    Filesystem::loadFile(file_path);

    ....
}

Here the program uses three libraries that are independently developed
by different developers, and each of them has chosen a different
string format for their API for various reasons. If the traditional
approach of std::string is used, the developer would have to manually
convert the UTF-16 QString to std::string for the GUI prompt, then
convert std::string again to std::basic_string<wchar_t> for filesystem
access.

But with unicode_string_adapter, the operator =() and operator +() are
generalized so that all string conversion operations are actually
happened transparently. Not only this significantly reduce the code
needed to perform conversion, it is also gives the freedom for
developers to have the choice to use their favorite string type
without having to follow the "one true way" to use Unicode strings.

The template also makes it possible to create generic Unicode string
processing utilities that can accept any template instance and return
the resulting Unicode string in the same template instance type. For
example, a non-modifying Unicode toUpper() function:

template< unicode_string_adapter<...> >
unicode_string_adapter<...> toUpper(const unicode_string_adapter<...>& arg);

== What Will I Do in This Project ==

For the main objective of this GSoC project, I will implement a
complete generic version of unicode_string_adapter, and also the
std::basic_string specialization for UTF-8/16/32 encoding. I will use
Mathias' Boost.Unicode as the back end for encoding and decoding of
Unicode characters. I will also provide use cases and test cases for
each of the functionality to make sure that the class can serve real
world needs.

If main objective completes in time and there are still time
remaining, I will also implement as many template specializations for
string classes from other non-Boost projects, such as QString and ICU
UnicodeString. As this would require significant efforts to study into
each of the potentially large libraries, I cannot guarantee the number
of specializations I can finish in time.

However in case there are still time remaining after I implement
template specialization for all string classes, I will help Mathias
Gaunard to improve on his Boost.Unicode library.

Within the project period, I wish to work with Chad Nelson as my
mentor and develop the project as an independent library. But after
the project finish, I would be glad to merge it together with
Boost.Unicode to make it under one Boost project, depends on how
Mathias would think.

== Things to Consider ==

There are many things that I need to take into consideration in
designing the class. Some of these problems are quite controversial
and might bring intense discussion to the Boost community. However it
is required to have these problems resolved before the GSoC project
period ends to make a workable library. Until the answer to the
questions are agreed by the majority of community, the following
questions remain open ended and do not have a clear answer yet:

Should code point replacement be allowed in the middle of the string?
UTF-8 and UTF-16 both have code points encoded in variable length. If
the code point in the middle of string is replaced with a code point
of larger size, then the replacement operation can turn into a much
more expensive insertion operation. It could also potentially
invalidate the iterators and cause undefined behaviors.

What should be the type for single Unicode combine character and grapheme?
Unicode combine characters and graphemes (aka the abstract characters)
can consist of arbitrary number of code points. This means that unlike
basic types such as char that can be placed on the stack, the value
for even single abstract character must stay at the heap due to it's
variable size. Currently, Boost.Unicode uses a range of code points to
represent one single abstract character. However, range and iterators
do not generally claim ownership to the underlying memory object, so
it is not possible to retain a range outside of the string object's
scope. One other way is to allow unicode_string_adapter to hold
substring marks on it's underlying string, so that the abstract
characters have also the same type as unicode_string_adapter, where
the original string and the abstract character string actually share
the same string content behind the scene. If this is intended, then
unicode_string_adapter should support fast substring operation. On the
other hand, if the abstract character string has it's own raw string
buffer, then iterating over the characters would become too expensive
as dynamic memory allocation is required for each abstract character
extraction. One other way is to allow unicode_string_adapter to hold
extra space for single code point, so that an abstract character
string that consist of single code point do not need to allocate
dynamic memory. But doing so would make the code more complex and
increases the object size as well.

Should unicode_string_adapter supports fast substring operation?
As mentioned in the problem above, substring operation should be
supported if no type distinction is made between a multi-character
string and single combine character/grapheme string. However there is
tradeoff to share the same string buffer, that mutable operations
would become more expensive. If unicode_string_adapter has substring
mark on it's underlying string, then it's end() iterator may not
necessary be the one-past-end iterator of the underlying string. The
substring mark also increases the object size of
unicode_string_adapter beyond the underlying string's object size,
making it more expensive to copy the adapter. There are two possible
ways to mark the substring region of a unicode_string_adapter - which
is either by index or by iterator range.

Should unicode_string_adapter be immutable?
Mutable operations is often hard to code and error prone, while
functional programming has shown that immutable types not only work
but also reduce the chance of making mistakes. In the boost::string
discussion, the immutable string design also receives support from
many members. It is possible to make unicode_string_adapter immutable
while mutable operations can be performed by retrieving it's
underlying string, however doing so would make unicode_string_adapter
lose the ability to perform transparent string concatenation between
arbitrary string types and encoding.

Should unicode_string_adapter be append-only?
One possible alternative other than immutable string is to make
unicode_string_adapter append-only. As character replacement can
potentially become insertion operation and insertion operation is
almost as expensive as creating new strings, it can be better off by
simply creating a new string. However the append operation is more
valuable as it allows users to create new string in steps by appending
character by character.

How should unicode_string_adapter handle underlying immutable string type?
The string type behind unicode_string_adapter can be immutable. It can
be that the string is inherently immutable, like the one proposed in
Boost.Chain, or the string type can have a const modifier. In that
case the mutable operations in unicode_string_adapter, if any, should
be disabled using template techniques that I have not yet learned.

Should invalid code point be preserved in the raw string?
When unicode_string_adapter is constructed with a raw string, or new
content is inserted or appended, it is possible that the raw content
contains invalid code point. Things would go easier if the class has
an exception throwing policy, however for a code point replacement
policy, it is not clear whether unicode_string_adapter should modify
the raw content, or replace the resulting code point on the fly when
users try to access the content through iterators. The benefit of
preserving the raw string is that there will be no loss of
information, and is workable with immutable string types. Even though
it is possible to factor this decision in the policy based design, it
is still desirable to choose a default preferable policy.

Should the constructor that accepts the original raw string be
implicit or explicit?
An implicit constructor has to make assumption on the encoding of the
raw string, but it allows library code to change their API without
breaking old code base that give raw strings such as std::string as
parameter. On the other hand, explicit constructor forces the caller
to the library API to explicitly mention the encoding of the string,
guaranteeing that the correct encoding is used. But this will make it
hard for existing libraries to migrate to new version of API that use
unicode_string_adapter, as it will break existing code unless the old
API co-exist with the new API to ease the migration.

Should the raw string be accessible via operator *() or custom named function?
Chad Nelson's original utf*_t string classes have the operator *() to
access the underlying std::basic_string object. However it was
generally not accepted by the community as the utf*_t classes are
being as alternative string classes that contend to replace
std::basic_string, rather than as a higher level adapter for
std::basic_string. However as my research mentioned earlier, it is
clear that the original utf*_t and unicode_string_adapter actually
operate on one level higher than the raw strings and are actually
complement to each other. So in this case it makes sense that
unicode_string_adapter "contains" it's underlying string, and operator
*() can be used to retrieve the actual content that the class is
containing. I'd expect that this question alone will get quite a lot
of debate from the community.

== Conclusion ==

I am sorry to make this proposal draft so long, but as the topic is
quite controversial I need to provide more solid arguments to support
my project idea. I hope that this can at least convince you that it
will be better to have string wrapper classes to ensure encoding
correctness. While some of you might still disagree with my proposed
implementation, I believe that as long as we agree that this project
is worthy enough, then an ideal implementation design can eventually
be created as the project starts and goes along. I might get some
facts wrong, such as the basic_path usage in Boost.Filesystem. If I
have made any mistake please feel free to correct me.

Please also do note that this is a draft proposal I am preparing to
submit to GSoC, and the main objective of this thread is to get my
project accepted into GSoC. The proposal is no where near complete or
well thought enough and is not yet ready in any way to be acceptable
to the Boost standard. If there are still many doubts and
controversies, I think that it might be better for me to write a
technical report of some sort at the end of this project to cover all
aspects on the problems and solutions of Unicode strings. It might
take a long while for this to get accepted by everyone in Boost, but
everything has to start some where, right? So I hope that this GSoC
project will be the first step of the journey towards the ideal
solution. :)

Thanks!

Best Regards,

Soares Chen

On Sat, Mar 19, 2011 at 12:01 AM, Artyom <artyomtnk@yahoo.com> wrote:
...
...
From: Soares Chen <crf@hypershell.org>
Hi all,
[snip]
I think there are several  options that I can choose for my project:
1. To use Chad Nelson's code as  base, try to incorporate other ideas
proposed in the mailing list, integrate  with Boost.Locale, and make it
Boost quality to submit for review. If this  option is chosen, I wish
that Chad Nelson can be my mentor.
2. To start a  new code base, gather and compile ideas suggested in
mailing list, final  design decisions made by me and my mentor but not
the community (to keep the  project going on fast), make it Boost
quality and submit for review.
3. To  start the boost::string project, where another better string is
reinvented  and fix all the weaknesses of std::string.
4. Adopt different proposal, and  improve on existing project such as
Boost.Unicode [2] or Boost.Locale [3]  such that it really solves the
encoding awareness problem.
5. Any other  suggestion?
Hello,
I want you to address several points:
It would be very hard to get the consensus about the way to solve the
problem.
Probably the best and the most wishful thinking solution is to assume
that all strings are UTF-8 based, however it is not the reality.
The problem is actually not the string but rather the way you code.
Even if you create a perfect UTF-8 string and then call
   fopen(your_perfect_string.c_str(),"r")
Under windows... And it would not work <sigh... damn Windows>
As you can see from multiple discussions, there are many
contradicting requirements about how should string look
like and what should it bring with.
If you want to provide better Unicode awareness to Boost you
don't need new cool utf-XYZ string, you need a policy.
I think boost::filesystem v3 is a big step forward, it allows you
to use UTF-8 strings on Windows which I think is a really good
beginning.
This is my opinion.
Boost.Locale and several other my projects (CppCMS, CppDB) live happily
with std::string.
The problem is that in vast majority of cases you don't need encoding aware
string, as so many operations you usually do on strings are encoding
agnostic. But this is other story.
Bottom line, if you want to improve Unicode awareness of Boost
I think you need to adopt Boost.Filesystem v3 like policy
all over the code base of Boost.
1. Use Wide API as native one in Boost everywhere under Windows
2. Use char * API as native one in Boost everywhere under non-Windows platforms
3. Use std::codecvt to handle this (after many tricks... )
The Unicode String/Encoding Aware String is the last thing to do
not the first thing.
Why?
1. Because you will never get the consensus about what is the "right-thing"
  to do (wide, narrow, utf-8, utf-16) etc.
  Project that are handled and directed by a single source or management
  like Qt, GTK(mm), Java, C#, Python or others may decide what is the
  right thing.
  This will never happen in Boost as it is too pluralistic even in cases
  where it does not always make sense, just because the way libraries
  are developed, reviewed and got in - based on public reviews
  that eventually encourages diversity.
2. Because you would not likely to be able to enforce users to actually
  use your string. As boost is more about collaboration then enforcement
  of specific style.
3. Even heavy discussions there hadn't got to any conclusion. So what would
  happen and final review of your library?
My $0.02
Artyom
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost