GSoC Proposal Preparation For Encoding Awared String

15 Mar 2011

      Hi all,

I am an undergraduate student from the National University of
Singapore and I am interested to take part in this year's GSoC with
the Boost community. Currently I am preparing my proposal to add
support for encoding awareness in string through new/existing Boost
project, but there are some questions I would like to ask to clarify
the community's interest in such a project.

The inspiration I get came from several lengthy discussions I found in
the Boost mailing list archive, which happened recently from mid
January to mid February. [5][6][7] For a brief overview according to
my own understanding, the heated debates was mainly around different
ways to ensure consistency between the encoding expected when library
code accepts strings, and the actual encoding of strings passed by
users. The problem arise because a small minority of developers use
std::string in different encoding than UTF-8, and the implicit
assumption of UTF-8 encoding for std::string brings inconsistency and
causes numerous bugs that are outside the scope of the Boost library.

>From the discussion, I found several proposals that have been made to
solve the inconsistencies of std::string encoding:
1. Create new classes that warp around std::string, std::u16string,
and std::u32string for each encodings and ensure encoding correctness
simply through C++ type safety features. The classes are tentatively
called the utf*_t classes (which many disliked the name) - Proposed by
Chad Nelson with working prototype available. [1]
2. Continue to strongly enforce the assumption that all std::strings
are UTF-8 encoded. Depreciate or make it hard to use other encodings
in std::string.
3. Reinvent std::string and introduce boost::string. The new string
class is proposed to be immutable but also delegate the encoding
awareness to templated view<> classes that warp around boost::string,
which IMHO the view<> classes share similarity with proposal (1). [7]

Now I try not to go into the details and pros & cons of each proposal
to avoid turning this thread into yet another string discussion. The
original discussions have 542 messages in total, spanned a whole
month, and I failed to find any conclusion that everyone could agree
on. What I notice in the discussions is that there are several groups
of people that have strong opinion on different ways to solve the
problem and could not generally agree with each other. I also found
that the discussions often drifted away and lose focus on the original
problem, but every now and then someone would mention the problem
again and proposed a solution that is similar to earlier proposals.

Nevertheless, the discussion was extremely informative and insightful.
I learned a lot by just reading through these discussions. But since I
intend to start a GSoC project based on this subject, I hesitated on
what I should really do in this project as I feel that there is no
general agreement on how to solve this problem in the Boost community.
Although I have some ideas and I personally lean towards proposal (1)
by Chad Nelson, I think it'll be best if my project can fit in the
interest of majority of the Boost community members. I think it is
also best to avoid any further discussion on this topic to actively
make the design decision, as the time period for GSoC is limited and
the discussion tends to be never-ending.

I think there are several options that I can choose for my project:
1. To use Chad Nelson's code as base, try to incorporate other ideas
proposed in the mailing list, integrate with Boost.Locale, and make it
Boost quality to submit for review. If this option is chosen, I wish
that Chad Nelson can be my mentor.
2. To start a new code base, gather and compile ideas suggested in
mailing list, final design decisions made by me and my mentor but not
the community (to keep the project going on fast), make it Boost
quality and submit for review.
3. To start the boost::string project, where another better string is
reinvented and fix all the weaknesses of std::string.
4. Adopt different proposal, and improve on existing project such as
Boost.Unicode [2] or Boost.Locale [3] such that it really solves the
encoding awareness problem.
5. Any other suggestion?

I hope to get feedback from you on what should I really focus on in
this project. Of course I also hope that this subject is mature enough
to be accepted as GSoC project as I can see great interest in the
community to solve this problem. I would also like to clarify again
that I do not intend to solve actual Unicode handling problems in this
project - there are already excellent libraries such as Boost.Locale
designed for it. My main objective is to design a set of interfaces
that help to ensure encoding correctness and consistency when strings
are being passed between different functions. I look forward for
anyone that is interested in this project and is willing to be my
mentor.

Lastly, I apology for my grammar and any possible misunderstanding
that caused by my bad writings. Please do correct me if I have missed
anything or misunderstood some aspects. I will write a complete and
formal proposal once I hear feedbacks from you.

Thank you very much and hope that I can start contributing to the
Boost community!

Best Regards,

Soares Chen
National University of Singapore

References:
[1] The Oak Circle C++ (Unicode) Toolkit, by Chad Nelson.
http://www.oakcircle.com/toolkit.html
[2] Boost.Unicode, by Mathias Gaunard.
http://mathias.gaunard.com/unicode/doc/html/
[3] Boost.Locale. http://cppcms.sourceforge.net/boost_locale/html/index.html
[4] Should UTF-16 be considered harmful?, Stack Overflow.
http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful
[5] Always treat std::strings as UTF-8?, Boost Mailing List
Discussion. http://groups.google.com/group/boost-developers-archive/browse_thread/thread/13966c1a3d4ceadd/1be0173d252deb62
[6] What will string handling in C++ look like in the future, Boost
Maling List Discussion.
http://groups.google.com/group/boost-developers-archive/browse_thread/thread/deed8f95125dce02/c6e517b77f403eda
[7] [string] proposal, Boost Mailing List Discussion.
http://groups.google.com/group/boost-devel-archive/browse_thread/thread/f8516df28af22c4b/400f2e616de10ef0

(Sorry for linking the mailing list archive to Google Groups, but I
feel that Google Group provides better interface for reading archives
for those who haven't read the discussions)

Soares Chen

Soares Chen

Andrew Sutton

Chad Nelson

Gruenke, Matt

Soares Chen

tags

participants (4)