GSoC Proposal Preparation For Encoding Awared String

Hi all, I am an undergraduate student from the National University of Singapore and I am interested to take part in this year's GSoC with the Boost community. Currently I am preparing my proposal to add support for encoding awareness in string through new/existing Boost project, but there are some questions I would like to ask to clarify the community's interest in such a project. The inspiration I get came from several lengthy discussions I found in the Boost mailing list archive, which happened recently from mid January to mid February. [5][6][7] For a brief overview according to my own understanding, the heated debates was mainly around different ways to ensure consistency between the encoding expected when library code accepts strings, and the actual encoding of strings passed by users. The problem arise because a small minority of developers use std::string in different encoding than UTF-8, and the implicit assumption of UTF-8 encoding for std::string brings inconsistency and causes numerous bugs that are outside the scope of the Boost library. >From the discussion, I found several proposals that have been made to solve the inconsistencies of std::string encoding: 1. Create new classes that warp around std::string, std::u16string, and std::u32string for each encodings and ensure encoding correctness simply through C++ type safety features. The classes are tentatively called the utf*_t classes (which many disliked the name) - Proposed by Chad Nelson with working prototype available. [1] 2. Continue to strongly enforce the assumption that all std::strings are UTF-8 encoded. Depreciate or make it hard to use other encodings in std::string. 3. Reinvent std::string and introduce boost::string. The new string class is proposed to be immutable but also delegate the encoding awareness to templated view<> classes that warp around boost::string, which IMHO the view<> classes share similarity with proposal (1). [7] Now I try not to go into the details and pros & cons of each proposal to avoid turning this thread into yet another string discussion. The original discussions have 542 messages in total, spanned a whole month, and I failed to find any conclusion that everyone could agree on. What I notice in the discussions is that there are several groups of people that have strong opinion on different ways to solve the problem and could not generally agree with each other. I also found that the discussions often drifted away and lose focus on the original problem, but every now and then someone would mention the problem again and proposed a solution that is similar to earlier proposals. Nevertheless, the discussion was extremely informative and insightful. I learned a lot by just reading through these discussions. But since I intend to start a GSoC project based on this subject, I hesitated on what I should really do in this project as I feel that there is no general agreement on how to solve this problem in the Boost community. Although I have some ideas and I personally lean towards proposal (1) by Chad Nelson, I think it'll be best if my project can fit in the interest of majority of the Boost community members. I think it is also best to avoid any further discussion on this topic to actively make the design decision, as the time period for GSoC is limited and the discussion tends to be never-ending. I think there are several options that I can choose for my project: 1. To use Chad Nelson's code as base, try to incorporate other ideas proposed in the mailing list, integrate with Boost.Locale, and make it Boost quality to submit for review. If this option is chosen, I wish that Chad Nelson can be my mentor. 2. To start a new code base, gather and compile ideas suggested in mailing list, final design decisions made by me and my mentor but not the community (to keep the project going on fast), make it Boost quality and submit for review. 3. To start the boost::string project, where another better string is reinvented and fix all the weaknesses of std::string. 4. Adopt different proposal, and improve on existing project such as Boost.Unicode [2] or Boost.Locale [3] such that it really solves the encoding awareness problem. 5. Any other suggestion? I hope to get feedback from you on what should I really focus on in this project. Of course I also hope that this subject is mature enough to be accepted as GSoC project as I can see great interest in the community to solve this problem. I would also like to clarify again that I do not intend to solve actual Unicode handling problems in this project - there are already excellent libraries such as Boost.Locale designed for it. My main objective is to design a set of interfaces that help to ensure encoding correctness and consistency when strings are being passed between different functions. I look forward for anyone that is interested in this project and is willing to be my mentor. Lastly, I apology for my grammar and any possible misunderstanding that caused by my bad writings. Please do correct me if I have missed anything or misunderstood some aspects. I will write a complete and formal proposal once I hear feedbacks from you. Thank you very much and hope that I can start contributing to the Boost community! Best Regards, Soares Chen National University of Singapore References: [1] The Oak Circle C++ (Unicode) Toolkit, by Chad Nelson. http://www.oakcircle.com/toolkit.html [2] Boost.Unicode, by Mathias Gaunard. http://mathias.gaunard.com/unicode/doc/html/ [3] Boost.Locale. http://cppcms.sourceforge.net/boost_locale/html/index.html [4] Should UTF-16 be considered harmful?, Stack Overflow. http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful [5] Always treat std::strings as UTF-8?, Boost Mailing List Discussion. http://groups.google.com/group/boost-developers-archive/browse_thread/thread/13966c1a3d4ceadd/1be0173d252deb62 [6] What will string handling in C++ look like in the future, Boost Maling List Discussion. http://groups.google.com/group/boost-developers-archive/browse_thread/thread/deed8f95125dce02/c6e517b77f403eda [7] [string] proposal, Boost Mailing List Discussion. http://groups.google.com/group/boost-devel-archive/browse_thread/thread/f8516df28af22c4b/400f2e616de10ef0 (Sorry for linking the mailing list archive to Google Groups, but I feel that Google Group provides better interface for reading archives for those who haven't read the discussions)

Just to add one more proposal that I missed. I remember Mathias Gaunard suggested some where in the discussion to use range of two iterators to represent code points, characters, and strings. I don't remember the exact details but I think at least that is how Boost.Unicode represent Unicode strings using boost::iterator_range. I see it has similarity with other proposals, but I'll leave it for later discussions. Mathias Gaunard please correct me if I understand wrongly, thanks. On Wed, Mar 16, 2011 at 1:20 AM, Soares Chen <crf@hypershell.org> wrote:
Hi all,
I am an undergraduate student from the National University of Singapore and I am interested to take part in this year's GSoC with the Boost community. Currently I am preparing my proposal to add support for encoding awareness in string through new/existing Boost project, but there are some questions I would like to ask to clarify the community's interest in such a project.
The inspiration I get came from several lengthy discussions I found in the Boost mailing list archive, which happened recently from mid January to mid February. [5][6][7] For a brief overview according to my own understanding, the heated debates was mainly around different ways to ensure consistency between the encoding expected when library code accepts strings, and the actual encoding of strings passed by users. The problem arise because a small minority of developers use std::string in different encoding than UTF-8, and the implicit assumption of UTF-8 encoding for std::string brings inconsistency and causes numerous bugs that are outside the scope of the Boost library.
From the discussion, I found several proposals that have been made to solve the inconsistencies of std::string encoding: 1. Create new classes that warp around std::string, std::u16string, and std::u32string for each encodings and ensure encoding correctness simply through C++ type safety features. The classes are tentatively called the utf*_t classes (which many disliked the name) - Proposed by Chad Nelson with working prototype available. [1] 2. Continue to strongly enforce the assumption that all std::strings are UTF-8 encoded. Depreciate or make it hard to use other encodings in std::string. 3. Reinvent std::string and introduce boost::string. The new string class is proposed to be immutable but also delegate the encoding awareness to templated view<> classes that warp around boost::string, which IMHO the view<> classes share similarity with proposal (1). [7]
Now I try not to go into the details and pros & cons of each proposal to avoid turning this thread into yet another string discussion. The original discussions have 542 messages in total, spanned a whole month, and I failed to find any conclusion that everyone could agree on. What I notice in the discussions is that there are several groups of people that have strong opinion on different ways to solve the problem and could not generally agree with each other. I also found that the discussions often drifted away and lose focus on the original problem, but every now and then someone would mention the problem again and proposed a solution that is similar to earlier proposals.
Nevertheless, the discussion was extremely informative and insightful. I learned a lot by just reading through these discussions. But since I intend to start a GSoC project based on this subject, I hesitated on what I should really do in this project as I feel that there is no general agreement on how to solve this problem in the Boost community. Although I have some ideas and I personally lean towards proposal (1) by Chad Nelson, I think it'll be best if my project can fit in the interest of majority of the Boost community members. I think it is also best to avoid any further discussion on this topic to actively make the design decision, as the time period for GSoC is limited and the discussion tends to be never-ending.
I think there are several options that I can choose for my project: 1. To use Chad Nelson's code as base, try to incorporate other ideas proposed in the mailing list, integrate with Boost.Locale, and make it Boost quality to submit for review. If this option is chosen, I wish that Chad Nelson can be my mentor. 2. To start a new code base, gather and compile ideas suggested in mailing list, final design decisions made by me and my mentor but not the community (to keep the project going on fast), make it Boost quality and submit for review. 3. To start the boost::string project, where another better string is reinvented and fix all the weaknesses of std::string. 4. Adopt different proposal, and improve on existing project such as Boost.Unicode [2] or Boost.Locale [3] such that it really solves the encoding awareness problem. 5. Any other suggestion?
I hope to get feedback from you on what should I really focus on in this project. Of course I also hope that this subject is mature enough to be accepted as GSoC project as I can see great interest in the community to solve this problem. I would also like to clarify again that I do not intend to solve actual Unicode handling problems in this project - there are already excellent libraries such as Boost.Locale designed for it. My main objective is to design a set of interfaces that help to ensure encoding correctness and consistency when strings are being passed between different functions. I look forward for anyone that is interested in this project and is willing to be my mentor.
Lastly, I apology for my grammar and any possible misunderstanding that caused by my bad writings. Please do correct me if I have missed anything or misunderstood some aspects. I will write a complete and formal proposal once I hear feedbacks from you.
Thank you very much and hope that I can start contributing to the Boost community!
Best Regards,
Soares Chen National University of Singapore
References: [1] The Oak Circle C++ (Unicode) Toolkit, by Chad Nelson. http://www.oakcircle.com/toolkit.html [2] Boost.Unicode, by Mathias Gaunard. http://mathias.gaunard.com/unicode/doc/html/ [3] Boost.Locale. http://cppcms.sourceforge.net/boost_locale/html/index.html [4] Should UTF-16 be considered harmful?, Stack Overflow. http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf... [5] Always treat std::strings as UTF-8?, Boost Mailing List Discussion. http://groups.google.com/group/boost-developers-archive/browse_thread/thread... [6] What will string handling in C++ look like in the future, Boost Maling List Discussion. http://groups.google.com/group/boost-developers-archive/browse_thread/thread... [7] [string] proposal, Boost Mailing List Discussion. http://groups.google.com/group/boost-devel-archive/browse_thread/thread/f851...
(Sorry for linking the mailing list archive to Google Groups, but I feel that Google Group provides better interface for reading archives for those who haven't read the discussions)

Hi Soares,
I am an undergraduate student from the National University of Singapore and I am interested to take part in this year's GSoC with the Boost community.
Welcome! This is quite possibly the most comprehensive summary of a Boost discussion by a prospective GSoC student I've ever seen. Be sure to include this in your proposal as part of the background research (or a summary thereof).
I think there are several options that I can choose for my project: 1. To use Chad Nelson's code as base, try to incorporate other ideas proposed in the mailing list, integrate with Boost.Locale, and make it Boost quality to submit for review. If this option is chosen, I wish that Chad Nelson can be my mentor.
This seems feasible. Unfortunately, I don't know if Chad is available or willing to be a mentor.
2. To start a new code base, gather and compile ideas suggested in mailing list, final design decisions made by me and my mentor but not the community (to keep the project going on fast), make it Boost quality and submit for review.
I have a feeling that this will be a part of any project that you propose.
3. To start the boost::string project, where another better string is reinvented and fix all the weaknesses of std::string.
I think that this project may end up being a minefield. Everybody has their favorite string characteristics and whatever you string you eventually implement, will eventually fail somebody's requirements :)
4. Adopt different proposal, and improve on existing project such as Boost.Unicode [2] or Boost.Locale [3] such that it really solves the encoding awareness problem.
This also seems feasible. Was there any consensus on why std::string could or could not be parameterized with UTF-specific character traits? That seems, on the surface, like a possible solution. I was following the discussions but not as closely as others :)
I hope to get feedback from you on what should I really focus on in this project.
Hopefully, one of the participants in those discussions will be able to provide better feedback than I can. I think that one thing you should consider in your proposal is how you actually want to use your library. Consider trying to design your interface from a user perspective. I think that there is sometimes a tendency to focus on the technical aspects of a library, and It's easy to forget the end goal of writing a library: so somebody else can use it. Please don't let a lack of communication dissuade you from submitting a proposal. This list can be high traffic and its easy to miss good posts. Best regards, Andrew Sutton

On Thu, 17 Mar 2011 13:36:31 -0500 Andrew Sutton <asutton.list@gmail.com> wrote:
I think there are several options that I can choose for my project: 1. To use Chad Nelson's code as base, try to incorporate other ideas proposed in the mailing list, integrate with Boost.Locale, and make it Boost quality to submit for review. If this option is chosen, I wish that Chad Nelson can be my mentor.
This seems feasible.
Unfortunately, I don't know if Chad is available or willing to be a mentor.
I'm willing, and I should be available, depending on the time it requires. Up to a couple hours a day shouldn't be a problem.
[...] Was there any consensus on why std::string could or could not be parameterized with UTF-specific character traits? That seems, on the surface, like a possible solution. I was following the discussions but not as closely as others :)
I believe I mentioned trying it, and Artyom responded with some well-thought-out reasons why it wouldn't work, but I can't find the message now. -- Chad Nelson Oak Circle Software, Inc. * * *

________________________________ From: boost-bounces@lists.boost.org on behalf of Chad Nelson Sent: Thu 3/17/2011 6:02 PM To: boost@lists.boost.org Subject: Re: [boost] GSoC Proposal Preparation For Encoding Awared String On Thu, 17 Mar 2011 13:36:31 -0500 Andrew Sutton <asutton.list@gmail.com> wrote:
[...] Was there any consensus on why std::string could or could not be parameterized with UTF-specific character traits? That seems, on the surface, like a possible solution. I was following the discussions but not as closely as others :)
I believe I mentioned trying it, and Artyom responded with some well-thought-out reasons why it wouldn't work, but I can't find the message now.
I think I saw some discussion of this in regards to Glib::ustring, in glibmm. http://library.gnome.org/devel/glibmm/2.24/classGlib_1_1ustring.html#_detail... Matt

Hi Andrew,
Welcome!
This is quite possibly the most comprehensive summary of a Boost discussion by a prospective GSoC student I've ever seen. Be sure to include this in your proposal as part of the background research (or a summary thereof).
Thanks for the praise! But I think you can probably take that back now as there are several other excellent proposals posted since GSoC's mentor organization list is announced. :)
I think that this project may end up being a minefield. Everybody has their favorite string characteristics and whatever you string you eventually implement, will eventually fail somebody's requirements :)
Yes indeed this can be quite a dangerous project. Hopefully it will be better with the new proposal that I've posted just now.
Was there any consensus on why std::string could or could not be parameterized with UTF-specific character traits? That seems, on the surface, like a possible solution. I was following the discussions but not as closely as others :)
I think the the problem is that character traits operates on code unit, but code unit != code point. Also because UTF-8 code point has variable length, it makes no sense to compare for equality between two code units. Character traits cannot also prevent invalid code point from getting into the string.
I think that one thing you should consider in your proposal is how you actually want to use your library. Consider trying to design your interface from a user perspective. I think that there is sometimes a tendency to focus on the technical aspects of a library, and It's easy to forget the end goal of writing a library: so somebody else can use it.
Thanks for pointing that out. Yes I will include different use cases and test cases in the project to make sure that the code can fulfill real world needs. I think that I might also fork some Boost projects and do some minor modification on their APIs to show the benefits of using my unicode_string_adapter over plain old std::string.
Please don't let a lack of communication dissuade you from submitting a proposal. This list can be high traffic and its easy to miss good posts.
Thanks for the tips. I think I'll try to make my posts shorter next time to allow more people to have time to read and reply to me. :)
participants (4)
-
Andrew Sutton
-
Chad Nelson
-
Gruenke, Matt
-
Soares Chen