
15 Mar
2011
15 Mar
'11
5:20 p.m.
Hi all, I am an undergraduate student from the National University of Singapore and I am interested to take part in this year's GSoC with the Boost community. Currently I am preparing my proposal to add support for encoding awareness in string through new/existing Boost project, but there are some questions I would like to ask to clarify the community's interest in such a project. The inspiration I get came from several lengthy discussions I found in the Boost mailing list archive, which happened recently from mid January to mid February. [5][6][7] For a brief overview according to my own understanding, the heated debates was mainly around different ways to ensure consistency between the encoding expected when library code accepts strings, and the actual encoding of strings passed by users. The problem arise because a small minority of developers use std::string in different encoding than UTF-8, and the implicit assumption of UTF-8 encoding for std::string brings inconsistency and causes numerous bugs that are outside the scope of the Boost library. >From the discussion, I found several proposals that have been made to solve the inconsistencies of std::string encoding: 1. Create new classes that warp around std::string, std::u16string, and std::u32string for each encodings and ensure encoding correctness simply through C++ type safety features. The classes are tentatively called the utf*_t classes (which many disliked the name) - Proposed by Chad Nelson with working prototype available. [1] 2. Continue to strongly enforce the assumption that all std::strings are UTF-8 encoded. Depreciate or make it hard to use other encodings in std::string. 3. Reinvent std::string and introduce boost::string. The new string class is proposed to be immutable but also delegate the encoding awareness to templated view<> classes that warp around boost::string, which IMHO the view<> classes share similarity with proposal (1). [7] Now I try not to go into the details and pros & cons of each proposal to avoid turning this thread into yet another string discussion. The original discussions have 542 messages in total, spanned a whole month, and I failed to find any conclusion that everyone could agree on. What I notice in the discussions is that there are several groups of people that have strong opinion on different ways to solve the problem and could not generally agree with each other. I also found that the discussions often drifted away and lose focus on the original problem, but every now and then someone would mention the problem again and proposed a solution that is similar to earlier proposals. Nevertheless, the discussion was extremely informative and insightful. I learned a lot by just reading through these discussions. But since I intend to start a GSoC project based on this subject, I hesitated on what I should really do in this project as I feel that there is no general agreement on how to solve this problem in the Boost community. Although I have some ideas and I personally lean towards proposal (1) by Chad Nelson, I think it'll be best if my project can fit in the interest of majority of the Boost community members. I think it is also best to avoid any further discussion on this topic to actively make the design decision, as the time period for GSoC is limited and the discussion tends to be never-ending. I think there are several options that I can choose for my project: 1. To use Chad Nelson's code as base, try to incorporate other ideas proposed in the mailing list, integrate with Boost.Locale, and make it Boost quality to submit for review. If this option is chosen, I wish that Chad Nelson can be my mentor. 2. To start a new code base, gather and compile ideas suggested in mailing list, final design decisions made by me and my mentor but not the community (to keep the project going on fast), make it Boost quality and submit for review. 3. To start the boost::string project, where another better string is reinvented and fix all the weaknesses of std::string. 4. Adopt different proposal, and improve on existing project such as Boost.Unicode [2] or Boost.Locale [3] such that it really solves the encoding awareness problem. 5. Any other suggestion? I hope to get feedback from you on what should I really focus on in this project. Of course I also hope that this subject is mature enough to be accepted as GSoC project as I can see great interest in the community to solve this problem. I would also like to clarify again that I do not intend to solve actual Unicode handling problems in this project - there are already excellent libraries such as Boost.Locale designed for it. My main objective is to design a set of interfaces that help to ensure encoding correctness and consistency when strings are being passed between different functions. I look forward for anyone that is interested in this project and is willing to be my mentor. Lastly, I apology for my grammar and any possible misunderstanding that caused by my bad writings. Please do correct me if I have missed anything or misunderstood some aspects. I will write a complete and formal proposal once I hear feedbacks from you. Thank you very much and hope that I can start contributing to the Boost community! Best Regards, Soares Chen National University of Singapore References: [1] The Oak Circle C++ (Unicode) Toolkit, by Chad Nelson. http://www.oakcircle.com/toolkit.html [2] Boost.Unicode, by Mathias Gaunard. http://mathias.gaunard.com/unicode/doc/html/ [3] Boost.Locale. http://cppcms.sourceforge.net/boost_locale/html/index.html [4] Should UTF-16 be considered harmful?, Stack Overflow. http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful [5] Always treat std::strings as UTF-8?, Boost Mailing List Discussion. http://groups.google.com/group/boost-developers-archive/browse_thread/thread/13966c1a3d4ceadd/1be0173d252deb62 [6] What will string handling in C++ look like in the future, Boost Maling List Discussion. http://groups.google.com/group/boost-developers-archive/browse_thread/thread/deed8f95125dce02/c6e517b77f403eda [7] [string] proposal, Boost Mailing List Discussion. http://groups.google.com/group/boost-devel-archive/browse_thread/thread/f8516df28af22c4b/400f2e616de10ef0 (Sorry for linking the mailing list archive to Google Groups, but I feel that Google Group provides better interface for reading archives for those who haven't read the discussions)