
From: Soares Chen Ruo Fei <crf@hypershell.org>
To: boost@lists.boost.org Sent: Tuesday, August 9, 2011 10:53 AM Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
My post has probably slipped through the radar so I'm just going to bump this post again. Please feel free to criticize if you think that my library has any fundamental design flaw. As a student and GSoC participant, I think the most important thing is for me is to learn what I did wrong in the project so that I will not repeat the same mistake, and also to allow me to gain enough experience so that I can really give useful contribution to the open source community in future.
Any feedback is really much appreciated. Thanks.
Hello, First of all I want to tell that I'm as the author of Boost.Locale library have very strong opinion on how strings and Unicode should be handled. My strong opinion is: a. Strings should be just container object with default encoding and some useful API to handle it. b. Default encoding MUST be UTF-8 c. There are several ways to implement strings COW, Mutable, Immutable, with small string optimization and so on. This way or other std::string is de-facto string and I think we should live with it and use some alternative containers where it matters. d. Code point and code unit are meaningless unless you develop some Unicode algorithm - and you don't - you use one written by experts. So my biggest problem is motivation: -----------------------------------
The main reason that Boost.Ustr is developed is because current
raw string types such as std::string requires developers to make assumption on the encoding of the string content, such as UTF-8 for std::string. This creates inconsistency when a string passed to library APIs has different encoding from the library expects.
This Ustr does not solve this problem as it does not provide really some kind of adapter<generic encoding> { string content } This is some kind of thing that may be useful, but not in this case. Basically your library provides wrapper around string and outputs Unicode code points but it does it for UTF encodings only! It does not benefit too much. You provide encoding traits but it is basically meaningless for the propose you had given as: It does not provide traits for non-Unicode encodings like lets say Shift-JIS or ISO-8859-8 BTW you can't create traits for many encodings, for example you can't implement traits requirements: http://crf.scriptmatrix.net/ustr/ustr/advanced.html#ustr.advanced.custom_enc... For popular encodings like Shift-JIS or GBK... Homework: tell me why ;-) Also it is likely that encoding is something that can be changed in the runtime not compile time and it seems that this adapter does not support such option.
The problem mainly arise because there are a small minority of developers who use different encoding for the same string type. If someone uses strings with different encodings he usually knows their encoding...
The problem is that API inconsistent as on Windows narrow string is some ANSI code page and anywhere else it is UTF-8. This is entirely different problem and such adapters don't really solve them but actually make it worse... Other problem is ================ I don't believe that string adapter would solve any real problems because: a) If you iterate over code points you are very likely do something wrong. As code point != character and this is very common mistake. b) If you want to iterate over code points it is better to have some kind of utf_iterator that receives a range and iterate over it, it would be more generic and do not require to have an additional class. For example Boost.Locale has utf_traits that allow to implement iteration over code points quite easily. See: http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/namespaceboost_1_1... http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/structboost_1_1loc... And you don't need any kind of specific adapters. c) The problem in Boost is not missing Unicode String and it is not even required to have yet-another-unicode-string that we have good Unicode support. The problem is policy the problem is Boost just can't decide once and forever that std::string is UTF-8... But don't get me wrong. This is My Opinion, many would disagree with me. ================================= Bottom line, Unicode strings, cool string adapters, UTF-iterators and even Boost.Unicode and Boost.Locale would not solve the problems that Boost libraries use inconsistent encodings on different platforms. IMHO: the only way to solve it is POLICY. Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.sf.net/ CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/