
Hi everyone, Thank you very much and I appreciate all your feedback! :) I have talked privately with Chad Nelson and Mathias Gaunard over the past few days and they have give me a lot of useful suggestions. Based on the feedbacks and some study into Chad's Unicode library, Boost.Unicode and Boost.Filesystem, I have come out with some ideas on what kind of library should I build in this GSoC project. == Observation == Before I go into the concept of the library I'm proposing, I would like to point out a few observations. Firstly for Chad's code, I notice that his utf*_t classes have signatures similar to the following: class utf8_t : public specialized_string_t<utf8_t, std::basic_string<char>> class utf16_t : public specialized_string_t<utf16_t, std::basic_string<char16_t>> class utf32_t : public specialized_string_t<utf32_t, std::basic_string<char32_t>> where char16_t and char32_t are custom typedef to 16-bit and 32-bit characters if not in C++0x. Notice that the classes are all derived from a template called specialized_string_t that has generic interface that access to the underlying string. This makes it possible to add Unicode encoding semantics to any string class that only handle raw bytes by creating new template instances following the pattern `specialized_string_t<ClassName, RawStringContainerClass>`. This pattern actually somewhat similar to the view<> concept mentioned by Dean Michael Berris in the boost::string discussion. Dean's view concept has the signature of `class view<Encoding>` and wraps the proposed boost::string as it's underlying container. Notice that the view template can actually be generalized to wrap other strings, such as std::string, by adding one template parameter to make it `class view<Encoding, StringT>`. In the boost::string discussion, it is also generally agreeable that a string class should really be just a dumb container that store raw bytes and do not care about the meaning of those bytes. This is also why even the new proposed boost::string class (now called Boost.Chain) also do not attempt to add Unicode semantics into it. Instead, the view<> class is used at one level higher than boost::string to add encoding semantics to the raw string container. This pattern can also be seen applied in Boost.Filesystem, where it use a special class to represent the path, rather than the raw std::basic_string<> variants. The path class has the following signature: template <class StringT, class PathTraits> class basic_path; where StringT is the type for the internal raw string container, and PathTraits contains two conversion functions that know how to convert one type of external (incoming) strings into the type of it's underlying string container. This allows Boost.Filesystem's developers to choose a consistent internal string format, such as the 16-bit wchar_t, while still able to compare it against other string format, such as the 8-bit std::string. There is one inefficiency I notice in the basic_path design, which is that the path traits is restricted to only able to convert between two string types, instead of arbitrary external string type to one internal string type. This means that for example, if the developer chose the path traits to convert between 8-bit and 16-bit character strings, then it is not possible to implicitly convert a 32-bit character string into that path type. Fortunately this is probably fine for Boost.Filesystem as at this moment, Windows use 16-bit wchar_t* in it's filesystem API while all other OS use 8-bit char*. However the design does not scale to general usage as there are definitely more than two string types in use in C++ today. It would also probably bring problem to Boost.Filesystem one day in future, if some OS developers ever decided to use 32-bit character string in their filesystem API. (Well it most probably will never happen. 16-bit ought to be enough for everyone, but who knows? :P) == Proposal == So following the observations I mentioned above, there are two conclusions that I can make: 1, A string class should be a dumb container for characters sequences. It's main focus is to enable manipulation on the character sequences and it should not focus on the meaning of those characters. 2. It is generally agreeable that it is good to have classes at higher level perspective that take care of the meaning of characters in a string. These classes, which I'll call the _string wrapper class_, or the _string adapter class_, warp around the raw string classes to bring in semantics in the form of character encoding. The string wrapper class do not care about the internal workings of character manipulation, but makes sure that the semantics at end result of the manipulation is always valid. Now I believe that the string wrapper pattern would have been applied in some places I do not know about, but as far as I know, I have not seen this pattern been formally studied or be designed as a general solution. So I would like to take this opportunity to design a Unicode string wrapper library that operates one level higher than the raw string classes and bring in consistent semantics of UTF encoding into these string classes. The string wrapper class that I propose will have the following signature: template <typename StringT, typename StringTraits, typename EncodingTraits, typename Policy> class unicode_string_adapter; where StringT is any kind of string class in any character size, that may or may not have encoding semantics, including but not limited to std::basic_string<>. StringTraits generalizes on the interfaces to access a given string type, such as to get the code unit iterator/range, to modify string content, to append characters, to concatenate strings, to copy strings, and to create/destroy strings. It also provides type information such as the character type, character size, and character traits. EncodingTraits provides generic interface to process code unit iterators. The interface can accept generic character type and compare characters with their CodeUnitTraits, which provides generic way to access character information. It is necessary for an encoding traits to at least specialize in code unit size, which determines whether the string is encoded in UTF-8/16/32. Note that with the generic interface, it is possible to even create encoding traits that process non-Unicode encodings, and thus making a non-Unicode string "pretend" to act like Unicode string, although this is definitely not within the initial scope of this project. Policy is the policy class that handles errors occured during Unicode processing. When invalid Unicode code point is found, the policy class determines whether to throw an exception, to ignore it, to replace the code point, or to do anything else. Upon completion, it should be trivial to define commonly used Unicode string classes with simple typedef: typedef unicode_string_adapter<std::string, .....> utf8_string; typedef unicode_string_adapter<std::basic_string<wchar_t>, .....> utf16_string; typedef unicode_string_adapter<std::basic_string<char32_t>, .....> utf32_string; it should also be possible to build adapter for other commonly used string types: typedef unicode_string_adapter<QString, .....> utf16_qstring; typedef unicode_string_adapter<boost::chain, .....> utf8_chain; // for Dean's proposed Boost.Chain string class typedef unicode_string_adapter<const char*, .....> utf8_raw_string; typedef unicode_string_adapter<UnicodeString, ....> utf16_icu_string; // ICU's Unicode string == Benefits of Using unicode_string_adapter == So why should developers use a template instance of unicode_string_adapter in their library APIs, instead of the plain old std::string? Sure, the added safety of encoding correctness is nice, but it is quite tedious to wrap everything in it just for the safeness, as most people would be expected to complain. However there is another great benefit of wrapping raw strings inside unicode_string_adapter - it provides automatic conversion between any template instance of unicode_string_adapter. This means that in case that the caller of a library is using a string format that is different from the string format that the library accepts, the implicit constructor of unicode_string_adapter will be called and the caller's string will then be transparently converted into the library's string format. Here I will present a use case for a simple program that uses Qt's GUI framework to retrieve a file name input from users and load the file from the filesystem: utf16_qstring Qt::promptInput(utf16_qstring question); // hypothetical functions void Filesystem::loadFile(utf16_string path); utf8_string Config::getConfigValue(utf8_string key); main() { .... utf8_string document_dir = Config::getConfigValue(utf8_string("doc_dir")); utf16_qstring file_name = Qt::promptInput(utf16_qstring("Enter file name: ")); implicit conversion between the three string types utf16_string file_path = document_dir + utf8_string("/") + file_name; Filesystem::loadFile(file_path); .... } Here the program uses three libraries that are independently developed by different developers, and each of them has chosen a different string format for their API for various reasons. If the traditional approach of std::string is used, the developer would have to manually convert the UTF-16 QString to std::string for the GUI prompt, then convert std::string again to std::basic_string<wchar_t> for filesystem access. But with unicode_string_adapter, the operator =() and operator +() are generalized so that all string conversion operations are actually happened transparently. Not only this significantly reduce the code needed to perform conversion, it is also gives the freedom for developers to have the choice to use their favorite string type without having to follow the "one true way" to use Unicode strings. The template also makes it possible to create generic Unicode string processing utilities that can accept any template instance and return the resulting Unicode string in the same template instance type. For example, a non-modifying Unicode toUpper() function: template< unicode_string_adapter<...> > unicode_string_adapter<...> toUpper(const unicode_string_adapter<...>& arg); == What Will I Do in This Project == For the main objective of this GSoC project, I will implement a complete generic version of unicode_string_adapter, and also the std::basic_string specialization for UTF-8/16/32 encoding. I will use Mathias' Boost.Unicode as the back end for encoding and decoding of Unicode characters. I will also provide use cases and test cases for each of the functionality to make sure that the class can serve real world needs. If main objective completes in time and there are still time remaining, I will also implement as many template specializations for string classes from other non-Boost projects, such as QString and ICU UnicodeString. As this would require significant efforts to study into each of the potentially large libraries, I cannot guarantee the number of specializations I can finish in time. However in case there are still time remaining after I implement template specialization for all string classes, I will help Mathias Gaunard to improve on his Boost.Unicode library. Within the project period, I wish to work with Chad Nelson as my mentor and develop the project as an independent library. But after the project finish, I would be glad to merge it together with Boost.Unicode to make it under one Boost project, depends on how Mathias would think. == Things to Consider == There are many things that I need to take into consideration in designing the class. Some of these problems are quite controversial and might bring intense discussion to the Boost community. However it is required to have these problems resolved before the GSoC project period ends to make a workable library. Until the answer to the questions are agreed by the majority of community, the following questions remain open ended and do not have a clear answer yet: Should code point replacement be allowed in the middle of the string? UTF-8 and UTF-16 both have code points encoded in variable length. If the code point in the middle of string is replaced with a code point of larger size, then the replacement operation can turn into a much more expensive insertion operation. It could also potentially invalidate the iterators and cause undefined behaviors. What should be the type for single Unicode combine character and grapheme? Unicode combine characters and graphemes (aka the abstract characters) can consist of arbitrary number of code points. This means that unlike basic types such as char that can be placed on the stack, the value for even single abstract character must stay at the heap due to it's variable size. Currently, Boost.Unicode uses a range of code points to represent one single abstract character. However, range and iterators do not generally claim ownership to the underlying memory object, so it is not possible to retain a range outside of the string object's scope. One other way is to allow unicode_string_adapter to hold substring marks on it's underlying string, so that the abstract characters have also the same type as unicode_string_adapter, where the original string and the abstract character string actually share the same string content behind the scene. If this is intended, then unicode_string_adapter should support fast substring operation. On the other hand, if the abstract character string has it's own raw string buffer, then iterating over the characters would become too expensive as dynamic memory allocation is required for each abstract character extraction. One other way is to allow unicode_string_adapter to hold extra space for single code point, so that an abstract character string that consist of single code point do not need to allocate dynamic memory. But doing so would make the code more complex and increases the object size as well. Should unicode_string_adapter supports fast substring operation? As mentioned in the problem above, substring operation should be supported if no type distinction is made between a multi-character string and single combine character/grapheme string. However there is tradeoff to share the same string buffer, that mutable operations would become more expensive. If unicode_string_adapter has substring mark on it's underlying string, then it's end() iterator may not necessary be the one-past-end iterator of the underlying string. The substring mark also increases the object size of unicode_string_adapter beyond the underlying string's object size, making it more expensive to copy the adapter. There are two possible ways to mark the substring region of a unicode_string_adapter - which is either by index or by iterator range. Should unicode_string_adapter be immutable? Mutable operations is often hard to code and error prone, while functional programming has shown that immutable types not only work but also reduce the chance of making mistakes. In the boost::string discussion, the immutable string design also receives support from many members. It is possible to make unicode_string_adapter immutable while mutable operations can be performed by retrieving it's underlying string, however doing so would make unicode_string_adapter lose the ability to perform transparent string concatenation between arbitrary string types and encoding. Should unicode_string_adapter be append-only? One possible alternative other than immutable string is to make unicode_string_adapter append-only. As character replacement can potentially become insertion operation and insertion operation is almost as expensive as creating new strings, it can be better off by simply creating a new string. However the append operation is more valuable as it allows users to create new string in steps by appending character by character. How should unicode_string_adapter handle underlying immutable string type? The string type behind unicode_string_adapter can be immutable. It can be that the string is inherently immutable, like the one proposed in Boost.Chain, or the string type can have a const modifier. In that case the mutable operations in unicode_string_adapter, if any, should be disabled using template techniques that I have not yet learned. Should invalid code point be preserved in the raw string? When unicode_string_adapter is constructed with a raw string, or new content is inserted or appended, it is possible that the raw content contains invalid code point. Things would go easier if the class has an exception throwing policy, however for a code point replacement policy, it is not clear whether unicode_string_adapter should modify the raw content, or replace the resulting code point on the fly when users try to access the content through iterators. The benefit of preserving the raw string is that there will be no loss of information, and is workable with immutable string types. Even though it is possible to factor this decision in the policy based design, it is still desirable to choose a default preferable policy. Should the constructor that accepts the original raw string be implicit or explicit? An implicit constructor has to make assumption on the encoding of the raw string, but it allows library code to change their API without breaking old code base that give raw strings such as std::string as parameter. On the other hand, explicit constructor forces the caller to the library API to explicitly mention the encoding of the string, guaranteeing that the correct encoding is used. But this will make it hard for existing libraries to migrate to new version of API that use unicode_string_adapter, as it will break existing code unless the old API co-exist with the new API to ease the migration. Should the raw string be accessible via operator *() or custom named function? Chad Nelson's original utf*_t string classes have the operator *() to access the underlying std::basic_string object. However it was generally not accepted by the community as the utf*_t classes are being as alternative string classes that contend to replace std::basic_string, rather than as a higher level adapter for std::basic_string. However as my research mentioned earlier, it is clear that the original utf*_t and unicode_string_adapter actually operate on one level higher than the raw strings and are actually complement to each other. So in this case it makes sense that unicode_string_adapter "contains" it's underlying string, and operator *() can be used to retrieve the actual content that the class is containing. I'd expect that this question alone will get quite a lot of debate from the community. == Conclusion == I am sorry to make this proposal draft so long, but as the topic is quite controversial I need to provide more solid arguments to support my project idea. I hope that this can at least convince you that it will be better to have string wrapper classes to ensure encoding correctness. While some of you might still disagree with my proposed implementation, I believe that as long as we agree that this project is worthy enough, then an ideal implementation design can eventually be created as the project starts and goes along. I might get some facts wrong, such as the basic_path usage in Boost.Filesystem. If I have made any mistake please feel free to correct me. Please also do note that this is a draft proposal I am preparing to submit to GSoC, and the main objective of this thread is to get my project accepted into GSoC. The proposal is no where near complete or well thought enough and is not yet ready in any way to be acceptable to the Boost standard. If there are still many doubts and controversies, I think that it might be better for me to write a technical report of some sort at the end of this project to cover all aspects on the problems and solutions of Unicode strings. It might take a long while for this to get accepted by everyone in Boost, but everything has to start some where, right? So I hope that this GSoC project will be the first step of the journey towards the ideal solution. :) Thanks! Best Regards, Soares Chen On Sat, Mar 19, 2011 at 12:01 AM, Artyom <artyomtnk@yahoo.com> wrote:
From: Soares Chen <crf@hypershell.org>
Hi all,
[snip]
I think there are several options that I can choose for my project: 1. To use Chad Nelson's code as base, try to incorporate other ideas proposed in the mailing list, integrate with Boost.Locale, and make it Boost quality to submit for review. If this option is chosen, I wish that Chad Nelson can be my mentor. 2. To start a new code base, gather and compile ideas suggested in mailing list, final design decisions made by me and my mentor but not the community (to keep the project going on fast), make it Boost quality and submit for review. 3. To start the boost::string project, where another better string is reinvented and fix all the weaknesses of std::string. 4. Adopt different proposal, and improve on existing project such as Boost.Unicode [2] or Boost.Locale [3] such that it really solves the encoding awareness problem. 5. Any other suggestion?
Hello,
I want you to address several points:
It would be very hard to get the consensus about the way to solve the problem.
Probably the best and the most wishful thinking solution is to assume that all strings are UTF-8 based, however it is not the reality.
The problem is actually not the string but rather the way you code.
Even if you create a perfect UTF-8 string and then call
fopen(your_perfect_string.c_str(),"r")
Under windows... And it would not work <sigh... damn Windows>
As you can see from multiple discussions, there are many contradicting requirements about how should string look like and what should it bring with.
If you want to provide better Unicode awareness to Boost you don't need new cool utf-XYZ string, you need a policy.
I think boost::filesystem v3 is a big step forward, it allows you to use UTF-8 strings on Windows which I think is a really good beginning.
This is my opinion.
Boost.Locale and several other my projects (CppCMS, CppDB) live happily with std::string.
The problem is that in vast majority of cases you don't need encoding aware string, as so many operations you usually do on strings are encoding agnostic. But this is other story.
Bottom line, if you want to improve Unicode awareness of Boost I think you need to adopt Boost.Filesystem v3 like policy all over the code base of Boost.
1. Use Wide API as native one in Boost everywhere under Windows 2. Use char * API as native one in Boost everywhere under non-Windows platforms 3. Use std::codecvt to handle this (after many tricks... )
The Unicode String/Encoding Aware String is the last thing to do not the first thing.
Why?
1. Because you will never get the consensus about what is the "right-thing" to do (wide, narrow, utf-8, utf-16) etc.
Project that are handled and directed by a single source or management like Qt, GTK(mm), Java, C#, Python or others may decide what is the right thing.
This will never happen in Boost as it is too pluralistic even in cases where it does not always make sense, just because the way libraries are developed, reviewed and got in - based on public reviews that eventually encourages diversity.
2. Because you would not likely to be able to enforce users to actually use your string. As boost is more about collaboration then enforcement of specific style.
3. Even heavy discussions there hadn't got to any conclusion. So what would happen and final review of your library?
My $0.02
Artyom
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost