Re: [boost] GSoC Proposal Preparation For Encoding Awared String

newer
std::tuple vs boost::tuple clashes...

older
Re: [boost] [lexical-cast] version...

Artyom

18 Mar 2011 18 Mar '11

4:01 p.m.

From: Soares Chen <crf@hypershell.org>

Hi all,

[snip]

I think there are several options that I can choose for my project: 1. To use Chad Nelson's code as base, try to incorporate other ideas proposed in the mailing list, integrate with Boost.Locale, and make it Boost quality to submit for review. If this option is chosen, I wish that Chad Nelson can be my mentor. 2. To start a new code base, gather and compile ideas suggested in mailing list, final design decisions made by me and my mentor but not the community (to keep the project going on fast), make it Boost quality and submit for review. 3. To start the boost::string project, where another better string is reinvented and fix all the weaknesses of std::string. 4. Adopt different proposal, and improve on existing project such as Boost.Unicode [2] or Boost.Locale [3] such that it really solves the encoding awareness problem. 5. Any other suggestion?

Hello, I want you to address several points: It would be very hard to get the consensus about the way to solve the problem. Probably the best and the most wishful thinking solution is to assume that all strings are UTF-8 based, however it is not the reality. The problem is actually not the string but rather the way you code. Even if you create a perfect UTF-8 string and then call fopen(your_perfect_string.c_str(),"r") Under windows... And it would not work <sigh... damn Windows> As you can see from multiple discussions, there are many contradicting requirements about how should string look like and what should it bring with. If you want to provide better Unicode awareness to Boost you don't need new cool utf-XYZ string, you need a policy. I think boost::filesystem v3 is a big step forward, it allows you to use UTF-8 strings on Windows which I think is a really good beginning. This is my opinion. Boost.Locale and several other my projects (CppCMS, CppDB) live happily with std::string. The problem is that in vast majority of cases you don't need encoding aware string, as so many operations you usually do on strings are encoding agnostic. But this is other story. Bottom line, if you want to improve Unicode awareness of Boost I think you need to adopt Boost.Filesystem v3 like policy all over the code base of Boost. 1. Use Wide API as native one in Boost everywhere under Windows 2. Use char * API as native one in Boost everywhere under non-Windows platforms 3. Use std::codecvt to handle this (after many tricks... ) The Unicode String/Encoding Aware String is the last thing to do not the first thing. Why? 1. Because you will never get the consensus about what is the "right-thing" to do (wide, narrow, utf-8, utf-16) etc. Project that are handled and directed by a single source or management like Qt, GTK(mm), Java, C#, Python or others may decide what is the right thing. This will never happen in Boost as it is too pluralistic even in cases where it does not always make sense, just because the way libraries are developed, reviewed and got in - based on public reviews that eventually encourages diversity. 2. Because you would not likely to be able to enforce users to actually use your string. As boost is more about collaboration then enforcement of specific style. 3. Even heavy discussions there hadn't got to any conclusion. So what would happen and final review of your library? My $0.02 Artyom

Show replies by date

Soares Chen

24 Mar 24 Mar

3:14 a.m.

New subject: GSoC Proposal Preparation For Encoding Awared String

Hi everyone, Thank you very much and I appreciate all your feedback! :) I have talked privately with Chad Nelson and Mathias Gaunard over the past few days and they have give me a lot of useful suggestions. Based on the feedbacks and some study into Chad's Unicode library, Boost.Unicode and Boost.Filesystem, I have come out with some ideas on what kind of library should I build in this GSoC project. == Observation == Before I go into the concept of the library I'm proposing, I would like to point out a few observations. Firstly for Chad's code, I notice that his utf*_t classes have signatures similar to the following: class utf8_t : public specialized_string_t<utf8_t, std::basic_string<char>> class utf16_t : public specialized_string_t<utf16_t, std::basic_string<char16_t>> class utf32_t : public specialized_string_t<utf32_t, std::basic_string<char32_t>> where char16_t and char32_t are custom typedef to 16-bit and 32-bit characters if not in C++0x. Notice that the classes are all derived from a template called specialized_string_t that has generic interface that access to the underlying string. This makes it possible to add Unicode encoding semantics to any string class that only handle raw bytes by creating new template instances following the pattern `specialized_string_t<ClassName, RawStringContainerClass>`. This pattern actually somewhat similar to the view<> concept mentioned by Dean Michael Berris in the boost::string discussion. Dean's view concept has the signature of `class view<Encoding>` and wraps the proposed boost::string as it's underlying container. Notice that the view template can actually be generalized to wrap other strings, such as std::string, by adding one template parameter to make it `class view<Encoding, StringT>`. In the boost::string discussion, it is also generally agreeable that a string class should really be just a dumb container that store raw bytes and do not care about the meaning of those bytes. This is also why even the new proposed boost::string class (now called Boost.Chain) also do not attempt to add Unicode semantics into it. Instead, the view<> class is used at one level higher than boost::string to add encoding semantics to the raw string container. This pattern can also be seen applied in Boost.Filesystem, where it use a special class to represent the path, rather than the raw std::basic_string<> variants. The path class has the following signature: template <class StringT, class PathTraits> class basic_path; where StringT is the type for the internal raw string container, and PathTraits contains two conversion functions that know how to convert one type of external (incoming) strings into the type of it's underlying string container. This allows Boost.Filesystem's developers to choose a consistent internal string format, such as the 16-bit wchar_t, while still able to compare it against other string format, such as the 8-bit std::string. There is one inefficiency I notice in the basic_path design, which is that the path traits is restricted to only able to convert between two string types, instead of arbitrary external string type to one internal string type. This means that for example, if the developer chose the path traits to convert between 8-bit and 16-bit character strings, then it is not possible to implicitly convert a 32-bit character string into that path type. Fortunately this is probably fine for Boost.Filesystem as at this moment, Windows use 16-bit wchar_t* in it's filesystem API while all other OS use 8-bit char*. However the design does not scale to general usage as there are definitely more than two string types in use in C++ today. It would also probably bring problem to Boost.Filesystem one day in future, if some OS developers ever decided to use 32-bit character string in their filesystem API. (Well it most probably will never happen. 16-bit ought to be enough for everyone, but who knows? :P) == Proposal == So following the observations I mentioned above, there are two conclusions that I can make: 1, A string class should be a dumb container for characters sequences. It's main focus is to enable manipulation on the character sequences and it should not focus on the meaning of those characters. 2. It is generally agreeable that it is good to have classes at higher level perspective that take care of the meaning of characters in a string. These classes, which I'll call the _string wrapper class_, or the _string adapter class_, warp around the raw string classes to bring in semantics in the form of character encoding. The string wrapper class do not care about the internal workings of character manipulation, but makes sure that the semantics at end result of the manipulation is always valid. Now I believe that the string wrapper pattern would have been applied in some places I do not know about, but as far as I know, I have not seen this pattern been formally studied or be designed as a general solution. So I would like to take this opportunity to design a Unicode string wrapper library that operates one level higher than the raw string classes and bring in consistent semantics of UTF encoding into these string classes. The string wrapper class that I propose will have the following signature: template <typename StringT, typename StringTraits, typename EncodingTraits, typename Policy> class unicode_string_adapter; where StringT is any kind of string class in any character size, that may or may not have encoding semantics, including but not limited to std::basic_string<>. StringTraits generalizes on the interfaces to access a given string type, such as to get the code unit iterator/range, to modify string content, to append characters, to concatenate strings, to copy strings, and to create/destroy strings. It also provides type information such as the character type, character size, and character traits. EncodingTraits provides generic interface to process code unit iterators. The interface can accept generic character type and compare characters with their CodeUnitTraits, which provides generic way to access character information. It is necessary for an encoding traits to at least specialize in code unit size, which determines whether the string is encoded in UTF-8/16/32. Note that with the generic interface, it is possible to even create encoding traits that process non-Unicode encodings, and thus making a non-Unicode string "pretend" to act like Unicode string, although this is definitely not within the initial scope of this project. Policy is the policy class that handles errors occured during Unicode processing. When invalid Unicode code point is found, the policy class determines whether to throw an exception, to ignore it, to replace the code point, or to do anything else. Upon completion, it should be trivial to define commonly used Unicode string classes with simple typedef: typedef unicode_string_adapter<std::string, .....> utf8_string; typedef unicode_string_adapter<std::basic_string<wchar_t>, .....> utf16_string; typedef unicode_string_adapter<std::basic_string<char32_t>, .....> utf32_string; it should also be possible to build adapter for other commonly used string types: typedef unicode_string_adapter<QString, .....> utf16_qstring; typedef unicode_string_adapter<boost::chain, .....> utf8_chain; // for Dean's proposed Boost.Chain string class typedef unicode_string_adapter<const char*, .....> utf8_raw_string; typedef unicode_string_adapter<UnicodeString, ....> utf16_icu_string; // ICU's Unicode string == Benefits of Using unicode_string_adapter == So why should developers use a template instance of unicode_string_adapter in their library APIs, instead of the plain old std::string? Sure, the added safety of encoding correctness is nice, but it is quite tedious to wrap everything in it just for the safeness, as most people would be expected to complain. However there is another great benefit of wrapping raw strings inside unicode_string_adapter - it provides automatic conversion between any template instance of unicode_string_adapter. This means that in case that the caller of a library is using a string format that is different from the string format that the library accepts, the implicit constructor of unicode_string_adapter will be called and the caller's string will then be transparently converted into the library's string format. Here I will present a use case for a simple program that uses Qt's GUI framework to retrieve a file name input from users and load the file from the filesystem: utf16_qstring Qt::promptInput(utf16_qstring question); // hypothetical functions void Filesystem::loadFile(utf16_string path); utf8_string Config::getConfigValue(utf8_string key); main() { .... utf8_string document_dir = Config::getConfigValue(utf8_string("doc_dir")); utf16_qstring file_name = Qt::promptInput(utf16_qstring("Enter file name: ")); implicit conversion between the three string types utf16_string file_path = document_dir + utf8_string("/") + file_name; Filesystem::loadFile(file_path); .... } Here the program uses three libraries that are independently developed by different developers, and each of them has chosen a different string format for their API for various reasons. If the traditional approach of std::string is used, the developer would have to manually convert the UTF-16 QString to std::string for the GUI prompt, then convert std::string again to std::basic_string<wchar_t> for filesystem access. But with unicode_string_adapter, the operator =() and operator +() are generalized so that all string conversion operations are actually happened transparently. Not only this significantly reduce the code needed to perform conversion, it is also gives the freedom for developers to have the choice to use their favorite string type without having to follow the "one true way" to use Unicode strings. The template also makes it possible to create generic Unicode string processing utilities that can accept any template instance and return the resulting Unicode string in the same template instance type. For example, a non-modifying Unicode toUpper() function: template< unicode_string_adapter<...> > unicode_string_adapter<...> toUpper(const unicode_string_adapter<...>& arg); == What Will I Do in This Project == For the main objective of this GSoC project, I will implement a complete generic version of unicode_string_adapter, and also the std::basic_string specialization for UTF-8/16/32 encoding. I will use Mathias' Boost.Unicode as the back end for encoding and decoding of Unicode characters. I will also provide use cases and test cases for each of the functionality to make sure that the class can serve real world needs. If main objective completes in time and there are still time remaining, I will also implement as many template specializations for string classes from other non-Boost projects, such as QString and ICU UnicodeString. As this would require significant efforts to study into each of the potentially large libraries, I cannot guarantee the number of specializations I can finish in time. However in case there are still time remaining after I implement template specialization for all string classes, I will help Mathias Gaunard to improve on his Boost.Unicode library. Within the project period, I wish to work with Chad Nelson as my mentor and develop the project as an independent library. But after the project finish, I would be glad to merge it together with Boost.Unicode to make it under one Boost project, depends on how Mathias would think. == Things to Consider == There are many things that I need to take into consideration in designing the class. Some of these problems are quite controversial and might bring intense discussion to the Boost community. However it is required to have these problems resolved before the GSoC project period ends to make a workable library. Until the answer to the questions are agreed by the majority of community, the following questions remain open ended and do not have a clear answer yet: Should code point replacement be allowed in the middle of the string? UTF-8 and UTF-16 both have code points encoded in variable length. If the code point in the middle of string is replaced with a code point of larger size, then the replacement operation can turn into a much more expensive insertion operation. It could also potentially invalidate the iterators and cause undefined behaviors. What should be the type for single Unicode combine character and grapheme? Unicode combine characters and graphemes (aka the abstract characters) can consist of arbitrary number of code points. This means that unlike basic types such as char that can be placed on the stack, the value for even single abstract character must stay at the heap due to it's variable size. Currently, Boost.Unicode uses a range of code points to represent one single abstract character. However, range and iterators do not generally claim ownership to the underlying memory object, so it is not possible to retain a range outside of the string object's scope. One other way is to allow unicode_string_adapter to hold substring marks on it's underlying string, so that the abstract characters have also the same type as unicode_string_adapter, where the original string and the abstract character string actually share the same string content behind the scene. If this is intended, then unicode_string_adapter should support fast substring operation. On the other hand, if the abstract character string has it's own raw string buffer, then iterating over the characters would become too expensive as dynamic memory allocation is required for each abstract character extraction. One other way is to allow unicode_string_adapter to hold extra space for single code point, so that an abstract character string that consist of single code point do not need to allocate dynamic memory. But doing so would make the code more complex and increases the object size as well. Should unicode_string_adapter supports fast substring operation? As mentioned in the problem above, substring operation should be supported if no type distinction is made between a multi-character string and single combine character/grapheme string. However there is tradeoff to share the same string buffer, that mutable operations would become more expensive. If unicode_string_adapter has substring mark on it's underlying string, then it's end() iterator may not necessary be the one-past-end iterator of the underlying string. The substring mark also increases the object size of unicode_string_adapter beyond the underlying string's object size, making it more expensive to copy the adapter. There are two possible ways to mark the substring region of a unicode_string_adapter - which is either by index or by iterator range. Should unicode_string_adapter be immutable? Mutable operations is often hard to code and error prone, while functional programming has shown that immutable types not only work but also reduce the chance of making mistakes. In the boost::string discussion, the immutable string design also receives support from many members. It is possible to make unicode_string_adapter immutable while mutable operations can be performed by retrieving it's underlying string, however doing so would make unicode_string_adapter lose the ability to perform transparent string concatenation between arbitrary string types and encoding. Should unicode_string_adapter be append-only? One possible alternative other than immutable string is to make unicode_string_adapter append-only. As character replacement can potentially become insertion operation and insertion operation is almost as expensive as creating new strings, it can be better off by simply creating a new string. However the append operation is more valuable as it allows users to create new string in steps by appending character by character. How should unicode_string_adapter handle underlying immutable string type? The string type behind unicode_string_adapter can be immutable. It can be that the string is inherently immutable, like the one proposed in Boost.Chain, or the string type can have a const modifier. In that case the mutable operations in unicode_string_adapter, if any, should be disabled using template techniques that I have not yet learned. Should invalid code point be preserved in the raw string? When unicode_string_adapter is constructed with a raw string, or new content is inserted or appended, it is possible that the raw content contains invalid code point. Things would go easier if the class has an exception throwing policy, however for a code point replacement policy, it is not clear whether unicode_string_adapter should modify the raw content, or replace the resulting code point on the fly when users try to access the content through iterators. The benefit of preserving the raw string is that there will be no loss of information, and is workable with immutable string types. Even though it is possible to factor this decision in the policy based design, it is still desirable to choose a default preferable policy. Should the constructor that accepts the original raw string be implicit or explicit? An implicit constructor has to make assumption on the encoding of the raw string, but it allows library code to change their API without breaking old code base that give raw strings such as std::string as parameter. On the other hand, explicit constructor forces the caller to the library API to explicitly mention the encoding of the string, guaranteeing that the correct encoding is used. But this will make it hard for existing libraries to migrate to new version of API that use unicode_string_adapter, as it will break existing code unless the old API co-exist with the new API to ease the migration. Should the raw string be accessible via operator *() or custom named function? Chad Nelson's original utf*_t string classes have the operator *() to access the underlying std::basic_string object. However it was generally not accepted by the community as the utf*_t classes are being as alternative string classes that contend to replace std::basic_string, rather than as a higher level adapter for std::basic_string. However as my research mentioned earlier, it is clear that the original utf*_t and unicode_string_adapter actually operate on one level higher than the raw strings and are actually complement to each other. So in this case it makes sense that unicode_string_adapter "contains" it's underlying string, and operator *() can be used to retrieve the actual content that the class is containing. I'd expect that this question alone will get quite a lot of debate from the community. == Conclusion == I am sorry to make this proposal draft so long, but as the topic is quite controversial I need to provide more solid arguments to support my project idea. I hope that this can at least convince you that it will be better to have string wrapper classes to ensure encoding correctness. While some of you might still disagree with my proposed implementation, I believe that as long as we agree that this project is worthy enough, then an ideal implementation design can eventually be created as the project starts and goes along. I might get some facts wrong, such as the basic_path usage in Boost.Filesystem. If I have made any mistake please feel free to correct me. Please also do note that this is a draft proposal I am preparing to submit to GSoC, and the main objective of this thread is to get my project accepted into GSoC. The proposal is no where near complete or well thought enough and is not yet ready in any way to be acceptable to the Boost standard. If there are still many doubts and controversies, I think that it might be better for me to write a technical report of some sort at the end of this project to cover all aspects on the problems and solutions of Unicode strings. It might take a long while for this to get accepted by everyone in Boost, but everything has to start some where, right? So I hope that this GSoC project will be the first step of the journey towards the ideal solution. :) Thanks! Best Regards, Soares Chen On Sat, Mar 19, 2011 at 12:01 AM, Artyom <artyomtnk@yahoo.com> wrote:

...

...
From: Soares Chen <crf@hypershell.org>

Hi all,

[snip]

I think there are several options that I can choose for my project: 1. To use Chad Nelson's code as base, try to incorporate other ideas proposed in the mailing list, integrate with Boost.Locale, and make it Boost quality to submit for review. If this option is chosen, I wish that Chad Nelson can be my mentor. 2. To start a new code base, gather and compile ideas suggested in mailing list, final design decisions made by me and my mentor but not the community (to keep the project going on fast), make it Boost quality and submit for review. 3. To start the boost::string project, where another better string is reinvented and fix all the weaknesses of std::string. 4. Adopt different proposal, and improve on existing project such as Boost.Unicode [2] or Boost.Locale [3] such that it really solves the encoding awareness problem. 5. Any other suggestion?

Hello,

I want you to address several points:

It would be very hard to get the consensus about the way to solve the problem.

Probably the best and the most wishful thinking solution is to assume that all strings are UTF-8 based, however it is not the reality.

The problem is actually not the string but rather the way you code.

Even if you create a perfect UTF-8 string and then call

fopen(your_perfect_string.c_str(),"r")

Under windows... And it would not work <sigh... damn Windows>

As you can see from multiple discussions, there are many contradicting requirements about how should string look like and what should it bring with.

If you want to provide better Unicode awareness to Boost you don't need new cool utf-XYZ string, you need a policy.

I think boost::filesystem v3 is a big step forward, it allows you to use UTF-8 strings on Windows which I think is a really good beginning.

This is my opinion.

Boost.Locale and several other my projects (CppCMS, CppDB) live happily with std::string.

The problem is that in vast majority of cases you don't need encoding aware string, as so many operations you usually do on strings are encoding agnostic. But this is other story.

Bottom line, if you want to improve Unicode awareness of Boost I think you need to adopt Boost.Filesystem v3 like policy all over the code base of Boost.

1. Use Wide API as native one in Boost everywhere under Windows 2. Use char * API as native one in Boost everywhere under non-Windows platforms 3. Use std::codecvt to handle this (after many tricks... )

The Unicode String/Encoding Aware String is the last thing to do not the first thing.

Why?

1. Because you will never get the consensus about what is the "right-thing" to do (wide, narrow, utf-8, utf-16) etc.

Project that are handled and directed by a single source or management like Qt, GTK(mm), Java, C#, Python or others may decide what is the right thing.

This will never happen in Boost as it is too pluralistic even in cases where it does not always make sense, just because the way libraries are developed, reviewed and got in - based on public reviews that eventually encourages diversity.

2. Because you would not likely to be able to enforce users to actually use your string. As boost is more about collaboration then enforcement of specific style.

3. Even heavy discussions there hadn't got to any conclusion. So what would happen and final review of your library?

My $0.02

Artyom

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Chad Nelson

3:03 p.m.

New subject: GSoC Proposal Preparation For Encoding Awared String

On Thu, 24 Mar 2011 11:14:08 +0800 Soares Chen <crf@hypershell.org> wrote:

...

[...] What should be the type for single Unicode combine character and grapheme? Unicode combine characters and graphemes (aka the abstract characters) can consist of arbitrary number of code points. This means that unlike basic types such as char that can be placed on the stack, the value for even single abstract character must stay at the heap due to it's variable size. [...]

Maybe not. The "Stream-Safe Text Format" is designed specifically for this. From <http://www.unicode.org/reports/tr15/index.html#Stream_Safe_Text_Format>: A Unicode string is said to be in Stream-Safe Text Format if it would not contain any sequences of non-starters longer than 30 characters in length when normalized to NFKD. Such a string can be normalized in buffered serialization with a buffer size of 32 characters, which would require no more than 128 bytes in any Unicode Encoding Form. It might be feasible to require graphemes to be in this format. I was planning to do so if I ever wrote a grapheme iterator. Of course, it still might not be feasible to use a fixed-size structure for graphemes, depending on how many you need to store at once, but for an iterator it would be reasonable. -- Chad Nelson Oak Circle Software, Inc. * * *

Anders Dalvander

5:35 p.m.

New subject: GSoC Proposal Preparation For Encoding Awared String

On 20:59, Soares Chen wrote:

...

Hi everyone,

Thank you very much and I appreciate all your feedback! :)

I have talked privately with Chad Nelson and Mathias Gaunard over the past few days and they have give me a lot of useful suggestions. Based on the feedbacks and some study into Chad's Unicode library, Boost.Unicode and Boost.Filesystem, I have come out with some ideas on what kind of library should I build in this GSoC project.

Hi Soares, Some of the issues you talk about are addressed or could be addressed with my Boost.Text library. Although I haven't worked on it since I posted it earlier this year. http://www.dalvander.com/boost_text/ Please feel free to use any parts of it as you please. I'd be happy if it was of any use. Regards, Anders Dalvander -- WWFSMD?

Soares Chen

26 Mar 26 Mar

9:49 p.m.

New subject: GSoC Proposal Preparation For Encoding Awared String

Hi Anders, Thanks for the link to Boost.Text! I think I missed it in the discussion I read. Yes, indeed I think we all share similar ideas. Other than difference in design, I think the concept is quite similar to Chad's Unicode string library, though I don't know whether you two have seen each other's code. Basically, my concept is also something like adding one more template parameter to your basic_text class to generalize string_type. Other than that, you have pretty much demonstrated my motivation to build such a class. One simple question to ask: May I know why is your class holding shared_ptr<basic_string<>> as the private member instead of simply basic_string<>? If I'm not wrong, basic_string<> is already a smart pointer to the underlying buffer and most implementation offers handy COW semantics. Or is there a need to wrap basic_string<> in shared_ptr<> because not all basic_string<> implementation is efficient? Thanks. On Fri, Mar 25, 2011 at 1:35 AM, Anders Dalvander <boost@dalvander.com> wrote:

...

On 20:59, Soares Chen wrote:

...
Hi everyone,

Thank you very much and I appreciate all your feedback! :)

I have talked privately with Chad Nelson and Mathias Gaunard over the past few days and they have give me a lot of useful suggestions. Based on the feedbacks and some study into Chad's Unicode library, Boost.Unicode and Boost.Filesystem, I have come out with some ideas on what kind of library should I build in this GSoC project.

Hi Soares,

Some of the issues you talk about are addressed or could be addressed with my Boost.Text library. Although I haven't worked on it since I posted it earlier this year.

http://www.dalvander.com/boost_text/

Please feel free to use any parts of it as you please. I'd be happy if it was of any use.

Regards, Anders Dalvander

-- WWFSMD? _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Sebastian Redl

28 Mar 28 Mar

8:27 a.m.

New subject: GSoC Proposal Preparation For Encoding Awared String

On 26.03.2011 22:49, Soares Chen wrote:

...

One simple question to ask: May I know why is your class holding shared_ptr<basic_string<>> as the private member instead of simply basic_string<>? If I'm not wrong, basic_string<> is already a smart pointer to the underlying buffer and most implementation offers handy COW semantics.

On the contrary, the only CoW string I know is that from GCC's libstdc++, and that one will have to change: C++0x prohibits CoW implementations of std::basic_string. Sebastian

Artyom

8:33 a.m.

New subject: GSoC Proposal Preparation For Encoding Awared String

...

On 26.03.2011 22:49, Soares Chen wrote:

...
One simple question to ask: May I know why is your class holding shared_ptr<basic_string<>> as the private member instead of simply basic_string<>? If I'm not wrong, basic_string<> is already a smart pointer to the underlying buffer and most implementation offers handy COW semantics.

On the contrary, the only CoW string I know is that from GCC's libstdc++, and that one will have to change: C++0x prohibits CoW implementations of std::basic_string.

Almost All implementations I know (but MSVC) use CoW strings, GCC, HP, SunCC and more. Shame on C++0x that CoW was removed for really irrelevant reasons. Artyom

Anders Dalvander

7:18 p.m.

New subject: GSoC Proposal Preparation For Encoding Awared String

On 20:59, Soares Chen wrote:

...

Thanks for the link to Boost.Text! I think I missed it in the discussion I read.

You're welcome.

...

Yes, indeed I think we all share similar ideas. Other than difference in design, I think the concept is quite similar to Chad's Unicode string library, though I don't know whether you two have seen each other's code.

I've reviewed Chad's library and his ideas of a Unicode string class gave inspiration for my Unicode string class, although with different design and intermediate goals.

...

Basically, my concept is also something like adding one more template parameter to your basic_text class to generalize string_type. Other than that, you have pretty much demonstrated my motivation to build such a class.

Thank you. I'm happy if my class can give anyone an ounce of inspiration or motivation.

...

One simple question to ask: May I know why is your class holding shared_ptr<basic_string<>> as the private member instead of simply basic_string<>? If I'm not wrong, basic_string<> is already a smart pointer to the underlying buffer and most implementation offers handy COW semantics. Or is there a need to wrap basic_string<> in shared_ptr<> because not all basic_string<> implementation is efficient?

I did so to ensure that the data was shared between instances, as some implementations does not use COW semantics. The basic_text class was initially designed to be immutable, but I don't know if that is always good. Now I think it would be better if instances could be appended at the end, that way one could use Mathias Gaunard's Unicode library without any intermediates.

...

Thanks.

You're, once again, welcome. Regards, Anders Dalvander -- WWFSMD?

Soares Chen Ruo Fei

8:53 p.m.

New subject: GSoC Proposal Preparation For Encoding Awared String

Sebastian Redl wrote:

...

On the contrary, the only CoW string I know is that from GCC's libstdc++, and that one will have to change: C++0x prohibits CoW implementations of std::basic_string.

Artyom wrote:

...

Almost All implementations I know (but MSVC) use CoW strings, GCC, HP, SunCC and more. Shame on C++0x that CoW was removed for really irrelevant reasons.

Thanks for pointing out that C++0x actually bans CoW in basic_string, Sebastian and Artyom. So far I've only found N2668 [1] mentioning this change, and there is little I can find through Google search. If you have links that provide detailed information on this change, please do let me know. As I understand, the ban is basically due to the inefficiency of performing concurrent mutation of basic_string, especially when the mutation is done through iterators. I believe that this serves as evidence that immutable string can be more desirable, and that mutable iterators are hard and the only reason to use them is to stay compatible with STL algorithms. I also see that the standard committee recommend a rope proposal be considered for inclusion in Library TR2. Probably this is what Boost.Chain should be heading to? Stewart, Robert wrote:

...

Please read <http://www.boost.org/community/policy.html#quoting>.

I am really sorry to not know about the quoting convention in this mailing list. Hopefully starting from this post I am following the policy correctly. Thanks for letting me know. Anders Dalvander wrote:

...

I've reviewed Chad's library and his ideas of a Unicode string class gave inspiration for my Unicode string class, although with different design and intermediate goals.

I'm glad to see somebody shows support to Chad's library and the idea of Unicode string class. I believe that as long as we all have the same goal in mind, we can eventually sort out the design and implementation details to create a Unicode string library that is useful for everyone. Perhaps I should also discuss with you about my ideas in IRC and private mails.

...

I did so to ensure that the data was shared between instances, as some implementations does not use COW semantics.

As Sebastian and Artyom mentioned that CoW is prohibited in C++0x, so I guess that we have no other choice other than to deal with non-CoW strings.

...

The basic_text class was initially designed to be immutable, but I don't know if that is always good. Now I think it would be better if instances could be appended at the end, that way one could use Mathias Gaunard's Unicode library without any intermediates.

I have the same idea with you that append operation is desirable, which I've mentioned in an earlier post. Currently I have some ideas on how to solve these design challenges, but they are not yet matured enough for me to share it here. [1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2668.htm Best Regards, Soares Chen

Dean Michael Berris

30 Mar 30 Mar

2:06 a.m.

New subject: GSoC Proposal Preparation For Encoding Awared String

On Tue, Mar 29, 2011 at 4:53 AM, Soares Chen Ruo Fei <crf@hypershell.org> wrote:

...

I also see that the standard committee recommend a rope proposal be considered for inclusion in Library TR2. Probably this is what Boost.Chain should be heading to?

Well, yes and no. ;) I aim to address the issue of immutability mainly and efficient concatenation/traversal with an implementation that is memory-efficient (in use and requirements). The only overlap between a chain and a rope is the efficient concatenation/traversal characteristics. A chain will basically be immutable as compared to a rope which still enables mutation. HTH -- Dean Michael Berris http://about.me/deanberris

Soares Chen Ruo Fei

5 Apr 5 Apr

6:59 p.m.

New subject: GSoC Proposal Preparation For Encoding Awared String

Hi all, I have prepared a formal draft proposal at https://docs.google.com/document/pub?id=1oBxQWRFF5wjmK9WCR4BhVuL2dTLzNYGIrEk.... Feel free to look at it before I submit to GSoC. Specifically, I would like to know whether my proposal is too long, and whether I have missed anything that I should write in the proposal? Thanks. Best Regards, Soares Chen

Soares Chen

24 Mar 24 Mar

4:36 a.m.

New subject: GSoC Proposal Preparation For Encoding Awared String

Hi Artyom,

...

I think boost::filesystem v3 is a big step forward, it allows you to use UTF-8 strings on Windows which I think is a really good beginning.

Bottom line, if you want to improve Unicode awareness of Boost I think you need to adopt Boost.Filesystem v3 like policy all over the code base of Boost.

1. Use Wide API as native one in Boost everywhere under Windows 2. Use char * API as native one in Boost everywhere under non-Windows platforms 3. Use std::codecvt to handle this (after many tricks... )

Thanks for the suggestion to look at Boost.Filesystem. I have observed how basic_path<> is constructed in Boost.Filesystem and use a similar design in my proposal. Feel free to take a look at it.

...

Even if you create a perfect UTF-8 string and then call

fopen(your_perfect_string.c_str(),"r")

Under windows... And it would not work <sigh... damn Windows>

I'd take a shortcut in explanation and ask you to look at the proposal I just posted. Using my proposed class the function can be written with something like this: typedef unicode_string_adapter<std::string, ....> utf8_string; typedef unicode_string_adapter<std::basic_string<wchar_t>, ....> utf16_string; utf8_string my_path("/path/to/file"); #ifdef WINDOWS // transparent conversion through constructor fopen(utf16_string(my_path).str().c_str(), "r"); #else fopen(mypath.str().c_str(), "r"); #endif

...

Boost.Locale and several other my projects (CppCMS, CppDB) live happily with std::string.

I think that it is not hard to maintain encoding consistency if the developer only use libraries maintained by the same group of people. But as more libraries are used and mixed together, for example if CppCMS had a module system, and when different libraries use different Unicode processing backends from Boost.Locale, then only weird bugs will appear randomly. Most of the Unicode bugs are at least not fatal, and usually only consist of annoyance when end users see weird text appear on the screen.

...

The problem is that in vast majority of cases you don't need encoding aware string, as so many operations you usually do on strings are encoding agnostic. But this is other story.

Yes, in many cases there is no need to look at the content of the string, and the main thing the code does is to pass on everything that is inside a string. But since it doesn't matter on what's get passed along, it would hurt to pass along content+encoding information, as the external operation is essentially the same. The only real difference is the type of the string. But functions that accept certain type of objects but do not really care about it's type and content probably seems like a good candidate to become a generalized templates anyway. Anyway back to the topic, for legacy code that accepts std::string instead of my proposed unicode_string_adapter, one can easily convert to it's internal string type by simply calling operator *() or similar named function. The template class may even have an operator StringT() to implicitly convert itself to it's underlying string type when needed.

...

Why?

1. Because you will never get the consensus about what is the "right-thing" to do (wide, narrow, utf-8, utf-16) etc.

Project that are handled and directed by a single source or management like Qt, GTK(mm), Java, C#, Python or others may decide what is the right thing.

This will never happen in Boost as it is too pluralistic even in cases where it does not always make sense, just because the way libraries are developed, reviewed and got in - based on public reviews that eventually encourages diversity.

2. Because you would not likely to be able to enforce users to actually use your string. As boost is more about collaboration then enforcement of specific style.

You are right. Because there is no one single "right" way to use strings, my new proposal actually provides a generic solution on top of existing strings. Hopefully the ability to choose any string type as underlying container will give the best of both world.

...

3. Even heavy discussions there hadn't got to any conclusion. So what would happen and final review of your library?

Heavy discussions indicate that the problem do exist and there is a need for solution. However I think the reason that the previous discussion was never ending was because nobody was willing to take action to write a library other than Chad and now Dean. Talk is cheap so it is easy to criticize a solution without solid evidence. I think that the best way for the discussion to move forward is to get more people to take action and implement their own code, if they are really oppose to existing solution so badly. Let the code fork, merge, and speak for themselves, and let the code with the best solution be the winner. After all, this is how open source works, right? ;) Thanks for your feedback!

5230

Age (days ago)

5248

Last active (days ago)

List overview

Download

11 comments

7 participants

participants (7)

Anders Dalvander
Artyom
Chad Nelson
Dean Michael Berris
Sebastian Redl
Soares Chen
Soares Chen Ruo Fei