[gsoc]Boost.Ustr Unicode String Adapter First Preview

Hi all, I would like to inform you all that the development repository for Boost Unicode String Adapter (now I call Boost.Ustr) is now available at GitHub: https://github.com/crf00/boost.ustr. I have been working closely with my mentor Chad Nelson on the design of the library, and now I'd like to show it to the community to gather some feedback before proceeding further on. Currently the main class, unicode_string_adapter, provides a uniform way for code point access from various string containers without having to concern about the underlying encoding and code units. unicode_string_adapter will not handle higher level Unicode processing tasks such as abstract characters, as I plan to leave the functionality to another class called unicode_abstract_character which I will build later. At the moment my library only works under GCC with C++0x enabled, as I was focusing on the design issues first. I also understand that I have not adopted the Boost way of building the project. While I am now going to spend more time on fixing these issues, I hope that this discussion can have more focus on the design issues instead. Feel free to let me know any potential issues on the class design so that I can fix it before it is too late. Thank you! cheers, Soares Chen

On 20:59, Soares Chen Ruo Fei wrote:
I would like to inform you all that the development repository for Boost Unicode String Adapter (now I call Boost.Ustr) is now available at GitHub: https://github.com/crf00/boost.ustr. I have been working closely with my mentor Chad Nelson on the design of the library, and now I'd like to show it to the community to gather some feedback before proceeding further on.
Currently the main class, unicode_string_adapter, provides a uniform way for code point access from various string containers without having to concern about the underlying encoding and code units. unicode_string_adapter will not handle higher level Unicode processing tasks such as abstract characters, as I plan to leave the functionality to another class called unicode_abstract_character which I will build later.
The library looks very interesting. I like your approach that the default iterator iterates over codepoints and not codeunits. I also like the way you handle both mutable and immutable strings. It would be very helpful with some kind of documentation or tutorial. Just a few lines of sample code would help understand it better. How does this library work with Artyom Beilis Boost.Locale library, Mathias Gaunard Boost.Unicode library and ICU? If the interoperability is good, then you probably don't need to create an unicode_abstract_character class.
At the moment my library only works under GCC with C++0x enabled, as I was focusing on the design issues first. I also understand that I have not adopted the Boost way of building the project. While I am now going to spend more time on fixing these issues, I hope that this discussion can have more focus on the design issues instead.
It would be great if it would work under VC++2010 as well. Would be a lot easier (for me at least) to test and play with.
Feel free to let me know any potential issues on the class design so that I can fix it before it is too late. Thank you!
I see a potential issue with the `unicode_string_adapter(const raw_char_type* other)` constructor, as it won't know the encoding of the string literal. See discussion between Ryou Ezoe and Artyom Beilis during the Boost.Locale review.
cheers,
Soares Chen
Best regards, Anders Dalvander -- WWFSMD?

On Sat, Jun 18, 2011 at 7:00 PM, Anders Dalvander <boost@dalvander.com> wrote:
The library looks very interesting. I like your approach that the default iterator iterates over codepoints and not codeunits. I also like the way you handle both mutable and immutable strings.
It would be very helpful with some kind of documentation or tutorial. Just a few lines of sample code would help understand it better.
Thank you for liking my design approach. Currently I am working on the documentation and figuring out how to use Doxygen for Boost style documentation. It is probably going to take a while before I can publish the first draft of the documentation so I'm going to explain it briefly here. The unicode_string_adapter class is intended to replace traditional string containers such as std::string and std::vector by wrapping these containers and "downgrade" their usage to become raw code unit containers. unicode_string_adapter then provides a uniform way to decode/encode code points out of the code unit containers regardless of the actual underlying encoding. I have just written a simple hello world example and upload it to Github. You can view it at https://github.com/crf00/boost.ustr/blob/master/libs/ustr/example/hello/hell.... Hopefully this example can help the understanding of the library.
How does this library work with Artyom Beilis Boost.Locale library, Mathias Gaunard Boost.Unicode library and ICU?
Since unicode_string_adapter produces code point iterators, it should work work Mathias's Boost.Unicode library functions that accept code point iterators such as grapheme_segment. I have not tested it though. unicode_string_adapter is specially designed for libraries that need to provide APIs that accept strings with different encodings such as Artyom's Boost.Locale and also Boost.Filesystem. It works by replacing the legacy APIs that accept char*, wchar_t*, and std::string, and replace these parameter types with a single unicode_string_adapter template. However although the solution sounds easy, the biggest challenge for existing libraries is that it will break existing APIs unless the library author is willing to support unicode_string_adapter together with legacy strings at the same time. I intend to make ICU's UnicodeString class as one of the code unit containers used by unicode_string_adapter in future. An ICU Unicode string can then be written as unicode_string_adapter<UnicodeString> to make it easily convertible to other string types such as unicode_string_adapter<std::string> when needed.
If the interoperability is good, then you probably don't need to create an unicode_abstract_character class.
There are some features I have in mind that can be greatly simplified by using a class like this. The good thing of a unicode_abstract_character is that we can then construct independent objects that represents a single abstract character, which can be used for higher level purposes. Anyway since this is a future planned class, I think we can leave this for future discussion.
At the moment my library only works under GCC with C++0x enabled, as I was focusing on the design issues first. I also understand that I have not adopted the Boost way of building the project. While I am now going to spend more time on fixing these issues, I hope that this discussion can have more focus on the design issues instead.
It would be great if it would work under VC++2010 as well. Would be a lot easier (for me at least) to test and play with.
I apologize for only trying to fix portability issues this late. Currently there are a few compilation errors that I am not familiar of so it might take a while for me to fix it.
Feel free to let me know any potential issues on the class design so that I can fix it before it is too late. Thank you!
I see a potential issue with the `unicode_string_adapter(const raw_char_type* other)` constructor, as it won't know the encoding of the string literal. See discussion between Ryou Ezoe and Artyom Beilis during the Boost.Locale review.
I did read on the discussions in the Boost.Locale review, agreeably this is a very challenging problem. I tentatively added `unicode_string_adapter(const raw_char_type* other)` constructor in later time trying to make construction from raw string work in the unit tests, but I agree with you that it most probably should be removed. The main challenge I found is that there is actually no portable way to create static Unicode strings embedded in any C++ source code. From my understanding the encoding of static strings within source code is dependent on the locale the compiler is using and the operating system, so it is not possible to statically choose a single encoder/decoder that is used for processing the source code strings. I have a solution in mind that can allow developers to statically construct Unicode strings in a portable way, which is by using a USTR() macro before any source code strings is passed to a unicode_string_adapter constructor. So the construction of source code strings will look something like unicode_string_adapter<std::string> my_string( USTR(L"世界你好") ); The USTR() macro will expand into another unicode_string_adapter constructor with template argument that matches the current encoding the compiler is using. So it will for example be expanded as such in different platforms: // UTF-16 source encoding unicode_string_adapter<std::string> my_string( unicode_string_adapter< std::basic_string<char16_t> >(L"世界你好") ); // UTF-32 source encoding unicode_string_adapter<std::string> my_string( unicode_string_adapter< std::basic_string<char32_t> >(L"世界你好") ); // GB2312 Chinese source encoding // Not sure if this could happen in Windows, but even if it does, USTR() can still handle that unicode_string_adapter<std::string> my_string( unicode_string_adapter< std::basic_string<char16_t>, string_traits< std::basic_string<char16_t> >, gb2312_encoding_traits< string_traits< std::basic_string<char16_t> >
(L"世界你好") );
Until the USTR() macro is built, I don't think it is possible to even use unicode_string_adapter to handle source code Unicode strings. Unfortunately it seems like there is no better solution exists. Best Regards, Chen Ruo Fei
participants (2)
-
Anders Dalvander
-
Soares Chen Ruo Fei