[gsoc] Request Feedback for Boost.Ustr Unicode String Adapter

Hi all, A while ago I gave some previews of my Unicode String Adapter library to the boost community but I didn't receive much feedback. Now that GSoC is ending I'd like you all to take a look at my project again and provide feedback on the usefulness of the library. Following are the links to my project repository and documentation: GitHub repository: https://github.com/crf00/boost.ustr Documentation: http://crf.scriptmatrix.net/ustr/index.html Recently there was some threads in the mailing list mentioning about Unicode issues and forcing the use of UTF-8 encoding in std::string again. I feel certain that few of the key people involving in the debates should have at least read my project proposal before and understand what Boost.Ustr was trying to achieve, however it seems like my approach is not much favored by anyone. So here I'd also like to know from these people why my approach on using Unicode string adapter is not the best way to solve the Unicode problem in C++. I understand that the requirement to alter existing APIs might be the biggest barrier to make Boost.Ustr adopted by everyone, but I think here is the key question: if you have the chance to go back in time and restart your project from scratch, are you willing to use Boost.Ustr in your library APIs and do you think that it could have solved the Unicode problems you're having right now? Thanks. Best Regards, Soares Chen

My post has probably slipped through the radar so I'm just going to bump this post again. Please feel free to criticize if you think that my library has any fundamental design flaw. As a student and GSoC participant, I think the most important thing is for me is to learn what I did wrong in the project so that I will not repeat the same mistake, and also to allow me to gain enough experience so that I can really give useful contribution to the open source community in future. Any feedback is really much appreciated. Thanks. cheers, Soares Chen

Hi, I'm reading the documentation and I must say it's very clear and easy to understand (at least for someone who did follow the recent discussions about the subject on this mailing list). Minor error : missing '>' in http://crf.scriptmatrix.net/ustr/ustr/unicode_string_adapter.html (String Concatenation) unicode_string_adapter< std::vector< char16_t > second_string = USTR("你好"); Also, in the same section, maybe adding informations about potential stream operators (I don't see them so far) might help? if you have the chance
to go back in time and restart your project from scratch, are you willing to use Boost.Ustr in your library APIs and do you think that it could have solved the Unicode problems you're having right now?
From the documentation, that library seems to solve the problem as by several boost authors. However, real-world usage and experimentation would help figuring out if there is any important flaw in the design.
As a boost user, I find it interesting even for my non-library projects, but I wouldn't expect all boost libraries to expose encoded strings in their interfaces. Maybe just Boost.Locale and Boost.FileSystem. If I can find time in the coming weeks, I'll try to use it in a prototype and provide feedbacks. Joël Lamotte

2011/8/9 Klaim - Joël Lamotte wrote:
I'm reading the documentation and I must say it's very clear and easy to understand (at least for someone who did follow the recent discussions about the subject on this mailing list).
Thanks for the feedback. This brings encouragement and motivation for me to continue improving this library. :)
Minor error : missing '>' in http://crf.scriptmatrix.net/ustr/ustr/unicode_string_adapter.html (String Concatenation)
unicode_string_adapter< std::vector< char16_t > second_string = USTR("你好");
Ahh I see I missed that. Thanks for noticing it, I've updated it on my website.
Also, in the same section, maybe adding informations about potential stream operators (I don't see them so far) might help?
Currently Boost.Ustr provides limited support towards I/O. Due to portability issues I think it is very hard for Boost.Ustr alone to solve problems such as printing Unicode strings to the screen. In fact, I have no idea on how to print even raw Unicode strings onto the Windows terminal. (Any Windows expert knows how to solve this?) My current solution is to rely on the raw string class to provide the actual I/O operations. So for example it is possible to print a `unicode_string_adapter<std::string>` by passing the const reference of the raw string to std::cout via operator *(). I'd also implemented a convenient function that automatically exposes the internal raw string when the string adapter is passed to std::cout through operator <<(). On the other hand I haven't considered much for input stream. `unicode_string_adapter_builder` already has code point and code unit output iterators so I think it shouldn't be too hard to perform input stream operation through these output iterators. Though I didn't provide operator *() for the mutable builder class as I think exposing the raw string class in the string adapter builder would re-enable read operations on the mutable string. (`unicode_string_adapter_builder` purposely forbids read operations to discourage programmers from reading and writing strings at the same time) My conclusion is that instead of trying to make `unicode_string_adapter` to work with the old iostream libraries, we could instead leverage on the encoding-agnostic-API advantage of Boost.Ustr to implement a truly portable I/O library for Unicode strings. For example, a `print(str)` function that accepts a `unicode_string_adapter` or a `scan(mstr)` function that accepts a `unicode_string_adapter_builder` can correctly print or scan Unicode strings of any encoding regardless of the actual encoding that the system use. (No more pain of choosing whether to use wide or non-wide version of the functions) That said, I'd like to disclaim that I am not familiar with the C++ iostream library as I feel that the design is too complex and I personally don't like that design. My bias might be wrong and I'm willing to add more functionality into Boost.Ustr to work with iostream if there are actually simple ways to do it.
if you have the chance
to go back in time and restart your project from scratch, are you willing to use Boost.Ustr in your library APIs and do you think that it could have solved the Unicode problems you're having right now?
From the documentation, that library seems to solve the problem as by several boost authors. However, real-world usage and experimentation would help figuring out if there is any important flaw in the design.
As a boost user, I find it interesting even for my non-library projects, but I wouldn't expect all boost libraries to expose encoded strings in their interfaces. Maybe just Boost.Locale and Boost.FileSystem.
If I can find time in the coming weeks, I'll try to use it in a prototype and provide feedbacks.
Thanks. I'm looking forward to hear your feedback. Btw do feel free to email me if you face any problem compiling the code, as I have not yet throughoutly tested it on all platform, though it should work on at least Ubuntu Linux, Mac OS X, and Windows 7. You might also want to add the C++11 flag in your bjam command line option as that is not enabled by default. cheers, Soares

On 9 August 2011 11:45, Soares Chen Ruo Fei <crf@hypershell.org> wrote:
Currently Boost.Ustr provides limited support towards I/O. Due to portability issues I think it is very hard for Boost.Ustr alone to solve problems such as printing Unicode strings to the screen. In fact, I have no idea on how to print even raw Unicode strings onto the Windows terminal. (Any Windows expert knows how to solve this?)
I'm not a Windows expert, but I needed to do this for quickbook, I wasn't able to find a complete solution, but what I've got sort of works. Maybe someone else knows better. I've a horrible feeling that someone is going to point out a much simpler solution that makes what I do look silly and pointless. Quickbook always uses a wide stream for output on windows. When running from an IDE this worked fine, but at the command line it would be converted to the current code page - losing characters that the code page doesn't support. So when running from the console you need to tell windows to use UTF-16: #include <io.h> #include <fcntl.h> int main() { if (_isatty(_fileno(stdout))) _setmode(_fileno(stdout), _O_U16TEXT); if (_isatty(_fileno(stderr))) _setmode(_fileno(stderr), _O_U16TEXT); } Annoyingly _O_U16TEXT was largely undocumented until recently, I don't know if it was available before Visual Studio 2008. The last time I checked, it wasn't available in mingw. Here's the MSDN page for _setmode: http://msdn.microsoft.com/en-us/library/tw4k6df8%28v=VS.100%29.aspx The '_isatty(_fileno(stdout))' checks that you are writing to the console. You don't want to write UTF-16 when output is piped into a program that expects 8 bit characters. A better solution might be to use the UTF-8 code page for output, but that didn't seem to work, at least not on XP. Finally, remember to make sure your console is using a font that can display the characters you're outputting.

On Tue, Aug 9, 2011 at 3:45 AM, Soares Chen Ruo Fei <crf@hypershell.org> wrote:
Currently Boost.Ustr provides limited support towards I/O. Due to portability issues I think it is very hard for Boost.Ustr alone to solve problems such as printing Unicode strings to the screen. In fact, I have no idea on how to print even raw Unicode strings onto the Windows terminal. (Any Windows expert knows how to solve this?)
On Windows, you can use WriteConsoleW() with GetStdHandle(STD_OUTPUT_HANDLE). I have no clue how this will behave if your stdlib buffers output. In VC++ specifically (2005 and up) you can do _setmode(_fileno(stdout), _O_U16TEXT), and wprintf() and wcout will behave properly for all Unicode text. You still need to set the console's font to one supporting Unicode glyphs, though. -- Cory Nelson http://int64.org

From: Soares Chen Ruo Fei <crf@hypershell.org>
To: boost@lists.boost.org Sent: Tuesday, August 9, 2011 10:53 AM Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
My post has probably slipped through the radar so I'm just going to bump this post again. Please feel free to criticize if you think that my library has any fundamental design flaw. As a student and GSoC participant, I think the most important thing is for me is to learn what I did wrong in the project so that I will not repeat the same mistake, and also to allow me to gain enough experience so that I can really give useful contribution to the open source community in future.
Any feedback is really much appreciated. Thanks.
Hello, First of all I want to tell that I'm as the author of Boost.Locale library have very strong opinion on how strings and Unicode should be handled. My strong opinion is: a. Strings should be just container object with default encoding and some useful API to handle it. b. Default encoding MUST be UTF-8 c. There are several ways to implement strings COW, Mutable, Immutable, with small string optimization and so on. This way or other std::string is de-facto string and I think we should live with it and use some alternative containers where it matters. d. Code point and code unit are meaningless unless you develop some Unicode algorithm - and you don't - you use one written by experts. So my biggest problem is motivation: -----------------------------------
The main reason that Boost.Ustr is developed is because current
raw string types such as std::string requires developers to make assumption on the encoding of the string content, such as UTF-8 for std::string. This creates inconsistency when a string passed to library APIs has different encoding from the library expects.
This Ustr does not solve this problem as it does not provide really some kind of adapter<generic encoding> { string content } This is some kind of thing that may be useful, but not in this case. Basically your library provides wrapper around string and outputs Unicode code points but it does it for UTF encodings only! It does not benefit too much. You provide encoding traits but it is basically meaningless for the propose you had given as: It does not provide traits for non-Unicode encodings like lets say Shift-JIS or ISO-8859-8 BTW you can't create traits for many encodings, for example you can't implement traits requirements: http://crf.scriptmatrix.net/ustr/ustr/advanced.html#ustr.advanced.custom_enc... For popular encodings like Shift-JIS or GBK... Homework: tell me why ;-) Also it is likely that encoding is something that can be changed in the runtime not compile time and it seems that this adapter does not support such option.
The problem mainly arise because there are a small minority of developers who use different encoding for the same string type. If someone uses strings with different encodings he usually knows their encoding...
The problem is that API inconsistent as on Windows narrow string is some ANSI code page and anywhere else it is UTF-8. This is entirely different problem and such adapters don't really solve them but actually make it worse... Other problem is ================ I don't believe that string adapter would solve any real problems because: a) If you iterate over code points you are very likely do something wrong. As code point != character and this is very common mistake. b) If you want to iterate over code points it is better to have some kind of utf_iterator that receives a range and iterate over it, it would be more generic and do not require to have an additional class. For example Boost.Locale has utf_traits that allow to implement iteration over code points quite easily. See: http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/namespaceboost_1_1... http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/structboost_1_1loc... And you don't need any kind of specific adapters. c) The problem in Boost is not missing Unicode String and it is not even required to have yet-another-unicode-string that we have good Unicode support. The problem is policy the problem is Boost just can't decide once and forever that std::string is UTF-8... But don't get me wrong. This is My Opinion, many would disagree with me. ================================= Bottom line, Unicode strings, cool string adapters, UTF-iterators and even Boost.Unicode and Boost.Locale would not solve the problems that Boost libraries use inconsistent encodings on different platforms. IMHO: the only way to solve it is POLICY. Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.sf.net/ CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/

On 11 August 2011 12:03, Artyom Beilis <artyomtnk@yahoo.com> wrote:
The problem is policy the problem is Boost just can't decide once and forever that std::string is UTF-8...
Even if there was a consensus within boost, that isn't feasible. We don't own std::string, so we don't have a say in what it represents. There's a lot of existing code which is not based on that assumption - we can't just wish it out of existence and boost should be compatible with it.

From: Daniel James <dnljms@gmail.com>
Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
On 11 August 2011 12:03, Artyom Beilis <artyomtnk@yahoo.com> wrote:
The problem is policy the problem is Boost just can't decide once and forever that std::string is UTF-8...
Even if there was a consensus within boost, that isn't feasible. We don't own std::string, so we don't have a say in what it represents.
std::string represents a sequence of "char" objects that happens to be useful for text processing. It can represent a text in any encoding. The question is how we treat this sequence... And this is a matter of policy and requirements of the library.
There's a lot of existing code which is not based on that assumption - we can't just wish it out of existence and boost should be compatible with it.
Then cross platform, Unicode aware programming will always (I'm sorry) suck with Boost :-) Thats it... Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.sf.net/ CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/

On 11 August 2011 12:57, Artyom Beilis <artyomtnk@yahoo.com> wrote:
There's a lot of existing code which is not based on that assumption - we can't just wish it out of existence and boost should be compatible with it.
Then cross platform, Unicode aware programming will always (I'm sorry) suck with Boost :-)
Thats it...
Unless a different solution can be found.

On Fri, Aug 12, 2011 at 9:57 AM, Daniel James <dnljms@gmail.com> wrote:
On 11 August 2011 12:57, Artyom Beilis <artyomtnk@yahoo.com> wrote:
There's a lot of existing code which is not based on that assumption - we can't just wish it out of existence and boost should be compatible with it.
Then cross platform, Unicode aware programming will always (I'm sorry) suck with Boost :-)
Thats it...
Unless a different solution can be found.
I see the old flam .. er discussion on text handling is back :)
From the previous debate(s) I now accept that it would be a bad idea just to force the encoding of std::string to be utf8, So a (nearly) ideal text handling class should IMO look like this (see usage below):
// text encoding tag types for conversion function dispatching namespace /*or struct */ textenc { struct utf8 {}; struct utf16 {}; struct utf32 {}; struct winapi {}; struct posix {}; struct stdlib {}; struct sqlite {}; struct libpq {}; ... struct libxyz {}; #if WE_ARE_ON_WINDOWS typedef winapi os; #elif WE_ARE_ON_POSIX typedef posix os; #elif ... #endif struct gcc {}; struct msvc {}; struct icc {}; struct clang {}; #if COMPILING_WITH_GCC typedef compiler gcc; #elif COMPILING_WITH_MSVC typedef compiler msvc; #elif ... #endif }; class text { public: // *** construction *** // by default expect UTF8 text(const char* cstr) { assert(is_utf8(cstr)); store(cstr); } // by default expect UTF8 text(const std::string& str) { assert(is_utf8(str.begin(), str.end())); store(str); } // otherwise use the tag type to // do any necessary conversions template <typename Char, typename EncodingTag> text(const Char* cstr, EncodingTag encoding) { // use an overload to convert from the encoding // basically if the tag is textenc::winapi then use // the winapi-supplied functions and convert to utf8 // if it's posix look at the locale and convert with the posix function // if the tag is textenc::msvc convert the msvc literal from // whatever crazy encoding it uses to utf8, ...etc. convert_and_store(cstr, encoding)); } template <typename Char, typename EncodingTag> text(const std::basic_string<Char>& cstr, EncodingTag encoding) { convert_and_store(str.begin(), str.end(), encoding)); } // *** conversion *** // by default output in uft8 const char* c_str(void) const; // by default in utf8 (could be a friend fn instead of member) std::string str(void) const; // (could be a friend fn instead of member) template <typename EncodingTag> std::string str(EncodingTag encoding) const { return convert_from(encoding); } // wide char string output template <typename EncodingTag> std::wstring wstr(EncodingTag encoding) const { return wconvert_from(encoding); } // implement whatever functionality // making sense for utf8-encoded-text }; // usage text t1 = "blahblah"; // must be utf8 // whatever encoding the compiler uses for wide literals text t2(L"blablablabl", textenc::compiler()); text t3(some_posix_function(), textenc::posix()); text t4(SomeWinapiFunc(), textenc::winapi()); text t5(SomeWinapiFuncW(), textenc::winapi()); text t6(pq_some_func(), textenc::libpq()); text t7 = concat(t1, t2, t3, t4, t5, t6); std::ostream& out = get_outs(); out << t7; // output in utf8 text t8; std::istream& in = get_ins(); in.read_line(t8); text t9; in.read(t9, 1024); some_function_expecting_utf8(t9.c_str()); SomeWinapiFunction(t8.str(textenc::winapi()).c_str()); SomeWinapiFunctionW(concat(t9, text::newline(), t8).wstr(textenc::winapi()).c_str()); some_posix_function(transform(concat(t4, t7, t9)).str(textenc::posix()).c_str()); some_wrapped_os_function(str(t8, textenc::os())); some_stdlib_function(str(head(substring_after(t9, t2), 10), textenc::stdlib())); i.e. besides the fact that the string "uses utf8" (there is already a whole heap of such strings) it must also handle all the conversions between utf8 and whatever the OS and the major libraries and APIs expect and use; conveniently (and effectively). Otherwise the effort is IMHO wasted. Boost libraries (at the very least those wrapping OS functionality) should adopt this text class, and do the conversions, "just-in-time" when making the OS API call. My 0.02Euro Best, Matus

On Fri, Aug 12, 2011 at 12:00, Matus Chochlik <chochlik@gmail.com> wrote:
On Fri, Aug 12, 2011 at 9:57 AM, Daniel James <dnljms@gmail.com> wrote:
On 11 August 2011 12:57, Artyom Beilis <artyomtnk@yahoo.com> wrote:
There's a lot of existing code which is not based on that assumption - we can't just wish it out of existence and boost should be compatible with it.
Then cross platform, Unicode aware programming will always (I'm sorry) suck with Boost :-)
Thats it...
Unless a different solution can be found.
I see the old flam .. er discussion on text handling is back :)
From the previous debate(s) I now accept that it would be a bad idea just to force the encoding of std::string to be utf8, So a (nearly) ideal text handling class should IMO look like this (see usage below):
[...]
// by default expect UTF8 text(const std::string& str) { assert(is_utf8(str.begin(), str.end())); store(str); }
What you are doing is, in fact, forcing the assumed encoding of std::string to UTF-8. You just said you think it's a bad idea.
[...] text t1 = "blahblah"; // must be utf8
// whatever encoding the compiler uses for wide literals text t2(L"blablablabl", textenc::compiler());
text t3(some_posix_function(), textenc::posix());
text t4(SomeWinapiFunc(), textenc::winapi()); text t5(SomeWinapiFuncW(), textenc::winapi());
How is it better than: string t4 = from_narrow(SomeWinapiFuncA()); // use the default encoding used by system for narrow strings string t5 = from_wide(SomeWinapiFuncW()); // wchar_t on windows is always utf16
text t6(pq_some_func(), textenc::libpq());
You don't need it. You're proposing a design that tries to solve a non-existing problem. There is no such diversity of encodings in the interfaces. I don't know what is libpq, but it either uses UTF-8 in which case you write: string t6 = pq_some_func(); or the default system encoding, in which case you write: string t6 = from_narrow(pq_some_func()); As you start using more libraries with UTF-8 default encoding, you will use from_* less frequently. (It's possible to use a single to_utf8 instead of from_narrow/from_wide combination.) [...]
SomeWinapiFunction(t8.str(textenc::winapi()).c_str()); SomeWinapiFunctionW(concat(t9, text::newline(), t8).wstr(textenc::winapi()).c_str());
Same as above. 'text' as a distinct type doesn't play any role here. If t9 is std::string, this becomes: SomeWinapiFunctionA(to_narrow(t8).c_str()); // to the default narrow system-encoding. SomeWinapiFunctionW(to_wide(t9 + "\r\n" + t8).c_str()); // what kind of newline is expected defined by the API, not the system.
[...] i.e. besides the fact that the string "uses utf8" (there is already a whole heap of such strings) it must also handle all the conversions between utf8 and whatever the OS and the major libraries and APIs expect and use; conveniently (and effectively). Otherwise the effort is IMHO wasted.
Your 'text' doesn't do this in a transparent way. In fact you cannot do it in transparent way because 'const char*' doesn't carry the necessary semantic information. The burden of deciding what encoding to convert to/from falls on the programmer *anyway*. You don't benefit anything from defining yet-another string type. Boost libraries (at the very least those wrapping OS functionality)
should adopt this text class, and do the conversions, "just-in-time" when making the OS API call.
In the light of the said above, your 'text' class won't catch bugs like: char str[1024]; GetWindowTextA(hwnd, str, sizeof(str)); boost::function_with_text_parameter(str); Therefore, I don't think we should adopt this text class. -- Yakov

On Fri, Aug 12, 2011 at 1:08 PM, Yakov Galka <ybungalobill@gmail.com> wrote:
On Fri, Aug 12, 2011 at 12:00, Matus Chochlik <chochlik@gmail.com> wrote:
On Fri, Aug 12, 2011 at 9:57 AM, Daniel James <dnljms@gmail.com> wrote:
On 11 August 2011 12:57, Artyom Beilis <artyomtnk@yahoo.com> wrote:
[...]
// by default expect UTF8 text(const std::string& str) { assert(is_utf8(str.begin(), str.end())); store(str); }
What you are doing is, in fact, forcing the assumed encoding of std::string to UTF-8. You just said you think it's a bad idea.
No, I'm proposing to implement a *new* class that will store the text in UTF8 encoding and if during the construction no encoding is specified, then it is assumed that the particular std::string is already in UTF8. This is *very* different from imposing an encoding on std::string which is already used in many situations with other encodings. i.e. my approach does not break any existing code.
[...] text t1 = "blahblah"; // must be utf8
// whatever encoding the compiler uses for wide literals text t2(L"blablablabl", textenc::compiler());
text t3(some_posix_function(), textenc::posix());
text t4(SomeWinapiFunc(), textenc::winapi()); text t5(SomeWinapiFuncW(), textenc::winapi());
How is it better than: string t4 = from_narrow(SomeWinapiFuncA()); // use the default encoding used by system for narrow strings string t5 = from_wide(SomeWinapiFuncW()); // wchar_t on windows is always utf16
I believe that it is more generic to use a combination of function + tag than just a function because there are other APIs besides the OS's that use various encodings and my approach scales better. r do you like from_narrow_os(), from_wide_os(), from_narrow_stdlib(), from_narrow_lib1(), ... from_wide_libN(); more ?
text t6(pq_some_func(), textenc::libpq());
You don't need it. You're proposing a design that tries to solve a non-existing problem. There is no such diversity of encodings in the interfaces. I don't know what is libpq, but it either uses UTF-8 in which case you write:
string t6 = pq_some_func();
or the default system encoding, in which case you write:
This is just an example. OK libpq already uses UTF8 but there are others that do not. Besided it does not harm you in any way to do this because if the returned string is already in UTF8 you would not do any conversion but if you are using a very old version of libpq (not using UTF8), the transcoding would be handled automatically. The same with other libraries/APIs.
string t6 = from_narrow(pq_some_func());
As you start using more libraries with UTF-8 default encoding, you will use from_* less frequently. (It's possible to use a single to_utf8 instead of from_narrow/from_wide combination.)
The approach that I proposed *does not* force you to specify the utf8 encoding explicitly neither if you are 100% sure that the string is in UTF8 and that this does not change under any circumstances (like when somebody changes the locale)
[...]
SomeWinapiFunction(t8.str(textenc::winapi()).c_str()); SomeWinapiFunctionW(concat(t9, text::newline(), t8).wstr(textenc::winapi()).c_str());
Same as above. 'text' as a distinct type doesn't play any role here. If t9 is std::string, this becomes:
SomeWinapiFunctionA(to_narrow(t8).c_str()); // to the default narrow system-encoding.
And what if t8 was read from another source (not UTF8 and not WINAPI) which may for example use the locale's encoding or some arbitrary encoding?
SomeWinapiFunctionW(to_wide(t9 + "\r\n" + t8).c_str()); // what kind of newline is expected defined by the API, not the system.
[...] i.e. besides the fact that the string "uses utf8" (there is already a whole heap of such strings) it must also handle all the conversions between utf8 and whatever the OS and the major libraries and APIs expect and use; conveniently (and effectively). Otherwise the effort is IMHO wasted.
Your 'text' doesn't do this in a transparent way. In fact you cannot do it in transparent way because 'const char*' doesn't carry the necessary semantic information. The burden of deciding what encoding to convert to/from falls on the programmer *anyway*. You don't benefit anything from defining yet-another string type.
I didn't say *transparent* I said *convenient*. Of course you cannot do this completely transparently because of the reasons you mentioned (const char* can be encoded in any way). You need to specify the source from which the text comes (by the symbolic tag) and the library handles the details for you. If the source is UTF8 do nothing otherwise do the transcoding. And by doing to_narrow/from_narrow you are trying to do that "transparently". But again, that are other sources of text which use other encodings, besides the OS API.
Boost libraries (at the very least those wrapping OS functionality)
should adopt this text class, and do the conversions, "just-in-time" when making the OS API call.
In the light of the said above, your 'text' class won't catch bugs like:
char str[1024]; GetWindowTextA(hwnd, str, sizeof(str)); boost::function_with_text_parameter(str);
No I didn't suggest doing it this way so sorry but this is strawman. This should look like: char cstr[1024]; GetWindowTextA(hwnd, cstr, sizeof(cstr)); text str(cstr, textenc::winapi()); boost::function_with_text_parameter(str); Best, Matus

On Fri, Aug 12, 2011 at 15:04, Matus Chochlik <chochlik@gmail.com> wrote:
On Fri, Aug 12, 2011 at 1:08 PM, Yakov Galka <ybungalobill@gmail.com> wrote:
On Fri, Aug 12, 2011 at 12:00, Matus Chochlik <chochlik@gmail.com> wrote:
On Fri, Aug 12, 2011 at 9:57 AM, Daniel James <dnljms@gmail.com> wrote:
On 11 August 2011 12:57, Artyom Beilis <artyomtnk@yahoo.com> wrote:
[...]
// by default expect UTF8 text(const std::string& str) { assert(is_utf8(str.begin(), str.end())); store(str); }
What you are doing is, in fact, forcing the assumed encoding of std::string to UTF-8. You just said you think it's a bad idea.
No, I'm proposing to implement a *new* class that will store the text in UTF8 encoding and if during the construction no encoding is specified, then it is assumed that the particular std::string is already in UTF8.
This is *very* different from imposing an encoding on std::string which is already used in many situations with other encodings. i.e. my approach does not break any existing code.
Sorry, your arguments start to look non-constructive to me. Correct me where I'm wrong in the following reasoning. (1) You object to UTF-8 strings in boost interface because someone may pass something other than UTF-8 there and it's going to be undetected at compile time: namespace boost { void func(const std::string& a); } // UTF-8 boost::func(non_utf_string); //oops You're proposing a `text` class that is meant to somehow overcome this problem. So you change the boost interface to accept `text` but user code is left unchanged...: namespace boost { void func(const text& a); } boost::func(non_utf_string); //oops, the std::string default constructor is called. Yes, you can make this constructor explicit, so the above code stops compiling and the user must write explicitly: boost::func(text(non_utf_string)); But then there is nothing in your proposal that makes std::string utf-8 encoded by 'default'. Default == implicit. [...]
I believe that it is more generic to use a combination of function + tag than just a function because there are other APIs besides the OS's that use various encodings and my approach scales better.
r do you like from_narrow_os(), from_wide_os(), from_narrow_stdlib(), from_narrow_lib1(), ... from_wide_libN(); more ?
(2) No, I'd never proposed that. I repeat this again: The only encodings which matter are 'system default', UTF-8, and UTF-16. I would like to see a list of widely used libraries which use other encodings, please. [ Note: Not including libraries used for encoding conversions. — end note. ] Even if there is such a library out there, the user is *already* converting to/from its exotic encoding.
[...] Besided it does not harm you in any way
It does. I already use UTF-8 for all my strings, even on windows, and I don't want the code-bloat of all these conversions (even if they're no-ops).
[...]
string t6 = from_narrow(pq_some_func());
As you start using more libraries with UTF-8 default encoding, you will
use
from_* less frequently. (It's possible to use a single to_utf8 instead of from_narrow/from_wide combination.)
The approach that I proposed *does not* force you to specify the utf8 encoding explicitly neither if you are 100% sure that the string is in UTF8 and that this does not change under any circumstances (like when somebody changes the locale)
Huh? Neither mine. Again, what you say here contradicts (1).
[...] And what if t8 was read from another source (not UTF8 and not WINAPI) which may for example use the locale's encoding or some arbitrary encoding?
See (2). [...]
You need to specify the source from which the text comes (by the symbolic tag) and the library handles the details for you. If the source is UTF8 do nothing otherwise do the transcoding.
So can I summarize this debate as 'the programmer specifies the library and boost chooses the encoding' versus 'the programmer goes to the documentation of the library and says boost what encoding to use'? If yes, then it's a quite minor design decision. According to (2) I claim that there will be less encodings than libraries. And by doing to_narrow/from_narrow
you are trying to do that "transparently". But again, that are other sources of text which use other encodings, besides the OS API.
See (2).
Boost libraries (at the very least those wrapping OS functionality)
should adopt this text class, and do the conversions, "just-in-time" when making the OS API call.
In the light of the said above, your 'text' class won't catch bugs like:
char str[1024]; GetWindowTextA(hwnd, str, sizeof(str)); boost::function_with_text_parameter(str);
No I didn't suggest doing it this way so sorry but this is strawman. This should look like:
char cstr[1024]; GetWindowTextA(hwnd, cstr, sizeof(cstr)); text str(cstr, textenc::winapi()); boost::function_with_text_parameter(str);
Neither I suggested passing non-utf-8 string to a utf-8 assumed string. It's not about the way you proposed to write the code, it's about your proposal doesn't solve the problem it was advocated to solve and be better from Artyom's and my proposal. See (1). You say that there is some code that you don't want to break, code you want to be compatible with. Which code? This code: char str[1024]; GetWindowTextA(hwnd, str, sizeof(str)); boost::function_with_text_parameter(str); // currently assumes system encoding Let's leave aside the fact that this code uses deprecated winapi interface and thus unicode-unaware. Yes, it *should* be written as: char cstr[1024]; GetWindowTextA(hwnd, cstr, sizeof(cstr)); text str(cstr, textenc::winapi()); boost::function_with_text_parameter(str); BUT (!!!), until the user rewrites this code, you've silently broke his code. This is *exactly* the same situation as assuming std::string is utf-8 in the first place, and your way of how the user had to write his code is almost the same as mine: char str[1024]; GetWindowTextA(hwnd, str, sizeof(str)); boost::function_with_text_parameter(from_narrow(str)); // accepts UTF-8 std::string // versus: boost::function_with_text_parameter(text(str, textenc::winapi)); // accepts your::text (3) The only way to avoid silent breakage is to trap it at compile time through disabling the implicit conversion from string and char*. (!!!) By making the constructor explicit you just break user-code at compile time rather than (silently) at run-time. Indeed it's a bit better than assuming utf-8 by default, but now your string is going to be hell to use, even for those who already use utf-8 encoded std::strings: std::string str = get_utf_8_string(); boost::function_with_text_parameter(str); // error, explicit constructor of text not called. please specify you intent. boost::function_with_text_parameter(text(str)); // wait, don't we want to encourage utf-8 std::strings? -- Yakov

On Mon, Aug 15, 2011 at 1:08 PM, Yakov Galka <ybungalobill@gmail.com> wrote:
On Fri, Aug 12, 2011 at 15:04, Matus Chochlik <chochlik@gmail.com> wrote:
[...]
// by default expect UTF8 text(const std::string& str) { assert(is_utf8(str.begin(), str.end())); store(str); }
OK, to clarify. I certainly do not insist on implicit conversion from const char* and std::string. In fact I would like this and the other constructor to be explicit.
What you are doing is, in fact, forcing the assumed encoding of
std::string
to UTF-8. You just said you think it's a bad idea.
No, I'm proposing to implement a *new* class that will store the text in UTF8 encoding and if during the construction no encoding is specified, then it is assumed that the particular std::string is already in UTF8.
This is *very* different from imposing an encoding on std::string which is already used in many situations with other encodings. i.e. my approach does not break any existing code.
Sorry, your arguments start to look non-constructive to me. Correct me where I'm wrong in the following reasoning.
(1) You object to UTF-8 strings in boost interface because someone may pass something other than UTF-8 there and it's going to be undetected at compile time:
namespace boost { void func(const std::string& a); } // UTF-8 boost::func(non_utf_string); //oops
Yes, this is my concern, but see below.
You're proposing a `text` class that is meant to somehow overcome this problem. So you change the boost interface to accept `text` but user code is left unchanged...:
No I never said that the client code should be left unchanged and I'm sorry if you came to this conclusion because I did not express myself clearly. The text class should come with documentation that clearly states that it uses Unicode and UTF8 and if you are constructing text from a string than either you must be sure that the string already is in UTF8 (which is not always the case so I'm not enforcing std::string to be utf8), OR you must specify (by the means of the symbolic tag) where the string came from: the OS, some external library that does not use UNICODE, nor the OS's conventions. I prefer the symbolic tags because the logic whether the conversion needs to be done at all and from/to which source/destination encoding will be hidden from the user inside the library.
namespace boost { void func(const text& a); } boost::func(non_utf_string); //oops, the std::string default constructor is called.
Yes, you can make this constructor explicit, so the above code stops compiling and the user must write explicitly: boost::func(text(non_utf_string));
Yes this is the idea: boost::func(text(non_utf_string, textenc::symbolic_tag())); The advantage is that if the authors of the library which produced the non_utf_string change their mind and in a new version start to encode their strings in UTF8, this code does not have to be touched. You update the *text* library which will take the change into account and recompile your application (using the code above).
But then there is nothing in your proposal that makes std::string utf-8 encoded by 'default'. Default == implicit.
The idea is that the documentation will say so. See above.
[...]
I believe that it is more generic to use a combination of function + tag than just a function because there are other APIs besides the OS's that use various encodings and my approach scales better.
r do you like from_narrow_os(), from_wide_os(), from_narrow_stdlib(), from_narrow_lib1(), ... from_wide_libN(); more ?
(2) No, I'd never proposed that. I repeat this again: The only encodings which matter are 'system default', UTF-8, and UTF-16. I would like to see a list of widely used libraries which use other encodings, please. [ Note: Not including libraries used for encoding conversions. — end note. ]
I'm not saying that every library uses some different encoding but the situation is not as ideal as you put it neither (that everybody uses the OS's conventions). But sorry, I'm not an software API encyclopedia so no list ;).
Even if there is such a library out there, the user is *already* converting to/from its exotic encoding.
Yes and this is precisely the most annoying part of working with text in C++. I do not say that the proposed library is completely wrong, but IMO the *ultimate Unicode string* must handle two things: A) the *Unicode stuff* which the proposed library basically does and there are others which do as well: Boost.Locale, Boost.Unicode B) (equally important) handle *conveniently* the conversions from/to external APIs. What good is a Unicode library if you have to do for example the *WINAPI-text-string-conversion-voodoo* before or after every call to WINAPI of which there are hundreds (at least in the apps that I work on). And these have to work with code using Qt, wxWidgets, mysql, libpq, odbc, openGL, openSSL, xml-parsers, etc. etc. many (I don't say all) have their conventions about text encoding and these conventions change over time and are not always consistent with the OS. My proposal only allows you to hide annoying details in one (easily extensible) library, which in turn results in cleaner, less cluttered and more stable application code.
[...] Besided it does not harm you in any way
It does. I already use UTF-8 for all my strings, even on windows, and I don't want the code-bloat of all these conversions (even if they're no-ops).
Again (if you know that it is UTF8 you don't have to say it out loud):
The approach that I proposed *does not* force you to specify the utf8 encoding explicitly neither if you are 100% sure that the string is in UTF8 and that this does not change under any circumstances (like when somebody changes the locale)
Huh? Neither mine. Again, what you say here contradicts (1).
I never said that your does, I'm just saying that that mine doesn't either.
[...]
You need to specify the source from which the text comes (by the symbolic tag) and the library handles the details for you. If the source is UTF8 do nothing otherwise do the transcoding.
So can I summarize this debate as 'the programmer specifies the library and boost chooses the encoding' versus 'the programmer goes to the documentation of the library and says boost what encoding to use'? If yes, then it's a quite minor design decision. According to (2) I claim that there will be less encodings than libraries.
Bingo. To summarize my points: Besides handling Unicode well, it must also play nice with other libraries. To do that; to move the burden from the application programmer and to remove code repetition - hide the conversion logic, and let the user just say where the text is coming from or going to. And again I don't remember saying that there will be more encodings than libraries. The libpq was just an example (I use it more that 10 years and it wasn't always using UTF8, like it does now).
And by doing to_narrow/from_narrow
you are trying to do that "transparently". But again, that are other sources of text which use other encodings, besides the OS API.
[...]
Neither I suggested passing non-utf-8 string to a utf-8 assumed string. It's not about the way you proposed to write the code, it's about your proposal doesn't solve the problem it was advocated to solve and be better from Artyom's and my proposal. See (1). You say that there is some code that you don't want to break, code you want to be compatible with. Which code? This code:
No, maybe I didn't say this clearly but I do not want implicit conversion between text and string.
char str[1024]; GetWindowTextA(hwnd, str, sizeof(str)); boost::function_with_text_parameter(str); // currently assumes system encoding
Let's leave aside the fact that this code uses deprecated winapi interface and thus unicode-unaware. Yes, it *should* be written as:
char cstr[1024]; GetWindowTextA(hwnd, cstr, sizeof(cstr)); text str(cstr, textenc::winapi()); boost::function_with_text_parameter(str);
BUT (!!!), until the user rewrites this code, you've silently broke his code. This is *exactly* the same situation as assuming std::string is utf-8 in the first place, and your way of how the user had to write his code is almost the same as mine:
char str[1024]; GetWindowTextA(hwnd, str, sizeof(str)); boost::function_with_text_parameter(from_narrow(str)); // accepts UTF-8 std::string // versus: boost::function_with_text_parameter(text(str, textenc::winapi)); // accepts your::text
This is exactly how it should look like! And I certainly do not have anything against a syntactic sugar function like for example: boost::function_with_text_parameter(text::from_os(str)); which would hide the ugly(?) tags and possible no-ops in the most important cases of conversions, like the one when you are talking to the OS API. [...] Matus

on Thu Aug 11 2011, Artyom Beilis <artyomtnk-AT-yahoo.com> wrote:
From: Daniel James <dnljms@gmail.com>
Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
On 11 August 2011 12:03, Artyom Beilis <artyomtnk@yahoo.com> wrote:
The problem is policy the problem is Boost just can't decide once and forever that std::string is UTF-8...
Even if there was a consensus within boost, that isn't feasible. We don't own std::string, so we don't have a say in what it represents.
std::string represents a sequence of "char" objects that happens to be useful for text processing. It can represent a text in any encoding.
The question is how we treat this sequence... And this is a matter of policy and requirements of the library.
I think I agree with Artyom here. *Somebody* has to decide how that datatype will be interpreted when we receive it. Unless we refuse altogether to accept std::string in our interfaces (which sounds like a bad idea to me), why not make the decision that it's UTF-8? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 13 August 2011 19:02, Dave Abrahams <dave@boostpro.com> wrote:
I think I agree with Artyom here. *Somebody* has to decide how that datatype will be interpreted when we receive it. Unless we refuse altogether to accept std::string in our interfaces (which sounds like a bad idea to me), why not make the decision that it's UTF-8?
Because if the native encoding isn't UTF-8 that will give the wrong result for cases such as: int main(int argc, char** argv) { // .... boost::filesystem::path p(argv[0]);

On 13 August 2011 20:10, Daniel James <dnljms@gmail.com> wrote:
On 13 August 2011 19:02, Dave Abrahams <dave@boostpro.com> wrote:
I think I agree with Artyom here. *Somebody* has to decide how that datatype will be interpreted when we receive it. Unless we refuse altogether to accept std::string in our interfaces (which sounds like a bad idea to me), why not make the decision that it's UTF-8?
Because if the native encoding isn't UTF-8 that will give the wrong result for cases such as:
int main(int argc, char** argv) { // .... boost::filesystem::path p(argv[0]);
As a reader of the long discussions of a new string class, it seems to me the only solution left is to pass the encoding as a separate entity from the string to those functions that'll need it. Because: * A new string class only pushes the problem one way further up from the library level, and imposes unnecessary copying of data on those who don't need/want it. There's a myriad of string classes already, yet another adapter/container doesn't make things cleaner. * Enforcing UTF-8 possible breaks existing applications, which assume the current behaviour (whatever that is). With the above options discarded, I see this (in some form): enum string_encoding { platform_specific, utf_8, }; { boost::filesystem::path(const char* str, boost::string_encoding e = boost::platform_specific); } If boost were to settle for only two viable encodings, i.e. platform_specific (or whatever name that matches the current behaviour in related libraries) and utf_8, it would at least imply that utf_8 is the preferred viable option for portable code, even if libraries default to platform_specific for backward compatibility. the utf_8 encoding would take the route that Artyom advocates, but in a more explicit way. Well, that's my two euro cents ;) cheers, - Christian

on Sat Aug 13 2011, Daniel James <dnljms-AT-gmail.com> wrote:
On 13 August 2011 19:02, Dave Abrahams <dave@boostpro.com> wrote:
I think I agree with Artyom here. *Somebody* has to decide how that datatype will be interpreted when we receive it. Unless we refuse altogether to accept std::string in our interfaces (which sounds like a bad idea to me), why not make the decision that it's UTF-8?
Because if the native encoding isn't UTF-8 that will give the wrong result for cases such as:
int main(int argc, char** argv) { // .... boost::filesystem::path p(argv[0]);
? I don't see any utf-8 here -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Monday 15 August 2011 20:07:36 Dave Abrahams wrote:
on Sat Aug 13 2011, Daniel James <dnljms-AT-gmail.com> wrote:
On 13 August 2011 19:02, Dave Abrahams <dave@boostpro.com> wrote:
I think I agree with Artyom here. *Somebody* has to decide how that datatype will be interpreted when we receive it. Unless we refuse altogether to accept std::string in our interfaces (which sounds like a bad idea to me), why not make the decision that it's UTF-8?
Because if the native encoding isn't UTF-8 that will give the wrong result for cases such as:
int main(int argc, char** argv) {
// .... boost::filesystem::path p(argv[0]);
? I don't see any utf-8 here
File names and comand line arguments are encoded as utf-8 these days which is an example of an ascii encoded string transparently transitioning to utf-8. Indeed as utf-8 was designed to be backwards compatible with ascii it is natural to treat any string or char* as utf-8.

On 16 August 2011 13:15, Marius Stoica <letto2@gmail.com> wrote:
File names and comand line arguments are encoded as utf-8 these days
That's not true for windows. It isn't necessarily true for linux either (although it usually is nowadays). The point is that libraries like filesystem and program options should use the native encoding by default.

On Tue, Aug 16, 2011 at 12:27:51PM +0100, Daniel James wrote:
On 16 August 2011 13:15, Marius Stoica <letto2@gmail.com> wrote:
File names and comand line arguments are encoded as utf-8 these days
That's not true for windows. It isn't necessarily true for linux either (although it usually is nowadays). The point is that libraries like filesystem and program options should use the native encoding by default.
This is handled by Filesystem v3, as anything that accepts a string input also takes a codecvt, which defaults to a default-constructed codecvt() (which as far as I understand it uses the current locale). I ran into a bug with this the other day on Windows (out of ignorance) where a client machine tried to store UTF-8 paths into a Boost.Filesystem path without using an utf8_codecvt_facet. Speaking of which, the one in detail/utf8_codecvt_facet.{cpp,hpp} is quite unfriendly to end programmers. -- Lars Viklund | zao@acc.umu.se

on Sat Aug 13 2011, Daniel James <dnljms-AT-gmail.com> wrote:
On 13 August 2011 19:02, Dave Abrahams <dave@boostpro.com> wrote:
I think I agree with Artyom here. *Somebody* has to decide how that datatype will be interpreted when we receive it. Unless we refuse altogether to accept std::string in our interfaces (which sounds like a bad idea to me), why not make the decision that it's UTF-8?
Because if the native encoding isn't UTF-8 that will give the wrong result for cases such as:
int main(int argc, char** argv) { // .... boost::filesystem::path p(argv[0]);
? I don't see any std::string here. Is there an implicit conversion? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 16 August 2011 01:08, Dave Abrahams <dave@boostpro.com> wrote:
on Sat Aug 13 2011, Daniel James <dnljms-AT-gmail.com> wrote:
On 13 August 2011 19:02, Dave Abrahams <dave@boostpro.com> wrote:
I think I agree with Artyom here. *Somebody* has to decide how that datatype will be interpreted when we receive it. Unless we refuse altogether to accept std::string in our interfaces (which sounds like a bad idea to me), why not make the decision that it's UTF-8?
Because if the native encoding isn't UTF-8 that will give the wrong result for cases such as:
int main(int argc, char** argv) { // .... boost::filesystem::path p(argv[0]);
? I don't see any std::string here. Is there an implicit conversion?
Well no, but it'd be an odd choice to make this do something different (wrt. the encoding of the argument): int main(int argc, char** argv) { std::vector<std::string> arguments( argv, argv + argc); // .... boost::filesystem::path p(arguments[0]);

Dave Abrahams wrote:
std::string represents a sequence of "char" objects that happens to be useful for text processing. It can represent a text in any encoding.
The question is how we treat this sequence... And this is a matter of policy and requirements of the library.
I think I agree with Artyom here. *Somebody* has to decide how that datatype will be interpreted when we receive it. Unless we refuse altogether to accept std::string in our interfaces (which sounds like a bad idea to me), why not make the decision that it's UTF-8?
hmmm - why can't we just leave it at "std::string represents a sequence of "char"" and define some derivative class which defines it as a "a refinement of std::string which supports UTF-8 functionality" ? Robert Ramey

On Sun, Aug 14, 2011, Robert Ramey wrote:
hmmm - why can't we just leave it at "std::string represents a sequence of "char"" and define some derivative class which defines it as a "a refinement of std::string which supports UTF-8 functionality" ?
Actually my design is based on this exact idea except that I added an indirection through smart pointers. But if you look at my alt_string_traits implementation at https://github.com/crf00/boost.ustr/blob/master/boost/ustr/detail/alt_string..., it eliminates all the smart pointers and effectively making it exactly the same as what you have described.

On Thu, Aug 11, 2011 at 14:41, Daniel James <dnljms@gmail.com> wrote:
On 11 August 2011 12:03, Artyom Beilis <artyomtnk@yahoo.com> wrote:
The problem is policy the problem is Boost just can't decide once and forever that std::string is UTF-8...
Even if there was a consensus within boost, that isn't feasible. We don't own std::string, so we don't have a say in what it represents.
Of course it's feasible. We have the right to say what it represents in the interface of *our* libraries. If Boost.ProgramOptions, Boost.Locale and Sqlite did it, surely we can adopt this policy to the rest of the libraries. There's a lot of existing code which is not based on that assumption -
we can't just wish it out of existence and boost should be compatible with it.
Most of existing code working with plain chars is either encoding agnostic or is already wrong. As per the design of the proposed library: It mixes two orthogonal concepts, namely encoding and storage. The two shall be separate. I don't like reference counted strings. Passing strings by reference is not that hard. Moreover, lots of atomic memory-bus locks in a multiprocessor system degrade performance. The 'unicode' support (codepoint iteration, etc) is purely algorithmic and thus shall be independent of the way the data is stored. I wold like to see something like `codepoints(any_char_iterator_range)` returning a range of codepoints. -- Yakov

On 11 August 2011 13:12, Yakov Galka <ybungalobill@gmail.com> wrote:
On Thu, Aug 11, 2011 at 14:41, Daniel James <dnljms@gmail.com> wrote:
Even if there was a consensus within boost, that isn't feasible. We don't own std::string, so we don't have a say in what it represents.
Of course it's feasible. We have the right to say what it represents in the interface of *our* libraries.
Not really, boost is intended to be interoperable with the C++ standard library. That limits us to following its conventions and policies.
If Boost.ProgramOptions, Boost.Locale and Sqlite did it, surely we can adopt this policy to the rest of the libraries.
According to its documentation, Program Options doesn't require UTF-8, it uses the standard locale facet. Locale can dictate localization issues, since it's a localization library and its users have chosen to use it. Users of boost's other libraries haven't made that decision. As far as I can tell sqlite doesn't use std::string, so I'm not sure why it's relevant. Regardless of that, it doesn't have the same requirements as us.

On Fri, Aug 12, 2011 at 10:58, Daniel James <dnljms@gmail.com> wrote:
On Thu, Aug 11, 2011 at 14:41, Daniel James <dnljms@gmail.com> wrote:
Even if there was a consensus within boost, that isn't feasible. We don't own std::string, so we don't have a say in what it represents.
Of course it's feasible. We have the right to say what it represents in
On 11 August 2011 13:12, Yakov Galka <ybungalobill@gmail.com> wrote: the
interface of *our* libraries.
Not really, boost is intended to be interoperable with the C++ standard library. That limits us to following its conventions and policies.
The standard library doesn't have any conventions. As a result, even in C++11 you cannot open a unicode filename in a portable way, even among systems that do support unicode. Boost's role is to provide us the tools to do things in a portable manner, to *hide* the differences between the platforms. The best way to accomplish this is to standardize things.
If Boost.ProgramOptions, Boost.Locale and Sqlite did it, surely we can adopt this policy to the rest of the libraries.
According to its documentation, Program Options doesn't require UTF-8, it uses the standard locale facet.
Oops, you're right. My claim was based on http://www.boost.org/doc/libs/1_47_0/doc/html/program_options/design.html, which happens to be a LIE. [...]
As far as I can tell sqlite doesn't use std::string, so I'm not sure why it's relevant. Regardless of that, it doesn't have the same requirements as us.
We are talking here not just about std::string but about any 'sequences of chars'. sqlite accepts UTF-8 filenames on *windows*! I'm not sure about what requirements are you talking. boost is a library after all, just as sqlite is. On Fri, Aug 12, 2011 at 10:57, Daniel James <dnljms@gmail.com> wrote:
On 11 August 2011 12:57, Artyom Beilis <artyomtnk@yahoo.com> wrote:
There's a lot of existing code which is not based on that assumption - we can't just wish it out of existence and boost should be compatible with it.
Then cross platform, Unicode aware programming will always (I'm sorry) suck with Boost :-)
Thats it...
Unless a different solution can be found.
Exactly. We are proposing a solution that have already been proven to work, You're resisting the change and prefer to be stuck with the status quo that, as we see, does not solve the problem. -- Yakov

Hi, I just wanted to express my worries about this discussion going to "what would have been the best solution?" instead of going to "how much the solution that is proposed solves the problem?". Whatever our preferred solution, couldn't we just focus on using/experimenting/trying the proposed solution, see where we get with it, in practice? I think it would be more helpful. After all, the points of each solution have already been discussed a lot in past months. My 0.2 cents. Joël Lamotte

On 12 August 2011 10:30, Yakov Galka <ybungalobill@gmail.com> wrote:
On Fri, Aug 12, 2011 at 10:58, Daniel James <dnljms@gmail.com> wrote:
Not really, boost is intended to be interoperable with the C++ standard library. That limits us to following its conventions and policies.
The standard library doesn't have any conventions.
My mistake, I should have said the de facto conventions.
Oops, you're right. My claim was based on http://www.boost.org/doc/libs/1_47_0/doc/html/program_options/design.html, which happens to be a LIE.
I doubt it's a lie. Sometimes we change our mind and forget to update the rationale. Or maybe it uses the locale when converting from narrow to wide, but uses UTF-8 when dealing with narrow strings. Anyway, this is getting increasingly off topic.
We are talking here not just about std::string but about any 'sequences of chars'. sqlite accepts UTF-8 filenames on *windows*! I'm not sure about what requirements are you talking. boost is a library after all, just as sqlite is.
A part of the popularity of boost is because it works well with existing code. So we need to work with the strings we get at the command line, from streams etc.
Unless a different solution can be found.
Exactly.
I meant a different solution to assuming that std::string is always UTF-8. It appears unlikely that your proposal will be accepted by boost, so what other possibilities are there? Perhaps a distinct string type, maybe some mechanism to specify what encoding strings are using, or something else entirely?

On Fri, Aug 12, 2011 at 15:29, Daniel James <dnljms@gmail.com> wrote:
[...] A part of the popularity of boost is because it works well with existing code. So we need to work with the strings we get at the command line, from streams etc.
It will work well with existing code. The user will just need to convert her strings to UTF-8 if they aren't UTF-8 already.
Unless a different solution can be found.
Exactly.
I meant a different solution to assuming that std::string is always UTF-8. It appears unlikely that your proposal will be accepted by boost, so what other possibilities are there? Perhaps a distinct string type, maybe some mechanism to specify what encoding strings are using, or something else entirely?
I assume that we all agree that we want to encourage UTF-8, as it's the only way to handle Unicode on all systems except windows. Assuming we *do* want to move to UTF-8, we have just one problem, compatibility. We cannot do both, use UTF-8 and don't break or change any existing code. It's just impossible. See (3) in my previous mail. (4) I've already proposed another solution on another thread. We can add a compile-time flag that, if set, makes all the narrow char and std::string interfaces assume UTF-8 encoding. This flag can be off by default as long as we feel comfortable. When more people find this useful we can make it the default and deprecate the non-UTF-8 configuration. This is *the only way* to kill two birds with one stone. Any other solution either silently breaks existing code, or requires boilerplate code to be added even in code that uses utf-8 std::strings. I think this is the simplest, least painful way to switch to UTF-8. In fact, if the experiment fails, we can just deprecate the feature, remove it later and pretend it'd never happened. Unfortunately, the community mostly ignored this proposal. The only reply from a library author showed complete unwillingness to soil his holy code with such unimportant things like portable Unicode support on windows. He sent me to request UTF-8 support from microsoft itself. -- Yakov

On Thu, Aug 11, 2011 Yakov Galka wrote:
As per the design of the proposed library: It mixes two orthogonal concepts, namely encoding and storage. The two shall be separate. I don't like reference counted strings. Passing strings by reference is not that hard. Moreover, lots of atomic memory-bus locks in a multiprocessor system degrade performance.
If you don't like smart pointer of strings, I have just created an alternative string traits and you can look at it in the detail folder. The alt_string_traits defines all smart pointer types as the original string type, effectively making unicode_string_adapter directly hold the string object as it's member object. I hope that you'd agree with me that the pattern: class unicode_string { public: // decorated methods here ... private: std::string _str; }; is effectively the same as: class unicode_string : public std::string { public: // decorated methods here ... }; and a unicode class like that is the extension of the original string.
The 'unicode' support (codepoint iteration, etc) is purely algorithmic and thus shall be independent of the way the data is stored. I wold like to see something like `codepoints(any_char_iterator_range)` returning a range of codepoints.
Due to popular demand I've added the static method unicode_string_adapter::make_codepoint_iterator() to satisfy your request. It takes three arguments: current, begin, and end to transverse in both direction without going out of bound. I am sorry that I don't understand why is this request so insisted when the same functionality is already exist in Boost.Unicode and other Unicode libraries. But anyway here is it. On Thu, Aug 11, 2011 Phil Endecott wrote:
Soares Chen Ruo Fei wrote:
you can assume the class to have the following signature with identical functionality:
<typename StringT> class unicode_string_adapter : public std::shared_ptr<const StringT>;
What is your rationale for that? What precedent is there for a wrapper/adapter/facade that behaves like a pointer to the wrapped object? I'm not aware of any precedents; this is a new pattern to me. Why have you chosen to do this, rather than providing an accessor member like impl()?
Let's just say it's purely syntactic taste that I think can make ease of the transition without too much confusion. But since the method is just consist of a few lines of code why don't we just have a democratic vote on whether to keep operator *()? I don't know, maybe let's say if six or more people here vote for no out of a maximum number of ten people, then I'll delete that few lines and make everyone happy. :) Actually I can't find any precedents that use this pattern as well, so I'm not sure whether it is a new pattern. The basis of this pattern is simple: use shared_ptr to store const version of the object so that it can be shared quickly, and clone the object for modification and store the cloned mutable object in something like unique_ptr to disable copying and sharing, and if possible then find a way to disable access of const methods in this mutable object to prevent user from reading and writing at the same time. I won't say this pattern is flawless but just because nobody used it before doesn't mean it is bad by default. It'll need some practical use of this pattern to see whether this pattern can lead to better programming construct and fewer bugs especially for amateur developers who are new in C++. On Fri, Aug 12, 2011 at 8:19 PM, Daniel James <dnljms@gmail.com> wrote:
I'm not a Windows expert, but I needed to do this for quickbook, I wasn't able to find a complete solution, but what I've got sort of works. Maybe someone else knows better. I've a horrible feeling that someone is going to point out a much simpler solution that makes what I do look silly and pointless.
[...]
Annoyingly _O_U16TEXT was largely undocumented until recently, I don't know if it was available before Visual Studio 2008. The last time I checked, it wasn't available in mingw. Here's the MSDN page for _setmode:
http://msdn.microsoft.com/en-us/library/tw4k6df8%28v=VS.100%29.aspx
The '_isatty(_fileno(stdout))' checks that you are writing to the console. You don't want to write UTF-16 when output is piped into a program that expects 8 bit characters.
A better solution might be to use the UTF-8 code page for output, but that didn't seem to work, at least not on XP.
Finally, remember to make sure your console is using a font that can display the characters you're outputting.
Thanks a lot for pointing to the obscured techniques! Now that I have a clue I can find out more on it through Google.

On Aug 12, 2011, at 2:48 PM, Soares Chen Ruo Fei <crf@hypershell.org> wrote:
What is your rationale for that? What precedent is there for a wrapper/adapter/facade that behaves like a pointer to the wrapped object? I'm not aware of any precedents; this is a new pattern to me. Why have you chosen to do this, rather than providing an accessor member like impl()?
Let's just say it's purely syntactic taste that I think can make ease of the transition without too much confusion. But since the method is just consist of a few lines of code why don't we just have a democratic vote on whether to keep operator *()?
Not a vote.
Actually I can't find any precedents that use this pattern as well, so I'm not sure whether it is a new pattern.
IIUC Vladimir Batov's Pimpl library works much like this. I don't know if it makes sense here, but you might want to take a look.

Soares Chen Ruo Fei wrote:
Due to popular demand I've added the static method unicode_string_adapter::make_codepoint_iterator() to satisfy your request. It takes three arguments: current, begin, and end to transverse in both direction without going out of bound.
Please decouple it more, e.g. make the codepoint iterator a public type and provide a constructor: const char* first = ...; const char* last = ...; const_utf8_codepoint_iterator iter(first,first,last);
I am sorry that I don't understand why is this request so insisted when the same functionality is already exist in Boost.Unicode and other Unicode libraries. But anyway here is it.
As I think I said before, I don't mind who implements it as long as we get it into Boost. Boost.Unicode is not yet an accepted library; it's not even on the review schedule. It seems to me that all the different opinions around Unicode and strings means that we miss the opportunity to provide this core, uncontroversial, functionality. Regards, Phil.

On Aug 13, 2011, at 7:02 AM, "Phil Endecott" <spam_from_boost_dev@chezphil.org> wrote:
As I think I said before, I don't mind who implements it as long as we get it into Boost. Boost.Unicode is not yet an accepted library; it's not even on the review schedule.
My understanding is that Matthias is just waiting for a good application or two and then he'll put it up for review. IMHO this library should use that one and then you can focus on the std::string-like stuff and he can do the range stuff. This might require some intermediate frameworks to deal with the other encodings. Please watch the video of Matthias' talk - after the first half it sort of turned into an impromptu review where among other things people were discussing how to make it useful by connecting it to a string wrapper class.

On Aug 13, 2011, at 7:20 AM, Gordon Woodhull <gordon@woodhull.com> wrote:
it sort of turned into an impromptu review where among other things people were discussing how to make it useful by connecting it to a string wrapper class.
s/make it useful/bring it to a wider audience/

On Sat, Aug 13, 2011, Gordon Woodhull wrote:
As I think I said before, I don't mind who implements it as long as we get it into Boost. Boost.Unicode is not yet an accepted library; it's not even on the review schedule.
My understanding is that Matthias is just waiting for a good application or two and then he'll put it up for review.
Yup, you can count on Mathias' Boost.Unicode in matured phase much more than you can count on my Boost.Ustr in it's infancy phase. :)
IMHO this library should use that one and then you can focus on the std::string-like stuff and he can do the range stuff. This might require some intermediate frameworks to deal with the other encodings.
Please watch the video of Matthias' talk - after the first half it sort of turned into an impromptu review where among other things people were discussing how to make it useful by connecting it to a string wrapper class.
I just briefly watched it and yeah, my idea is basically the same as the commentator - to add Unicode functionality to specifically *strings*, keep it as simple as possible sacrificing generality, and designed specially for users who *don't* care about Unicode processing. Thanks for pointing to the videos, I've forgotten about it after busying through other stuffs.

On Thu, Aug 11, 2011, Artyom Beilis wrote:
My strong opinion is:
a. Strings should be just container object with default encoding and some useful API to handle it. b. Default encoding MUST be UTF-8 c. There are several ways to implement strings COW, Mutable, Immutable, with small string optimization and so on. This way or other std::string is de-facto string and I think we should live with it and use some alternative containers where it matters. d. Code point and code unit are meaningless unless you develop some Unicode algorithm - and you don't - you use one written by experts.
This Ustr does not solve this problem as it does not provide really some kind of
adapter<generic encoding> { string content }
This is some kind of thing that may be useful, but not in this case. Basically your library provides wrapper around string and outputs Unicode code points but it does it for UTF encodings only!
It does not benefit too much. You provide encoding traits but it is basically meaningless for the propose you had given as:
It does not provide traits for non-Unicode encodings like lets say Shift-JIS or ISO-8859-8
The library is designed to be flexible without intending to include every possible encodings to the library by default. The point is that external developers can leverage the EncodingTraits template parameter to implement the desired encoding *themselves*. The core library should be as small as possible without being bloated by translation tables between encodings that are not commonly used by the rest of the world. You may request me to add a sub-library for Shift-JIS or other encodings, and I'll consider implementing it for popular demand.
BTW you can't create traits for many encodings, for example you can't implement traits requirements:
http://crf.scriptmatrix.net/ustr/ustr/advanced.html#ustr.advanced.custom_enc...
For popular encodings like Shift-JIS or GBK...
Homework: tell me why ;-)
I was trying to write a few lines of prototype code to show you that it'd work, but I've run out of time and missed so many reply so I'll show you next time. But why not? There are already standard translation table offered by Unicode Consortium at ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT, so all it was needed to make the encoder/decoder work with that translation table. Perhaps you are referring to the non-roundtrip conversion of some Shift-JIS characters, as mentioned by Microsoft at http://support.microsoft.com/kb/170559. But the objective is to make best effort emulation not completely perfect. Such problem cannot be solved by any other implementation means anyway, so if you are trying to convert Shift-JIS strings to Unicode before passing it to Unicode-oriented functions, you are still screwed the same way. Or perhaps you mean that you can't properly encode non-Japanese characters into the Shift-JIS encoding. In that case the character will just be substituted by a replacement character, or throw an exception according to the provided policy. But the user of such Unicode-emulated string should only read and not modify on it anyway, i.e. it is probably a bad idea to create a unicode_string_adapter_builder instance of the string and pass it to mutation function that assumes full Unicode encoding functionality. Then again you are also screwed the same way if you are trying to do manual conversion from Unicode-encoded string to Shift-JIS encoded string and pass it to Shift-JIS-oriented function. The conclusion is that Boost.Ustr with custom encoding traits is intended for convenience of automatically converting between two encoding between library boundaries. It will not solve any encoding conversion problem that you can't even solve with manual conversion.
Also it is likely that encoding is something that
can be changed in the runtime not compile time and it seems that this adapter does not support such option.
It is always possible to add a dynamic layer on top of the static encoding layer but not the other way round. It shouldn't be too hard to write a class with virtual interfaces to call the proper template instance of unicode_string_adapter. But that is currently outside of the scope and the fundamental design will still be static regardless.
If someone uses strings with different encodings he usually knows their encoding...
The problem is that API inconsistent as on Windows narrow string is some ANSI code page and anywhere else it is UTF-8.
This is entirely different problem and such adapters don't really solve them but actually make it worse...
If I'm not wrong however, the wide version of strings on Windows is always UTF-16 encoded, am I correct? So a manual solution of constructing UTF-8 strings on Windows would be similar to: std::wstring wide_str = L"世界你好"; std::string u8_str; generic_conversion::u16_to_u8(wide_str.begin(), wide_str.end(), std::back_inserter(u8_str)); except that it will not be portable across Unix systems. But with Boost.Ustr you can achieve the same thing by unicode_string_adapter<std::string> u8_str = USTR("世界你好"); which gets expanded into unicode_string_adapter<std::string> u8_str = unicode_string_adapter<std::wstring>( std::wstring(L"世界你好") ); So while it's hard to construct UTF-8 string literals on Windows, it should still be possible by writing code that manually insert UTF-8 code units into std::string. After all std::string does not have any restriction on which bytes we can manually insert into it. Perhaps you are talking about printing the string out to std::out, but that's another hard problem that we have to tackle separately.
Other problem is ================
I don't believe that string adapter would solve any real problems because:
a) If you iterate over code points you are very likely do something wrong. As code point != character and this is very common mistake.
I am well aware of it but I decide to separate the concern into different layers and tackle the abstract character problem at a higher layer. The unicode_string_adapter class has a well defined role which is to offer *code point* level access. I did plan to write another class that does the character iteration, but I don't have enough time to do it yet. But I believe you can pass the code point iterators to Mathias' Boost.Unicode methods to create abstract character iterators that does the job you want.
b) If you want to iterate over code points it is better to have some kind of utf_iterator that receives a range and iterate over it, it would be more generic and do not require to have an additional class.
For example Boost.Locale has utf_traits that allow to implement iteration over code points quite easily.
See: http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/namespaceboost_1_1... http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/structboost_1_1loc...
And you don't need any kind of specific adapters.
I added a static method unicode_string_adapter::make_codepoint_iterator to do what you have requested. It accepts three code unit iterator parameters, current, begin, and end, so that it can iterate in both directions without going out of bound. Hope that is what you looking for.
c) The problem in Boost is not missing Unicode String and it is not even required to have yet-another-unicode-string that we have good Unicode support.
The problem is policy the problem is Boost just can't decide once and forever that std::string is UTF-8...
But don't get me wrong. This is My Opinion, many would disagree with me.
Bottom line,
Unicode strings, cool string adapters, UTF-iterators and even Boost.Unicode and Boost.Locale would not solve the problems that Boost libraries use inconsistent encodings on different platforms.
IMHO: the only way to solve it is POLICY.
It is ok if you disagree with my approach, but keep in mind that the focus of this thread is just to tell whether my current proposed solution is good, and not to propose a better solution. Thanks. cheers, Soares

It does not provide traits for non-Unicode encodings like lets say Shift-JIS or ISO-8859-8
The library is designed to be flexible without intending to include every possible encodings to the library by default. The point is that external developers can leverage the EncodingTraits template parameter to implement the desired encoding *themselves*.
I understand and I do not expect from you implementing every encoding, but some encodings lets say Latin1 or ASCII should be given to at least provide an example.
BTW you can't create traits for many encodings, for example you can't implement traits requirements:
http://crf.scriptmatrix.net/ustr/ustr/advanced.html#ustr.advanced.custom_enc...
For popular encodings like Shift-JIS or GBK...
Homework: tell me why ;-)
Perhaps you are referring to the non-roundtrip conversion of some Shift-JIS characters, as mentioned by Microsoft at http://support.microsoft.com/kb/170559.
No
Or perhaps you mean that you can't properly encode non-Japanese characters into the Shift-JIS encoding
No The problem, that unlike variable width UTF encodings that have clear separation between the lead and the trail code units that multi-byte encodings like Shift-JIS or GBK have not such separation. UTF-8 and UTF-16 encodings are so called self synchronizing, you can go forward, you can go backward without any problem and even if you lost the position you can find the next valid position in either direction and continue. However with non-Unicode CJK encodings like Shift-JIS or GBK there is no way to go backward because it is ambiguous, and in order to decode text you always should go forward. That is why the traits model you provided has conceptual flaw as it is impossible to implement bidirection iterator over most non UTF CJK multibyte encodings.
It is always possible to add a dynamic layer on top of the static encoding layer but not the other way round. It shouldn't be too hard
Is encoding traits object is a member of unicode_string_adapter?
From what I had seen in the code it does not - correct me if I wrong.
So in current situation you can't make dynamic encoding traits.
to write a class with virtual interfaces to call the proper template instance of unicode_string_adapter. But that is currently outside of the scope and the fundamental design will still be static regardless.
This is a problem as for example typical use case where ANSI code page is used as default encodings is something that is defined by the OS the program runs on.
If someone uses strings with different encodings he usually knows their encoding...
The problem is that API inconsistent as on Windows narrow string is some ANSI code page and anywhere else it is UTF-8.
This is entirely different problem and such adapters don't really solve them but actually make it worse...
If I'm not wrong however, the wide version of strings on Windows is always UTF-16 encoded, am I correct?
Yes, but wide characters useless for cross platform development as they UTF-16 only on Windows on other OSes they are UTF-32
So while it's hard to construct UTF-8 string literals on Windows, it should still be possible by writing code that manually insert UTF-8 code units into std::string. After all std::string does not have any restriction on which bytes we can manually insert into it.
Sorry but IMHO this is quite ugly solution...
Perhaps you are talking about printing the string out to std::out, but that's another hard problem that we have to tackle separately.
Actually this is quite simple even on Windows. In console write chcp 65001 and then UTF-8 output will be shown correctly.
The unicode_string_adapter class has a well defined role which is to offer *code point* level access.
No problem with that but shouldn't it be maybe code_point iterator and not "String-Adaoptor"?
I did plan to write another class that does the character iteration, but I don't have enough time to do it yet.
If you implement character iteration it will not be compact library as character iteration requires access to Unicode database of character properties. And actually Boost.Locale already provides character iteration. http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/namespaceboost_1_1... And Boost.Unicode should provide grapheme segmentation with different interface.
But I believe you can pass the code point iterators to Mathias' Boost.Unicode methods to create abstract character iterators that does the job you want.
I see, this is better.
I added a static method unicode_string_adapter::make_codepoint_iterator to do what you have requested. It accepts three code unit iterator parameters, current, begin, and end, so that it can iterate in both directions without going out of bound. Hope that is what you looking for.
Yes, this is better, but it should not be a static method of string_adaptor but a free algorithm.
It is ok if you disagree with my approach, but keep in mind that the focus of this thread is just to tell whether my current proposed solution is good, and not to propose a better solution.
Thanks.
I understand it and I pointed to the problems with string_adaptor. But also I question the usability of the library and motivations. Best, Artyom

On Sat, Aug 13, Artyom Beilis wrote:
I understand and I do not expect from you implementing every encoding, but some encodings lets say Latin1 or ASCII should be given to at least provide an example.
That's an easier task to do. :) I'll try to find time to implement probably the ASCII and URL-encoded encoding traits but I'm not sure if I can make it before the GSoC deadline next week.
The problem, that unlike variable width UTF encodings that have clear separation between the lead and the trail code units that multi-byte encodings like Shift-JIS or GBK have not such separation.
UTF-8 and UTF-16 encodings are so called self synchronizing, you can go forward, you can go backward without any problem and even if you lost the position you can find the next valid position in either direction and continue.
However with non-Unicode CJK encodings like Shift-JIS or GBK there is no way to go backward because it is ambiguous, and in order to decode text you always should go forward.
That is why the traits model you provided has conceptual flaw as it is impossible to implement bidirection iterator over most non UTF CJK multibyte encodings.
Ahh I see so that's quite nasty, but actually it still can be done with the sacrifice on efficiency. Basically since the iterator already has the begin and end boundary iterators it can simply reiterate all over from the beginning of the string. Although doing so is roughly O(N^2) it shouldn't make significant impact as developers rarely use this multi-byte encoding and even seldom use the reverse decoding function. We can also find way to see if there are any cases where we can determine the previous character position without having to go back to the beginning, and if that is actually the norm then the penalty will become less significant as well.
It is always possible to add a dynamic layer on top of the static encoding layer but not the other way round. It shouldn't be too hard
Is encoding traits object is a member of unicode_string_adapter?
From what I had seen in the code it does not - correct me if I wrong.
So in current situation you can't make dynamic encoding traits.
No, the encoding traits is static according to my design. But what I mean about dynamic encoding is something different than you thought. See https://github.com/crf00/boost.ustr/blob/master/boost/ustr/dynamic_unicode_s... which is the dynamic encoding string class that I have just wrote. I haven't include full functionality but you can get the basic idea of my dynamic string from there.
to write a class with virtual interfaces to call the proper template instance of unicode_string_adapter. But that is currently outside of the scope and the fundamental design will still be static regardless.
This is a problem as for example typical use case where ANSI code page is used as default encodings is something that is defined by the OS the program runs on.
Basically Boost.Ustr is designed to be completely locale-agnostic so it does not try to play well with locale rule. As I said above the dynamic encoding string is probably the feature you want, but actually I think this problem you mention can still be solved using purely static encoding. It can be something like: unicode_string_adapter<std::string> get_string_from_locale_sensitive_system() { const char* raw_string = get_locale_dependent_system_string(); CodePage codepage = get_system_codepage(); if(codepage == CodePage::UTF8_CodePage) { return unicode_string_adapter<std::string>(raw_string); } else if(codepage == CodePage::932_CodePage) { return unicode_string_adapter<std::string, ..., ShiftJisEncoder, ...>(raw_string); } else if(codepage == CodePage::950_CodePage) { return unicode_string_adapter<std::string, ..., Big5Encoder, ...>(raw_string); } } It is probably not a good idea to pass a string encoded in uncommon encoding and let it slip through the entire system even with the proper encoding tag. Such a design would eventually still lead to bugs even with the best possible help of Unicode utilities. The better idea is to make use the automatic conversion and convert the string back to UTF-8 string as soon as the string of the uncommon encoding is no longer needed.
Yes, but wide characters useless for cross platform development as they UTF-16 only on Windows on other OSes they are UTF-32
Since unicode_string_adapter makes use of template-metaprogramming technique to choose either UTF-16 or UTF-32 encoding for wchar_t strings, using wchar_t together with Boost.Ustr should be much less painful IMHO, and it can always be converted to UTF-8 string with an extra line of code.
So while it's hard to construct UTF-8 string literals on Windows, it should still be possible by writing code that manually insert UTF-8 code units into std::string. After all std::string does not have any restriction on which bytes we can manually insert into it.
Sorry but IMHO this is quite ugly solution...
The implementation may be ugly but it is encapsulated already. The main issue is whether it makes the *user* code uglier or more elegant. Most users would not care about the extra overhead of converting the short string literals during runtime, and IMHO it's much more worth it to save developers' time than avoiding run-time overhead.
Perhaps you are talking about printing the string out to std::out, but that's another hard problem that we have to tackle separately.
Actually this is quite simple even on Windows.
In console write chcp 65001 and then UTF-8 output will be shown correctly.
The unicode_string_adapter class has a well defined role which is to offer *code point* level access.
No problem with that but shouldn't it be maybe code_point iterator and not "String-Adaoptor"?
You can suggest a better name, the current name "unicode_string_adapter" is a bit too long anyway. But code point iterator is just a small part of the whole design. There are much more issues on designing to make it work closely with the original string than just producing a code point iterator.
I did plan to write another class that does the character iteration, but I don't have enough time to do it yet.
If you implement character iteration it will not be compact library as character iteration requires access to Unicode database of character properties.
And actually Boost.Locale already provides character iteration.
http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/namespaceboost_1_1...
And Boost.Unicode should provide grapheme segmentation with different interface.
Yup.. That's why this is not a high priority task for me. I try to implement just the string-related Unicode functions and leave the rest to Boost.Unicode and Boost.Locale. :) And even if I write an abstract character class it will probably be separated with the core class and rely on these other Unicode libraries to perform the actual functionality.
I added a static method unicode_string_adapter::make_codepoint_iterator to do what you have requested. It accepts three code unit iterator parameters, current, begin, and end, so that it can iterate in both directions without going out of bound. Hope that is what you looking for.
Yes, this is better, but it should not be a static method of string_adaptor but a free algorithm.
Actually the code point iterator class is now independent and can be used for generic purpose. However it requires a few more template parameters so it may be even less convenient to use it directly. Currently it's signature is as follow: template <typename CodeunitIterator, typename Encoder, typename Policy> class codepoint_iterator { public: codepoint_iterator( codeunit_iterator_type codeunit_it, codeunit_iterator_type begin, codeunit_iterator_type end); }; If you like this class I'll try to implement some convenient functions to construct UTF-8/16/32 iterators. Thanks. cheers, Soares

----- Original Message -----
From: Soares Chen Ruo Fei <crf@hypershell.org>
The problem, that unlike variable width UTF encodings that have clear separation between the lead and the trail code units that multi-byte encodings like Shift-JIS or GBK have not such separation.
UTF-8 and UTF-16 encodings are so called self synchronizing, you can go forward, you can go backward without any problem and even if you lost the position you can find the next valid position in either direction and continue.
However with non-Unicode CJK encodings like Shift-JIS or GBK there is no way to go backward because it is ambiguous, and in order to decode text you always should go forward.
That is why the traits model you provided has conceptual flaw as it is impossible to implement bidirection iterator over most non UTF CJK multibyte encodings.
Ahh I see so that's quite nasty, but actually it still can be done with the sacrifice on efficiency. Basically since the iterator already has the begin and end boundary iterators it can simply reiterate all over from the beginning of the string. Although doing so is roughly O(N^2) it shouldn't make significant impact as developers rarely use this multi-byte encoding and even seldom use the reverse decoding
Except that these are default narrow string encodings on Windows (and sometimes even on Linux) at China, Japan and Korea... No they are not rare encodings. The better solution would be to create an index of Shift-JIS -> code-point and use it, you can probably do it in lazy way on first attempt of backward iteration. This is what I do for Boost.Locale.
It is always possible to add a dynamic layer on top of the static encoding layer but not the other way round. It shouldn't be too hard
Is encoding traits object is a member of unicode_string_adapter?
From what I had seen in the code it does not - correct me if I wrong.
So in current situation you can't make dynamic encoding traits.
No, the encoding traits is static according to my design. But what I mean about dynamic encoding is something different than you thought. See https://github.com/crf00/boost.ustr/blob/master/boost/ustr/dynamic_unicode_s... which is the dynamic encoding string class that I have just wrote. I haven't include full functionality but you can get the basic idea of my dynamic string from there.
Ok I see this code: class dynamic_codepoint_iterator_object : public std::iterator<std::bidirectional_iterator_tag, codepoint_type> { public: virtual const codepoint_type dereference() const = 0; virtual void increment() const = 0; virtual void decrement() const = 0; virtual bool equals(const dynamic_codepoint_iterator_object* other) const = 0; virtual const unicode_string_type& get_type() const = 0; virtual void* get_raw_iterator() const = 0; virtual ~dynamic_codepoint_iterator_object() { } }; THIS IS VERY BAD DESIGN. ------------------ I've been there. Think of while(pos!=end) { code_point = *pos; ++pos; } How many virtual calls for one code point required? 1. equals 2. deference 3. increment This is horrible way to do things. I've started to work on generic "abstract-iterator" for Boost.Locale however hadn't completed the work yet. It allows to reduce virtual call per-character below 1. This is not a way to go. (BTW you had forgot clone() member function)
to write a class with virtual interfaces to call the proper template instance of unicode_string_adapter. But that is currently outside of the scope and the fundamental design will still be static regardless.
This is a problem as for example typical use case where ANSI code page is used as default encodings is something that is defined by the OS the program runs on.
Basically Boost.Ustr is designed to be completely locale-agnostic so it does not try to play well with locale rule. As I said above the dynamic encoding string is probably the feature you want, but actually I think this problem you mention can still be solved using purely static encoding. It can be something like:
unicode_string_adapter<std::string> get_string_from_locale_sensitive_system() { const char* raw_string = get_locale_dependent_system_string();
CodePage codepage = get_system_codepage(); if(codepage == CodePage::UTF8_CodePage) { return unicode_string_adapter<std::string>(raw_string); } else if(codepage == CodePage::932_CodePage) { return unicode_string_adapter<std::string, ..., ShiftJisEncoder, ...>(raw_string); } else if(codepage == CodePage::950_CodePage) { return unicode_string_adapter<std::string, ..., Big5Encoder, ...>(raw_string); } }
The entire motivation behind this library was to provide some "Unicode Warrping" over different encodings. And if you tell me that the library is "locale" agnostic makes it unsuitable for the proposed motivation because non-Unicode encodings do change in run-time. But forget locale - according to the motivation requires to support at least runtime OS ANSI codepage to Unicode which it does not support. This is the biggest flaw of the current library.
It is probably not a good idea to pass a string encoded in uncommon encoding and let it slip through the entire system even with the proper encoding tag. Such a design would eventually still lead to bugs even with the best possible help of Unicode utilities. The better idea is to make use the automatic conversion and convert the string back to UTF-8 string as soon as the string of the uncommon encoding is no longer needed.
Or maybe just convert "uncommon-encoding" to UTF-8/UTF-16/UTF-32 and forget all the wrapper? ===================================================================== ===================================================================== Dear Soares Chen Ruo Fei, (Sorry I have no idea what is the first name :-) ) Don't get me wrong. I see what you are trying to do and it is good thing. There is a one big problem: You entered very-very-very dangerous swamp. It is very easy to do wrong assumptions not because you don't try to do the best but because even such a small problem is very diverse. And the "best" you can do is to assume that if there can be something wrong it will. It is not an accident that there is no "unicode string" in Boost and that there are too many "fancy" Unicode strings around like QString, icu::UnicodeString, gtk::ustring and many others that somehow fail to provide the goodies you need. ==================================================================== Best Regards, Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.sf.net/ CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/

On Sun, Aug 14, 2011, Artyom Beilis wrote:
Except that these are default narrow string encodings on Windows (and sometimes even on Linux) at China, Japan and Korea...
No they are not rare encodings.
Well we are talking about *cough* Windows *cough* here, it's not like it is widely used by developers in preference of their *own* choice. (except Shift-JIS) But for me I'd think of these encodings as depreciated encodings that developers should not use it in newer programs. Of course we got old generations programmers who don't care about portability and insist on using their old time favorite encodings, and we know how long it takes to fully depreciate something. But Boost.Ustr's intended audience is for those who do *want* everything Unicode badly but is forced to somehow deal with small portion of legacy code that still uses the old time MBCS encodings. On the other hand, I don't really consider any use case to make Boost.Ustr easy for hard core developers who *insist* on continue using the MBCS encodings while expecting Boost.Ustr to let them use new Unicode libraries on their MBCS strings. Sure you can do that, but it is currently out of my scope and intention to support it.
The better solution would be to create an index of Shift-JIS -> code-point and use it, you can probably do it in lazy way on first attempt of backward iteration.
This is what I do for Boost.Locale.
Sorry I thought index is the same as translation table, and that's how the decoding is supposed to work?
Ok I see this code:
class dynamic_codepoint_iterator_object : public std::iterator<std::bidirectional_iterator_tag, codepoint_type> [...]
THIS IS VERY BAD DESIGN.
------------------
I've been there.
Think of
while(pos!=end) { code_point = *pos; ++pos; }
How many virtual calls for one code point required?
1. equals 2. deference 3. increment
This is horrible way to do things.
It depends on how you look at it actually, but I'm not surprised that most C++ programmers would complain on such design and anything that involves virtual function. (I remember someone also complained that your Boost.Locale library contains even minimal number of virtual functions. :) People can should at it the same way they should how many machine instructions are spent on a for-loop in Python or Javascript. It is actually a matter of preference on whether you prefer minimal coding with slower performance, or more coding with better performance. For me I'd just choose the right design for the right situation. I cleanly separated the static and dynamic part into two separate classes, so that you have full freedom to choose whichever class that you see fit. The reason I design dynamic_unicode_string in such way is so that you can work transparently with any types of string, be it std::string, std::u16string, std::wstring, std::vector, or anything else. And surely with such a flexible design the only way to achieve it is by using virtual functions. If you are performance critical, you can just use unicode_string_adapter and ignore this dynamic_unicode_string class.
I've started to work on generic "abstract-iterator" for Boost.Locale however hadn't completed the work yet. It allows to reduce virtual call per-character below 1.
Can you show me the code of your abstract-iterator? Perhaps your objective is different from mine so our designs are different, or perhaps you do have a better design that I can learn from.
This is not a way to go.
(BTW you had forgot clone() member function)
Oh yah, ok I'll add that when I have the time. Thanks!
The entire motivation behind this library was to provide some "Unicode Warrping" over different encodings.
And if you tell me that the library is "locale" agnostic makes it unsuitable for the proposed motivation because non-Unicode encodings do change in run-time.
But forget locale - according to the motivation requires to support at least runtime OS ANSI codepage to Unicode which it does not support.
This is the biggest flaw of the current library.
I am sorry but can you give me an example of how that actually happens? I'm not familiar with Windows development so I don't know most of the quirks on it. By changing locale on run-time, do you mean that an existing char* string stored on the stack/heap can suddenly change it's byte content to fit certain encoding that is changed on run time? Or is the run-time change only affects new char* strings obtained from the Windows API while old char* strings still retain their old byte content and encoding? If it is the latter I don't see why the proposed snippet in my previous message doesn't solve your problem? By locale-agnostic what I actually mean is to leave the locale related problems to a higher layer. I think locale and encodings are two separate issues and they should really be implemented in different classes and libraries. And actually I don't like the design of locale-aware function without delegating the actual functionality to a locale-agnostic function. For me, I think all locale-aware functions should be implemented in two steps: one that detects the locale and then delegates to another specific function that does the work and ignores the current locale.
Or maybe just convert "uncommon-encoding" to UTF-8/UTF-16/UTF-32 and forget all the wrapper?
Yes if you can do it and assume all std::string is UTF-8 then of course you don't need Boost.Ustr anymore. But until that happens, Boost.Ustr is what I think will be at the moment. :)
Dear Soares Chen Ruo Fei,
(Sorry I have no idea what is the first name :-) )
You can call me Soares, which is my unofficial English name that I give to myself because my given name is hard to pronounce by English people. Chen is my family name. I have to keep my given name in the email because that is my official name and I thought it might be easier for GSoC to track on my posts.
Don't get me wrong. I see what you are trying to do and it is good thing.
There is a one big problem:
You entered very-very-very dangerous swamp.
It is very easy to do wrong assumptions not because you don't try to do the best but because even such a small problem is very diverse. And the "best" you can do is to assume that if there can be something wrong it will.
I do realize that I am *always* trying to do something very dangerous for almost all my previous and current projects. I don't like the harsh criticisms I get for my radical ideas and ambitions, and I don't like myself choosing on projects that are "dangerous". But since that's who I really am, I just have no choice but to live with it and hope to strive for an eventual success.
It is not an accident that there is no "unicode string" in Boost and that there are too many "fancy" Unicode strings around like QString, icu::UnicodeString, gtk::ustring and many others that somehow fail to provide the goodies you need.
That doesn't stop me from trying and learn something valuable in case it still fail. After all, this is what GSoC really should be about right, to try to solve something challenging and learn by making mistakes. ;) cheers, Soares

Soares Chen Ruo Fei wrote:
with non-Unicode CJK encodings like Shift-JIS or GBK there is no way to go backward
Ahh I see so that's quite nasty, but actually it still can be done with the sacrifice on efficiency. Basically since the iterator already has the begin and end boundary iterators it can simply reiterate all over from the beginning of the string. Although doing so is roughly O(N^2) it shouldn't make significant impact as developers rarely use this multi-byte encoding and even seldom use the reverse decoding function.
As a general point, I believe it's a bad idea to hide a surprise like O(N^2) instead of O(N) complexity in a "rare" case. Doing so means that users will implement something that seems to work, and then get bitten later when it doesn't work in the field. (For example, the first time that a customer in Japan tries to process a 1 MB file and it takes a million times longer than expected.) It would be better to not provide the inefficient case at all. Compare with how std::list doesn't provide random access, even though it could do so in O(N). Looking at your character set iterator, it seems to me that you could have a forward-only iterator and a bidirectional iterator for UTF, but only the former for these other encodings. Not storing the begin iterator when only forward iteration is needed also saves space. Regards, Phil.

On Sun, Aug 14, 2011, Phil Endecot wrote:
As a general point, I believe it's a bad idea to hide a surprise like O(N^2) instead of O(N) complexity in a "rare" case. Doing so means that users will implement something that seems to work, and then get bitten later when it doesn't work in the field. (For example, the first time that a customer in Japan tries to process a 1 MB file and it takes a million times longer than expected.)
It would be better to not provide the inefficient case at all. Compare with how std::list doesn't provide random access, even though it could do so in O(N). Looking at your character set iterator, it seems to me that you could have a forward-only iterator and a bidirectional iterator for UTF, but only the former for these other encodings. Not storing the begin iterator when only forward iteration is needed also saves space.
Hmm it is possible that I can use SFINAE to disable the decrement function in the code point iterator, but I think it will probably impact function APIs that accepts generic template of unicode_string_adapter and expects the same behavior on all template instances. It will also require developer to manually lookup the documentation when using a string adapter with custom custom encoding traits. Of course having just one conditionally enabled method doesn't hurt that much, but I'd be wary on letting in too many conditional variations for different template instances of unicode_string_adapter. Anyway I wonder if there is any use case of developer storing content as large as 1MB in a single std::string object, and I doubt any operation being done on that string would be efficient. Since Boost.Ustr is a string adapter it can make reasonable assumption the same way as general purpose string classes do, and I don't think any general purpose string class'd expect to scale to that large of size. cheers, Soares

Soares Chen Ruo Fei wrote:
Anyway I wonder if there is any use case of developer storing content as large as 1MB in a single std::string object, and I doubt any operation being done on that string would be efficient. Since Boost.Ustr is a string adapter it can make reasonable assumption the same way as general purpose string classes do, and I don't think any general purpose string class'd expect to scale to that large of size.
Err... No, I disagree. I should be able to have a string as large as my virtual memory will allow, and it should continue to perform according to its documented complexity guarantees however large it gets. For example, I have an IMAP mail server that uses std::string internally. This is not code that needs to be especially fast, but it does require that O(1) operations don't become O(N) operations, or that O(N) operations become O(N^2) operations. Phil.

on Sun Aug 14 2011, Soares Chen Ruo Fei <crf-AT-hypershell.org> wrote:
Hmm it is possible that I can use SFINAE to disable the decrement function in the code point iterator,
Don't use SFINAE; just use iterator_facade and don't implement decrement for iterators that can only go forward. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Aug 14, 2011, at 10:12 AM, "Phil Endecott" <spam_from_boost_dev@chezphil.org> wrote:
It would be better to not provide the inefficient case at all. Compare with how std::list doesn't provide random access, even though it could do so in O(N). Looking at your character set iterator, it seems to me that you could have a forward-only iterator and a bidirectional iterator for UTF, but only the former for these other encodings. Not storing the begin iterator when only forward iteration is needed also saves space.
+1 Please use standard iterator concepts and provide the expected behavior. Emulating bidirectional or random access iterators on top of a forward iterator is never good. People can use std::advance if they really need to. It's not just an issue for really long strings. Consider someone trying to read a moderate sized string backward. If only forward access is possible efficiently, there should be no operator--(). Cheers Gordon

On Mon, Aug 15, 2011 at 4:37 AM, Gordon Woodhull <gordon@woodhull.com> wrote:
+1
Please use standard iterator concepts and provide the expected behavior. Emulating bidirectional or random access iterators on top of a forward iterator is never good. People can use std::advance if they really need to.
It's not just an issue for really long strings. Consider someone trying to read a moderate sized string backward. If only forward access is possible efficiently, there should be no operator--().
On Mon, Aug 15, 2011, Phil Endecott wrote:
Err...
No, I disagree. I should be able to have a string as large as my virtual memory will allow, and it should continue to perform according to its documented complexity guarantees however large it gets. For example, I have an IMAP mail server that uses std::string internally. This is not code that needs to be especially fast, but it does require that O(1) operations don't become O(N) operations, or that O(N) operations become O(N^2) operations.
Ok I got what you mean. I think it'll be easier to just remove the decrement function completely. During implementation I also wondered if there is any real use for reverse code point iterator, but since I still got time I just implemented it anyway to leave it for just in case. (Actually it's also because I don't know if there is any way to conditionally let the code point iterator inherit from either std::forward_iterator or std::bidirectional_iterator)

From: Soares Chen Ruo Fei <crf@hypershell.org> but it does require that O(1) operations don't
become O(N) operations, or that O(N) operations become O(N^2) operations.
Ok I got what you mean. I think it'll be easier to just remove the decrement function completely. During implementation I also wondered if there is any real use for reverse code point iterator, but since I still got time I just implemented it anyway to leave it for just in case. (Actually it's also because I don't know if there is any way to conditionally let the code point iterator inherit from either std::forward_iterator or std::bidirectional_iterator)
Ok... Now I will make your life even harder :-) Many Unicode algorithms like segmentation or collation require random access... So you do need random access or bidirectional iterator. Bottom line... Don't bother. It is better to use Unicode in first place :-) Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.sf.net/ CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/

On Mon, Aug 15, 2011, Artyom Beilis wrote:
From: Soares Chen Ruo Fei <crf@hypershell.org> but it does require that O(1) operations don't
become O(N) operations, or that O(N) operations become O(N^2) operations.
Ok I got what you mean. I think it'll be easier to just remove the decrement function completely. During implementation I also wondered if there is any real use for reverse code point iterator, but since I still got time I just implemented it anyway to leave it for just in case. (Actually it's also because I don't know if there is any way to conditionally let the code point iterator inherit from either std::forward_iterator or std::bidirectional_iterator)
Ok... Now I will make your life even harder :-)
Many Unicode algorithms like segmentation or collation require random access...
So you do need random access or bidirectional iterator.
Bottom line... Don't bother.
It is better to use Unicode in first place :-)
So if that's the case then the only way for the code point iterators of those MBCS adapters to work with these Unicode algorithms is to enable the O(N^2) decrement function. Or is it better to make it yield compilation error and force the developer to manually convert the string into other string adapters before passing to the Unicode algorithms? Random access is quite easy I think once we have bidirectional iterator - just move forward/backward N times. I don't think there is any other way to jump to random location in O(1) without having to decode the string.

Soares Chen Ruo Fei wrote:
I think it'll be easier to just remove the decrement function completely.
No, don't do that. (That would be like removing random access from std::vector because std::list can't implement it efficiently.) I'm not familiar with the algorithms requiring bidirectional access that Artyom mentions, but a standard way to make them work with iterators for various different encodings would be to specialise the algorithms. You would have a main implementation that requires the bidirectional (or random access) iterator, and a forwarding implementation that looks like this: template <typename FORWARD_ITER> void algorithm(FORWARD_ITER begin, FORWARD_ITER end) { // Make a copy of the range into a bidirectional container: std::vector< typename FORWARD_ITER::value_type > v(begin,end); // Call the other specialisation: algorithm(v.begin(),v.end()); } That is the standard time-vs-space complexity trade-off.
(Actually it's also because I don't know if there is any way to conditionally let the code point iterator inherit from either std::forward_iterator or std::bidirectional_iterator)
You don't mean "inherit from". You mean "be a model of". See Artyom's "VERY BAD DESIGN" post. There should not be any virtual methods anywhere in this library. If you don't understand how that can be done, we should discuss that urgently. Phil.

On Tue, Aug 16, 2011, Phil Endecott wrote:
Soares Chen Ruo Fei wrote:
I think it'll be easier to just remove the decrement function completely.
No, don't do that. (That would be like removing random access from std::vector because std::list can't implement it efficiently.)
I'm not familiar with the algorithms requiring bidirectional access that Artyom mentions, but a standard way to make them work with iterators for various different encodings would be to specialise the algorithms. You would have a main implementation that requires the bidirectional (or random access) iterator, and a forwarding implementation that looks like this:
template <typename FORWARD_ITER> void algorithm(FORWARD_ITER begin, FORWARD_ITER end) { // Make a copy of the range into a bidirectional container: std::vector< typename FORWARD_ITER::value_type > v(begin,end); // Call the other specialisation: algorithm(v.begin(),v.end()); }
That is the standard time-vs-space complexity trade-off.
Well I don't think forcing all generic Unicode algorithms to provide specialization version for forward-only iterators is any better than providing a less-efficient bidirectional iterator. Such a burden is too high for the algorithm developers. Or perhaps a better decision is to simply let the compiler yield a (friendly?) error when the generic algorithm uses the decrement/random access operator, and find a way to inform the user to convert the string to standard UTF strings before passing to the Unicode algorithms. Or perhaps I could find a way to let template instances of unicode_string_adapter with MBCS encoding to store convert the string to UTF string during construction and store the UTF encoded string instead. The only problem for this is that during conversion back to the raw string, the string adapter would have to reconvert the internally stored UTF-encoded string back to the MBCS-encoded string. This can be expensive if the user regularly wants access the raw string, unless we store two smart pointers within the string adapter - one for the MBCS string and one for the converted UTF string, but doing so would waste storage space as well.
(Actually it's also because I don't know if there is any way to conditionally let the code point iterator inherit from either std::forward_iterator or std::bidirectional_iterator)
You don't mean "inherit from". You mean "be a model of". See Artyom's "VERY BAD DESIGN" post. There should not be any virtual methods anywhere in this library. If you don't understand how that can be done, we should discuss that urgently.
The virtual functions are used in my prototype file dynamic_unicode_string.hpp. The design hasn't gone through much thought and I wrote it just to demonstrate to Artyom that dynamic encoded strings can be implemented at a higher layer by using virtual functions. There might be more efficient ways to do so but I'll leave it for another discussion thread. Soares

Soares Chen Ruo Fei wrote:
On Tue, Aug 16, 2011, Phil Endecott wrote:
I'm not familiar with the algorithms requiring bidirectional access that Artyom mentions, but a standard way to make them work with iterators for various different encodings would be to specialise the algorithms. ?You would have a main implementation that requires the bidirectional (or random access) iterator, and a forwarding implementation that looks like this:
template <typename FORWARD_ITER> void algorithm(FORWARD_ITER begin, FORWARD_ITER end) { ?// Make a copy of the range into a bidirectional container: ?std::vector< typename FORWARD_ITER::value_type > v(begin,end); ?// Call the other specialisation: ?algorithm(v.begin(),v.end()); }
That is the standard time-vs-space complexity trade-off.
Well I don't think forcing all generic Unicode algorithms to provide specialization version for forward-only iterators is any better than providing a less-efficient bidirectional iterator. Such a burden is too high for the algorithm developers. Or perhaps a better decision is to simply let the compiler yield a (friendly?) error when the generic algorithm uses the decrement/random access operator, and find a way to inform the user to convert the string to standard UTF strings before passing to the Unicode algorithms.
The "less-efficient" O(N^2) bidirectional iterator is completely unreasonable. Algorithms are not being "forced" to do anything. Have a look at how the standard library does things. std::lower_bound() and std::rotate(), for example, have specialisations that select different algorithms depending on the type of iterator that is supplied; on the other hand, std::random_shuffle() only takes random access iterators and it would be the user's responsibility to choose what to do if they had some other kind of range.
Or perhaps I could find a way to let template instances of unicode_string_adapter with MBCS encoding to store convert the string to UTF string during construction and store the UTF encoded string instead. The only problem for this is that during conversion back to the raw string, the string adapter would have to reconvert the internally stored UTF-encoded string back to the MBCS-encoded string. This can be expensive if the user regularly wants access the raw string, unless we store two smart pointers within the string adapter - one for the MBCS string and one for the converted UTF string, but doing so would waste storage space as well.
No, don't do that. Just provide the iterators that can be provided efficiently. Phil.

On Wed, Aug 17, 2011, Phil Endecott wrote:
Well I don't think forcing all generic Unicode algorithms to provide specialization version for forward-only iterators is any better than providing a less-efficient bidirectional iterator. Such a burden is too high for the algorithm developers. Or perhaps a better decision is to simply let the compiler yield a (friendly?) error when the generic algorithm uses the decrement/random access operator, and find a way to inform the user to convert the string to standard UTF strings before passing to the Unicode algorithms.
The "less-efficient" O(N^2) bidirectional iterator is completely unreasonable. Algorithms are not being "forced" to do anything.
Have a look at how the standard library does things. std::lower_bound() and std::rotate(), for example, have specialisations that select different algorithms depending on the type of iterator that is supplied; on the other hand, std::random_shuffle() only takes random access iterators and it would be the user's responsibility to choose what to do if they had some other kind of range.
Or perhaps I could find a way to let template instances of unicode_string_adapter with MBCS encoding to store convert the string to UTF string during construction and store the UTF encoded string instead. The only problem for this is that during conversion back to the raw string, the string adapter would have to reconvert the internally stored UTF-encoded string back to the MBCS-encoded string. This can be expensive if the user regularly wants access the raw string, unless we store two smart pointers within the string adapter - one for the MBCS string and one for the converted UTF string, but doing so would waste storage space as well.
No, don't do that. Just provide the iterators that can be provided efficiently.
Alright thanks for your advice, I'll try to implement it within these few days. Now I learn something new. :) Soares

on Sat Aug 13 2011, Soares Chen Ruo Fei <crf-AT-hypershell.org> wrote:
Ahh I see so that's quite nasty, but actually it still can be done with the sacrifice on efficiency. Basically since the iterator already has the begin and end boundary iterators it can simply reiterate all over from the beginning of the string. Although doing so is roughly O(N^2) it shouldn't make significant impact as developers rarely use this multi-byte encoding and even seldom use the reverse decoding function.
What you're describing is *not* a bidirectional iterator. Efficiency guarantees (e.g. --p is O(1)) are part of the concept requirements. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Soares Chen Ruo Fei wrote:
A while ago I gave some previews of my Unicode String Adapter library to the boost community but I didn't receive much feedback. Now that GSoC is ending I'd like you all to take a look at my project again and provide feedback on the usefulness of the library. Following are the links to my project repository and documentation:
GitHub repository: https://github.com/crf00/boost.ustr Documentation: http://crf.scriptmatrix.net/ustr/index.html
I think there are probably as many ways to implement a "better" string as there are potential users, and previous long discussions here have considered those possibilities at great length. In summary your proposal is for a string that is: - Immutable. - Reference counted. - Iterated by default over unicode code points. - Provides access to the code units via operator* and operator->, i.e. s.begin() // Returns a code point iterator. s->begin() // Returns a code unit iterator. I won't comment about the merits or otherwise of those points, apart from the last, where I'll note that it is not to my taste. It looks like it's "over clever". Imagine that I wrote some code using your library, and then a colleague who was not familiar with it had to look at it later. Would they have any idea about the difference between those two cases? No, not unless I added a comment every time I used it. Please let's have an obvious syntax like: s.begin() // Code points. s.impl.begin() // Code units. or s.units_begin() // Code units. Personally, I don't want a new clever string class. What I want is a few well-written building-blocks for Unicode. For example, I'd like to be able to iterate over the code points in a block of UTF-8 data in raw memory, so some sort of iterator adaptor is needed. Your library does have this functionality, but it is hidden in an implementation detail. Please can you consider bringing out your core UTF encoding and decoding functions to the public interface? I would also like to see some benchmarks for the core UTF conversion functions. If you post some benchmarks that decouple the UTF conversion from the rest of the string class, I will compare the performance with my own code. Regards, Phil.

Hi Phil, On Aug 9, 2011, Phil Endecott wrote:
I think there are probably as many ways to implement a "better" string as there are potential users, and previous long discussions here have considered those possibilities at great length. In summary your proposal is for a string that is:
- Immutable. - Reference counted. - Iterated by default over unicode code points.
I think you misunderstood my point. Boost.Ustr does not attempt to redesign another string class to begin with. Instead it wraps existing string class that is provided through the template parameter and rely on that string class for actual container operations. The immutability of the string adapter is actually achieved by holding a smart pointer to the const version of the raw string.
- Provides access to the code units via operator* and operator->, i.e. s.begin() // Returns a code point iterator. s->begin() // Returns a code unit iterator.
I won't comment about the merits or otherwise of those points, apart from the last, where I'll note that it is not to my taste. It looks like it's "over clever". Imagine that I wrote some code using your library, and then a colleague who was not familiar with it had to look at it later. Would they have any idea about the difference between those two cases? No, not unless I added a comment every time I used it. Please let's have an obvious syntax like:
s.begin() // Code points. s.impl.begin() // Code units. or s.units_begin() // Code units.
The actual intention of operator ->() is not actually to provide access to code unit iterator, instead it is used for programmers to access some raw string functionalities that unicode_string_adapter is not able to provide. After all, unicode_string_adapter is a wrapper/decorator to the raw string class so it is supposed to add but not subtract functionality from the original string class. The availability of str->begin() is actually the side effect of enabling access of methods in the raw string class. I'm sorry that the example provided in the documentation probably confused the intention of operator ->(). In the documentation I did specify that it is strongly not recommended to call str->begin() but it looks like counterexample is bad. I'll change the usage example to str->c_str() to illustrate the usefulness of accessing raw string methods in some cases. while back to your question of method for accessing code unit iterator, I did initially consider on making two distinct methods of str.codepoint_begin() and str.codeunit_begin() for different level of access. But I finally concluded that reading and comparing code units would make the code far less portable so there should not be official support for accessing code units. For example, if a developer writes a function that checks just the first UTF-8 code unit to determine if the first code point belongs to the Basic Multilingual Plane, the same function would not work well if he later decides to allow checking for UTF-16 strings as well. The other problem of supporting code unit read access is that Boost.Ustr will then not be able to handle the errors of malformed encoding. Currently the default implementation of unicode_string_adapter returns the replacement character � on the fly when malformed code unit is found during decoding and it does not alter the original malformed raw string. So it is troublesome to handle these malformed strings in code unit iterators unless Boost.Ustr leaves the error handling to the caller, or Boost.Ustr explicitly make a copy of properly encoded raw string during construction which would also cause performance slow down. Though, in the end I also added append_codeunit() and codeunit_begin() methods for the unicode_string_adapter_builder class because I thought of the use case of reading encoded text from I/O and storing it directly into strings without decoding and re-encoding the text again. In this case it has to be carefully assumed that the encoding of the incoming text has already been determined, and even so I'm still worried if exposing this code unit output iterator would eventually introduce numerous bugs.
Personally, I don't want a new clever string class. What I want is a few well-written building-blocks for Unicode. For example, I'd like to be able to iterate over the code points in a block of UTF-8 data in raw memory, so some sort of iterator adaptor is needed.
Boost.Ustr's design objective is not to provide complete toolset of processing arbitrary Unicode data, if you are looking for features such as decoding Unicode from raw memory I think Mathias' Boost.Unicode library has already provide excellent support on this. Instead, Boost.Ustr's main objective is to allow developers who don't care about encoding issues to add encoding awareness to existing string class. For example, the use case scenarios are: - Here is a Unicode string with this given content. I don't care how it is encoded but I want to pass this string to any Unicode-enabled functions. - I'd like to write a function that accepts a Unicode string, I don't care whether it is UTF-8 or UTF-16 encoded but I want to know if the decoded code point sequence of this string match a certain pattern.
Your library does have this functionality, but it is hidden in an implementation detail. Please can you consider bringing out your core UTF encoding and decoding functions to the public interface?
My encoder/decoder functions are actually quite similar to Mathias' implementation. (in fact I referred to his design before implementing my own) However these function interfaces are specifically designed to fit the internal usage of Boost.Ustr, albeit I made them generic enough. The reason I did not directly use/copy Mathias' implementation is because the interfaces are slightly different and I wanted to avoid obscured bugs, and because the algorithm is simple enough to re-implement, and also because I wanted to take this chance to learn the encoding algorithms (and I did learn something). :) But I'd agree that it shouldn't be hard to refactor the encoders and marge with Mathias' implementation when the time comes. Currently I do not have plan to make iterator adapters on top of these encoding/decoding functions, and I think it is also a bit redundant as Mathias has already gone through the mess of generating these functions using macros and template metaprogramming. ;)
I would also like to see some benchmarks for the core UTF conversion functions. If you post some benchmarks that decouple the UTF conversion from the rest of the string class, I will compare the performance with my own code.
At this time I am focusing on design issues rather than optimizations, so I didn't think much about benchmarks. I'd guess that the encoding/decoding speed is probably inferior to other encoder/decoder functions. You can see in my implementation that I did not use obscured hacks that can shorten the code while mathematically remain the same. Instead I focused on readability first so that even amateurs can read the code and easily learn how the encoding/decoding process works. So if you are writing performance critical application that encode/decode huge amount of Unicode text, I'd say that Boost.Ustr is probably not for you (yet). Thanks for the feedback. Hope this answers your questions. cheers, Soares

Soares Chen Ruo Fei wrote:
Hi Phil,
On Aug 9, 2011, Phil Endecott wrote:
I think there are probably as many ways to implement a "better" string as there are potential users, and previous long discussions here have considered those possibilities at great length. ?In summary your proposal is for a string that is:
- Immutable. - Reference counted. - Iterated by default over unicode code points.
I think you misunderstood my point.
No, I believe I understand what you are doing.
Boost.Ustr does not attempt to redesign another string class to begin with. Instead it wraps existing string class that is provided through the template parameter and rely on that string class for actual container operations.
No, because:
The immutability of the string adapter is actually achieved by holding a smart pointer to the const version of the raw string.
If you were just wrapping an existing string class, you wouldn't do that; you'd just wrap the existing string class. By adding this extra bit, you're making a string that is immutable, copy-on-write and reference counted - whether or not the underlying string is or not.
- Provides access to the code units via operator* and operator->, i.e. ? ?s.begin() ?// Returns a code point iterator. ? ?s->begin() // Returns a code unit iterator.
I won't comment about the merits or otherwise of those points, apart from the last, where I'll note that it is not to my taste. ?It looks like it's "over clever". ?Imagine that I wrote some code using your library, and then a colleague who was not familiar with it had to look at it later. ?Would they have any idea about the difference between those two cases? ?No, not unless I added a comment every time I used it. ?Please let's have an obvious syntax like:
? ?s.begin() ? ? ? // Code points. ? ?s.impl.begin() ?// Code units. ?or s.units_begin() // Code units.
The actual intention of operator ->() is not actually to provide access to code unit iterator, instead it is used for programmers to access some raw string functionalities that unicode_string_adapter is not able to provide.
Whatever. The point is that you have this operator* and operator-> overload whose purpose is non-obvious to someone looking at code that uses it. What is your rationale for doing that, rather than providing e.g. an impl() or base() or similar accessor? Can you give examples of any precedents for this usage? What names or syntax do other wrapper/adaptor/facade implementations use?
Your library does have [raw UTF encoding and decoding functions] , but it is hidden in an implementation detail. ?Please can you consider bringing out your core UTF encoding and decoding functions to the public interface?
My encoder/decoder functions are actually quite similar to Mathias' implementation. (in fact I referred to his design before implementing my own) However these function interfaces are specifically designed to fit the internal usage of Boost.Ustr, albeit I made them generic enough. The reason I did not directly use/copy Mathias' implementation is because the interfaces are slightly different and I wanted to avoid obscured bugs, and because the algorithm is simple enough to re-implement, and also because I wanted to take this chance to learn the encoding algorithms (and I did learn something). :) But I'd agree that it shouldn't be hard to refactor the encoders and marge with Mathias' implementation when the time comes.
Currently I do not have plan to make iterator adapters on top of these encoding/decoding functions, and I think it is also a bit redundant as Mathias has already gone through the mess of generating these functions using macros and template metaprogramming. ;)
Well I don't really care who does it, but I think we should have these UTF encoding and decoding functions somewhere in Boost that is not an implementation detail of some other library.
I would also like to see some benchmarks for the core UTF conversion functions. ?If you post some benchmarks that decouple the UTF conversion from the rest of the string class, I will compare the performance with my own code.
At this time I am focusing on design issues rather than optimizations, so I didn't think much about benchmarks. I'd guess that the encoding/decoding speed is probably inferior to other encoder/decoder functions. You can see in my implementation that I did not use obscured hacks that can shorten the code while mathematically remain the same. Instead I focused on readability first so that even amateurs can read the code and easily learn how the encoding/decoding process works. So if you are writing performance critical application that encode/decode huge amount of Unicode text, I'd say that Boost.Ustr is probably not for you (yet).
OK, it's not for me, that's a shame. Maybe if you're lucky someone who DOES want this functionality will now post a reply to your request for comments... Regards, Phil.

Phil Endecott wrote:
No, because:
The immutability of the string adapter is actually achieved by holding a smart pointer to the const version of the raw string.
If you were just wrapping an existing string class, you wouldn't do that; you'd just wrap the existing string class. By adding this extra bit, you're making a string that is immutable, copy-on-write and reference counted - whether or not the underlying string is or not.
I wouldn't argue with you about the definition of new string class but as long as you understand the design goal then it's fine. For me I'd say that the unicode_string_adapter class is more like a glorified smart pointer, as you can assume the class to have the following signature with identical functionality: <typename StringT> class unicode_string_adapter : public std::shared_ptr<const StringT>; and I'm just using composition over inheritance because that brings better organization to the code. Also the class does not exactly do the same copy-on-write as std::string used to. It *always* copy when the edit() method is called regardless of whether it has one or many reference count. So there is no nasty overhead of making sure to have only one reference count during mutation.
Whatever. The point is that you have this operator* and operator-> overload whose purpose is non-obvious to someone looking at code that uses it. What is your rationale for doing that, rather than providing e.g. an impl() or base() or similar accessor? Can you give examples of any precedents for this usage? What names or syntax do other wrapper/adaptor/facade implementations use?
I'd say the purpose of operator *() is pretty obvious: to retain backward compatibility with the original raw string class. One of the biggest obstacle of creating new string class is that it will break compatibility with legacy library APIs that accept std::string in the function parameter. My goal is to make it as easy as possible for users of Boost.Ustr to get back their original raw string at any time when needed, so that it is less painful in migration. Ultimately a developer should can use `unicode_string_adapter<std::string>` with only his existing knowledge on std::string. The developer does not need to learn Boost.Ustr at all if he does not care about the encoding and content of the string, and all he have to do in his code to migrate to Boost.Ustr is just to replace all str.string_method() to str->string_method(), and existing_function(str) to existing_function(*str). As a result, the syntax makes it extremely easy to migrate with minimal changes. There is already a member function that does the actual implementation, which is str.to_string(). So it will be just be a matter of deleting three lines of code to remove operator *() anyway. But if you look at unicode_string_adapter itself as a smart pointer to the raw string, then operator *() would make more sense.
Well I don't really care who does it, but I think we should have these UTF encoding and decoding functions somewhere in Boost that is not an implementation detail of some other library.
I'd agree with you that Boost needs a complete toolset of Unicode library. But since that is out of my project scope I'll leave it to others to answer this question.
OK, it's not for me, that's a shame. Maybe if you're lucky someone who DOES want this functionality will now post a reply to your request for comments...
Yup.. Basically whatever raw string processing algorithm that cannot work well enough with the standard std::string implementation should not be made to work with Boost.Ustr as well. Actually I don't think there is any general purpose string class that can does the job you want as well.

Soares Chen Ruo Fei wrote:
you can assume the class to have the following signature with identical functionality:
<typename StringT> class unicode_string_adapter : public std::shared_ptr<const StringT>;
What is your rationale for that? What precedent is there for a wrapper/adapter/facade that behaves like a pointer to the wrapped object? I'm not aware of any precedents; this is a new pattern to me. Why have you chosen to do this, rather than providing an accessor member like impl()? Regards, Phil.
participants (14)
-
Artyom Beilis
-
Christian Holmquist
-
Cory Nelson
-
Daniel James
-
Dave Abrahams
-
Gordon Woodhull
-
Klaim - Joël Lamotte
-
Lars Viklund
-
Marius Stoica
-
Matus Chochlik
-
Phil Endecott
-
Robert Ramey
-
Soares Chen Ruo Fei
-
Yakov Galka