
On Thu, Aug 11, 2011, Artyom Beilis wrote:
My strong opinion is:
a. Strings should be just container object with default encoding and some useful API to handle it. b. Default encoding MUST be UTF-8 c. There are several ways to implement strings COW, Mutable, Immutable, with small string optimization and so on. This way or other std::string is de-facto string and I think we should live with it and use some alternative containers where it matters. d. Code point and code unit are meaningless unless you develop some Unicode algorithm - and you don't - you use one written by experts.
This Ustr does not solve this problem as it does not provide really some kind of
adapter<generic encoding> { string content }
This is some kind of thing that may be useful, but not in this case. Basically your library provides wrapper around string and outputs Unicode code points but it does it for UTF encodings only!
It does not benefit too much. You provide encoding traits but it is basically meaningless for the propose you had given as:
It does not provide traits for non-Unicode encodings like lets say Shift-JIS or ISO-8859-8
The library is designed to be flexible without intending to include every possible encodings to the library by default. The point is that external developers can leverage the EncodingTraits template parameter to implement the desired encoding *themselves*. The core library should be as small as possible without being bloated by translation tables between encodings that are not commonly used by the rest of the world. You may request me to add a sub-library for Shift-JIS or other encodings, and I'll consider implementing it for popular demand.
BTW you can't create traits for many encodings, for example you can't implement traits requirements:
http://crf.scriptmatrix.net/ustr/ustr/advanced.html#ustr.advanced.custom_enc...
For popular encodings like Shift-JIS or GBK...
Homework: tell me why ;-)
I was trying to write a few lines of prototype code to show you that it'd work, but I've run out of time and missed so many reply so I'll show you next time. But why not? There are already standard translation table offered by Unicode Consortium at ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT, so all it was needed to make the encoder/decoder work with that translation table. Perhaps you are referring to the non-roundtrip conversion of some Shift-JIS characters, as mentioned by Microsoft at http://support.microsoft.com/kb/170559. But the objective is to make best effort emulation not completely perfect. Such problem cannot be solved by any other implementation means anyway, so if you are trying to convert Shift-JIS strings to Unicode before passing it to Unicode-oriented functions, you are still screwed the same way. Or perhaps you mean that you can't properly encode non-Japanese characters into the Shift-JIS encoding. In that case the character will just be substituted by a replacement character, or throw an exception according to the provided policy. But the user of such Unicode-emulated string should only read and not modify on it anyway, i.e. it is probably a bad idea to create a unicode_string_adapter_builder instance of the string and pass it to mutation function that assumes full Unicode encoding functionality. Then again you are also screwed the same way if you are trying to do manual conversion from Unicode-encoded string to Shift-JIS encoded string and pass it to Shift-JIS-oriented function. The conclusion is that Boost.Ustr with custom encoding traits is intended for convenience of automatically converting between two encoding between library boundaries. It will not solve any encoding conversion problem that you can't even solve with manual conversion.
Also it is likely that encoding is something that
can be changed in the runtime not compile time and it seems that this adapter does not support such option.
It is always possible to add a dynamic layer on top of the static encoding layer but not the other way round. It shouldn't be too hard to write a class with virtual interfaces to call the proper template instance of unicode_string_adapter. But that is currently outside of the scope and the fundamental design will still be static regardless.
If someone uses strings with different encodings he usually knows their encoding...
The problem is that API inconsistent as on Windows narrow string is some ANSI code page and anywhere else it is UTF-8.
This is entirely different problem and such adapters don't really solve them but actually make it worse...
If I'm not wrong however, the wide version of strings on Windows is always UTF-16 encoded, am I correct? So a manual solution of constructing UTF-8 strings on Windows would be similar to: std::wstring wide_str = L"世界你好"; std::string u8_str; generic_conversion::u16_to_u8(wide_str.begin(), wide_str.end(), std::back_inserter(u8_str)); except that it will not be portable across Unix systems. But with Boost.Ustr you can achieve the same thing by unicode_string_adapter<std::string> u8_str = USTR("世界你好"); which gets expanded into unicode_string_adapter<std::string> u8_str = unicode_string_adapter<std::wstring>( std::wstring(L"世界你好") ); So while it's hard to construct UTF-8 string literals on Windows, it should still be possible by writing code that manually insert UTF-8 code units into std::string. After all std::string does not have any restriction on which bytes we can manually insert into it. Perhaps you are talking about printing the string out to std::out, but that's another hard problem that we have to tackle separately.
Other problem is ================
I don't believe that string adapter would solve any real problems because:
a) If you iterate over code points you are very likely do something wrong. As code point != character and this is very common mistake.
I am well aware of it but I decide to separate the concern into different layers and tackle the abstract character problem at a higher layer. The unicode_string_adapter class has a well defined role which is to offer *code point* level access. I did plan to write another class that does the character iteration, but I don't have enough time to do it yet. But I believe you can pass the code point iterators to Mathias' Boost.Unicode methods to create abstract character iterators that does the job you want.
b) If you want to iterate over code points it is better to have some kind of utf_iterator that receives a range and iterate over it, it would be more generic and do not require to have an additional class.
For example Boost.Locale has utf_traits that allow to implement iteration over code points quite easily.
See: http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/namespaceboost_1_1... http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/structboost_1_1loc...
And you don't need any kind of specific adapters.
I added a static method unicode_string_adapter::make_codepoint_iterator to do what you have requested. It accepts three code unit iterator parameters, current, begin, and end, so that it can iterate in both directions without going out of bound. Hope that is what you looking for.
c) The problem in Boost is not missing Unicode String and it is not even required to have yet-another-unicode-string that we have good Unicode support.
The problem is policy the problem is Boost just can't decide once and forever that std::string is UTF-8...
But don't get me wrong. This is My Opinion, many would disagree with me.
Bottom line,
Unicode strings, cool string adapters, UTF-iterators and even Boost.Unicode and Boost.Locale would not solve the problems that Boost libraries use inconsistent encodings on different platforms.
IMHO: the only way to solve it is POLICY.
It is ok if you disagree with my approach, but keep in mind that the focus of this thread is just to tell whether my current proposed solution is good, and not to propose a better solution. Thanks. cheers, Soares