Idea to support unicode strings

Hi, I wrote a wrapper around John Maddock's unicode iterators (Thanks to Tomas for the hint), which provide a std::string like interface to access utf8, utf16 or utf32 encoded strings. <example> utf8_string u8("unicode string"); // construct by utf8 coded char[] u8 += 0x0020; // add some chars u8 += 0x0391; // alpha u8 += 0x0392; // betha u8 += 0x0393; // gamma std::cout << u8.raw() << std::endl; // access encoded string std::copy(u8.begin(), u8.end(), std::ostream_iterator<utf32_char>(std::cout, ", ")); std::cout << std::endl; utf32_string u32=u8; // assign and convert to utf32; std::copy(u32.begin(), u32.end(), std::ostream_iterator<utf32_char>(std::cout, ", ")); std::cout << std::endl; </example> The wrapper can be extended to support additional encodings like latin-1 or windows-1252, by providing encode and decode iterators. The source for the wrapper: http://opensource.nicai-systems.com/unicode/unicode.h And some test code: http://opensource.nicai-systems.com/unicode/test_unicode.cpp If there is any interest I can extend the code to support more std::basic_string methods, and add additional encodings... Regards, Nils

I have seen your example and source and recognize your contribution already now. Especially I like the constructor which converts between various types of unicode strings. It is also nice you try to mimic basic_string interface. However it isn't basic_string and it means it is isolated from the rest of standard library. In perfect world I would expect to read/write utf_strings from std::streams in the same way it is provided for std::string i.e. all the operations like operator>>, getline and so on should be usable on utf_strings. But these rely on specializing char_traits, codecvt and other locale facets so I Am afraid to achieve this it is necessary to write it the way the standard does. So in this area I basicaly identify with Matt Austern's proposal for the C++0x ( http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2035.pdf ). Of course there is also necessary to provide utf32 access for each utf_string specialization (effectively implemented through iterators) which is missing there and which your implementation already has. Best Regards Tomas

The central difference between ansi strings and a utf8 strings is, that character access by index is simple for ansi strings but difficult for utf8 encoded strings. std::basic_string can handle utf8, utf16 and utf32 encoded strings, but there is no access to the decoded string with access to the unicode values of the characters.
However it isn't basic_string and it means it is isolated from the rest of standard library. In perfect world I would expect to read/write utf_strings from std::streams in the same way it is provided for std::string i.e. all the operations like operator>>, getline and so on should be usable on utf_strings. It is always possible to access the basic_string<> data by calling raw()! The standard requires character access, which can't be implemented efficiently for utf8 and utf16 encoded strings.
So in this area I basicaly identify with Matt Austern's proposal for the C++0x ( http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2035.pdf ). I see my approach as an addition to Matt Austern's proposal. While Matt is handling encoded strings, my approach deals with decoded strings. The encoded string types are std::string, std::ustring, and std::u32string. These strings allow access to the raw values of the encoded words. My wrapper allow access to the strings at an symbolic level. It allows conversion between the different encodings and also to the unicode values of the characters as char32_t values.
8bit word array -> std::string 16bit word array -> std::ustring 32bit word array -> std::u32string utf8 encoded strings -> utf8_string (based on std::string) utf16 encoded strings -> utf16_string (based on std::ustring) utf32 encoded strings -> utf32_string (based on std::u32string) but the approach also allows: latin-1 encoded strings -> latin1_string (based on std::string) windows-1252 encoded strings -> windows_1252_string (based on std::string) Regards, Nils

Nils Springob wrote:
However it isn't basic_string and it means it is isolated from the rest of standard library. In perfect world I would expect to read/write utf_strings from std::streams in the same way it is provided for std::string i.e. all the operations like operator>>, getline and so on should be usable on utf_strings. It is always possible to access the basic_string<> data by calling raw()! The standard requires character access, which can't be implemented efficiently for utf8 and utf16 encoded strings.
I see! I have some tips to your code: 1) insert(size_type, uint32_t &). This doesn't compile without using boost. It should be written as insert(size_type, utf32_char &) 2) I have also other compiler problem on line static const size_type npos = string_type::npos; With my msvc8 I Am getting error C2057: expected constant expression. I don't understand because string_type::npos IS constant expression. Do you know what's the problem here? 3) In append, assign and ctor members you should call raw_string.reserve before appending characters to be more effective. I know that sometimes the exact number of characters (code units) isn't known but it's better to reserve at least the minimal count. 4) This is rather cosmetic. I think something like ustring<windows_1252> would be both more tabular and more boost-way than current windows_1252_string. You can do it easily by uniting all your utf*_traits classes into one template and specializing on predefined struct tags. Regards Tomas

1) insert(size_type, uint32_t &). This doesn't compile without using boost. It should be written as insert(size_type, utf32_char &) The bug is fixed.
2) I have also other compiler problem on line static const size_type npos = string_type::npos; With my msvc8 I Am getting error C2057: expected constant expression. I don't understand because string_type::npos IS constant expression. Do you know what's the problem here? I don't know, but I changed the code to use the BOOST_STATIC_CONSTANT macro, perhaps this will fix the problem.
3) In append, assign and ctor members you should call raw_string.reserve before appending characters to be more effective. I know that sometimes the exact number of characters (code units) isn't known but it's better to reserve at least the minimal count. That's problematic... The size of an encoded string can only be calculated by iteration. The other problem is, that iterators would be invalid after assignment, because you can't predict, if reallocation would occur - even if you reserved some memory before.
4) This is rather cosmetic. I think something like ustring<windows_1252> would be both more tabular and more boost-way than current windows_1252_string. You can do it easily by uniting all your utf*_traits classes into one template and specializing on predefined struct tags. utf8_string is only a shortcut for unicode_string<utf8_traits>
I added a new file containing some additional encodings, allowing access to latin1, windows_1252 and cp437 encoded strings. * unicode_string<latin1_traits> * unicode_string<windows_1252_traits> * unicode_string<cp437_traits> Additionally I updated the existing files, all files can be accessed here: http://opensource.nicai-systems.com/unicode/unicode.h http://opensource.nicai-systems.com/unicode/test_unicode.cpp http://opensource.nicai-systems.com/unicode/encodings.h Regards, Nils
participants (2)
-
Nils Springob
-
Tomas Pecholt